Do we need 100B+ parameters in a large language model?

Question

Cerebras-GPT, DataBricks's Dolly performs reasonably well on many instruction-based tasks while being significantly smaller than GPT-3, challenging the notion that is big always better!From my personal experience, the quality of the model depends a lot on the fine-tuning data as opposed to just the sheer size. If you choose your retraining data correctly, you can fine-tune your smaller model to perform better than the state-of-the-art GPT-X. The future of LLMs might look more open-source than imagined 3 months back!Would love to hear everyone's opinions on how they see the future of LLMs evolving? Will it be few players (OpenAI) cracking the AGI and conquering the whole world or a lot of smaller open-source models which ML engineers fine-tune for their use-cases?P.S. I am kinda betting on the latter and building UpTrain (https://github.com/uptrain-ai/uptrain), an open-source project which helps you collect that high quality fine-tuning dataset

retrac · Accepted Answer

You've asked a very deep question. What is the information density of language? Of general knowledge? How many bits, in theory, are required to describe the vocabulary and grammar of a language like English in a way, such that software can manipulate a variety of natural language tasks? How many bits are required, in theory, to contain a database of general knowledge?
ChatGPT has been compared to a "blurry JPEG of the web" by Ted Chiang [1] and I think that is a very appropriate analogy. There's a relationship between deep learning and compression. In a sense, a generative model like ChatGPT is a lossy compression algorithm that re-synthesizes outputs, approximately like its inputs. (Unsurprisingly, deep learning based methods blow traditional compression algorithms out of the water on compression ratio.)
I suspect the question is essentially the same as "How small could the complete text of Wikipedia be compressed?" or "How few bits does it take to lossily compress a human voice recording yet the individual still be recognizable?"
It's an unsolved philosophical problem. I don't know of any attempts to determine the lower bound. Intuitively, something like hundreds of kilobytes does seem inadequate. And hundreds of gigabytes is adequate. So, it's somewhere in between.
[1] https://www.newyorker.com/tech/annals-of-technology/chatgpt-...