And in 10-20 years it’ll be capable of some crazy stuff
I might be ignorant of the field but why do we assume this?
How do we know it won’t just plateau in performance at some point?
Or that say the compute requirements become impractically high
No one has hit a model/dataset size where the curves break down, and they're fairly smooth. Usually simple models that accurately predict performance work pretty well nearby existing performance, so I expect trillion or 10-trillion parameter models to be on the same curve.
What we haven't seen yet (that I'm aware of) is whether the specializations to existing models (LoRa, RLHF, different attention methods, etc.) follow similar scaling laws, since most of the efforts have been focused on achieving similar performance on smaller/sparser models and not investing the large amounts of money into huge experiments. It will be interesting to see what Deepmind Gemini reveals.
Data
Compute
Algorithms
All three are just scratching the surface of what is possible.
Data: What has been scraped off the internet is just <0.001% of human knowledge as most platforms cannot be scraped so easily, are in formats that are not in text like video, audio, or just plain old pieces of paper undigitized. Finally there are probably techniques to increase data through synthetic means, which is purportedly OpenAI's secret sauce to GPT-4's quality.
Compute: While 3nm processes are approaching an atomic limit (0.21nm for Si), there is still room to explore more densely packed transistors or other materials like Gallium Nitride or optical computing. Not only that but there is a lot of room in hardware architecture to allow more parallelism and 3-D stacked transistors.
Algorithms: The transformer and other attention mechanisms have several sub-optimal components to them like how arbitrary the Transformer is in terms of design decisions, and quadratic time complexity for attention. There also seems to be a large space of LLM augmentations like RLHF for instruction following and improvements in factuality and other mechanisms.
And these ideas are just from my own limited experience. So I think its fair to say that LLMs have plenty of room to improve.
That doesn't mean there isn't possibly a plateau somewhere but it's somewhere way off in the distance.
But that's just my opinion and no one knows the future. If you read papers on arxiv.org, progress is being made. Papers are being written, low-hanging fruit consumed. So we're going to try because PhDs are there for the taking on the academic side, and generational wealth is there for the taking on the business side.
E. F. Codd invented the relational database and won the Turing Award. Larry Ellison founded Oracle to sell relational databases and that worked out well for him, too.
There's plenty of motivation to go around.
Digital computer architecture evolved the way it did because there was no other practical way to get the job done besides enforcing a strict separation of powers between the ALU, memory, mass storage, and I/O. We are no longer held to those constraints, technically, but they still constitute a big comfort zone. Maybe someone tinkering with a bunch of FPGAs duct-taped together in their basement will be the first to break out of it in a meaningful way.
Good LLMs like ChatGPT are a relatively new technology so I think it's hard to say either way. There might be big unrealized gains by just adding more compute, or adding/improving training data. There might be other gains in implementation, like some kind of self-improvement training, a better training algorithm, a different kind of neural net, etc. I think it's not unreasonable to believe there are unrealized improvements given the newness of the technology.
On the other hand, there might be limitations to the approach. We might never be able to solve for frequent hallucinations, and we might not find much more good training data as things get polluted by LLM output. Data could even end up being further restricted by new laws meaning this is about the best version we will have and future versions will have worse input data. LLMs might not have as many "emergent" behaviors as we thought and may be more reliant on past training data than previously understood, meaning they struggle to synthesize new ideas (but do well at existing problems they've trained on). I think it's also not unreasonable to believe LLMs can't just improve infinitely to AGI without more significant developments.
Speculation is always just speculation, not a guarantee. We can sometimes extrapolate from what we've seen, but sometimes we haven't seen enough to know the long term trend.
I think I have a corollary type idea: Why are LLM's not perhaps like "Linux," something than never really needs to be REWRITTEN from scratch, merely added to or improved on? In other words, isn't it fair to think that LoRA's are the really important thing to pay attention to?
(And perhaps, like Google Fuschia or whatever, new LLMs might just be mostly a waste of time from an innovators POV?)
Its not unfeasable in the future to have a box at home that you can ask a fairly complicated question, like "how do I build a flying car", and it will have the ability to
- tell you step by step instructions of what you need to order
- write and run code to simulate certain things
- analyze you work from video streams and provide feedback
- possibly even have a robotic arm with attachments that can do some work.
From a software perspective, I've wondered for a while if as LLM usage matures, there will be an effort to optimize hotspots like what happened with VMs, or auto indexing like in relational DBs. I'm sure there are common data paths which get more usage, which could somehow be prioritized, either through pre-processing or dynamically, helping speed up inference.
Also, GPT4 seems to include multiple LLMs working in concert. There's bound to be way more fruit to picked along that route as well. In short, there's tons of areas where improvements large and small can be made.
As always in computer science, the maxim, "Make it work, make it work well, then make it work fast," applies here as well. We're collectively still at step one.
Great video to talk about this: https://www.youtube.com/watch?v=ARf0WyFau0A
In threads on LLMs, this point doesn't get brought up as much as I'd expect, so I'm curious if I'm missing talks on this or maybe it's wrong. But I see this as the way forward. Models generating tons of answers, and other models being able to pick out the correct ones, and the combinations being beyond human ability, where after, humans can do their own verification.
Edit:
Think of it this way. Trying to create something isn't easy. If I was to write a short story, it'd be very difficult, even if I spent years reading what others have written to learn their patterns. If I then tried to write and publish a single one myself, no chance it'd be any good.
But _judging_ short stories is much easier to do. So if I said screw it, I'll read a couple stories to get the initial framework, then write 100 stories in the same amount of time I'd have spent reading and learning more about short stories, I can then go through the 100 and pick out the one I think is the best and publish that.
That's where I see LLMs going and what the video and papers mentioned in the video say.
I'm not an expert here either but I wonder if there will be the same "leap" we saw from ChatGPT3-4 or if there's a diminishing curve to performance, ie: adding another trillion parameters has less of a noticeable effect than the first few hundred billion.
[0] https://fortune.com/2023/09/09/ai-chatgpt-usage-fuels-spike-... -- I am fairly certain they paid for that water, it was not a commensurate price given the circumstances, and if they had to ask to use it first the answer would have been, no, by a reasonable environmental stewardship organization.
I, of course, already know how to do all this for a mere $80B.
Anything that has seen continual growth will be assumed to have further continual growth at a similar rate.
Or, how I mentally model it even if it's a bit incorrect: People see sigmoidal growth as exponential.
I suspect that we've already seen the shape of the curve: a 1B parameter model can index a book; a 4B model can converse, but a 14B model can be a little more eloquent. Beyond that no real gains will be seen.
The "technology advancement" phase has already happened mostly, but the greater understanding of theory, that would discourage foolish investments hasn't propagated yet. So there's probably at least another full year of hype cycle before the next buzzword is brought out to start hoovering up excess investment funds.
So if we have that much compute power already why can't we just configure it in the right way to match a human brain?
I'm not sure I totally buy that logic though, since I would think the architecture/efficiency of a brain is way different from a computer
But even if you’re looking just at the LLM it seems like there’s a lot of ways it can be improved still.
We don't.
But that's also the sort of thing you can't say when seeking huge amounts of funding for your LLM company.