What do you think? Are there any other alternatives or solutions on sight?
The latest connotation of RAG includes mixing in real-time data from tools or RPC calls. E.g. getting data specific to the user issuing the query (their orders, history etc) and adding that to the context.
So will very large context windows (1M tokens!) "kill RAG"?
- at the simple end of the app complexity spectrum: when you're spinning up a prototype or your "corpus" is not very large, yes-- you can skip the complexity of RAG and just dump everything into the window.
- but there are always more complex use-cases that will want to shape the answer by limiting what they put into the context window.
- cost-- filling up a significant fraction of a 1M window is expensive, both in terms of money and latency. So at scale, you'll want to filter out and RAG relevant info rather than indiscriminately dump everything into the window.
I still think LLMs are the best AI tech/tools since I started getting paid to be an AI practitioner in 1982, but that is a low bar of achievement given that some forms of Symbolic AI failed to ever scale to solve real problems.
Since you asked about alternatives...
(a) "World models" where LLMs structure information into code, structured data, etc. and query those models will likely be a thing. AlphaGeometry uses this[1], and people have tried to abstract this in different ways[2].
(b) Depending on how you define RAG, knowledge graphs could be a form of RAG or alternatively an alternative to them. Companies like Elemental Cognition[3] are building distinct alternatives to RAG that use such graphs and give LLMs the ability to run queries on said graphs. Another approach here is to build "fact databases" where, you structure observations about the world into standalone concepts/ideas/observations and reference those[4]. Again, similar to RAG but not quite RAG as we know it today.
[1] https://deepmind.google/discover/blog/alphageometry-an-olymp...
[2] https://arxiv.org/abs/2306.12672
[3] https://ec.ai/
Try it out: https://github.com/truefoundry/cognita
That’s RAG. Doesn’t matter that you didn’t use vectors or knowledge graphs or FTS or what have you.
Then the jump from “this whole document” to “well actually I only need this particular bit” puts you immediately into the territory of needing some sort of semantic map of the document.
I don’t think it makes conceptual sense to think about using LLMs without some sort of domain relevance function.
From the video in this IBM post [0], I understand that it is a way for the LLM to check what its source and latest date of information is. Based on that, it could, in principle, say “I don’t know”, instead of “hallucinating” an answer. A RAG is a way to implement this feature for LLMs.
[0] https://research.ibm.com/blog/retrieval-augmented-generation...
i am sure there can be newer ways to do prompt injection in an elegant way, but for the most part the llm is either summarizing the injected prompt or regurgitating it.
if the output is satisfactory, it is still more convenient than writing custom rules for answers for each kind of question you want to address.
I think LLM context is going to be like cache levels. The first level is small but super fast (like working memory). The next level is larger but slower, and so on.
RAG is basically a bad version of attention mechanisms. RAG is used to focus your attention on relevant documents. The problem is that RAG systems are not trained to minimize loss, it is just a similarity score.
Obligatory note that I could be wrong and it's just my armchair opinion
Transformer Memory as a Differentiable Search Index:
Both waste compute because you have to re-encode things as text each time and RAG needs a lot of heuristics + a separate embedding model.
Instead, it makes a lot more sense to pre-compute KV for each document, then compute values for each query. Only surfacing values when the attention score is high enough.
The challenge here is to encode global position information in the surfaced values and to get them to work with generation. I suspect it can't be done out of the box but we it will work with training.
This approach has echoes of both infinite context length and RAG but is an intermediate method that can be parallelized and is more efficient than either one.
It's 1000x more efficient to give it a look-aside buffer of info than to try to teach it ab initio.
Why do more work when the data is already there?
So you'd still want to use RAG as a performance optimization, even though today it's being used as more of a "there is no other way to supply enough of your own data to the LLM" must-have.
Longer term it gets more interesting.
Assuming we can solve long (approaching infinite) context, and solve the issues with reasoning over long context that LangChan correctly identified[1] then it becomes a cost and performance (speed) issue.
It is currently very very expensive to run a full scan of all knowledge for every inference call.
And there are good reasons why databases use indexes instead of table scans (ie, performance).
But maybe we find a route forward towards adaptive compute over the next two years. Then we can use low compute to find items of interest in the infinite contest window, and then use high compute to reason over them. Maybe this could provide a way forward on the cost issues at least.
Performance is going to remain an issue. It's not clear to me how solvable that is (sure you can imagine ways it could be parallelized but it seems likely there will be a cost penalty on planning that)
The only issue right now is the cost. You can make a bet that GPU performance will double every year or even 6 months according to Elon. RAG addresses cost issues today aswell by only retrieving relevant context, once LLMs get cheaper and context windows widen which they will, RAG will be easier, dare I say trivial.
I would argue RAG is important today on its own and as a grounding, no pun intended, for agent workflows.
RAG can augment the LLM with specific knowledge, which may make it more likely to give factually correct answers in those domains, but is mostly orthogonal to the hallucination problem (except to the extent that LLM's hallucinate when asked questions on a subject they don't know).
It is "search and summarize." It is not "glean new conclusions." That being said, "search and summarize" is probably good for 80%.
LoRA is an improvement, but I have seen benchmarks showing that it struggles to make as deep inferences as regular training does.
There isn't a one-size fits all... Yet.
“Stuffing relevant data into the context window rather than relying purely on training” is a solution to confabulation, though, just like providing relevant reference information to a person who is being pressured to answer a question is.
https://www.aryn.ai/post/rag-is-a-band-aid-we-need-llm-power...
We do see a world where LLMs are used to answer questions (Luna), but it’s a more complex compound AI system that references a corpus (knowledge source), and uses LLMs to process that data.
The discussion around context sizes is a red herring. They can’t grow as fast the demand for data.
https://myscale.com/blog/prompt-engineering-vs-finetuning-vs...
With RAG tools that exist today, we can already do things like
- providing summaries
- hierarchical summarization
- generation of questions / more prompts to nudge the model
- caching
- using knowledge graphs, function calling, or database connectors for non-semantic data querying
etc.
Inefficiency (in other words, higher expense)is sometimes even easier to perceive for decision-makers
The solution is finding a way for models to recognise the absence of knowledge.
1. Training a LLM is expensive.
2. Due to the cost to train, it’s hard to update a LLM with latest information.
3. Observability is lacking. When you ask a LLM a question, it’s not obvious how the LLM arrived at its answer.
There’s a different approach: Retrieval-Augmented Generation (RAG). Instead of asking LLM to generate an answer immediately, frameworks like LlamaIndex:
1. retrieves information from your data sources first,
2. adds it to your question as context, and
3. asks the LLM to answer based on the enriched prompt.
RAG overcomes all three weaknesses of the fine-tuning approach:
1. There’s no training involved, so it’s cheap.
2. Data is fetched only when you ask for them, so it’s always up to date.
3. The framework can show you the retrieved documents, so it’s more trustworthy.
The models aren't actually capable of taking into account everything in their context window with industrial yields.
These are stochastic processes, not in the "stochastic parrot" sense, but in the sense of "you are manufacturing emissions and have some measurable rate of success." Like a condom factory.
When you reduce the amount of information you inject, you both decrease cost and improve yield.
"RAG" is application specific methods of estimating which information to admit to the context window. In other words, we use domain knowledge and labor to reduce computational load.
When to do that is a matter of economy.
The economics of RAG in 2024 differ from 2022, and will differ in 2026.
So the question that matters is, "given my timeframe, and current pricing, do I need RAG to deliver my application?"
The second question is, "what's an acceptable yield, and how do I measure it?"
You can't answer that for 2026, because, frankly, you don't even know what you'll be working on.
If LLMs are akin to a "low resolution jpeg of the internet", RAGs allow checking of facts.