HACKER Q&A
📣 Gooblebrai

Is RAG the Future of LLMs?


It seems to be in vogue that RAG is one of the best solutions to reduce the problem of hallucinations in LLMs.

What do you think? Are there any other alternatives or solutions on sight?


  👤 gandalfgeek Accepted Answer ✓
#1 motivation for RAG: you want to use the LLM to provide answers about a specific domain. You want to not depend on the LLM's "world knowledge" (what was in its training data), either because your domain knowledge is in a private corpus, or because your domain's knowledge has shifted since the LLM was trained.

The latest connotation of RAG includes mixing in real-time data from tools or RPC calls. E.g. getting data specific to the user issuing the query (their orders, history etc) and adding that to the context.

So will very large context windows (1M tokens!) "kill RAG"?

- at the simple end of the app complexity spectrum: when you're spinning up a prototype or your "corpus" is not very large, yes-- you can skip the complexity of RAG and just dump everything into the window.

- but there are always more complex use-cases that will want to shape the answer by limiting what they put into the context window.

- cost-- filling up a significant fraction of a 1M window is expensive, both in terms of money and latency. So at scale, you'll want to filter out and RAG relevant info rather than indiscriminately dump everything into the window.


👤 mark_l_watson
I wrote a book on LangChain and LlamaIndex about 14 months ago, and at the time I thought that RAG style applications were great, but now I am viewing them as being more like material for demos. I am also less enthusiastic about LangChain and LlamaIndex; they are still useful, but the libraries are a moving target and often it seems best to just code up what I need by hand. The moving target issue is huge for me, updating my book frequently has been a major time sync.

I still think LLMs are the best AI tech/tools since I started getting paid to be an AI practitioner in 1982, but that is a low bar of achievement given that some forms of Symbolic AI failed to ever scale to solve real problems.


👤 cl42
RAG will have a place in the LLM world, since it's a way to obtain data/facts/info for relevant queries.

Since you asked about alternatives...

(a) "World models" where LLMs structure information into code, structured data, etc. and query those models will likely be a thing. AlphaGeometry uses this[1], and people have tried to abstract this in different ways[2].

(b) Depending on how you define RAG, knowledge graphs could be a form of RAG or alternatively an alternative to them. Companies like Elemental Cognition[3] are building distinct alternatives to RAG that use such graphs and give LLMs the ability to run queries on said graphs. Another approach here is to build "fact databases" where, you structure observations about the world into standalone concepts/ideas/observations and reference those[4]. Again, similar to RAG but not quite RAG as we know it today.

[1] https://deepmind.google/discover/blog/alphageometry-an-olymp...

[2] https://arxiv.org/abs/2306.12672

[3] https://ec.ai/

[4] https://emergingtrajectories.com/


👤 supreetgupta
TrueFoundry has recently introduced a new open-source framework called Cognita, which utilizes Retriever-Augmented Generation (RAG) technology to simplify the transition by providing robust, scalable solutions for deploying AI applications.

Try it out: https://github.com/truefoundry/cognita


👤 darkteflon
Unless we’re going to paste a whole domain corpus into the context window, we’re going to continue to need some sort of “relevance function” - a means of discriminating what needs to go in from what doesn’t. That could be as simple as “document A goes in, document B doesn’t”.

That’s RAG. Doesn’t matter that you didn’t use vectors or knowledge graphs or FTS or what have you.

Then the jump from “this whole document” to “well actually I only need this particular bit” puts you immediately into the territory of needing some sort of semantic map of the document.

I don’t think it makes conceptual sense to think about using LLMs without some sort of domain relevance function.


👤 mif
For those of us who don’t know what RAG is (including myself), RAG stands for Retrieval Augmented Generation.

From the video in this IBM post [0], I understand that it is a way for the LLM to check what its source and latest date of information is. Based on that, it could, in principle, say “I don’t know”, instead of “hallucinating” an answer. A RAG is a way to implement this feature for LLMs.

[0] https://research.ibm.com/blog/retrieval-augmented-generation...


👤 rldjbpin
RAG is to me just a glued-up solution to try counter the obvious limitations of LLMs that they essentially find the next token in a very convincing fashion.

i am sure there can be newer ways to do prompt injection in an elegant way, but for the most part the llm is either summarizing the injected prompt or regurgitating it.

if the output is satisfactory, it is still more convenient than writing custom rules for answers for each kind of question you want to address.


👤 spencerchubb
I believe RAG is a temporary hack until we figure out virtually infinite context.

I think LLM context is going to be like cache levels. The first level is small but super fast (like working memory). The next level is larger but slower, and so on.

RAG is basically a bad version of attention mechanisms. RAG is used to focus your attention on relevant documents. The problem is that RAG systems are not trained to minimize loss, it is just a similarity score.

Obligatory note that I could be wrong and it's just my armchair opinion


👤 p1esk
It’s strange: most answers here assume the next gen models won’t be able to perform RAG on its own. IMO, it would be wise to assume the opposite - anything humans currently do to make models smarter will be built in.

👤 teleforce
Another potent alternative is perhaps Differentiable Search Index (DSI) based on Transformer Memory:

Transformer Memory as a Differentiable Search Index:

https://arxiv.org/abs/2202.06991


👤 machinelearning
Both RAG and infinite contexts in their current states are hacks.

Both waste compute because you have to re-encode things as text each time and RAG needs a lot of heuristics + a separate embedding model.

Instead, it makes a lot more sense to pre-compute KV for each document, then compute values for each query. Only surfacing values when the attention score is high enough.

The challenge here is to encode global position information in the surfaced values and to get them to work with generation. I suspect it can't be done out of the box but we it will work with training.

This approach has echoes of both infinite context length and RAG but is an intermediate method that can be parallelized and is more efficient than either one.


👤 sigmoid10
The latest research suggests that the best thing you can do is RAG + finetuning on your target domain. Both give roughly equal percentage gains, but they are independent (i.e. they accumulate if you do both). As context windows constantly grow and very recent architectures move more towards linear context complexity, we'll probably see current RAG mechanisms lose importance. I can totally imagine a future where if you have a research level question about physics, you just put a ton of papers and every big graduate physics textbook into the current context instead of searching text snippets using embeddings etc.

👤 nimish
RAG is an easy way to incorporate domain knowledge into a generalized model.

It's 1000x more efficient to give it a look-aside buffer of info than to try to teach it ab initio.

Why do more work when the data is already there?


👤 cjbprime
It's hard to imagine what could happen instead. Even with a model with infinite context, where we imagine you could supply e.g. your entire email archive with each message in order to ask questions about one email, the inference time is still proportional to each input token.

So you'd still want to use RAG as a performance optimization, even though today it's being used as more of a "there is no other way to supply enough of your own data to the LLM" must-have.


👤 nl
In the ~2 year timeframe we'll be using RAG.

Longer term it gets more interesting.

Assuming we can solve long (approaching infinite) context, and solve the issues with reasoning over long context that LangChan correctly identified[1] then it becomes a cost and performance (speed) issue.

It is currently very very expensive to run a full scan of all knowledge for every inference call.

And there are good reasons why databases use indexes instead of table scans (ie, performance).

But maybe we find a route forward towards adaptive compute over the next two years. Then we can use low compute to find items of interest in the infinite contest window, and then use high compute to reason over them. Maybe this could provide a way forward on the cost issues at least.

Performance is going to remain an issue. It's not clear to me how solvable that is (sure you can imagine ways it could be parallelized but it seems likely there will be a cost penalty on planning that)

[1] https://blog.langchain.dev/multi-needle-in-a-haystack/


👤 sc077y
RAG is a fantastic solution and I think it's here to stay one way or another. Yes the libs surrounding it are lacking because the field is moving so fast and yes I'm mainly talking about LangChain. RAG is just one way of grounding, that being said I think it's Agent Workflows that will really be the killer here. The idea that you can assist or even perhaps replace an entire task fulfilling unit aka worker with an LLM assisted by RAG is going to be revolutionary.

The only issue right now is the cost. You can make a bet that GPU performance will double every year or even 6 months according to Elon. RAG addresses cost issues today aswell by only retrieving relevant context, once LLMs get cheaper and context windows widen which they will, RAG will be easier, dare I say trivial.

I would argue RAG is important today on its own and as a grounding, no pun intended, for agent workflows.


👤 haolez
I don't think so. Token windows are always increasing and new architectures (Le Cunn is proposing some interesting stuff with world models) might make it cheaper to add knowledge to the model itself. I think it's more of a necessity of our current state of the art than something that I'd bet on.

👤 redskyluan
What I Observe: Simple RAG is Fading, but Complex RAG Will Persist and Evolve - Involving Query Rewriting, Data Cleaning, Reflection, Vector Search, Graph Search, Rerankers, and More Intelligent Chunking. Large Models Should Not Just Be Knowledge Providers, But Tool Users and Process Drivers"

👤 waldrews
I work on statistical quality control methods for the hallucination problem. Model how difficult/error prone a query is, and prioritize sending it to humans to verify the LLM's answer if it's high risk. Some form of human control like that is the only way to really cut hallucinations down to something like human-equivalent level (human answers are unreliable too, and should be subject to quality control with reputation scores and incentives as well).

RAG can augment the LLM with specific knowledge, which may make it more likely to give factually correct answers in those domains, but is mostly orthogonal to the hallucination problem (except to the extent that LLM's hallucinate when asked questions on a subject they don't know).


👤 zamalek
RAG can't create associations to data that isn't superficially (found by the indexing strategy) assosciated to the query. For example, you might query about one presidential candidate and lose out on the context of all other presidential candidates (probably a bad example, but gets the point across).

It is "search and summarize." It is not "glean new conclusions." That being said, "search and summarize" is probably good for 80%.

LoRA is an improvement, but I have seen benchmarks showing that it struggles to make as deep inferences as regular training does.

There isn't a one-size fits all... Yet.


👤 throwaway74432
I suspect we'll discover/invent an IR (intermediate representation) that behaves like RAG in that it primes the LLM to produce a specific bit of knowledge/facts, but the IR is a lot less like normal english, and more like a strange pseudo-english.

👤 dragonwriter
RAG is mostly a hack to address limited context windows, or limited use of wide context windows – some models have large windows but don’t use content that isn’t near the beginning or end well, or expensive content windows (LLM-as-service typically charges by the token, so RAG can reduce cost).

“Stuffing relevant data into the context window rather than relying purely on training” is a solution to confabulation, though, just like providing relevant reference information to a person who is being pressured to answer a question is.


👤 simonw
I wouldn't call it the "future of LLMs". I do see it as both the present and future of one of the application areas of LLMs, which is answering questions against a custom collection of content.

👤 mehulashah
We think that RAG is fundamentally limited:

https://www.aryn.ai/post/rag-is-a-band-aid-we-need-llm-power...

We do see a world where LLMs are used to answer questions (Luna), but it’s a more complex compound AI system that references a corpus (knowledge source), and uses LLMs to process that data.

The discussion around context sizes is a red herring. They can’t grow as fast the demand for data.


👤 sc077y
Thinking back, if LLMs are able to have Memory store and access then RAG becomes useless. RAG is like a system that shoves bits down the RAM (Context Window) and ask the cpu(LLM) to compute something. But If you expand the RAM to a ridiculous amount or you use the HDD, it's no longer necessary to do that. RAG is a suboptimal way of having long term memory. That being said, today it is useful. And when or if this problem gets solved is not easy to say. In the meantime, RAG is the way to go.

👤 lqhl
LLM applications can benefit from Retrieval-Augmented Generation (RAG) in a similar way that humans benefit from search engines like Google. Therefore, I believe RAG cannot be replaced by prompts or fine-tuning.

https://myscale.com/blog/prompt-engineering-vs-finetuning-vs...


👤 0x008
What people are not always considering is that RAG has many more applications than just selecting the relevant context chunks. After all, the R in RAG is not for Vector Search, it is for "Retrieval".

With RAG tools that exist today, we can already do things like

- providing summaries

- hierarchical summarization

- generation of questions / more prompts to nudge the model

- caching

- using knowledge graphs, function calling, or database connectors for non-semantic data querying

etc.


👤 atleastoptimal
What will happen as inference cost goes down is RAG will just be one master LLM calling a bunch of smaller LLMs spanning the context window of every document you are querying. Context windows went from 8k to like 128k or more in like a year. In a few years we will have practically unlimited context windows for minimal cost.

👤 stainlu
Yes. So basically RAG is RAM for human and AI to interact with each other. Doing no RAG is on one side unprecise (for lack of context) and on the other side inefficient (think of non-RAG as having a more general attention).

Inefficiency (in other words, higher expense)is sometimes even easier to perceive for decision-makers


👤 syndacks
Are there any best practices for doing RAG over, say, a novel? (50k-100k words) things that would make this unique compared, say, RAG over smaller docs or research papers: - ability to return specific sentences/passages of a character while also keeping their arch in mind from beginning to end of story

👤 Lerc
It doesn't matter how much knowledge augmentation is provided, if it is less than infinite, hallucinations are going to be a problem. This is a mitigation, not a solution.

The solution is finding a way for models to recognise the absence of knowledge.


👤 tslmy
To make a LLM relevant to you, your intuition might be to fine-tune it with your data, but:

1. Training a LLM is expensive.

2. Due to the cost to train, it’s hard to update a LLM with latest information.

3. Observability is lacking. When you ask a LLM a question, it’s not obvious how the LLM arrived at its answer.

There’s a different approach: Retrieval-Augmented Generation (RAG). Instead of asking LLM to generate an answer immediately, frameworks like LlamaIndex:

1. retrieves information from your data sources first,

2. adds it to your question as context, and

3. asks the LLM to answer based on the enriched prompt.

RAG overcomes all three weaknesses of the fine-tuning approach:

1. There’s no training involved, so it’s cheap.

2. Data is fetched only when you ask for them, so it’s always up to date.

3. The framework can show you the retrieved documents, so it’s more trustworthy.

(https://lmy.medium.com/why-rag-is-big-aa60282693dc)


👤 edude03
I’m bullish on RAG since you’ll always need a way to work with new information without retraining or fine tuning an LLM. Even as humans we essentially do RAG with google

👤 darby_eight
It's certainly the near future; it's the main option that offers parameterization of behavior outside the initial prompt and training data.

👤 thatjoeoverthr
With this technology, faster chips will solve it _for an application-specific definition of "solved."_

The models aren't actually capable of taking into account everything in their context window with industrial yields.

These are stochastic processes, not in the "stochastic parrot" sense, but in the sense of "you are manufacturing emissions and have some measurable rate of success." Like a condom factory.

When you reduce the amount of information you inject, you both decrease cost and improve yield.

"RAG" is application specific methods of estimating which information to admit to the context window. In other words, we use domain knowledge and labor to reduce computational load.

When to do that is a matter of economy.

The economics of RAG in 2024 differ from 2022, and will differ in 2026.

So the question that matters is, "given my timeframe, and current pricing, do I need RAG to deliver my application?"

The second question is, "what's an acceptable yield, and how do I measure it?"

You can't answer that for 2026, because, frankly, you don't even know what you'll be working on.


👤 intended
The best solution to reducing the problem of Hallucinations is first someone telling us what their Error rates in production are.

👤 stuaxo
I think it's a good complement.

If LLMs are akin to a "low resolution jpeg of the internet", RAGs allow checking of facts.


👤 ayushl
We'll get a token level RAG , something similar to routing mechanisms in MoE

👤 hbarka
Does RAG depend on a vector database?