What is the best approach for feeding custom set of documents to LLM and get non-halucinating and decent result in Dec 2023?
UPD: The question is generally about how to "teach" LLM answer questions using your set of documents (not necessarily train your own, so approaches like RAG counts)
[1] https://news.ycombinator.com/item?id=36832572
You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.
Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.
What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.
What it will do is learn the patterns that are in those documents.
1. RAG: Most popular and works really well on smaller datasets. It is limited by number of vectors/embeddings. A typical embedding could be of 1000 tokens in size. Llamaindex did a lot of engineering on this and their techniques work pretty well. The problem with large datasets is almost always that users don't like writing long prompts/queries so the answers are more generic.
2. Finetuning + RAG: You can finetune a model on the expected outputs. If your datasets have the knowledge which might already be on open internet (blogposts, articles, anything non proprietary), then finetuning would work really well in combination with RAG, especially for large datasets. It may not work if you are working on a proprietary knowledge hard to find on open internet.
3. Continual pretraining - large large datasets, and when the knowledge is proprietary. I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them. Needs a model that is trained on their data and then Instruction Tuning of top of that. Most likely you wont need to do this
You have to upload your documents to S3, create a “Knowledge Base” then sync your documents into a vector database like OpenSearch or PineCone. You are then good to go via their playground or the AWS API.
I made a video here describing the process, check around 14 minutes in:
https://ensembleanalytics.io/blog/introducing-bedrock-knowle...
Bedrock is a decent product I think. All of the models in one place (apart from the big dogs from OpenAI) and a common API across them.
Fine tuning does result in degradation of the overall model (https://twitter.com/xaiguydotagi/status/1737082280835703142) and so various RAG techniques may be desirable. As others have mentioned, LlamaIndex is a neat solution to build RAG pipelines: https://docs.llamaindex.ai/en/stable/optimizing/production_r...
The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).
The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.
It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.
You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.
Then
make ingest /path/to/folder/with/files
Then chat to the LLM.
Done.
Docs: https://docs.privategpt.dev/overview/welcome/quickstart
Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.
The main perk over the cloud options is that you can point it at any language model, including fully local—my local install pointed at my local Ollama running Mistral.
For the first (fine tuning) follow “AI Jason” on YouTube. He has some great tutorials.
For the second (RAG or similar), fire up a cloud VM with GPUs or use Ollama locally and read through the LlamaIndex docs on how to build a RAG pipeline.
Other than OpenAI, there is the newly introduced „continued pre-training“ of Amazon Bedrock - but I haven’t tried.
RAG: I think that‘s a fundamentally flawed concept, but RAGfluencers will disagree. ;-)
- https://notebooklm.google/ (US or VPN): uses the "gemini pro" model.
- poe.com: You need to "create a new bot", disable "Make bot publicly accessible," and then "add a knowledge source." this offers many models, although the best ones require a subscription.
I am currently working on a demo use-case to generate documents, and I intend to feed a few documents as sample. E.g., Say a leasing documents. Such document will vary by leasing company, state, etc. so there is not 1 template.
I do understand that I could create embeddings for each template and then use them to ask ChatGpt to generate documents where certain entities would change. I've setup a basic project but I am stuck at stage where I don't know how to tell ChatGpt that provided documents are sample and it needs to generate similar ones based on prompt engineering.
i.e given a text, always return back a certain set of fields. For some keys here is the possible set of enums etc. One shot prompting does work but curious how others approach this if you have training data on hand.
I built a vector database from the documents, and I query the questions against it, which is very fast. This is the RAG (retrival augmented generation) step others mentioned.
The results, which are literal extracts from the documents, but short ones, are given to the model, which produces an answer. This is the slow part.
I used many of Langchain's tools to manage the whole process.
You can try it on Supawiki, with one of the featured wikis. Then, if you are ok with a solution that hosts your documents for you, you can upload them and use our solution.
I've tried out working with custom documents in two different ways for different types of data:
* Once using LlamaIndex + Chroma[0] to transcribe and then conversationally query video contents (using GPT 3.5 or 4 as the backing LLM).
* Once using GPT Plus, uploading long-form PDFs of my own fiction books to the GPT's knowledge base. I use this to help me remember character names and timelines (not always accurate, so results need to be treated with caution) and help brainstorm story or tech ideas for my world.
Both work for what I'm using them for. I feel like option one is more customizable and easier to tweak for the types of results I would want, if I have concrete requirements about what kind of output I'm looking for. Option two has a lower barrier to entry and is just a little lower effort (no need to run your own app).
For the next iteration, I'd like to try out AWS Bedrock and compare the workflow and results.
[0] https://www.daily.co/blog/search-your-video-content-library-...
I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)
We've built what we call an Unstructured Data Transformation Layer. Think about it like an assembly line from raw text to tables in your data warehouse.
We don't use Llamaindex, we have our own (proprietary) piece of tech that does this. We can and have been outputting gold-standard tables on top of a lot of different types of context (legal docs, call transcripts, customer reviews, chat logs, emails, blog posts, social media posts, etc...) and looking to expand to more interesting domains soon.
If anyone wants to hear more hit me up at tom [at] flexor [dot] ai (does this still work or are scrapers smart enough nowadays to just grep for this too lol)
The conventional approach is suitable for software engineer where they may not be less familiar with AI. The configurable approach is suitable for ML engineer where they have sophisticated uses and would want to configure chunking, indexing and retrieval strategies.
Would a five years old laptop work? Do you need a beefy GPU? Are you using some prepackaged software?
Based on this assumption, I have a couple of questions:
1. Is chunking the most significant factor in RAG quality?
2. If there are no limitations, would humans that are experts in that dataset, be the best people to create chunks?
[0] https://blog.llamaindex.ai/running-mixtral-8x7-locally-with-...
Then create an index about the metadata of each doc. So that you can ask the RAGbot what all it can answer about.
Another way to ensure it stays on-domain is to generate synthetic questions & check for similarity against user queries. There's a whole rabbit hole of query decomposition to avoid straying off topic as well.
Instead you can extract text embeddings from your documents, put them in a vector DB, and then you have a super search. You can convert your search query to an embedding, search the DB and keep the e.g. 10 closest matches.
Early next year I’m preparing something similar for my team, so I’ll surely look into the useful links/recommendations posted by fellow HNers :-)
For instance, if I search for a specific product name but accidentally mistype it, the resulting encoded vector might not be similar to the vector for the correctly spelled product name ?
Tried this summer, and kinda worked!
Well over 40 years ago I certainly wasn't working on language models with only 16 kilobytes of memory and a 1MHz microprocessor.
No "high-level" or human-readable language anyway.
OTOH I always attempted to use the electronics further toward the limit of what it could provide, compared to average, and once the resource limit was reached (which occurred fairly quickly with only 16k) then the entire effort concentrated on maximizing the amount of machine learning that could be accomplished by code that fit in the memory.
There was no distraction preparing for more powerful hardware to come, it wasn't going to be coming during the time period needed.
No real artifical intelligence evolved, and there was nothing "general" about it.
The idea was to select & collect the desired most useful inferences from the raw data and make them available to the operator's natural intelligence for all the high-level decision-making.
Definitely no room to have data in memory, since it would waste the space you need for more thoughtful code. You can't have the resulting factors building up in memory either so they had to go to external storage as they were generated. Naturally to be used later by completely different code which is geared to process the rudimentary findings in relation to new data, and present that to the operator in order to enhance their pattern recognition and decision-making efforts.
I could only imagine what it would be like if memory came in megabytes rather than merely the lowly kilobytes.
One thing I think might still be true today, whatever amount of memory you have, you should be able to handle so much raw data that it makes the amount of memory look insignificant.
OTOH, if you can't highly leverage a naturally intelligent operator without some huge resource requirements, you might not be on the right track when it comes to maximizing hardware utilization.
And then there's the concept of analog noise amplification. You really need to be careful that there is nothing wrong or unrelated in the fundamental data set you are using at the time. Starting with a raw input signal, each stage of amplification will increase the amount of noise proportionally, and depending on the number of layers of amplification, any noise can cascade into top prominence when the desired signal is unfortunately weak. But the same level of noise-in-place-of-signal does not go away when it is dwarfed by a strong signal, the noise is merely masked during the high-signal passages but remains a considerable component.
Now when the raw data only gets one initial pass, anything that's missed the first time is lost forever, and if the missing nuance is something important that would seem like the type of thing that when a strong need is there, anything related to that nuance would be unreliable, incorrect, false, or downright hallucinatory if the performance was advanced enough.
GIGO is still the name of the game so I would think if it's custom training you have to step up to the plate and take the good with the bad. You've got to laboriously handle all the training data yourself anyway, so might as well take the opportunity to seriously babysit that data thoroughly in advance like you would never be able to do if you were only dealing with somebody else's already trained model.
Plus you can't usually take an adequately huge data set and in one pass remove all the undesired artifacts. And if one pass results in a processed dataset which can no longer be considered very huge at all, you've probaly lost too much valuable information and you may still not have eliminated all the undesirable noise.
This is somewhat analogous to lossy digital data compression, but focused on retaining only the most prominent meaning that can be gleaned from the data. As opposed to plain compression which retains the most prominent data regardless of meaning.
The more intelligently the raw input data is handled, the more realistically intelligent you can expect the final outcome to turn out.
An under-appreciate benefit of RAG is the ability to have the LLM cite sources for its answers (which are in principle automatically/manually verifiable). You lose this citation ability when you finetune on your documents.
In Langroid (the Multi-Agent framework from ex-CMU/UW-Madison researchers) https://github.com/langroid/langroid we’ve implemented a number of RAG techniques: our DocChatAgent uses a combination of lexical and semantic retrieval, reranking and relevance extraction to improve precision and recall: https://github.com/langroid/langroid/blob/main/langroid/agen... All the code is laid out clearly so can be tweaked. We have companies using Langroid in production (e.g. for customer-support); they especially like the RAG and multi-agent features.
One of the interesting techniques in Langroid is a numbering trick for the relevance extraction stage (having the LLM extract verbatim relevant parts of passages) — instead of having the LLM “parrot” out the relevant portions, thus wasting time and tokens, we have it just spit out the relevant sentence numbers from a pre-annotated passage. We use a tool/function-call for this, leveraging Langroid’s task loop that seamlessly works for tool handling as well as sub-task handoff.
Many interesting RAG applications often require more than simple question-answering (e.g extract structured info, match doc against requirements etc) and these scenarios benefit immensely from having multiple agents (so you get separation of concerns, modularity and easier state management). Langroid simplifies this type of multi-agent setup with its unique conversational task loops, e.g https://langroid.github.io/langroid/examples/agent-tree/
Colab quick start that builds up to a 2-agent system for extracting structured info from a document:
https://colab.research.google.com/github/langroid/langroid/b...
Among many other things, Langroid also has full support for the OpenAI Assistants API, so you could use the “built-in” RAG from this API, which is a convenience but is a black box, I.e you don’t know what retrieval algo it is using, how it is filling context, and the tokens consumed.