HACKER Q&A
📣 divan

How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?


There is a 5 month old thread [1] on this, but it might be already outdated.

What is the best approach for feeding custom set of documents to LLM and get non-halucinating and decent result in Dec 2023?

UPD: The question is generally about how to "teach" LLM answer questions using your set of documents (not necessarily train your own, so approaches like RAG counts)

[1] https://news.ycombinator.com/item?id=36832572


  👤 ilaksh Accepted Answer ✓
You don't train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that's what people are searching for.

You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.

Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.

What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.

What it will do is learn the patterns that are in those documents.


👤 ankit219
I think the answer depends on how many documents you have. To think in terms of tokens (assuming 750-1000 tokens is a page), if you have a good estimate of number of pages you want to query on, you can decide on the approach. Three popular approaches:

1. RAG: Most popular and works really well on smaller datasets. It is limited by number of vectors/embeddings. A typical embedding could be of 1000 tokens in size. Llamaindex did a lot of engineering on this and their techniques work pretty well. The problem with large datasets is almost always that users don't like writing long prompts/queries so the answers are more generic.

2. Finetuning + RAG: You can finetune a model on the expected outputs. If your datasets have the knowledge which might already be on open internet (blogposts, articles, anything non proprietary), then finetuning would work really well in combination with RAG, especially for large datasets. It may not work if you are working on a proprietary knowledge hard to find on open internet.

3. Continual pretraining - large large datasets, and when the knowledge is proprietary. I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them. Needs a model that is trained on their data and then Instruction Tuning of top of that. Most likely you wont need to do this


👤 benjaminwootton
AWS Bedrock is fairly easy. You can do it in 5 or 6 clicks.

You have to upload your documents to S3, create a “Knowledge Base” then sync your documents into a vector database like OpenSearch or PineCone. You are then good to go via their playground or the AWS API.

I made a video here describing the process, check around 14 minutes in:

https://ensembleanalytics.io/blog/introducing-bedrock-knowle...

Bedrock is a decent product I think. All of the models in one place (apart from the big dogs from OpenAI) and a common API across them.


👤 ignoramous
Here's a (video) guide on fine-tuning Mistral 7B with QLoRA: https://www.harpercarroll.com/articles/ai/llm-finetune-own-d... / https://ghostarchive.org/varchive/kmkcNVvEz-k

Fine tuning does result in degradation of the overall model (https://twitter.com/xaiguydotagi/status/1737082280835703142) and so various RAG techniques may be desirable. As others have mentioned, LlamaIndex is a neat solution to build RAG pipelines: https://docs.llamaindex.ai/en/stable/optimizing/production_r...


👤 kkielhofner
As others have said you want RAG.

The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).

The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.

It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.

You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.

[0] - https://github.com/h2oai/h2ogpt


👤 ukuina
PrivateGPT is one of the better-known examples, but most people are not aware that GPT4 Assistants handle RAG natively now: https://platform.openai.com/docs/assistants/overview

👤 bmgoau
Run https://github.com/imartinez/privateGPT

Then

make ingest /path/to/folder/with/files

Then chat to the LLM.

Done.

Docs: https://docs.privategpt.dev/overview/welcome/quickstart


👤 lolinder
I haven't personally tried this for anything serious yet, but to get the thread started:

Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.

The main perk over the cloud options is that you can point it at any language model, including fully local—my local install pointed at my local Ollama running Mistral.

[0] https://github.com/cheshire-cat-ai/core


👤 monkeydust
Did this in the summer via RAG. One thing we realised is that pure vector embeddings retrieval doesn't work so well for docs with acronyms (which let's face it all businesses have). Created a hybrid solution using embeddings and BM25 which is traditional ranking tool. This hybrid gave best results.

👤 galacticaactual
Train on your own documents or analyze your own documents for answers? Very different things.

For the first (fine tuning) follow “AI Jason” on YouTube. He has some great tutorials.

For the second (RAG or similar), fire up a cloud VM with GPUs or use Ollama locally and read through the LlamaIndex docs on how to build a RAG pipeline.


👤 orenlindsey
You can pay for a ChatGPT account and upload your own documents. I didn't do this myself but my dad uploaded 6 years of sermon transcripts from our church. It sounds exactly like the pastor.

👤 ndr_
With OpenAI, you can first build Question & Answer pairs derived from your documents and use the OpenAI fine-tuning feature to build yourself a custom model. This method is more than just learning behavior in that facts do get recalled. I have written about it here, with a play demo use-case: https://ndurner.github.io/training-own-model-finetuning Note that I have yet to use this in a real world use-case, and I would love to hear feedback.

Other than OpenAI, there is the newly introduced „continued pre-training“ of Amazon Bedrock - but I haven’t tried.

RAG: I think that‘s a fundamentally flawed concept, but RAGfluencers will disagree. ;-)


👤 ohthehugemanate
NGL I think this one has passed the point on the tech maturityc curve where it makes sense to roll your own I played with MS office's copilot builder the other day and it's amazing. Point it at a set of base URLs, uploaded files, public or behind authentication. In literal seconds you have a copilot that can be embedded anywhere, including messengers. I gave it the root of Azure documentation, the root of red hat documentation? And the root of the ansible documentation and it's excellent. It uses MS's open source LLM copilot framework, and you can swap out models for an open source one (instead of GPT) if you like.

👤 sophiebits
If you’re looking for something that is hosted for you, at Notion we launched a feature for this a few weeks ago and it works quite well in my experience. RAG is one of the techniques used. https://www.notion.so/blog/introducing-q-and-a

👤 CrypticShift
Here are ways to do it by simply adding files to an online interface. I mention them only because they are quite straightforward (and free) to set up.

- https://notebooklm.google/ (US or VPN): uses the "gemini pro" model.

- poe.com: You need to "create a new bot", disable "Make bot publicly accessible," and then "add a knowledge source." this offers many models, although the best ones require a subscription.


👤 treeskirt33

👤 vikbehal
Hello,

I am currently working on a demo use-case to generate documents, and I intend to feed a few documents as sample. E.g., Say a leasing documents. Such document will vary by leasing company, state, etc. so there is not 1 template.

I do understand that I could create embeddings for each template and then use them to ask ChatGpt to generate documents where certain entities would change. I've setup a basic project but I am stuck at stage where I don't know how to tell ChatGpt that provided documents are sample and it needs to generate similar ones based on prompt engineering.


👤 kmkarim
Slightly off topic but is there recommended advice on how to tune / train not for document retrieval but for consistent JSON output with specific enums?

i.e given a text, always return back a certain set of fields. For some keys here is the possible set of enums etc. One shot prompting does work but curious how others approach this if you have training data on hand.


👤 ZunarJ5
I'm a fan of Khoj. Been using it for months. https://github.com/khoj-ai/khoj

👤 xnx
GPT-4 Turbo has a 128K (~300 pages) context window, which probably handles a lot of use cases which might have previously needed extra training/refinement.

👤 BeetleB
Since no one has mentioned it so far: I did just this recently with txtai in a few lines of code.

https://neuml.github.io/txtai/


👤 quickthrower2
Easiest is OpenAI assistants api. Use the playground and it’s a no code experience.

👤 rplp
My approach was not to train the model on the documents, as others mentioned.

I built a vector database from the documents, and I query the questions against it, which is very fast. This is the RAG (retrival augmented generation) step others mentioned.

The results, which are literal extracts from the documents, but short ones, are given to the model, which produces an answer. This is the slow part.

I used many of Langchain's tools to manage the whole process.

You can try it on Supawiki, with one of the featured wikis. Then, if you are ok with a solution that hosts your documents for you, you can upload them and use our solution.


👤 joeyrobert
Gpt4all is a local desktop app with a Python API that can be trained on your documents: https://gpt4all.io/

👤 _giorgio_
If you want a simpler task, like training a mistral llama etc on your documents, to act as a document completer, how would you proceed instead? Probably much easier. Thanks

👤 drakonka
As mentioned above, I don't think you'd need to train your own model for this (or for most use cases of this, anyway). You'd use a RAG.

I've tried out working with custom documents in two different ways for different types of data:

* Once using LlamaIndex + Chroma[0] to transcribe and then conversationally query video contents (using GPT 3.5 or 4 as the backing LLM).

* Once using GPT Plus, uploading long-form PDFs of my own fiction books to the GPT's knowledge base. I use this to help me remember character names and timelines (not always accurate, so results need to be treated with caution) and help brainstorm story or tech ideas for my world.

Both work for what I'm using them for. I feel like option one is more customizable and easier to tweak for the types of results I would want, if I have concrete requirements about what kind of output I'm looking for. Option two has a lower barrier to entry and is just a little lower effort (no need to run your own app).

For the next iteration, I'd like to try out AWS Bedrock and compare the workflow and results.

[0] https://www.daily.co/blog/search-your-video-content-library-...


👤 viraptor
So far the recommendations are mostly hosted, so here's one local: https://github.com/weaviate/Verba

I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)


👤 throwaway421967
A bit unrelated, but one could open any binary file as text. With enough training data, could an llm just learn the format?

👤 tomgs
We do just that at Flexor!

We've built what we call an Unstructured Data Transformation Layer. Think about it like an assembly line from raw text to tables in your data warehouse.

We don't use Llamaindex, we have our own (proprietary) piece of tech that does this. We can and have been outputting gold-standard tables on top of a lot of different types of context (legal docs, call transcripts, customer reviews, chat logs, emails, blog posts, social media posts, etc...) and looking to expand to more interesting domains soon.

If anyone wants to hear more hit me up at tom [at] flexor [dot] ai (does this still work or are scrapers smart enough nowadays to just grep for this too lol)


👤 novaRom
We have to add LLMs and MMMs (multi modal models) into all standard Linux distributions. A service will index all local files creating embedding connectors, this will be used to augment user prompts, and voila we can search for anything with natural language.

👤 min76
I have a related question. I have a fair idea of the LLM ecosystem. (Thanks to this very nice blog called Emerging Architectures for LLM Applications). The problem is, there are way too options in each component. ( For E.g, too many vector store implementations, ingestion engines etc) What is the easiest way to get started? Primarily around RAG on my own pdf files. Also, what is the best/easiest option for hosting?. That blog lists vercel,streamlit, streamship and modal. I know vercel at a high level and found it very good. I am not well versed with javascript/typescript though. I believe the best option for UI generation is to use one of their templates.

👤 staranjeet
You can use embedchain[1] to connect various data sources and then get a RAG application running on your local and production very easily. Embedchain is an open source RAG framework and It follows a conventional but configurable approach.

The conventional approach is suitable for software engineer where they may not be less familiar with AI. The configurable approach is suitable for ML engineer where they have sophisticated uses and would want to configure chunking, indexing and retrieval strategies.

[1]: https://github.com/embedchain/embedchain


👤 narag
I'd very much appreciate if someone could clarify what exactly is needed in terms of hardware and software to implement these suggestions.

Would a five years old laptop work? Do you need a beefy GPU? Are you using some prepackaged software?


👤 mrbonner
I'm curious about this as well but my data is mostly (95%) numerical metrics. Is there a "RAG" mechanism for numerical data instead of text? My use case is data analysis, insight discovery for example.

👤 ryanSrich
What would be nice is some type of box or device I connect to a computer. I then give it full access to the system. It trains itself on all of the data on that computer. The device is now a portable LLM.

👤 627467
I've been looking for the answer to this to have a chat interface to my obsidian markdown notes (the whole vault, not just rag of individual notes). Will be following these threads closely

👤 gsharma
I have usually seen people recommend to chunk by sentences or paragraphs or some fixed length of characters. IMO, all these are suggested because they are easy to write code for, but in reality, length of a meaningful chunk depends entirely on the data. The way we chunk an FAQ document vs a PRD is different.

Based on this assumption, I have a couple of questions:

1. Is chunking the most significant factor in RAG quality?

2. If there are no limitations, would humans that are experts in that dataset, be the best people to create chunks?


👤 tucared
Hi all, I was asking myself the same thing today and I followed a recent blog post from LlamaIndex[0] to create the following repo https://github.com/tucared/llm-file-explorer

[0] https://blog.llamaindex.ai/running-mixtral-8x7-locally-with-...


👤 iAkashPaul
A go-to method is to ingest different chunksizes based on the document hierarchy & then use langchain with a bunch of retrievers depending on the doc type.

Then create an index about the metadata of each doc. So that you can ask the RAGbot what all it can answer about.

Another way to ensure it stays on-domain is to generate synthetic questions & check for similarity against user queries. There's a whole rabbit hole of query decomposition to avoid straying off topic as well.


👤 liampulles
What is your usecase? If you want to search for relevant info in your documents and get relevant info, and you want to avoid hallucination, you might avoid the text generation altogether.

Instead you can extract text embeddings from your documents, put them in a vector DB, and then you have a super search. You can convert your search query to an embedding, search the DB and keep the e.g. 10 closest matches.


👤 NKosmatos
There was something similar about Retrieval Augmented Generation (RAG) recently on HN: https://news.ycombinator.com/item?id=38491251

Early next year I’m preparing something similar for my team, so I’ll surely look into the useful links/recommendations posted by fellow HNers :-)


👤 d7y
Try https://github.com/SecureAI-Tools/SecureAI-Tools -- it's an open-source application layer for Retrieval-Augmented Generation (RAG). It allows you to use any LLM -- you can use OpenAI APIs, or run models locally with Ollama.

👤 JacobiX
I have a question for which I haven't found a definitive answer yet: is how can one effectively manage typos and Out-of-Vocabulary (OOV) words in RAG systems?

For instance, if I search for a specific product name but accidentally mistype it, the resulting encoded vector might not be similar to the vector for the correctly spelled product name ?


👤 Igor_Wiwi
It's super easy, an example could be found here https://technoclub.bearblog.dev/creating-a-simple-ai-chat-wi...

👤 constantinum
Unstract - https://unstract.com/ They are a month away from launch(both open source and cloud) The team might be able to give you a quick demo on your specific requirements.

👤 csbartus
https://khoj.dev/

Tried this summer, and kinda worked!


👤 assafe
Hey, GPT Researcher shows exactly how to do that with RAG. See here https://github.com/assafelovic/gpt-researcher

👤 jrpt
What are you trying to do more specifically? You can use https://docalysis.com/ for most document RAG tasks.

👤 jerpint
If you’re looking for an open source RAG solution, try our library:

https://www.github.com/jerpint/Buster


👤 cf1241290841
Show HN from two weeks ago mentioned this. https://news.ycombinator.com/item?id=38587052

👤 soultrees
Is there any open source front ends out there? I know of anything LLM but hoping to plug my own home built rag system into a nice front end.

👤 tesdinger
You can not get a non-hallucinating AI in 2023.

👤 yu3zhou4
And what’s the correct answer in December 2023 if one wants to narrow down only to tools and services provided on Azure?

👤 agnosticmantis

👤 noiv
How you do RAG with embeddings NOT in English? I mean there are a few thousand more languages.

👤 newstio
how to run a local llm model for RAG apps. Retrieval documents are turkish. But ı would to analyze this documents with llm. But ı have not a turkish local llm. How to solve this problem. Out of fine-tune and training.

👤 elvyscruz
anything-llm looks pretty interesting and easy to use https://github.com/Mintplex-Labs/anything-llm

👤 fuzzfactor
I would think you really need to get into action almost non-stop, maybe a couple all-nighters, since 2024 is almost here ;)

Well over 40 years ago I certainly wasn't working on language models with only 16 kilobytes of memory and a 1MHz microprocessor.

No "high-level" or human-readable language anyway.

OTOH I always attempted to use the electronics further toward the limit of what it could provide, compared to average, and once the resource limit was reached (which occurred fairly quickly with only 16k) then the entire effort concentrated on maximizing the amount of machine learning that could be accomplished by code that fit in the memory.

There was no distraction preparing for more powerful hardware to come, it wasn't going to be coming during the time period needed.

No real artifical intelligence evolved, and there was nothing "general" about it.

The idea was to select & collect the desired most useful inferences from the raw data and make them available to the operator's natural intelligence for all the high-level decision-making.

Definitely no room to have data in memory, since it would waste the space you need for more thoughtful code. You can't have the resulting factors building up in memory either so they had to go to external storage as they were generated. Naturally to be used later by completely different code which is geared to process the rudimentary findings in relation to new data, and present that to the operator in order to enhance their pattern recognition and decision-making efforts.

I could only imagine what it would be like if memory came in megabytes rather than merely the lowly kilobytes.

One thing I think might still be true today, whatever amount of memory you have, you should be able to handle so much raw data that it makes the amount of memory look insignificant.

OTOH, if you can't highly leverage a naturally intelligent operator without some huge resource requirements, you might not be on the right track when it comes to maximizing hardware utilization.

And then there's the concept of analog noise amplification. You really need to be careful that there is nothing wrong or unrelated in the fundamental data set you are using at the time. Starting with a raw input signal, each stage of amplification will increase the amount of noise proportionally, and depending on the number of layers of amplification, any noise can cascade into top prominence when the desired signal is unfortunately weak. But the same level of noise-in-place-of-signal does not go away when it is dwarfed by a strong signal, the noise is merely masked during the high-signal passages but remains a considerable component.

Now when the raw data only gets one initial pass, anything that's missed the first time is lost forever, and if the missing nuance is something important that would seem like the type of thing that when a strong need is there, anything related to that nuance would be unreliable, incorrect, false, or downright hallucinatory if the performance was advanced enough.

GIGO is still the name of the game so I would think if it's custom training you have to step up to the plate and take the good with the bad. You've got to laboriously handle all the training data yourself anyway, so might as well take the opportunity to seriously babysit that data thoroughly in advance like you would never be able to do if you were only dealing with somebody else's already trained model.

Plus you can't usually take an adequately huge data set and in one pass remove all the undesired artifacts. And if one pass results in a processed dataset which can no longer be considered very huge at all, you've probaly lost too much valuable information and you may still not have eliminated all the undesirable noise.

This is somewhat analogous to lossy digital data compression, but focused on retaining only the most prominent meaning that can be gleaned from the data. As opposed to plain compression which retains the most prominent data regardless of meaning.

The more intelligently the raw input data is handled, the more realistically intelligent you can expect the final outcome to turn out.


👤 GabrieleR
Try with private gpt github repo

👤 d4rkp4ttern
Many services/platforms are careless/disingenuous when they claim they “train” on your documents, where they actually mean they do RAG.

An under-appreciate benefit of RAG is the ability to have the LLM cite sources for its answers (which are in principle automatically/manually verifiable). You lose this citation ability when you finetune on your documents.

In Langroid (the Multi-Agent framework from ex-CMU/UW-Madison researchers) https://github.com/langroid/langroid we’ve implemented a number of RAG techniques: our DocChatAgent uses a combination of lexical and semantic retrieval, reranking and relevance extraction to improve precision and recall: https://github.com/langroid/langroid/blob/main/langroid/agen... All the code is laid out clearly so can be tweaked. We have companies using Langroid in production (e.g. for customer-support); they especially like the RAG and multi-agent features.

One of the interesting techniques in Langroid is a numbering trick for the relevance extraction stage (having the LLM extract verbatim relevant parts of passages) — instead of having the LLM “parrot” out the relevant portions, thus wasting time and tokens, we have it just spit out the relevant sentence numbers from a pre-annotated passage. We use a tool/function-call for this, leveraging Langroid’s task loop that seamlessly works for tool handling as well as sub-task handoff.

Many interesting RAG applications often require more than simple question-answering (e.g extract structured info, match doc against requirements etc) and these scenarios benefit immensely from having multiple agents (so you get separation of concerns, modularity and easier state management). Langroid simplifies this type of multi-agent setup with its unique conversational task loops, e.g https://langroid.github.io/langroid/examples/agent-tree/

Colab quick start that builds up to a 2-agent system for extracting structured info from a document:

https://colab.research.google.com/github/langroid/langroid/b...

Among many other things, Langroid also has full support for the OpenAI Assistants API, so you could use the “built-in” RAG from this API, which is a convenience but is a black box, I.e you don’t know what retrieval algo it is using, how it is filling context, and the tokens consumed.