HACKER Q&A
📣 cranberryturkey

How would I make an AI model that is trained on 2000 text documents?


I'd like to use node.js to create an ollam-compatible AI model that is trained on 2000+ texts I have.

The idea is to query an ai model that is knowledgable on a particular subject matter.

I have no idea where to begin.

I've looked at RAG and embeddings but they both appear to require you send the context (ie: book content) with the query, which would be way to large for one query. So I'm thinking it'd be better to just train an entire ai model.


  👤 cdrini Accepted Answer ✓
Your two options are RAG or fine-tuning.

RAG (with vector db) I've found to be a little finicky, and requires a decent amount of preprocessing of your data in order to split it up into reasonable chunks that vector encode well. I also think that due to the nature of how RAG works, it might not be able to do certain types of queries unless you allow for a more complicated back-and-forth of querying. But I've only toyed/built some proof-of-concepts with it, nothing production ready. I found llama-index to be useful here; it lets you spin up a barebones RAG system with sensible defaults in like 20 lines of code. But ofc for any real world application you'll have to start making a bunch of mods. Would love feedback from folks who have used RAG in production systems -- was it difficult to split your documents up? Did you have trouble with it using irrelevant chunks from your vector db? Etc.

Fine-tuning is modifying an existing llm to have knowledge about your data. It's been on my to try list for a while, but it's got a higher barrier to entry. There are good video tutorials on YouTube though. It seems like this would allow for a more in depth understanding of your documents, making it more likely to answer complicated prompts. But would love feedback on that hypothesis from folks with experience using it!


👤 asimpleusecase
If you build a RAG vector database with your 2000 documents. Future queries will be compared to the content of that database and the closest match will be given to an LLM as source content for its reply to you. If you were trying to build a game element or simple responder with a very narrow purpose you might get by with one of the small 7b open source llm’s and host the whole thing on your personal computer

👤 muzani
To my understanding, RAG is perfect for this.

Models are more for completion. Think of it like autocomplete. If you wanted a model to be good at storytelling, you'd train a model for that. Or say, writing Assembly code. It's like you write "Go to" and the completion model figures out the next word, which may be "jail", "Mexico" or "END".

Fine tuning is a way to bias the completion towards something. In general, it's better to fine tune a general model like Llama or GPT-4 than train it from scratch.

Embeddings models are there to decide which words are related to another. So you might say cat and dog are near each other. Or cat and gato. But cat and "go to" are far from each other. Where encoding turns letters and numbers to bits, embeddings turn words, phrases, images, sounds into vectors.

Since vectors are a little different to bits, they're stored in vector DBs. Vector DBs are often a pain in the ass to deal with. And embedding is super cheap. So, often RAG embeds the entire book each time in tutorials. This is not good practice.

RAG is really a fancy term meaning query, then generate based on that query. So tutorials would embed a million words, toss that in memory, query the memory, then throw it out. That's... wasteful. But not as wasteful as training a model. You should store it in a vector DB, then query that DB.

If it's for a few uses, then just use langchain for RAG. If it's for over 100, then you want to convert your text into embeddings and put that on a vector DB. If it's small and static, LanceDB is fine. Or pgvector (Supabase supports this).

If you want scale, there's plenty of others, but the price goes up fast. Zillis and qdrant seem be good at higher levels, especially if the text is updated continuously.


👤 verdverm
RAG breaks your 2k docs down and then provides only the relevant parts during query/answer time. Current context lengths should be more than sufficient

2000 texts is far too low a number to train from scratch. Fine-tuning is often used to refine a previously trained model, not sure 2k is enough for this.

RAG is really the first method you should be implementing. LlamaIndex has the best examples that are easily repurposed (imho)


👤 malteg