Is it feasible to train my own LLM?

Question

Hey, HN! Long-time lurker; a bit intimidated to post this:I'm an IB diploma candidate (in HS), and writing a research paper is an important part of that curriculum. I am hoping to write my paper on how an LLM's training data impacts its output, comparing one trained on, say, Wikipedia as opposed to Reddit.I have access to some reasonably powerful Nvidia GPUs and plenty of time to train.I'm fairly decent at "technology," as wide of an umbrella as that is -- I use Linux, have messed with things like koboldcpp, etc. -- but my programming abilities are weak; all I've done is 6.00.1x (intro to python) through edX.Does this seem like a reasonable project? I know the results will be bad, but will they be enough to measure differences in some way?

washadjeffmad · Accepted Answer

https://github.com/oobabooga/text-generation-webui/blob/main...
Consider a finetune - they're faster and relatively cheap (like, under $30 rented compute time). The link above lists them, but the steps are to gather a dataset, do the training, and evaluate your results. LLMs are about instruction/evaluation, so it's easy to show results, measure perplexity, and compare against the base model.
If you're interested in a building a limited dataset, fun ideas might be quotes or conversations from your classmates, lessons or syllabi from your program, or other specific, local, testable information. Datasets aren't plug and play, and they're the most important part of a model.
However, even using the same dataset can yield different results based on training parameters. I'd keep it simple and either make the test about the impact of differences in training parameters using a single dataset, or pick two already created datasets and train using the same parameters for comparison.
Good luck in IB! I was in it until I moved cities, and it was a blast.

runjake · Answer

In addition to the other recommendations in the comments, I would encourage you to peruse Simon Willison's blog[1].
Simon (simonw on HN) writes really approachable blog posts on LLMs.
Karpathy's nanoGPT project[2] is handy if you want to dip your toe into training.
1. https://simonwillison.net/
2. https://github.com/karpathy/nanoGPT/

ilaksh · Answer

For training from scratch, maybe a small model like https://github.com/karpathy/nanoGPT or tinyllama. Perhaps with quantization.
Fine-tuning is very doable. The hard part is making a novel dataset with input output pairs. You might consider just combining datasets you find on HuggingFace as an experiment.
replicate.com has a dead simple fine tuning API.
Predibase is also an easy to use option. But again for something custom you need a dataset with hundreds of examples. Normally people use GPT-4 to generate the dataset. As long as OpenAI doesn't block them.

solomatov · Answer

Yes, it's doable. Your model won't be as large and as performant as a real large scale model, but you could train something. You could watch this for a start: https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxb...

credit_guy · Answer

Maybe you should consider not doing that.
There are so many things you can use LLMs for. Training an LLM is a possible project, but why? Why don't you do a project where you show how to use an LLM to get interesting results.
Here's a proposal. Do you know the TclTutor? It's a fantastic interactive tutorial for the programming language Tcl:
http://www.msen.com/~clif/TclTutor.html
Why don't you do a project where you have an LLM create a similar tutorials for a different language, let's say python. And then "templetize" that, so you can quickly create language tutorial for many other languages.
Not to mention, if you do such a project, you can publish it on github, and then your resume becomes immensly stronger.

proc0 · Answer

Training a decently sized LLM takes hundreds of GPUs days and weeks, if not more. Anything smaller is not as useful or "smart". This includes the size of the data, the sanitation of the data, and the training cycles, all of which require a lot of compute resources. There are models out there that were trained with less and used other LLMs to generate their training data (I think Alpaca or Vicuna models are one of them), but doing this is even more complicated.Disregarding the quality of the results, then yes you can train a small LM on any data. I don't know what the threshold is for usefulness and coherence of the final model.

inimino · Answer

It is absolutely doable, the question is whether you have the combination of time and skills to make meaningful progress in whatever direction you choose in whatever time you have. Probably start with a review of recent literature (i.e. notable work on arxiv in last year or two) and see if you can replicate something that's already been done, then go from there. This lets you realistically assess your abilities before trying to do something that may be turn out to be more of a PhD thesis level of work. Try to find a mentor in your school or local area to help you out.

hnfong · Answer

The results will be "bad", which you have acknowledged as a possibility. Why do it then?
LLMs trained "merely" on just either wikipedia or reddit are probably going to be very limited in capability since there's not enough well rounded data (esp. for wikipedia). Of course you'll find differences. Reddit is going to contain more profanity, at the very least, so the reddit-trained LLM is going to swear and use slang more. Besides generating gibberish and comparing the gibberish doesn't seem to be any point with the exercise, unless that's a project you really want to do.
Without knowing how IB scores students' research papers I wouldn't be able to comment on whether this is feasible to get reasonable grades, but as I said, unless you really want to do it and somehow measure the reddit model understanding slang better and swearing more readily, I personally don't see a point in doing so given that the results will likely, as you mentioned, be somewhat "bad".
The thing about bleeding edge research on LLMs is that nobody really knows what will happen unless you actually try it out.
FWIW you generally don't have to do much proper "programming" to train models these days. There are many projects on github with code to train SoTA models (which in turn are just hundreds or low-thousands lines of code). The main difficulty is getting the hardware, the OS and the dependencies to work correctly, getting high quality training data (which you don't have to for your project), and tuning the hyperparameters (if you're concerned with performance).
So in terms of technical feasibility, yeah, but I am kind of concerned that the most likely main result would be reddit's knowledge of internet slang and swearing over wikipedia, which doesn't seem to mesh well with a high school project :D

pomatic · Answer

I stumbled across https://www.manning.com/books/build-a-large-language-model-f... today - it's not complete yet, but sounds like exactly what you're after.

acheong08 · Answer

I did something similar for my extended essay (IB class of 2023) last year. (Evaluated the effectiveness of fine tuning at removing censorship from training phase)Didn&rsquo;t do too well. Make sure to focus on the theory & cite lots of papers. Don&rsquo;t focus on practicals (at least that was the advice I got from my school)

davidajackson · Answer

You're probably better off analyzing how different models work with different RAG subjects/content. It will proxy the work you're trying to do of analyzing models trained on different datasets. Find an HF model trained on natural lang/web, one on code etc.

kromem · Answer

Others provided good advice about technically doing it.
A note in terms of the project goals:
Make sure to remember when interpreting your results that your findings will only apply to similarly sized models as what you trained.
So you'll have found the differences between using Reddit vs Wikipedia for a 7B model (or whatever size you go with) and those results shouldn't be assumed to extrapolate to larger models.