I'm an IB diploma candidate (in HS), and writing a research paper is an important part of that curriculum. I am hoping to write my paper on how an LLM's training data impacts its output, comparing one trained on, say, Wikipedia as opposed to Reddit.
I have access to some reasonably powerful Nvidia GPUs and plenty of time to train.
I'm fairly decent at "technology," as wide of an umbrella as that is -- I use Linux, have messed with things like koboldcpp, etc. -- but my programming abilities are weak; all I've done is 6.00.1x (intro to python) through edX.
Does this seem like a reasonable project? I know the results will be bad, but will they be enough to measure differences in some way?
Consider a finetune - they're faster and relatively cheap (like, under $30 rented compute time). The link above lists them, but the steps are to gather a dataset, do the training, and evaluate your results. LLMs are about instruction/evaluation, so it's easy to show results, measure perplexity, and compare against the base model.
If you're interested in a building a limited dataset, fun ideas might be quotes or conversations from your classmates, lessons or syllabi from your program, or other specific, local, testable information. Datasets aren't plug and play, and they're the most important part of a model.
However, even using the same dataset can yield different results based on training parameters. I'd keep it simple and either make the test about the impact of differences in training parameters using a single dataset, or pick two already created datasets and train using the same parameters for comparison.
Good luck in IB! I was in it until I moved cities, and it was a blast.
Simon (simonw on HN) writes really approachable blog posts on LLMs.
Karpathy's nanoGPT project[2] is handy if you want to dip your toe into training.
Fine-tuning is very doable. The hard part is making a novel dataset with input output pairs. You might consider just combining datasets you find on HuggingFace as an experiment.
replicate.com has a dead simple fine tuning API.
Predibase is also an easy to use option. But again for something custom you need a dataset with hundreds of examples. Normally people use GPT-4 to generate the dataset. As long as OpenAI doesn't block them.
There are so many things you can use LLMs for. Training an LLM is a possible project, but why? Why don't you do a project where you show how to use an LLM to get interesting results.
Here's a proposal. Do you know the TclTutor? It's a fantastic interactive tutorial for the programming language Tcl:
http://www.msen.com/~clif/TclTutor.html
Why don't you do a project where you have an LLM create a similar tutorials for a different language, let's say python. And then "templetize" that, so you can quickly create language tutorial for many other languages.
Not to mention, if you do such a project, you can publish it on github, and then your resume becomes immensly stronger.
Disregarding the quality of the results, then yes you can train a small LM on any data. I don't know what the threshold is for usefulness and coherence of the final model.
LLMs trained "merely" on just either wikipedia or reddit are probably going to be very limited in capability since there's not enough well rounded data (esp. for wikipedia). Of course you'll find differences. Reddit is going to contain more profanity, at the very least, so the reddit-trained LLM is going to swear and use slang more. Besides generating gibberish and comparing the gibberish doesn't seem to be any point with the exercise, unless that's a project you really want to do.
Without knowing how IB scores students' research papers I wouldn't be able to comment on whether this is feasible to get reasonable grades, but as I said, unless you really want to do it and somehow measure the reddit model understanding slang better and swearing more readily, I personally don't see a point in doing so given that the results will likely, as you mentioned, be somewhat "bad".
The thing about bleeding edge research on LLMs is that nobody really knows what will happen unless you actually try it out.
FWIW you generally don't have to do much proper "programming" to train models these days. There are many projects on github with code to train SoTA models (which in turn are just hundreds or low-thousands lines of code). The main difficulty is getting the hardware, the OS and the dependencies to work correctly, getting high quality training data (which you don't have to for your project), and tuning the hyperparameters (if you're concerned with performance).
So in terms of technical feasibility, yeah, but I am kind of concerned that the most likely main result would be reddit's knowledge of internet slang and swearing over wikipedia, which doesn't seem to mesh well with a high school project :D
Didn’t do too well. Make sure to focus on the theory & cite lots of papers. Don’t focus on practicals (at least that was the advice I got from my school)
A note in terms of the project goals:
Make sure to remember when interpreting your results that your findings will only apply to similarly sized models as what you trained.
So you'll have found the differences between using Reddit vs Wikipedia for a 7B model (or whatever size you go with) and those results shouldn't be assumed to extrapolate to larger models.