HACKER Q&A
📣 holomorphiclabs

Most efficient way to fine-tune an LLM in 2024?


In Apr 2024 what is the most efficient way to fine-tune an LLM?

In particular we are trying to understand performance vs. cost trade-offs. We don't have a budget to train from scratch.

We are working with a proprietary data set on the order of 100M tokens and are looking to fine-tune a general purpose language model and also create task-specific models based on the same corpus.

Any help would be appreciated!


  👤 dhouston Accepted Answer ✓
Qlora + axolotl + good foundation model (llama/mistral/etc, usually instruction fine tuned) + runpod works great.

A single A100 or H100 with 80GB VRAM can fine tune 70B open models (and obviously scaling out to many nodes/GPUs is faster, or can use much cheaper GPUs for fine tuning smaller models.)

The localllama Reddit sub at https://www.reddit.com/r/LocalLLaMA/ is also an awesome community for the GPU poor :)


👤 gardnr
You probably want to build a retrial augmented generation pipeline.

If you do end up wanting to fine tune then use qlora with axolotl or unsloth to prove your hypothesis on a smaller model and then evaluate if you want the marginal gains you get from full precision training.

After you fine tune it with 100m token dataset, use DPO to polish it off. You need to create a DPO dataset for that but it can be relatively small to get some great gains.

After that, look at applying grammars during inference if you are expecting structured results like json.

You should be able to run the experiments on 4090s from vast.ai or runpod or similar service.

It can cost less than $100 depending on your requirements.


👤 dosshell
I know this is maybe not the answer you want, but if you are just interested in getting the job done there exist companies that are experts on this, for example:

https://fortune.com/2024/03/11/adaptive-startup-funding-falc...


👤 luke-stanley
For my ChillTranslator project I spent maybe a few dollars fine-tuning Phi 2, to generate less spicy variations of inflammatory Hacker News comments with very little data to see how well it worked (especially compared to your 100M tokens). I'll improve it when I have time. I mostly followed the Brev fine-tune tutorial but I wanted to have a 2 GB file GGUF quantised model I could run on any device with a specific JSON grammar. It uses Transformers PEFT and QLoRA. I didn't try Axolotl yet, or OpenPipe but I hope to. Actual compute time is probably much less than I spent, I wasted time dealing with drivers, trying to figure out how to merge the finetuned weights, serialise to old fashioned Pickle, not safe-tensors, and how to convert to GGUF, quantise it and rsync it.

👤 danielhanchen
A bit late, but Unsloth makes LoRA / QLoRA finetuning 2x faster and reduces VRAM by 80% with 0% degradation in accuracy! (no approximations are done!)

Mistral 7b is 2x faster than HuggingFace + Flash Attention 2. Gemma 7b is 2.4x faster than HF + FA2.

Check out https://github.com/unslothai/unsloth for full benchmarks!


👤 jasonjmcghee
The approach I see used is axolotl with QLoRA using cloud GPUs which can be quite cheap.

https://github.com/OpenAccess-AI-Collective/axolotl

Someone from one of the cloud GPU vendors wrote a guide: https://brev.dev/blog/fine-tuning-mistral


👤 stanbiryukov
I recommend reviewing Stanford's dspy library - great examples of few-shot learning that works by generating and tuning prompts for LLMs and even distilling instruction following tasks to smaller models like T5. Second, as others mentioned, using QLoRA for supervised fine-tuning followed by DPO/KTO for preference optimization. This strategy placed Huggingface's Zephyr and IBM's Neural Chat on leaderboards for 7B parameter models. I also recommend reviewing the Unsloth library which has excellent accelerated examples of using these methods, along with the axolotl library. Lastly, skypilot and Modal both have excellent examples that showcase using axolotl to efficiently finetune models on cloud GPUs. [1] https://github.com/stanfordnlp/dspy [2] https://github.com/unslothai/unsloth [3] https://github.com/OpenAccess-AI-Collective/axolotl [4] https://github.com/skypilot-org/skypilot [5] https://github.com/modal-labs/llm-finetuning

👤 HarHarVeryFunny
A possible alternative to fine-tuning is in-context learning, especially if you are using a model with long context where you can provide a lot of examples. Models can do one/few-shot learning, but in-context learning improves the more examples you give. You could experiment cheaply with Claude Haiku to see if this works for you.

👤 magdyks
Finetuning a LoRA based adapter using a tool like predibase.com does this really fast. If you wanna go fully open source and have your own hardware you can do the same thing using a ludwig + lorax stack to do this yourself.

👤 tdba
What's your measure of performance?

Theres no one size fits all answer yet, but if you just want to test it out there are many commercial offerings on which you should be able to get some results for under $10k.


👤 objektif
Apologize if out of topic but could anyone please point me to a resource regarding best practices of implementing RAG with either proprietary LLMs like GPT?

👤 netdur
I understand the methods to address the fine-tuning and RAG issues but lack the time and possibly the technical skills to implement the solution. Fine-tuning can potentially dumb down a perfect model, and RAG has context limitations and may not cover all content. My thinking, we should vectorize the text and embed these vectors into all layers of the model at inference time. This approach would bypass the context size limitations and resource wastage associated with fine-tuning, as vectorization is fast. I believe this vectorization and embedding strategy is the solution.

👤 Redster
What LLM are you hoping to use. Have you considered using HelixML? If I am reading you right, the primary concern is compute costs, not human-time costs?

👤 blissfulresup

👤 dvt
I think you may be misunderstanding what fine tuning does. It does not teach the model new knowledge. In fact, Meta has a paper out that argues you only need a data set of 1000[1] to achieve pretty good alignment (fine-tuning) results. (100M is way overkill.) For knowledge retrieval, you need RAG (usually using the context window).

[1] https://arxiv.org/pdf/2305.11206.pdf


👤 viksit
if i understand the problem correctly - you'd like to feed xMM documents directly into an LLM so that it uses this context to "reason" answers to questions, vs offload the retrieval to a vector db and merely assemble results into an "answer"?

and since your dataset is large, the longest context windows are insufficient.


👤 xianshou
Single-GPU, optimal efficiency: unsloth + qlora + mistral-7b on runpod/vast/lambda

Blazing fast compared to out-of-the-box transformers, also make sure to use flash attention if you have A100s or better and context length >= 2k

Add FAISS (https://github.com/facebookresearch/faiss) if you need fast local RAG


👤 alxgt
Interested

👤 FezzikTheGiant
I was just gonna ask this question and saw this at the top of Ask. Interested.