HACKER Q&A
📣 Exorust

What local machines are people using to train LLMs?


How are people building local rigs to train LLMs?


  👤 malux85 Accepted Answer ✓
I don’t train LLMs from scratch, but I have:

3x4090s 1xTesla A100

Lots of fine tuning, attention visualisation, evaluation of embeddings and different embedding generation methods, not just LLMs though I use them a lot for deep nets of many kinds

Both for my day job (hedge fund) and my hobby project https://atomictessellator.com

It’s summer here in NZ and I have these in servers mounted in a freestanding server rack beside my desk, and it is very hot in here XD


👤 rgbrgb
Some people have been fine-tuning mistral 7B and phi-2 on their high-end macs. Unified memory is a hell of a thing. The resulting model here is not spectacular but as a proof of concept it's pretty exciting what you get in 3.5 hours on a consumer machine.

- Apple M2 Max 64GB shared RAM

- Apple Metal (GPU), 8 threads

- 1152 iterations (3 epochs), batch size 6, trained over 3 hours 24 minutes

https://www.reddit.com/r/LocalLLaMA/comments/18ujt0n/using_g...


👤 buildbot
A self built machine with dual 4090s, soon to be 3x. Watercooled for quieter operation.

Did the math on how much using runpod per day would be, and bought this setup instead.

Using Fully sharded data parallel and bfloat16, I can train a 7b param model very slowly. But that’s fine for only going 2000 steps!


👤 bearjaws
I doubt many people are using local setups for serious work.

Even fine tuning Mixtral is 4xH100 for 4 days. Which is a ~$200k server currently.

To fully train, not just fine tune a small model, say Llama 2 7b you need over 128GiB of vram, so still multiple GPU territory, likely A100s or H100s.

This is all dependent upon the settings you use, increase the batch size and you will see even more memory utilization.

I believe a lot of people see these models running locally and assume training is similar, but it isn't.