What local machines are people using to train LLMs?

Question

How are people building local rigs to train LLMs?

malux85 · Accepted Answer

I don&rsquo;t train LLMs from scratch, but I have:3x4090s 1xTesla A100Lots of fine tuning, attention visualisation, evaluation of embeddings and different embedding generation methods, not just LLMs though I use them a lot for deep nets of many kindsBoth for my day job (hedge fund) and my hobby project https://atomictessellator.comIt&rsquo;s summer here in NZ and I have these in servers mounted in a freestanding server rack beside my desk, and it is very hot in here XD

rgbrgb · Answer

Some people have been fine-tuning mistral 7B and phi-2 on their high-end macs. Unified memory is a hell of a thing. The resulting model here is not spectacular but as a proof of concept it's pretty exciting what you get in 3.5 hours on a consumer machine.
- Apple M2 Max 64GB shared RAM
- Apple Metal (GPU), 8 threads
- 1152 iterations (3 epochs), batch size 6, trained over 3 hours 24 minutes
https://www.reddit.com/r/LocalLLaMA/comments/18ujt0n/using_g...

buildbot · Answer

A self built machine with dual 4090s, soon to be 3x. Watercooled for quieter operation.
Did the math on how much using runpod per day would be, and bought this setup instead.
Using Fully sharded data parallel and bfloat16, I can train a 7b param model very slowly. But that’s fine for only going 2000 steps!

bearjaws · Answer

I doubt many people are using local setups for serious work.
Even fine tuning Mixtral is 4xH100 for 4 days. Which is a ~$200k server currently.
To fully train, not just fine tune a small model, say Llama 2 7b you need over 128GiB of vram, so still multiple GPU territory, likely A100s or H100s.
This is all dependent upon the settings you use, increase the batch size and you will see even more memory utilization.
I believe a lot of people see these models running locally and assume training is similar, but it isn't.