3x4090s 1xTesla A100
Lots of fine tuning, attention visualisation, evaluation of embeddings and different embedding generation methods, not just LLMs though I use them a lot for deep nets of many kinds
Both for my day job (hedge fund) and my hobby project https://atomictessellator.com
It’s summer here in NZ and I have these in servers mounted in a freestanding server rack beside my desk, and it is very hot in here XD
- Apple M2 Max 64GB shared RAM
- Apple Metal (GPU), 8 threads
- 1152 iterations (3 epochs), batch size 6, trained over 3 hours 24 minutes
https://www.reddit.com/r/LocalLLaMA/comments/18ujt0n/using_g...
Did the math on how much using runpod per day would be, and bought this setup instead.
Using Fully sharded data parallel and bfloat16, I can train a 7b param model very slowly. But that’s fine for only going 2000 steps!
Even fine tuning Mixtral is 4xH100 for 4 days. Which is a ~$200k server currently.
To fully train, not just fine tune a small model, say Llama 2 7b you need over 128GiB of vram, so still multiple GPU territory, likely A100s or H100s.
This is all dependent upon the settings you use, increase the batch size and you will see even more memory utilization.
I believe a lot of people see these models running locally and assume training is similar, but it isn't.