Running LLMs locally–what hardware should I get–M2 Ultra, PC, cloud GPU?

Question

I must make hundreds of queries, experiment with various model setups and prompts, and run LLMs. Here's what I have tried:- API solutions: I tried https://openrouter.ai to get access to llama-2-70b-chat models but it was so slow (high latency) that I gave up.- On my MacBook Pro with M1 Pro chip, I can only run models up to 34B, but the inference speed is not great.- The Mac Studio with M2 Ultra costs around $7000 after tax.> It's not upgradable but I think it's quite future-proof already with 192GB unified memory, no?> I won't be able to run games on it but I'm not much of a gamer anyway.> It weighs almost 8 pounds, meaning that I can carry it to work if I want to.> It's energy-efficient and doesn't make me hate electricity...> It's mostly compatible with llama.cpp, so no CUDA support (no exl2 or GPTQ).> I might want to finetune/train models in the future. Is it possible to do LORA/QLORA on Mac?- On the other hand, a PC:> Is upgradable, but the question is: at what cost? If I want to add more VRAM I'll have to buy GPUs that cost between $1000-$2000.> Draws so much power, esp. with multiple GPUs, so I'll have to keep it at work and SSH into it.> The case will be heavy and I can't just carry it to places.> I get to run games on it if I want.> But even with 2x4090s I get 48GB VRAM, way less than 192GB on the Mac.> I get full CUDA support for ML and finetuning.> More hassle to setup, configure, and maintain (esp. if I use Linux) compared to Mac which works OOTB.- I've also tried cloud GPUs but the costs quickly add up. A100s are basically gone, and the rest are so-so. Since I can't let the VM run 24/7, I have to configure the VM every single time I want to run something on GPU, which takes around 30-40 minutes (including downloading the 70B models...)I appreciate any comments you have about what I should do... Thanks!

pk-protect-ai · Accepted Answer

Is this for private/home use? Regardless, I was in a similar situation at the beginning of last year. However, I chose to upgrade my existing PC, which has 64GB RAM and an RTX 3060 TI, with an additional RTX A4000 16G. I bought this used one for about €600 from eBay.
Reasoning: This setup is sufficient for the inference of models up to 13B (quantized) and for QLoRA training of the 7B models (4bit q). It is also adequate for experimental training of smaller models like 300M. Thus, this is suitable for tweaking and experimentation on a very small scale.
When I need to train larger models, there is always the option of using the cloud. Lambda Labs offers economical prices for this. For training with huge datasets and 34B and larger models, you might want to consider DeepSpeed with ZERO3, which can significantly reduce the cost of training, even with Amazon or Azure instances.
Running 70B models will end up being costly. If you will not run this in production mode 24/7 with a profit turnover, then it does not make sense to invest in your own infrastructure.

PUSH_AX · Answer

When I did some quick math regarding buying hardware outright vs spinning up cloud versions, cloud seemed to win pretty handily.
I've used runpod a bit, they have serverless which is good for sporadic inference. For training LoRAs I used their regular cloud offering (which can be switched off), they have volumes so you shouldn't need to re-download models (although if they're stale for a long period of time I think they reclaim them).
It's certainly not perfect, their product is also not mature. But I got a lot done, and for not a lot of money.