What's the best hardware to run small/medium models locally?

Question

M2 MacBook? A certain graphics card? Running inference on CPU on my old Thinkpad isn't fun.

MPSimmons · Accepted Answer

I think there are a couple of basic questions need answered before we can find a good solution:

1) What are you trying to do?

2) What's your budget?

Generically saying, "run inference" is like... you can do that on your current thinkpad, if you want a small enough model. If you want to run 7B or 13B or 34B models for document or sentiment analysis, or whatever, then you can move to the budget question.

When I was faced with this question, I bought the cheapest 4060 Ti with 16GB I could find. It does "okay". Here's an example run:

  Llama.generate: prefix-match hit
  
  llama_print_timings:        load time =     627.53 ms
  llama_print_timings:      sample time =     415.30 ms /   200 runs   (    2.08 ms per token,   481.58 tokens per second)
  llama_print_timings: prompt eval time =     162.12 ms /    62 tokens (    2.61 ms per token,   382.44 tokens per second)
  llama_print_timings:        eval time =    8587.32 ms /   199 runs   (   43.15 ms per token,    23.17 tokens per second)
  llama_print_timings:       total time =    9498.89 ms
  Output generated in 9.79 seconds (20.43 tokens/s, 200 tokens, context 63, seed 1836128893)

I'm using the text-generation-webui to provide the OpenAI API interface. It's pretty easy to hit:

  import os
  import openai
  url = "http://localhost:7860/v1"
  openai_api_key = os.environ.get("OPENAI_API_KEY")
  client = openai.OpenAI(base_url=url, api_key=openai_api_key)
  result = client.chat.completions.create(
      model="wizardlm_wizardcoder-python-13b-v1.0",
      messages = [
          {"role":"system", "content":"You are a helpful AI agent. You are honest and truthful"},
          {"role":"user", "content": "What is the best approach when writing recursive functions?"},
      ]
  print(result)

But again, it just depends on what you want to do.

zer00eyz · Answer

1. The GPU market is a mess! https://www.tweaktown.com/news/94394/amds-top-end-rdna-3-sal... Insiders who watch the prices and talk to VAR's all say that the channels seem stuffed and that prices are holding back sales.
2. AMD: They may change the land scape in coming months. And it looks like the US gov restrictions on GPU's are going to impact price in the server market in 2024.
3. The stacks are evolving quickly. What you buy for today may be supersede by something tomorrow that means you should have spent more or could have spent less.
If you want to play: Ram, is what matters most. GPU ram and system ram (in that order). Get the best GPU you can (ram wise) under clock it and then add system memory if you can. Once you have a test bed that works for you, renting/cloud is a way to scale and play with bigger toys till you have a better sense of what you want and/or need.

gchadwick · Answer

Have you considered running on a cloud machine instead? You can rent machines on https://vast.ai/ for under $1 an hour that should work for small/medium models (I've mostly been playing with stable diffusion so I don't know what you'd need for an LLM off hand).
Good GPUs and Apple hardware is pricey. Get a bit of automation setup with some cloud storage (e.g backblaze B2) and you can have a machine ready to run your personally fined tuned model rapidly with a CLI command or two.
There will be a break even point of course. Though a major advantage of renting is you can move easily as the tech does. You don't want to sink large amounts of money into a GPU only to find the next new hot open model needs more memory than you've got.

modeless · Answer

A gaming desktop PC with Nvidia 3060 12GB or better. Upgrade the GPU first if you can afford it, prioritizing VRAM capacity and bandwidth. Nvidia GPU performance will blow any CPU including M3 out of the water and the software ecosystem pretty much assumes you are using Nvidia. Laptop GPUs are not equivalent to the desktop ones with the same number so don't be fooled. 8x 3090 (purchased used) is a popular configuration for people who have money and want to run the biggest models, but splitting models between GPUs requires extra work.Personally I have 1x 4090 because I like gaming too, but it isn't really a big improvement over 3090 for ML unless you have a specific use for FP8, because VRAM capacity and bandwidth are very similar.

MandieD · Answer

A data point for you: 7B models at 5-bit quantization run quite comfortably under llama.cpp on the AMD Radeon RX 6700 XT, which has 12GB VRAM and was part of a lot of gaming PC builds around 2021-22.I can&rsquo;t give this as a recommendation - there are far more tools available for Nvidia GPUs, but larger VRAM is available on AMD GPUs at lower prices from what I can see.

KolenCh · Answer

(If you want a Mac,) Apple silicon has the advantage of the unify memory, and with llama.cpp, they can run those models locally and quickly. I&rsquo;d say start with the largest model you want to run, run it through llama.cpp which will tell you the amount of memory needed. And buy the Mac with at least that amount of memory that you can afford. If you have more budget, prioritize more memory because you may want to be able to run larger model later.If not Mac, follow other advice with NVidia GPU. in term of the software ecosystem, NVidia >> Apple >> AMD > Intel. (I think I got the ordering right, but the magnitude of difference might be subjective.)

oaththrowaway · Answer

4060Ti w/ 16GB VRAM or 3090 w/ 24 GB VRAM
Of course with those you'll also have to spend some money on motherboard, ram, SSD, PSU, CPU, ect.
I think the best bang for the buck is probably a Mac studio with as much ram as you can afford.
I bought an RTX A2000 (12GB VRAM), and it's fine for 7B models and some 13B models with 4 bit quantization, but I kind of regret not getting something with more VRAM.

CJefferson · Answer

In my experience, an nvidia card with the most memory you can get &mdash; that&rsquo;s more important than speed, as models are tending to get bigger, and streaming models really hits speed.I don&rsquo;t have any Mac experience.

anonzzzies · Answer

Somewhat related; how to run an uncensored model locally? I run llamafile (llamafile-server-0.1-llava-v1.5-7b-q4 and mistral-7b-instruct-v0.1-Q4_K_M-server) ones on my macbook m1 and they run file (fast enough for playing), but they both seem neutered quite a bit. It's hard to get them off the rails and mistral (the above one) actually barfs really quickly just repeating the same letter (fffffff usually) where it should've said fuck. Now i'm not looking for something that writes porn or whatnot, but the online models are so pc, it's getting on my nerves.

jocaal · Answer

Nvidia GPU's are really your only choice. There is no framework as mature as CUDA and nvidia has been making the fastest hardware for decades. They know their stuff when it comes to architecture, so its unlikely that the hot new thing will actually be able to compete.

White_Wolf · Answer

I'm using a DUO 16 2023 with 4090 16GB and ryzen 9 7945HX (16c/32t). It also uses another 32 GB shared RAM which makes it a 48GB 4090. It's quite a bit slower than a full on 4090 but it can load decent sized models and works well.Tested both Linux (some things will need manual patching) and windows. Works like a charm.

rkagerer · Answer

Any links to setting up a ChatGPT-like experience that is entirely local - ie. no connectivity to the web/cloud?

l72 · Answer

To add to this, I have a laptop with 32G of RAM and am able to run some 7B models on CPU. But I'd like to work on some larger models. Are there any eGPUs that can aid in this?

PeterStuer · Answer

A used RTX3090 of eBay is the most interesting budget option by far.
If you have twice the cash, go for a new RTX4090 for rougly twice the performance.
If you need more than 24GB vram, you want to get comfortable with sharding across a few of the 3090's, or spend a lott more on a 48, 80, 100 GB card.
If you feel adventurous, you can go a non nvidia route, but expect a lott of friction and elbow grease at least for now.

jaimex2 · Answer

Anything from nVidia.3060 RTX with 12GB VRAM if you're budget friendly, dial up from there.Steer away from Apple unless all you do is work from a laptop.

RandomWorker · Answer

I&rsquo;m running mistral 7B on a M1 Mac 8GB just barely. It&rsquo;s ask a question get a coffee type of thing. No idea how this works, as 32 bit floats require 4 bytes and with 7B it would need to be swapping with the SSD.If I had the cash I would go for 24GB M2/3 pro. That would allow me to comfortably load the 7B model in to ram.

calamari4065 · Answer

I run a 13B Q4 llama variant on my ten year old server with two Xeon E5-2670, 128GB of RAM, and no GPUIt runs at under 3 tokens per second. I usually just give it my prompt and go make a coffee or something. The server is in my basement, you can barely hear the fans screaming at all.

pogue · Answer

I don't want to derail the OP's question, but would the same kind of system to run an LLM on also be suitable for an image generator like Stable Diffusion or does it work through different methods?

zvr · Answer

If you're wiling to wait a few days, remember that Intel Core Ultra processors (Meteor Lake) are supposed to be available on December 14th. The embedded NPU should make a difference.

steve_adams_86 · Answer

Somewhat related. I&rsquo;ve got an M2 Max Mac Studio with 32GB of ram. Is there anything interesting I can do with it in terms of ML? What&rsquo;s the scene like on moderately powered equipment like this?

catketch · Answer

M1 MacBooks are still available as new in high memory configs... I picked up a 64GB M1 Max for less than $2500. Good setup because of the shared cpu/gpu memory scheme

adastra22 · Answer

MacBook, thanks to Apple's new MLX framework.

egman_ekki · Answer

I&rsquo;ve been running Orca 2 13B on M1 Pro with 32GB of RAM with LLM Studio and GPU acceleration quite nicely.https://huggingface.co/TheBloke/Orca-2-13B-GGUF

mobiuscog · Answer

I was interested in Stable Diffusion / images, and also text generation.
I started playing with ComfyUI and Ollama.
An M1 studio ultra would generate a 'base' 512x512 image in around 6 seconds, and ollama responses seemed easily 'quick enough'. Faster than I could read.
On an I7-3930K, purely CPU only, a similar image would take around 2.5 minutes, and ollama was painful, as I would be waiting for the next word.
Then I switched to a 3080ti, which I hadn't been using for gaming as it got stupidly hot and I regretted having it. Suddenly it was redeemed.
On the 3080ti, the same images come out in less than a second, and ollama generation is even faster. Sure, I'm limited to 7B models for text (the mac could go much higher) and there will be limits with image size/complexity, but this thing is so much faster than I expected, and hardly generates any heat/noise at the same time - completely different to gaming. This is all a simple install under linux (pop os in this case).
tl;dr - A linux PC with a high-end GPU is the best value by far unless you really need big models, in my experience.

rickette · Answer

Llama models run fine on the M2/M3 Macbooks thanks to llama.cpp/GGML.

_vojam · Answer

You can try edgeimpulse.com they support a lot of &ldquo;small&rdquo; hardware for running different models.

rootusrootus · Answer

I started to put together a second machine to be good at inference and then decided to just make my daily driver capable enough. Ended up upgrading my laptop to an MBP w/M2 MAX and 96GB. It runs even bigger models fairly well.

f0000 · Answer

Just run them on AWS

chpatrick · Answer

I'm happy with my used 3090.

xhdusux · Answer

M2? That's some cope.