1) What are you trying to do?
2) What's your budget?
Generically saying, "run inference" is like... you can do that on your current thinkpad, if you want a small enough model. If you want to run 7B or 13B or 34B models for document or sentiment analysis, or whatever, then you can move to the budget question.
When I was faced with this question, I bought the cheapest 4060 Ti with 16GB I could find. It does "okay". Here's an example run:
Llama.generate: prefix-match hit
llama_print_timings: load time = 627.53 ms
llama_print_timings: sample time = 415.30 ms / 200 runs ( 2.08 ms per token, 481.58 tokens per second)
llama_print_timings: prompt eval time = 162.12 ms / 62 tokens ( 2.61 ms per token, 382.44 tokens per second)
llama_print_timings: eval time = 8587.32 ms / 199 runs ( 43.15 ms per token, 23.17 tokens per second)
llama_print_timings: total time = 9498.89 ms
Output generated in 9.79 seconds (20.43 tokens/s, 200 tokens, context 63, seed 1836128893)
I'm using the text-generation-webui to provide the OpenAI API interface. It's pretty easy to hit: import os
import openai
url = "http://localhost:7860/v1"
openai_api_key = os.environ.get("OPENAI_API_KEY")
client = openai.OpenAI(base_url=url, api_key=openai_api_key)
result = client.chat.completions.create(
model="wizardlm_wizardcoder-python-13b-v1.0",
messages = [
{"role":"system", "content":"You are a helpful AI agent. You are honest and truthful"},
{"role":"user", "content": "What is the best approach when writing recursive functions?"},
]
print(result)
But again, it just depends on what you want to do.
2. AMD: They may change the land scape in coming months. And it looks like the US gov restrictions on GPU's are going to impact price in the server market in 2024.
3. The stacks are evolving quickly. What you buy for today may be supersede by something tomorrow that means you should have spent more or could have spent less.
If you want to play: Ram, is what matters most. GPU ram and system ram (in that order). Get the best GPU you can (ram wise) under clock it and then add system memory if you can. Once you have a test bed that works for you, renting/cloud is a way to scale and play with bigger toys till you have a better sense of what you want and/or need.
Good GPUs and Apple hardware is pricey. Get a bit of automation setup with some cloud storage (e.g backblaze B2) and you can have a machine ready to run your personally fined tuned model rapidly with a CLI command or two.
There will be a break even point of course. Though a major advantage of renting is you can move easily as the tech does. You don't want to sink large amounts of money into a GPU only to find the next new hot open model needs more memory than you've got.
Personally I have 1x 4090 because I like gaming too, but it isn't really a big improvement over 3090 for ML unless you have a specific use for FP8, because VRAM capacity and bandwidth are very similar.
I can’t give this as a recommendation - there are far more tools available for Nvidia GPUs, but larger VRAM is available on AMD GPUs at lower prices from what I can see.
If not Mac, follow other advice with NVidia GPU. in term of the software ecosystem, NVidia >> Apple >> AMD > Intel. (I think I got the ordering right, but the magnitude of difference might be subjective.)
Of course with those you'll also have to spend some money on motherboard, ram, SSD, PSU, CPU, ect.
I think the best bang for the buck is probably a Mac studio with as much ram as you can afford.
I bought an RTX A2000 (12GB VRAM), and it's fine for 7B models and some 13B models with 4 bit quantization, but I kind of regret not getting something with more VRAM.
I don’t have any Mac experience.
Tested both Linux (some things will need manual patching) and windows. Works like a charm.
If you have twice the cash, go for a new RTX4090 for rougly twice the performance.
If you need more than 24GB vram, you want to get comfortable with sharding across a few of the 3090's, or spend a lott more on a 48, 80, 100 GB card.
If you feel adventurous, you can go a non nvidia route, but expect a lott of friction and elbow grease at least for now.
3060 RTX with 12GB VRAM if you're budget friendly, dial up from there.
Steer away from Apple unless all you do is work from a laptop.
If I had the cash I would go for 24GB M2/3 pro. That would allow me to comfortably load the 7B model in to ram.
It runs at under 3 tokens per second. I usually just give it my prompt and go make a coffee or something. The server is in my basement, you can barely hear the fans screaming at all.
I started playing with ComfyUI and Ollama.
An M1 studio ultra would generate a 'base' 512x512 image in around 6 seconds, and ollama responses seemed easily 'quick enough'. Faster than I could read.
On an I7-3930K, purely CPU only, a similar image would take around 2.5 minutes, and ollama was painful, as I would be waiting for the next word.
Then I switched to a 3080ti, which I hadn't been using for gaming as it got stupidly hot and I regretted having it. Suddenly it was redeemed.
On the 3080ti, the same images come out in less than a second, and ollama generation is even faster. Sure, I'm limited to 7B models for text (the mac could go much higher) and there will be limits with image size/complexity, but this thing is so much faster than I expected, and hardly generates any heat/noise at the same time - completely different to gaming. This is all a simple install under linux (pop os in this case).
tl;dr - A linux PC with a high-end GPU is the best value by far unless you really need big models, in my experience.