Which LLMs can run locally on most consumer computers
Are there any? I was thinking about LLM based agents and games and this will probably only be viable when most devices can handle LLMs running locally.
I've been curious as to when games would implement any kind of these new technologies, but i think they are simply too slow for now?
I think we're at least 10-15 years from being able to run low latency agents that "rag" themselves into the games they are a part of, where there are 100's of them, some of them NPC's other's controlling some game mechanic or checking if the output from other agents is acceptable or needs to be run again.
At the moment a macbook air 16 gb can run Phi-Medium 14gb, which is extremely impressive, but it's 7 tokens per second, way to slow for any kind of gaming, you need to 100x performance and we need 5+ generations before i can see this happening.
Unless there's some other application?
See llamafile (https://github.com/Mozilla-Ocho/llamafile), a standalone packaging of llama.cpp that runs an LLM locally. It will use the GPU, but falls back on the CPU. CPU-only performance of small, quantized models is still pretty decent, and the page lists estimated memory requirements for currently popular models.
Maybe a dumb question, but I think anyone reading this question would know a good answer for me. If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally? "Best" in this case would be I would want to get the best/smartest answers from my questions about these PDFs. They're all full-text PDFs, studies and results on a specific genetic condition that I'd like to understand better by asking something smart questions.
Is there any validity to the idea of using a higher-level LLM to generate the initial data, and then copying that data to a lower-level LLM for actual use?
For example, another comment asked:
"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"
So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.
I was able to successfully run Llama 3 8B, mistral 7B, phi and other 7B models using Ollama [1] on my M1 MacBook Air.
[1] https://ollama.com
The general rule is that VRAM == parameter count in billions (I'm generalizing gguf finetunes here)
8GB vram cards can run 7B models
16GB vram cards can run 13B models
24GB vram cards can run up to 33B models
Now to your question, what can most computers run? You need to look at the tiny but specialized models. I would think 3B models could be ran reasonably well even on the CPU. Intellij has a absolutely microscopic < 1B model that it uses for code completion locally. It's quite good and I don't notice any delay.
I run Mistral 7b and Llama 3 locally using jani.ai on a 32GB Dell laptop and get about 6 tokens per second with a context window of 8k. It's definitely usable if you're patient. I'm glad I also have a Hugging Face account though.
Related question: what's the minimum GPU that's roughly equivalent to Microsoft's Copilot+ spec NPU?
I imagine that Copilot+ will become the target minimum spec for many local LLM products and that most local LLM vendors will use GPU instead of NPU if a good GPU is available.
“Caniuse” equivalent for LLMs depending on machine specs would be extremely useful!
Running them at the edge is definitely possible on most hardware, but not ideal by any means. You'll have to set latency and throughput expectations fairly low if you don't have a GPU to utilize. This is why I'd disagree with your statement re: viability — its really going to be most viable if you centralize the inference in a distributed cloud environment off-device.
Thankfully, between llama 3 8b [1] and mistral 7b [2] you have two really capable generic instruction models you can use out of the box that could run locally for many folks. And the base models are straightforward to finetune if you need different capabilities more specific to your game use cases.
CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.
The quantized versions of the above models do fit easily in many consumer grade gpus (4-5GB for the weights themselves quantized at 4bpw), but it really depends on how much of your vram overhead you want to dedicate to the model weights vs actually running your game.
[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Check out Ollama, it's built to run models locally. Llama3 8b runs great locally for me, 70b is very slow. Plenty of options.
Quantized 6-8b models run well on consumer GPUs. My concern would be vram limits given you'll likely be expecting the card to do compute _and_ graphics.
Without a GPU I think it will likely be a poor experience, but it won't be long until you'll have to go out of your way to buy consumer hardware that doesn't integrate some kind of TPU.
Quantized 4/5-bit 8b models with medium-short context might be shippable. Still, it’s going to require a nice GPU for all that RAM. Plus you would have to support AMD—I would experiment with llama.cpp as it runs on many architectures.
Hope your game doesn’t have a big texture budget.
Seems like there is high potential for some NPC text generation from LLMs, especially a model that is trained to produce NPC dialog alongside discrete data that can be processed to correlate the content of the speech with the state of the game. This is going to be a tough challenge with a lot of room for research and creative approaches to producing immersive experiences. Unfortunately, only single-player and cooperative experiences will be practical for the foreseeable future since its trivial to totally break the immersion with some prompt poisoning.
Even more than LLMs, I'm curious about how transformers can be used to produce more convincing game AI in the areas where they are notoriously bad like 4x games.
Gemma 2B and Phi-3 3B, if you run them at Q4 quantization. I wouldn't bother with anything larger than 4B parameters; you're just not going to be able to reliably expect an end-user to run that size of model on a phone yet.
I assume the question is rather which LLM can cover most of the tasks while delivering decent quality. I would prefer an architecture using different LLM for different tasks rather like 'specialists' instead of simple 'agents'. I used to take the main task and divide it in smaller tasks and see what can I use to solve the problem. Sometimes rule-based approaches can be already enough for a sub-task and LLM would be not only overkill but also more difficult to implement and maintain.
I imagine you would have to solve some tricky scheduling issues to run an LLM on the GPU while it's also busy rendering the game. Frames need to be rendered at a more or less consistent rate no matter what, but the LLM would likely have erratic, spiky GPU utilisation depending on what the agents are doing, so you would have to throttle the LLM execution very carefully. Probably doable but I don't think there's any existing framework support for that.
You can 100% do that with quantized models that are 8b and below. Take a look at ollama to experiment. For incorporating in a game I would probably use llama.cpp or candle.
The game itself is not going to have much VRAM to work with though on older GPUs. Unless you use something fairly tiny like phi3-mini.
There are a lot more options if you can establish that the user has a 3090 or 4090.
There definitely are smaller LLMs that can run on consumer computers, but as for their performance... You would be lucky to get a full sentence. On the other hand, sending and receiving responses as text is probably the fastest and most realistic way to implement these things in games.
Check out this subreddit for a decent "source of truth": reddit.com/r/localllama
Macbook Pro with 128GB RAM runs Llama 3 70B entirely in memory and on GPU. It's remarkable to have a performant LLM that smart and that fast on a (pro)sumer laptop.
Mistral is pretty good, and delivers solid results.