Also open to other solutions. I have a Mac M1 (8gb RAM) and upgrading the computer itself would be cost prohibitive for me.
I was getting 2.2 tokens/s with the llama-2-13b-chat.Q4_K_M.gguf and 3.3 tokens/s with llama-2-13b-chat.Q3_K_S.gguf. With Mistral and Zephyr, the Q4_K_M versions, I was getting 4.4 tokens/s.
A few days ago I bought another stick of 16GB RAM ($30) and for some reason that escapes me, the inference speed doubled. So now I'm getting 6.5 tokens/s with llama-2-13b-chat.Q3_K_S.gguf, which for my needs gives the same results as Q4_K_M, and 9.1 tokens/s with Mistral and Zephyr. Personally, I can barely keep up with reading at 9 tokens/s (if I also have to process the text and check for errors).
If I wasn't considering getting an Nvidia 4060 Ti for Stable Diffusion, I would seriously be considering a used RX 580 8GB ($75) and run Llama Q4_K_M entirely on the GPU or offload some layers when using a 30B model.
Here is the best explanation I’ve found so far, covering various trade-offs and scenarios: https://www.hardware-corner.net/guides/computer-to-run-llama...
In your shoes, not being in the position to spend much right now, I’d try a few different 7B models at 4 and 5 bit quantizations on the Mac, which is going to be better than just about any other 8GB RAM system, and look into using cloud for larger stuff (remember to fully deallocate the VM when done for the day!)
Start with a 7B model then go from there. I used kobold ai but that didn’t seem too well recommended for macOS.
Raspberry pi 4B can do 3B models or 7B at one question per hour or so for now. Can quantize them for faster but then the answers are worse.
Then if you want something that is extremely quick and easy to set up and provides a convenient REST api for completions/embeddings with some other nice features, you might want to check out my project here:
https://github.com/Dicklesworthstone/swiss_army_llama
Especially if you use Docker to set it up, you can go from a brand new box to a working setup in under 20 minutes and then access it via the Swagger page from any browser.
Not the cheapest by far, but I recently bought a 32G internal memory M2 Pro Mac mini. I can run about four 7B models concurrently. I was able to run a 30B quantized model without page faults, but I killed most user land processes.
Also not what you are asking for, but I pay Google $10/month for Colab Pro and I can usually get an A100 whenever I request one. Between Colab and my 32G M2 box, I am very satisfied. Before I found good quantized models to run, I would rent a VPS by the hour from Lambda Labs, and that was a great experience, but I don’t need to do that now.
EDIT: on the M2 Pro, I get 25 to 30 tokens per second.
EDIT #2: I wrote a short blog yesterday on the best resources I have found so far for running on my Mac https://marklwatson.substack.com/p/running-open-llm-models-o...
In case it's the latter, I recently used Ollama[1] and boy was it good! Installation was a breeze, downloading/using models is very easy and performance on my M1 was quite good for the Mistral 7B model.
Your best bet is to run a quantized 7b model using LMStudio or Ollamma on your M1 Mac, like neural chat v3.1 from Mistral/Intel.
Let's say I point my resources at getting one up and running that outputs tokens in an acceptable manner - then what? What can I do with a local LLM?
...which is usually crap (because it's only 3b) and needs to be regenerated anyway. It's not a viable solution for any generative use case. Mechanical Turk is faster and more reliable.
There are smaller models that I could try but 7b is already the lower limit of my patience. YMMV
A token per second... what is that token going to say? Accurate information?