The ecosystem has matured: DGX Spark, high-end Mac Studios, AMD Strix Halo, upcoming DGX Station. Models are getting smaller and more efficient. Inference engines (llama.cpp, vLLM, SGLang) and frontends (Ollama, LMStudio, Jan) have made local deployment accessible. Yet I keep meeting more people researching this than actually deploying it.
For those running local inference: - What's your setup and use case? - Is it personal or shared across a team? - What's the real driver — privacy, regulation, latency, cost, tinkering?
I'm skeptical on cost arguments (cloud inference scales better, plus API subsidies, for now at least!), but curious if I'm missing something.
What would make local AI actually worth it for you?
I do local AI with Qwen, Whisper and another I can't remember right now.
These are all QWEN:
We do AI Invoice OCR - PDF -> Image -> Excel. Works much better than other solutions because it has invoice context so it looks for particular data to extract and ignores others. Why local? I proved it worked, no need to send our data outside for processing and it works,
We deal with photos of food packaging - I do a "photograph ingredients list and check them against our expected ingredients" - downside is it takes 2 mins per photo, I might actually push this one outside.
Ingredients classifier - is it animal (if so what species), vegetarian, vegan, halal, kosher, alcoholic, is nut based, peanuts and more - simply no need to send it outside.
I've got a Linux chatbot helper on the "test this" pile with Qwen Coder - not evaluated it but the idea will be "type command, get it wrong, ask Qwen for the answer" - I use Claude for this but it seems a bit heavy weight and I'm curious.
tbh some of it is solution hunting - we spent $1000 on the kit to evaluate if it was worth it so I try and get some value out of it.
But it is slow, 3 hours for a recent task that took Claude API 2 minutes.
My favourite use is Whisper. I voice->text almost all of my typing now.
I've also bought a Nvidia Orin Nano but I haven't set it up yet - I want to run Whisper in the car to take voice dictation as I drive.
I work with ML professionally, almost all in cloud, I just wanted something “off grid” and unmetered, and needed a computer anyway so decided to pay a bit more and get the one I want. It’s “personal” in that it’s exclusively for me, but I have a business and bought it for that.
Still figuring out the best software, so far it looks like llama.cpp with Vulcan though I have a lot of experimenting to do and don’t currently find it optimal for what I want.
Whats your stack?
And none of that hardware can run larger models, smaller tiny ones, or highly quantized versions of larger ones sure. Or do you have something important to say?