HACKER Q&A
📣 jiggawatts

Are there any ready-to-use Docker images for running LLMs locally?


I've been having a lot of trouble spinning up the various stacks for running open LLMs like Alpaca or Vicuna because they often require specific CUDA versions, specific gcc toolchains, etc...

Has anyone got a dockerfile or published container image that "just works" and can run 4-bit quantized models on CPUs and/or GPUs? Ideally something that will run StableLM.

I've tried to build such a thing myself, but I found that the vague instructions in blogs aren't sufficient for reproducible build. Too many instances of "clone this (every changing) Git repo" or "just curl & execute this", leading to very rapid bit-rot where even instructions from a month ago can't be reproduced!


  👤 brucethemoose2 Accepted Answer ✓
4 bit llama is messy because there are essentially 3 variants.

- 4 bit cuda

- 4 bit triton

- 4 bit cpu

https://github.com/oobabooga/text-generation-webui/blob/main...

Models should be quantized specifically for each, and both branches are under heavy (daily) development... you really want to git pull them all the time.

What OS are you running? A linux distro I presume?


👤 smoldesu
I'm using Serge[0] as an API for a local Discord bot. You probably won't find anything for StableLM this soon after release, but this will download and run the Ll*ma stuff with a decent web UI.

[0] https://github.com/nsarrazin/serge