Are there any ready-to-use Docker images for running LLMs locally?

Question

I've been having a lot of trouble spinning up the various stacks for running open LLMs like Alpaca or Vicuna because they often require specific CUDA versions, specific gcc toolchains, etc...Has anyone got a dockerfile or published container image that "just works" and can run 4-bit quantized models on CPUs and/or GPUs? Ideally something that will run StableLM.I've tried to build such a thing myself, but I found that the vague instructions in blogs aren't sufficient for reproducible build. Too many instances of "clone this (every changing) Git repo" or "just curl & execute this", leading to very rapid bit-rot where even instructions from a month ago can't be reproduced!

brucethemoose2 · Accepted Answer

4 bit llama is messy because there are essentially 3 variants.
- 4 bit cuda
- 4 bit triton
- 4 bit cpu
https://github.com/oobabooga/text-generation-webui/blob/main...
Models should be quantized specifically for each, and both branches are under heavy (daily) development... you really want to git pull them all the time.
What OS are you running? A linux distro I presume?

smoldesu · Answer

I'm using Serge[0] as an API for a local Discord bot. You probably won't find anything for StableLM this soon after release, but this will download and run the Ll*ma stuff with a decent web UI.[0] https://github.com/nsarrazin/serge