Has anyone got a dockerfile or published container image that "just works" and can run 4-bit quantized models on CPUs and/or GPUs? Ideally something that will run StableLM.
I've tried to build such a thing myself, but I found that the vague instructions in blogs aren't sufficient for reproducible build. Too many instances of "clone this (every changing) Git repo" or "just curl & execute this", leading to very rapid bit-rot where even instructions from a month ago can't be reproduced!
- 4 bit cuda
- 4 bit triton
- 4 bit cpu
https://github.com/oobabooga/text-generation-webui/blob/main...
Models should be quantized specifically for each, and both branches are under heavy (daily) development... you really want to git pull them all the time.
What OS are you running? A linux distro I presume?