Can we run ChatGPT/other large models on our devices?

Question

ChatGPT has become the rage in programming recently, it's like everyone knows about it and what it does. But when is it actually going to be applied to industries like education, healthcare and eCommerce?However it's unlikely OpenAI will just release the model like they did with whisper due to the size and commercial constraints. For it to be applied in industries it requires experimentation and changing from our side.When will ChatGPT have it's stable diffusion moment? Is it possible for any machine to run these kinds of models?

speedgoose · Accepted Answer

These models are huge and much bigger than StableDiffusion that limits itself to fit on gaming hardware.
For Bloom 176B, an alternative to GPT3, you may need a cluster/machine with 512GB of GPU memory. That’s expensive.
On a more normal machine, you can run something like GPT J 6B, but it’s very limited in comparison to ChatGPT.
Maybe we will find a way to reduce the size of the models while keeping the capabilities.

timestretch · Answer

There is a lot of ongoing research into making language models that can run well on a wider variety of hardware. It seems VRAM is the main limitation at this point.
You can already run smaller language models on your own hardware if you have a GPU with sufficient VRAM. For example, with quantization, you can run gpt-neox-20b (512 token context window) or gpt-pythia-13b (full context window) on an RTX 3090 with 24GB VRAM. Quantization allows you to run the model with less memory, where each parameter utilizes 8 bits or 4 bits instead of 16 or 32 bits.
Another possibility is to use reinforcement learning with human feedback to tune smaller models to give results comparable to larger models.
I've also been using RWKV with good results. It is a language model that uses an RNN and only needs matrix-vector multiplication instead of matrix-matrix, so inference runs much faster. The 7B model uses about 14GB VRAM without quantization. A 14B model is currently in training, but progress checkpoints are available. You can also do inference on a CPU, although it is much slower than GPU.

mikewarot · Answer

Running a model is a lot less resource intensive than training it, how many times do you have to loop through the model in order to get an output? If you've got the model with all the weights stored on SSD, is it possible to simply iterate through the layers without keeping it in RAM, using the CPU?StableDiffusion needed to be in RAM because you were iterating over all the pixels, but if it's just a token at a time coming out of ChatGPT, it might just be possible, right?