I'm missing something fundamental. How can I understand these technologies?
Easiest way to run a local LLM these days is Ollama. You don't need PyTorch or even Python installed.
https://mobiarch.wordpress.com/2024/02/19/run-rag-locally-us...
Hugging Face can be confusing but in the end a very well designed framework.
https://mobiarch.wordpress.com/2024/03/02/start-using-mistra...
His videos really helped me build an intuition of how LLMs work. What I like is that he builds very simple versions of things that are easier to wrap you rhead around.
Huggingface is basically 3 things. 1) Repo for models, 2) Transformers library (basically some classes on top of the core pytorch that define transformer architecture, plus code to auto dl models from hugging face by name), and 3) Accelerate library, which is basically multi device training-inference.
The thing first to understand about LLMs is quantization. Most original models are uploaded in fp16 format, with different parameter counts. Higher parameter count = better performance. If you were to fine tune the model on your own data set, you have to keep the model in fp16, because training gradients need higher resolution. However, fp16 also takes a shitload of ram to store.
Inference is pretty much statistically most likely token which can be obtained without such resolution. As such, these models are usually quantized to lower bit resolutions. There are 3 different quantization methods. GPTQ (GPU first), GGUF (CPU first, born from the llama.cpp project, but supports ), and AWQ (new method, supposedly faster than GPTQ). Generally its accepted that 4 bit quantization is sufficient enough for accuracy for most things, but generally there is value in higher bit quantization.
Read this: https://archive.ph/2023.11.21-144133/https://towardsdatascie...
If you want to just run llms locally, use Ollama, and use the cli to download the models (iirc most models that Ollama downloads through the cli are GGUF 4 bit quantized). If you are using the CPU and want decent inference speed, use the smallest parameter model. OOtherwise use the highest parameter count one that will fit in your VRAM (or RAM if you are on a Mac, since Ollama supports Apple Silicon)
If you wanna do a little bit more tinkering (like running larger models on a limited resource laptop) you need to become familiar with Accelerate library. Hugging face has most of the models already quantized by TheBloke user, so you can just use the example code for each one on the hugging face page to load the model, then use the Accelerate functions to split it up.
Once you get a version up and running I make a copy before I update it as several times updates have broken my working version and caused headaches.
a decent explanation of parameters outside of reading archive papers: https://github.com/oobabooga/text-generation-webui/wiki/03-%...
a news ai website: https://www.emergentmind.com/
Reddit locallamma and how to prompt an llm: https://old.reddit.com/r/LocalLLaMA/comments/1atyxqz/better_...
Since you mention silly text generation there is also sillytavern which runs on top of another llm software such as webui. https://docs.sillytavern.app/
The setup is relatively easy: install .NET runtime, download 4.5 GB model file from BitTorrent, unpack a small ZIP file and run the EXE.
TL:DR There's many ways to go about it.
Quick start?
Clone llama.cpp repo or download the .exe or main linux binary from the "Releases" on Github, on the right. If you care about security, do this in a virtual machine (unless you plan to only use unquantised safetensors).
Example syntax: ./llama.cpp/main -i -ins --color -c 0 --split-mode layer --keep -1 --top-p 40 --top-k 0.9 --min-p 0.02 --temp 2.0 --repeat_penalty 1.1 -n -1 --multiline-input -ngl 3 -m mixtral-8x7b-instruct-v0.1.Q8_0.gguf
In this example, I'm running Mixtral at quantisation Q8, with 3 layers offloaded to the GPU, for about 45GB RAM usage and 7GB VRAM (GPU) usage. To make sense of quants, this is the general rule: you pick the largest quant you can run with your RAM.
If you go look for TheBloke models, they all have a handy model card stating how much RAM each quantisation uses.
I tend to use GGUF versions, which run on CPU but can have some layers offloaded on GPU.
I definitely recommend reading the https://github.com/ggerganov/llama.cpp documentation.