Serverless AI is quickly becoming popular precisely because of the scenario you’re describing — it’s currently still pretty hard to deploy your own GPU stack, not to mention crazy expensive to run eg an A100 24/7, plus orchestration for scale up/down. It’s why so many model authors don’t host their own demos anymore and simply toss it on HuggingFace or Replicate directly.
Serverless providers will basically do the infra for you, as well as make necessary GPU reservations, then charge you a slight premium on the reserved price — so you’d pay less than on-demand on GCP/AWS, while they benefit from economies of scale.
I do imagine at some point soon GPUs will become cheap and more commercially available so you could away with hosting your own VPS in the distant (or near?) future.
Each model is built using some predefined model architecture and the majority of the LLMs of today are the implementation of Transformer architecture, based on the "Attention is All You Need" paper from 2017. That said, when you fine-tune a model, you usually start from a checkpoint and then using techniques like LORA or QLORA you compute new weights. You do this in your training/fine-tuning script using PyTorch, or some other framework.
Once the training is done you get the final weights -- a binary blob of floats. Now you need to use those weights back into the inference architecture of the model. You do that by using the framework which is used for training (PyTorch) to construct the inferencing pipeline. You can build your own framework/inferencing engine too if you want and try to beat PyTorch :) The pipeline will consist of things like:
- loading the model weights
- doing pre-processing on your input
- building the inference graph
- running your input (embeddings/vectors) through the graph
- generating predictions/results
Now, the execution of this pipeline can be done on GPU(s) so all the computations (matrix multiplications) are super fast and the results are generated quickly, or it can still run on good old CPUs, but much slower. Tricks like quantization of model weights can be used here to reduce the model size and speed up the execution by trading-off precision/recall.
Services like ollama, or vllm abstract away all the above steps and that's why they are very popular -- they might even allow you to bring your own (fine-tuned) model.
On top of the pure model execution, you can create a web service that will serve your model via a HTTP or gRPC endpoint. It could accept user query/input and return a JSON with the results. Then it can be incorporated in any application, or become part of another service, etc.
So, the answer is much more than "get the GPU and run with it" and I think it's important to be aware of all the steps required if you want to really understand what goes into deploying custom ML models and putting them to a good use.
1/ bought a gaming rig off craigslist 2/ set it up to run as a server (auto-start services, auto-start if power is cut, etc.) 3/ setup a cloud redis server 4/ gaming rig fetches tasks from redis queue. processes them. and updates them.
Queueing in front of ML models is important. GPUs are easily overwhelmed and will basically fail 90% of traffic and have max latency if you send it too much. Let the GPU server run at its own pace.
It’s “serverless” and you only pay for the compute you use.
I'm nothing even close to a knowledgeable expert on this, but I've dabbled enough to have an idea how to answer this.
The special thing about why GPU's are used to train models is that their architecture is directly applicable to parallelize the kind of vector math that goes on under the hood when adjusting weights between nodes in all the layers of a neural network.
The weights files (GGUF, etc) are just data describing how the networks need to be built up to be functional. Think compressed ZIP file versus uncompressed text document.
You can run a lot of models on just a cpu, buts its gonna be slooooooow. For example, I've been tweaking on running a Mixtral8x7b model on my 2019 Intel Macbook Pro with llama.cpp. It works, but it may be running at 1-2 tokens per second at max, and that's even with the limited GPU offloading I figured out how to do.
1. Train a student model from your fine-tuned model. (Known as "knowledge distillation").
2. Quantize the student model so that it uses integers.
You might also prune the model to get rid of some close-to-zero weights.
This will get you a smaller model that can probably run OK on a CPU, but will also be much more efficient on GPU.
Next: architect your code so that the inference step sits behind a queue. You do not generally want to have the user interface waiting on a inference event because you can't guarantee latency or resource availability, and your model's inference processing will be the biggest slowest thing in your stack, so you can't afford to overprovision.
So have a queue of "things to infer", and having your inference process run in the background chomping through the backlog, storing the results in your database. When it infers something, somehow notify your front-end clients that it's ready in the database for them to retrieve. In this model, you can potentially run your model somewhere cheaper than AWS (e.g. a cheaper provider, a machine under your desk).
Or, for the genius move: compile the model to ONNX and run it in a background thread in the users' browser, and then you don't have to worry about provisioning; users will wonder why their computer runs so slowly though.
- Replicate: $0.05 input, $0.25 output
- Octo: $0.15
- Anyscale: $0.15
- Together: $0.20
- Fireworks: $0.20
If you're looking for an end-to-end flow that will help you gather the training data, validate it, run the fine tune and then define evaluations, you could also check out my company, OpenPipe (https://openpipe.ai/). In addition to hosting your model, we help you organize your training data, relabel if necessary, define evaluations on the finished fine-tune, and monitor its performance in production. Our inference prices are higher than the above providers, but once you're happy with your model you can always export your weights and host them on one of the above!
They have examples on github on how to deploy and I’ve done it last year and was pretty straightforward.
You'll need GPUs for inferencing + have to quantize the model + have it hosted on the cloud. The platform I've built is around the same workflow (but all of it is automated, along with autoscaling, and you get an API endpoint; you only pay for the compute you host on).
Generally, the GPU(s) you choose will depend on how big the model is + how many tokens/sec you're looking to get out of it.
The next easiest step would be OpenAI. They offer extremely easy finetuning, more or less you'll upload a file with training data to their web app, then a few hours later it's done, and you just use your API key + a model ID for your particular fine-tuned model.
If you need data privacy and don't trust openai and anthropic when they say they don't train on api call data, then you will need a local model.
If it’s a small model you might be able to host it on a regular server with CPU inference (see llama.cpp)
Or a big model on cpu but really slowly
But realistically you’ll probably want to use gpu inference
Either running on gpus all the time (no cold start times) or on serverless gpus (but then the downside is the instances need to start up when needed, which might take 10 seconds)
Disclaimer: I’m the creator of dstack.
It's faster to use a GPU. If you tried to play a game on a laptop with onboard gfx vs buying a good external graphics card, it might technically work but a good GPU gives you more processing power and VRAM to make it a faster experience. Same thing here.
When is GPU needed:
You need it for both initial training (which it sounds like you've done) and also when someone prompts the LLM and it parses their query (called inference). So to answer your question - your web server that handles LLM queries coming in also needs a great GPU because with any amount of user activity it will be running effectively 24/7 as users are continually prompting it, as they would use any other site you have online.
When is GPU not needed:
Computationally, inference is just "next token prediction", but depending on how the user enters their prompt sometimes it's able to provide those predictions (called completions) with pre-computed embeddings, or in other words by performing a simple lookup, and the GPU is not invoked. For example in this autocompletion/token-prediction library I wrote that uses an ngram language model (https://github.com/bennyschmidt/next-token-prediction), GPU is only needed for initial training on text data, but there's no inference component to it - so completions are fast and don't invoke the GPU, they are effectively lookups. An LM like this could be trained offline and deployed cheaply, no cloud GPU needed. And you will notice that LLMs sometimes will work this way, especially with follow-up prompting once it already has the needed embeddings from the initial prompt - for some responses, an LLM is fast like this.
On-prem:
Beyond the GPU requirement, it's not fundamentally different than any other web server. You can buy/build a gaming PC with a decent GPU, forward ports, get a domain, install a cert, run your model locally, and now you have an LLM server online. If you like Raspberry Pi, you might look into the NVIDIA Jetson Nano (https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...) as it's basically a tiny computer like the Pi but with a GPU and designed for AI. So you can cheaply and easily get an AI/LLM server running out of your apartment.
Cloud & serverless:
Hosting is not very different from conventional web servers except that their hardware has more VRAM and their software is designed for LLM access rather than a web backend (different db technologies, different frameworks/libraries). Of course AWS already has options for deploying your own models and there are a number of tutorials showing how to deploy Ollama on EC2. There's also serverless providers - Replicate, Lightning.AI - these are your Vercels and Herokus that you might pay a little more for but get convenience so you can get up and running quickly.
TLDR: It's like any other web server except you need more GPU/VRAM to do training and inference. Whether you want to run it yourself on-prem, host in the cloud, use a PaaS, etc. those are mostly the same decisions as any other project.