How does deploying a fine-tuned model work

Question

If I've managed to build my own model, say a fine-tuned version of Llama and trained it on some GPUs, how do I then deploy it and use it in an app. Does it need to be running on the GPUs all the time or can I host the model on a web server or something. Sorry if this is an obvious/misinformed question, I'm a beginner in this space

cuuupid · Accepted Answer

There are a lot of places e.g. Replicate where you can finetune and deploy language models. You’ll need GPUs, but you can simply select a hardware class and pay per second of usage, similar to AWS Lambda or similar serverless infrastructure.
Serverless AI is quickly becoming popular precisely because of the scenario you’re describing — it’s currently still pretty hard to deploy your own GPU stack, not to mention crazy expensive to run eg an A100 24/7, plus orchestration for scale up/down. It’s why so many model authors don’t host their own demos anymore and simply toss it on HuggingFace or Replicate directly.
Serverless providers will basically do the infra for you, as well as make necessary GPU reservations, then charge you a slight premium on the reserved price — so you’d pay less than on-demand on GCP/AWS, while they benefit from economies of scale.
I do imagine at some point soon GPUs will become cheap and more commercially available so you could away with hosting your own VPS in the distant (or near?) future.

manca · Answer

A lot of the answers to your question focus solely on the infra piece of the deployment process, which is just one, albeit, important piece of the puzzle.
Each model is built using some predefined model architecture and the majority of the LLMs of today are the implementation of Transformer architecture, based on the "Attention is All You Need" paper from 2017. That said, when you fine-tune a model, you usually start from a checkpoint and then using techniques like LORA or QLORA you compute new weights. You do this in your training/fine-tuning script using PyTorch, or some other framework.
Once the training is done you get the final weights -- a binary blob of floats. Now you need to use those weights back into the inference architecture of the model. You do that by using the framework which is used for training (PyTorch) to construct the inferencing pipeline. You can build your own framework/inferencing engine too if you want and try to beat PyTorch :) The pipeline will consist of things like:
- loading the model weights
- doing pre-processing on your input
- building the inference graph
- running your input (embeddings/vectors) through the graph
- generating predictions/results
Now, the execution of this pipeline can be done on GPU(s) so all the computations (matrix multiplications) are super fast and the results are generated quickly, or it can still run on good old CPUs, but much slower. Tricks like quantization of model weights can be used here to reduce the model size and speed up the execution by trading-off precision/recall.
Services like ollama, or vllm abstract away all the above steps and that's why they are very popular -- they might even allow you to bring your own (fine-tuned) model.
On top of the pure model execution, you can create a web service that will serve your model via a HTTP or gRPC endpoint. It could accept user query/input and return a JSON with the results. Then it can be incorporated in any application, or become part of another service, etc.
So, the answer is much more than "get the GPU and run with it" and I think it's important to be aware of all the steps required if you want to really understand what goes into deploying custom ML models and putting them to a good use.

itake · Answer

For my hobby project, I:1/ bought a gaming rig off craigslist 2/ set it up to run as a server (auto-start services, auto-start if power is cut, etc.) 3/ setup a cloud redis server 4/ gaming rig fetches tasks from redis queue. processes them. and updates them.Queueing in front of ML models is important. GPUs are easily overwhelmed and will basically fail 90% of traffic and have max latency if you send it too much. Let the GPU server run at its own pace.

binarymax · Answer

I recommend checking out https://modal.comIt&rsquo;s &ldquo;serverless&rdquo; and you only pay for the compute you use.

geuis · Answer

You still need a GPU.I'm nothing even close to a knowledgeable expert on this, but I've dabbled enough to have an idea how to answer this.The special thing about why GPU's are used to train models is that their architecture is directly applicable to parallelize the kind of vector math that goes on under the hood when adjusting weights between nodes in all the layers of a neural network.The weights files (GGUF, etc) are just data describing how the networks need to be built up to be functional. Think compressed ZIP file versus uncompressed text document.You can run a lot of models on just a cpu, buts its gonna be slooooooow. For example, I've been tweaking on running a Mixtral8x7b model on my 2019 Intel Macbook Pro with llama.cpp. It works, but it may be running at 1-2 tokens per second at max, and that's even with the limited GPU offloading I figured out how to do.

solresol · Answer

If you care about inference time, then you'll do two things.
1. Train a student model from your fine-tuned model. (Known as "knowledge distillation").
2. Quantize the student model so that it uses integers.
You might also prune the model to get rid of some close-to-zero weights.
This will get you a smaller model that can probably run OK on a CPU, but will also be much more efficient on GPU.
Next: architect your code so that the inference step sits behind a queue. You do not generally want to have the user interface waiting on a inference event because you can't guarantee latency or resource availability, and your model's inference processing will be the biggest slowest thing in your stack, so you can't afford to overprovision.
So have a queue of "things to infer", and having your inference process run in the background chomping through the backlog, storing the results in your database. When it infers something, somehow notify your front-end clients that it's ready in the database for them to retrieve. In this model, you can potentially run your model somewhere cheaper than AWS (e.g. a cheaper provider, a machine under your desk).
Or, for the genius move: compile the model to ONNX and run it in a background thread in the users' browser, and then you don't have to worry about provisioning; users will wonder why their computer runs so slowly though.

kcorbitt · Answer

There are lots of third-party providers that will host your fine-tuned model for you, and just charge per token like OpenAI. Here are some of the providers I've personally used and would vouch for in production, along with their costs per 1M input tokens for Llama 3 8B, as a point of comparison:
- Replicate: $0.05 input, $0.25 output
- Octo: $0.15
- Anyscale: $0.15
- Together: $0.20
- Fireworks: $0.20
If you're looking for an end-to-end flow that will help you gather the training data, validate it, run the fine tune and then define evaluations, you could also check out my company, OpenPipe (https://openpipe.ai/). In addition to hosting your model, we help you organize your training data, relabel if necessary, define evaluations on the finished fine-tune, and monitor its performance in production. Our inference prices are higher than the above providers, but once you're happy with your model you can always export your weights and host them on one of the above!

Oras · Answer

For the inference part, you can dockerise your model and use https://banana.dev for serverless GPU.They have examples on github on how to deploy and I&rsquo;ve done it last year and was pretty straightforward.

batch12 · Answer

The hardware you need and amount of time you run the model really depends on which model you've fine-tuned and what your model will be doing. If you don't need quick responses, you could use a cpu via llama.cpp. If you are only doing something like summarization you could do the inference in batches and start and stop your GPU resources between them. If you're doing chat around something predictable, like a product, you could cache common questions and responses and use smaller model to pick between them.

namanski · Answer

I've built a product in this regard - specifically for fine-tuning and deploying said fine-tuned models.
You'll need GPUs for inferencing + have to quantize the model + have it hosted on the cloud. The platform I've built is around the same workflow (but all of it is automated, along with autoscaling, and you get an API endpoint; you only pay for the compute you host on).
Generally, the GPU(s) you choose will depend on how big the model is + how many tokens/sec you're looking to get out of it.

deskamess · Answer

Hijacking this thread to ask a question. Does anyone have a 1000 ft level description of all the steps related to the ML pipeline. I thought I had a basic understanding, but I am now seeing things like pre-training which I am not sure about (why pre-train, and what are you pre-training with). And fine-tuning which I understand better (at least the concept). I guess I am looking for a concept level definition of the verbiage/taxonomy used in the ML pipeline - ideally in chronological (pipeline) order.

anon373839 · Answer

Does anyone know which (if any) serverless providers support LoRA? It should be possible to serve community models like Llama3-70B and dynamically load an adapter on request, to greatly reduce the &ldquo;cold start&rdquo; problem.

refulgentis · Answer

The easiest stepping stone will involve deploying on HuggingFace, think of it, in this case, of basically a mashup of GitHub x Heroku. You'll upload your model, they'll already have a repo for the model you fine-tuned somewhere, you'll tweak 20 lines of Python to run your version instead of theirs.The next easiest step would be OpenAI. They offer extremely easy finetuning, more or less you'll upload a file with training data to their web app, then a few hours later it's done, and you just use your API key + a model ID for your particular fine-tuned model.

fzzzy · Answer

First, you should do some experimentation to determine if fine tuning is even needed for you. gpt4 and claude opus have enough context window to ground the model with your knowledge base and one of the best of class hosted models is going to be far better at reasoning than any model you can afford to run yourself.If you need data privacy and don't trust openai and anthropic when they say they don't train on api call data, then you will need a local model.

tikkun · Answer

TLDR you&rsquo;ll probably serve it on gpusIf it&rsquo;s a small model you might be able to host it on a regular server with CPU inference (see llama.cpp)Or a big model on cpu but really slowlyBut realistically you&rsquo;ll probably want to use gpu inferenceEither running on gpus all the time (no cold start times) or on serverless gpus (but then the downside is the instances need to start up when needed, which might take 10 seconds)

Havoc · Answer

Is vllm on a gpu server. Unfortunately won&rsquo;t be cheap if you need it 24/7

cheptsov · Answer

You can use https://github.com/dstackai/dstack to deploy your model to the most affordable GPU clouds. It supports auto-scaling and other features.Disclaimer: I&rsquo;m the creator of dstack.

dandiep · Answer

This is why everything I do is fine tuned openai. It is so much easier, faster and at least at my scale, cheaper. I suspect it&rsquo;s also better for small datasets like mine. It&rsquo;s unbelievable to me that none of the big competitors offer a similar service.

_akhe · Answer

GPU vs CPU:It's faster to use a GPU. If you tried to play a game on a laptop with onboard gfx vs buying a good external graphics card, it might technically work but a good GPU gives you more processing power and VRAM to make it a faster experience. Same thing here.When is GPU needed:You need it for both initial training (which it sounds like you've done) and also when someone prompts the LLM and it parses their query (called inference). So to answer your question - your web server that handles LLM queries coming in also needs a great GPU because with any amount of user activity it will be running effectively 24/7 as users are continually prompting it, as they would use any other site you have online.When is GPU not needed:Computationally, inference is just "next token prediction", but depending on how the user enters their prompt sometimes it's able to provide those predictions (called completions) with pre-computed embeddings, or in other words by performing a simple lookup, and the GPU is not invoked. For example in this autocompletion/token-prediction library I wrote that uses an ngram language model (https://github.com/bennyschmidt/next-token-prediction), GPU is only needed for initial training on text data, but there's no inference component to it - so completions are fast and don't invoke the GPU, they are effectively lookups. An LM like this could be trained offline and deployed cheaply, no cloud GPU needed. And you will notice that LLMs sometimes will work this way, especially with follow-up prompting once it already has the needed embeddings from the initial prompt - for some responses, an LLM is fast like this.On-prem:Beyond the GPU requirement, it's not fundamentally different than any other web server. You can buy/build a gaming PC with a decent GPU, forward ports, get a domain, install a cert, run your model locally, and now you have an LLM server online. If you like Raspberry Pi, you might look into the NVIDIA Jetson Nano (https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...) as it's basically a tiny computer like the Pi but with a GPU and designed for AI. So you can cheaply and easily get an AI/LLM server running out of your apartment.Cloud & serverless:Hosting is not very different from conventional web servers except that their hardware has more VRAM and their software is designed for LLM access rather than a web backend (different db technologies, different frameworks/libraries). Of course AWS already has options for deploying your own models and there are a number of tutorials showing how to deploy Ollama on EC2. There's also serverless providers - Replicate, Lightning.AI - these are your Vercels and Herokus that you might pay a little more for but get convenience so you can get up and running quickly.TLDR: It's like any other web server except you need more GPU/VRAM to do training and inference. Whether you want to run it yourself on-prem, host in the cloud, use a PaaS, etc. those are mostly the same decisions as any other project.

xmonkee · Answer

Replying to follow

How does deploying a fine-tuned model work

I recommend checking out https://modal.com
It’s “serverless” and you only pay for the compute you use.

For the inference part, you can dockerise your model and use https://banana.dev for serverless GPU.
They have examples on github on how to deploy and I’ve done it last year and was pretty straightforward.

Does anyone know which (if any) serverless providers support LoRA? It should be possible to serve community models like Llama3-70B and dynamically load an adapter on request, to greatly reduce the “cold start” problem.

Is vllm on a gpu server. Unfortunately won’t be cheap if you need it 24/7

You can use https://github.com/dstackai/dstack to deploy your model to the most affordable GPU clouds. It supports auto-scaling and other features.
Disclaimer: I’m the creator of dstack.

This is why everything I do is fine tuned openai. It is so much easier, faster and at least at my scale, cheaper. I suspect it’s also better for small datasets like mine. It’s unbelievable to me that none of the big competitors offer a similar service.

Replying to follow