HACKER Q&A
📣 lavren1974

Code Llama 70B on a dedicated server


I want to rent a server with 128 GB of RAM for my web projects. But primarily for launching CodeLlama 70B models. Is this possible without video memory?


  👤 paulzain Accepted Answer ✓
I run CodeLlama70B in my home lab using two RTX 4090s and the MLC AI framework. It provides Muti-GPU support and is extremely fast on consumer grade hardware. I'm seeing about 30 tokens/sec.

https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...


👤 mrtimo
I wanted my students to be able to run open source models. Easiest way was using ollama, and ollama web ui (now open web ui) on google cloud for .18 cents an hour with a 4-core nvidia and 16 GB of ram on a spot instance. I created a tutorial for my students: https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...

👤 loudmax
It's technically possible, in the sense that you will eventually get a complete response, but it will be extremely slow. Inference will be something like 1 word per second. Far too slow to use as an assistant for writing code.

If you don't want to rent something with a GPU, you should look into running non-70B models and experiment with varying levels of quantization. Without a GPU you really want to start with 7B models, not 70B models. The Mistral variants in particular are worth a try.


👤 jesprenj
I had a similar idea yesterday; I'd like to run Code Llama 70B on a GPU supercomputer cluster I have access to. But the wall I hit is the licence agreement by Facebook.

I'd rather download the models from 3rd party unofficial sources/torrents, since that'd be legal where I live for personal research usage. Does someone know of some server that hosts the Code Llama model weights or some torrent? I have downloaded the torrent with Llama weights, but I'd rather use the Code variant, if possible.


👤 windexh8er
As an alternative MassedCompute [0] has some interesting rental options (billed by the minute). They recently changed their site so it's harder to see the options available - but, I've used it a few times and it's generally competitive with respect to price and environments / features offered. Looks like you, unfortunately, need to sign up for an account now.

[0] https://massedcompute.com/


👤 jamal-kumar
The only system without a discrete GPU I've seen which pulls this off at a usable 8 ish tokens a second is an m3 macbook pro with 128gb unified memory [1], key word here unified... Something like a 6000$ investment. You're possibly better off renting GPU somewhere or investing in a couple of rtx cards

[1] https://www.nonstopdev.com/llm-performance-on-m3-max/


👤 TriangleEdge
It might be cheaper for you to call an API to run the inference instead of renting a machine. GPT-4-Turbo goes for $0.01 on 1k input and $0.03 for 1k output on Azure. A x2gd.2xlarg instance on AWS has 8vCPU and 128GB of memory. It goes for ~200$ per month.

👤 mirekrusin
Matt Williams just posted quick video you may be interested in [0] where he describes easy setup with brev and tailscale.

[0] https://www.youtube.com/watch?v=QRot1WtivqI


👤 theolivenbaum
Hetzner just added GPU servers, might be worth checking too

👤 hnfong
"Yes", but it's going to be unbearably slow on CPU without some form of GPU acceleration.

You'll probably get more info/responses from https://www.reddit.com/r/LocalLLaMA/


👤 wkoszek
It should work on Mac Studio with Ollama

👤 cheema33
Is it correct to assume that you want to do this for privacy reasons only?

I currently use GPT-4 for software development work that does not have those concerns. I am assuming that someone in my position would not benefit from my own server for this sort of thing. Even with a couple of 4090s.


👤 2099miles
Do you guys really get that much help from these local LLMs? Chatgpt has SO much more functionality and it still is quite limited. Why is it worth it for you guys to run these things locally like what are they doing that provides you so much value?


👤 cjbprime
Possible? Yes. Unusably slow? Also yes.

(You can rent a server with two 3090s for around $1/hour, or buy two used 3090s for around $1700.)


👤 ilaksh
You might look into something like together.ai or runpod doing token-based usage or per-minute usage.

👤 Havoc
It works but is painfully slow. Unless you don’t need results real time I wouldn’t

👤 lhl
You can run a Q4 quant of a 70B model in about 40GB of RAM (+context). You're single user (batch size 1, bs=1) inference speed will be basically memory bottlenecked, so on a dual channel dedicated box you'd expect somewhere about 1 token/s. That's inference, prefill/prompt generation will take even longer (as your chat history grows) on CPU. So falls into the realm of technically possible, but not for real world use.

If you're looking specifically for CodeLlama 70B, Artificial Analysis https://artificialanalysis.ai/models/codellama-instruct-70b/... lists Perplexity, Together.ai, Deep Infra, and Fireworks as potential hosts, with Together.ai and Deepinfra at about $0.9/1M tokens, with about 30 tokens/s and about 300ms latency (time to first token).

For those looking for local coding models in specifically. I keep a list of LLM coding evals here: https://llm-tracker.info/evals/Code-Evaluation

On the EvalPlus Leaderboard, there about about 10 open models that rank higher than CodeLlama 70B, all smaller models: https://evalplus.github.io/leaderboard.html

A few other evals (worth cross-referencing to counter contamination, overfitting):

* CRUXEval Leaderboard https://crux-eval.github.io/leaderboard.html

* CanAiCode Leaderboard https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

* Big Code Models Leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...

From the various leaderboards, deepseek-ai/deepseek-coder-33b-instruct still looks like the best performing open model (it has a very liberal ethical license), followed by ise-uiuc/Magicoder-S-DS-6.7B (a deepseek-coder-6.7b-base fine tune). The former can be run as a Q4 quant on a single 24GB GPU (a used 3090 should run you about $700 atm), and the latter, if it works for you will run 4X faster and fit on even cheaper/weaker GPUs.

There's always recent developments, but two worth pointing out:

OpenCodeInterpreter - a new system that uses execution feedback and outperforms ChatGPT4 Code Interpreter that is fine-tuned off of the DeepSeek code models: https://opencodeinterpreter.github.io/

StarCoder2-15B just dropped and also looks competitive. Announcement and relevant links: https://huggingface.co/blog/starcoder2


👤 senthilnayagam
how many users you plan to support with this setup?