https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...
If you don't want to rent something with a GPU, you should look into running non-70B models and experiment with varying levels of quantization. Without a GPU you really want to start with 7B models, not 70B models. The Mistral variants in particular are worth a try.
I'd rather download the models from 3rd party unofficial sources/torrents, since that'd be legal where I live for personal research usage. Does someone know of some server that hosts the Code Llama model weights or some torrent? I have downloaded the torrent with Llama weights, but I'd rather use the Code variant, if possible.
You'll probably get more info/responses from https://www.reddit.com/r/LocalLLaMA/
I currently use GPT-4 for software development work that does not have those concerns. I am assuming that someone in my position would not benefit from my own server for this sort of thing. Even with a couple of 4090s.
(You can rent a server with two 3090s for around $1/hour, or buy two used 3090s for around $1700.)
If you're looking specifically for CodeLlama 70B, Artificial Analysis https://artificialanalysis.ai/models/codellama-instruct-70b/... lists Perplexity, Together.ai, Deep Infra, and Fireworks as potential hosts, with Together.ai and Deepinfra at about $0.9/1M tokens, with about 30 tokens/s and about 300ms latency (time to first token).
For those looking for local coding models in specifically. I keep a list of LLM coding evals here: https://llm-tracker.info/evals/Code-Evaluation
On the EvalPlus Leaderboard, there about about 10 open models that rank higher than CodeLlama 70B, all smaller models: https://evalplus.github.io/leaderboard.html
A few other evals (worth cross-referencing to counter contamination, overfitting):
* CRUXEval Leaderboard https://crux-eval.github.io/leaderboard.html
* CanAiCode Leaderboard https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...
* Big Code Models Leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...
From the various leaderboards, deepseek-ai/deepseek-coder-33b-instruct still looks like the best performing open model (it has a very liberal ethical license), followed by ise-uiuc/Magicoder-S-DS-6.7B (a deepseek-coder-6.7b-base fine tune). The former can be run as a Q4 quant on a single 24GB GPU (a used 3090 should run you about $700 atm), and the latter, if it works for you will run 4X faster and fit on even cheaper/weaker GPUs.
There's always recent developments, but two worth pointing out:
OpenCodeInterpreter - a new system that uses execution feedback and outperforms ChatGPT4 Code Interpreter that is fine-tuned off of the DeepSeek code models: https://opencodeinterpreter.github.io/
StarCoder2-15B just dropped and also looks competitive. Announcement and relevant links: https://huggingface.co/blog/starcoder2