[^1]: https://news.ycombinator.com/item?id=36134249
You’re stuck with openai, and you’re stuck with whatever rules, limitations or changes they give you.
There are other models, but specifically if you’re actively using gpt-4 and find gpt-3.5 to be below the quality you require…
Too bad. You’re out of luck.
Wait for better open source models or wait patiently for someone to release a meaningful competitor, or wait for openai to release a better version.
That’s it. Right now, there’s no one else letting people have access to their models which are equivalent to gpt-4.
A quick test of the huggingface demo gives reasonable results[1]. The actual model behind the space is here[2], and should be self-hostable with reasonable effort.
0. https://arxiv.org/abs/2305.14314 1. https://huggingface.co/spaces/uwnlp/guanaco-playground-tgi 2. https://huggingface.co/timdettmers/guanaco-33b-merged
And this spreadsheet shows a pretty comprehensive list of LLMs: https://anania.ai/chatgpt-alternatives/
Currently the "best" ones seem to be Llama and Dolly. Dolly can be used commercially, and Llama cannot, so it's best for personal use.
I myself have been trying to get [the huggingface chat ui](https://github.com/huggingface/chat-ui) running on my own system, but it's finicky. Right now I'm focused on getting immediate income so I can't spend too much effort on it.
Overall, no model gets close to the accuracy of GPT-3 or 4 (though Llama does decently), though I can definitely imagine in 3 years or so open source can match or even exceed the capabilities of OpenAI's model.
The precipitating factor is that running large models for research is very expensive, but pales in comparison to putting these things into production. Expenses rise exponentially with model size. Everyone is looking for ways to make the models smaller and run at the edge. I will note that PaLM 2 is smaller than PaLM, the first time I can remember something like that happening. The smallest version of PaLM 2 can run at the edge. Small is beautiful.
Works on all platforms, but runs much better on Linux.
Running this in Docker on my 2080Ti, can barely fit 13B-4bit models into 11G of VRAM, but it works fine, produces around 10-15 tokens/second most of the time. It also has an API, that you can use with something like LangChain.
Supports multiple ways to run the models, purely with CUDA (I think AMD support is coming too) or on CPU with llama.cpp (also possible to offload part of the model to GPU VRAM, but the performance is still nowhere near CUDA).
Don't expect open-source models to perform as well as ChatGPT though, they're still pretty limited in comparison. Good place to get the models is TheBloke's page - https://huggingface.co/TheBloke. Tom converts popular LLM builds into multiple formats that you can use with textgen and he's a pillar of local LLM community.
I'm still learning how to fine-tune/train LoRAs, it's pretty finicky, but promising, I'd like to be able to feed personal data into the model and have it reliably answer questions.
In my opinion, these developments are way more exciting than whatever OpenAI is doing. No way I'm pushing my chatlogs into some corp datacenter, but running locally and storing checkpoints safely would achieve my end-goal of having it "impersonate" myself on the web.
The “best” self-hostable model is a moving target. As of this writing it’s probably one of Vicuña 13B, Wizard 30B, or maybe Guanaco 65B. I’d like to say that Guanaco is wildly better than Vicuña, what with its 5x larger size. But… that seems very task dependent.
As anecdata: my experience is that none of these is as good as even GPT3.5 for summarization, extraction, sentiment analysis, or assistance with writing code. Figuring out how to run them is painful. The speed at which their unquantized variants run on any hardware I have access to is painful. Sorting through licensing is… also painful.
And again: they’re nowhere close to GPT-4.
https://chat.lmsys.org/?leaderboard
The short answer is that nothing self hosted can come close to GPT-4. The only thing that comes close period is Anthropic's Claude.
There are open source models that are fine tuned for different tasks, and if you're able to pick a specific model for a specific use case you'll get better results.
---
For example, for chat there are models like `mpt-7b-chat` or `GPT4All-13B-snoozy` or `vicuna` that do okay for chat, but are not great at reasoning or code.
Other models are designed for just direct instruction following, but are worse at chat `mpt-7b-instruct`
Meanwhile, there are models designed for code completion like from replit and HuggingFace (`starcoder`) that do decently for programming but not other tasks.
---
For UI the easiest way to get a feel for quality of each of the models (or, chat models at least) is probably https://gpt4all.io/.
And as others have mentioned, for providing an API that's compatible with OpenAI, https://github.com/go-skynet/LocalAI seems to be the frontrunner at the moment.
---
For the project I'm working on (in bio) we're currently struggling with this problem too since we want a nice UI, good performance, and the ability for people to keep their data local.
So at least for the moment, there's no single drop-in replacement for all tasks. But things are changing every week and every day, and I believe that open-source and local can be competitive in the end.
For compatibility with the OpenAI API one project to consider is https://github.com/go-skynet/LocalAI
None of the open models are close to GPT-4 yet, but some of the LLaMA derivatives feel similar to GPT3.5.
Licenses are a big question though: if you want something you can use for commercial purposes your options are much more limited.
I'm the founder of Mirage Studio and we created https://www.mirage-studio.io/private_chatgpt. A privacy-first ChatGPT alternative that can be hosted on-premise or on a leading EU cloud provider.
Wizardlm-uncensored-30B is fun to play with.
(You can use any ChatGPT front-end which lets you change the OpenAI endpoint URL.)
[0] https://huggingface.co/TheBloke/guanaco-65B-HF A QLoRA finetune of LLaMA-65B by Tim Dettmers from the paper here: https://arxiv.org/abs/2305.14314
As for open models, HuggingFace has a nice leaderboard to see which ones are decent: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
For general use Falcon seems to be the current best:
For code specifically Replit's model seems to be the best:
Most of the open source stuff people are talking about is things like running a quantized 33B parameter LLaMA model on a 3090. That can be done on consumer hardware, but isn't quite as good at general purpose queries as GPT-4. Depending on your use case and your ability to fine tune it, that might be sufficient for a number of applications. Partcularly if you've got a very specific task.
However, if you're willing to spend, there are bigger models available (e.g. Falcon 40B, LLaMA 65B) that can be run on data server class machines, if you're willing to spend $15-20K.
Will that get you GPT-4 level inference? Probably not (though it is difficult to quantify); will it get you a high-quality model that can be further fine-tuned on your own data? Yes.
For the smaller models, the fine-tunes for various tasks can be fairly effective; in a few more weeks I expect that they'll have continued to improve significantly. There's new capabilities being added every week.
The biggest weakness that's been highlighted in research is that the open source models aren't as good at the wide range of tasks that OpenAI's RLHF has covered; that's partly a data issue and partly a training issue.
[0]: https://huggingface.co/tiiuae/falcon-40b-instruct [1]: https://huggingface.co/tiiuae/falcon-40b-instruct/blob/main/...
EDIT: I just realized you seem to be asking for a fully realized, turn-key commercial solution. Yeah, refer to others who say there's no alternative. It's true. Something like this gives you a lot more power and flexibility, but at the cost of a lot more work building the solution as you try to apply it.
I'm especially interested since the data center I'm working for is sitting on a bunch of A100 and I get daily requests of people asking for LLMs tuned to specific cases, who can't or won't use OpenAI for various reasons.
They also have A/B testing with a leaderboard where vicunia wins for the self-hostable ones: https://chat.lmsys.org/?leaderboard
https://lmsys.org/blog/2023-05-25-leaderboard/
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
https://assets-global.website-files.com/61fd4eb76a8d78bc0676...
https://www.mosaicml.com/blog/mpt-7b
Also keep up to date with r/LocalLLaMA where new best open models are posted all the time.
https://lmsys.org/blog/2023-05-25-leaderboard/
But unfortunately for now it seems there aren't any viable self-hosted options...
It's a simple app download and allows you to select from multiple available models. No hacking required.
GPT AI actually gives me hope. What if we can store and run an AI in a phone-sized-device that is superior to a similarly sized library of books? Can we have a rugged, solar-powered device that could survive the fall of Civilization and help us rebuild?
It would certainly have military applications in a warfare. Imagine being the 21ct century equivalent of a 1940's US Marine on Guadal Canal who need to know some survival skills. ChatGPT-on-a-phone would be handy if you could keep the battery charged.
With a 4090, you can get ChatGPT 3.5 level results from Guanaco 33B. Vicuna 13B is a solid performer on more resource-constrained systems.
I'd urge the naysayers who tried the OPT and LLaMA models only to give up to note that the the LLM field is moving very quickly - the current set of models are already vastly superior to the LLaMA models from just two months ago. And there is no sign the progress is slowing - in fact, it seems to be accelerating.
No kidding, and I am calling it on the record right here.
OpenAI will release an 'open source' model to try and recoup their moat in the self hosted / local space.
https://www.theinformation.com/briefings/openai-readies-new-...
The big models, if even available, need >100GB of graphics memory to run and would likely take minutes to warm up.
The pricing available via OpenAI/GCP/etc is only effective when you can multi-tenant many users. The cost to run one of these systems for private use would be ~$250k per year.
It’s actually impressive how good it is considering the limited resources they have.