Who is using small OS LLMs in production?
It is not clear whether relatively small open source LLMs in production or not, for instance replit code 3b, llama 2 7b, codegen,... What are the motivations for using those models over prompted GPT api ?
I'm integrating llama 2 7b with an application I'm building out currently and one of the biggest reasons was privacy, followed closely by price and lastly by getting it to work locally in a few minutes.
I built a now abandoned project using the GPT API and it was fine and not terribly expensive for my use case but customers didn't like the pay for usage model and the alternative was do weird UX to limit people abusing the prompts into something I couldn't afford bootstrapping as a side project.
I see comments here about running Llama on 4090's, which is fine for local development and testing - but getting into production is a significant leap and a significant cost.
The thing that I keep running into in my SLA plans is concurrency. Yes, you can have a Llama 2 model running on an A100 somewhere - but that will support 1 concurrent prompt. Anything at a higher concurrency needs another GPU, or your end users will be waiting a while. Want to rent an 8 GPU machine in the cloud for inference? Be prepared to pay a lot of money for it.
Facebook is working very hard to make the main dividing line in generative AI not company vs. company but commercial vs free. Starting from way behind, they are trying to make that irrelevant.
Price.
Data privacy.
Controlled latency.
Plenty of reasons to not send arbitrary data to a third party service.
Data security and privacy. Our clients (in aviation, finance, etc.) need this due to legal and regulatory reasons. Also, the new Llama 2 models are very powerful. In my testing, Llama 2 70b is comparable to GPT-3.5 in capability.
(Shameless plug: here's our website: https://www.amw.ai/)
Although we haven't gone down the path of deploying a fine-tuned model on our own infrastructure, we do see that as an eventual reality. Our current feature is disabled for any customer who signs a BAA with us because we can't get a DPA signed with OpenAI, and not for lack of trying. Maybe that resolves itself over time, but the most reliable option available is to fine-tune a model and run it ourselves. It's also likely a more expensive and challenging one, though, hence we're not doing it yet.
To me, the simple models might not cross the boundary where LLMs start to be useful versus, say, a fixed menu with choices in a helpdesk app.
It's a paradox because, to really feel human like and not make huge mistakes, we need these huge LLMs and they are expensive... and the alternative is not-so-smart traditional code.
So what I'm trying to say is that I think the small LLMs might not be that useful before they cross some arbitrary quality threshold (which they may never do.. considering more parameters => better model, in general).
My friend trained his own gpt2 because it is faster and cheaper to tune it.
I am using one of the uncensored versions of LLaMA 2 to allow chatbot roleplays without constant moralizing and replying to every other request with "I am just an AI, I don't have any opinion, emotion, feelings, don't like anything" etc.
Exploring a few options in my off-time. Main motivation for an OS LLM is to get it to do things which GPT-3.5/4 are somewhat promising at - but not good enough for applications.
We have several customers who aren't using OpenAI API / Anthropic API because of privacy reasons. We are spinning up infrastructure and making the features who rely on those APIs to also wok with OS LLMs too.
Tangential question - how well does Llama 2 do on coding tasks on less-mainstream languages like Rust?
Running llama-2-7b-chat at 8 bit quantization, and completions are essentially at GPT-3.5 levels (and instant) on a single RTX4090 using 15gb VRAM. I don't think most people realize just how small and efficient these models are going to become.