How are you improving your use of LLMs in production?

Question

Lots of companies building with LLMs and seemingly infinite startups showing off flashy demos with a waitlist. Who's deployed to production, released to everyone, and iterating on them? How are you evaluating them over time and making sure you don't regress stuff that's already working?

ramoz · Accepted Answer

At a large government agency, partnered with one of the top AI companies in the world. We've been deploying models and transformers and recently a LLM ensemble in prod for 3-4 years now. Many lessons learnt that open source hadn't really provided utility for.
The biggest backend things: models can run on CPU architecture, you still need highly performant CPU, but in cloud there is a way to significantly discount that by relying on spot instances (this doesn't mean GPU isnt viable, we just found a real cost lever here that CPU has supported... but things may change and we know how to test that). Further, distributed and parallel (network) processing are important, especially in retrieval-augmented architectures... thus we've booted python long ago, and the lower level apis aren't simply serializing standard json-- think protobuffer land (tensorflow serving offers inspiration). BTW we never needed any "vector DB" ... the integration of real data is complex, embeddings get created through platforms like Airflow, metadata in a document store, and made available for fast retrieval on low-gravity disk (e.g. built on top of something like rocks db... then something like ANN is easy on tools like FAISS).
The biggest UX thing: integration into real-life workflows has to be worked on in a design-centered way, very carefully. Responsible AI means not blindly throwing these type of things into critical workflows. This guides what I assume to be a bit of a complicated frontend effort for most (was for us esp as we integrate into real existing applications... we started prototyping with a chrome extension).
EDIT: btw oss has definitely made some of these things easier (esp for greenfield & simple product architecture), but I'm skeptical of the new popular kids on the block that are vc-funded & building abstractions you'll likely need to break through across software, data, and inference serving for anything in scaled+integrated commercial enterprise.
EDIT2: Monitoring, metrics, and analytics architecture were big efforts as well. Ask yourself how do you measure "value added by AI"

typpo · Answer

I'm responsible for multiple LLM apps with hundreds of thousands of DAU total. I have built and am using promptfoo to iterate: https://github.com/promptfoo/promptfoo
My workflow is based on testing: start by defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share them with the team. In some of my projects, this is integrated with CI.
The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.

JamesHinnek · Answer

We had our customer support solution coming to us showing a fancy AI feature, free of charge just for a PoC. Two weeks after we agreed it was running in prod, and two weeks after that we nuked it. Unless people want a big liability in their business, they should read very well the terms and conditions.
These tools are not mature enough, no automated decision making process should leverage these tools without human intervention, especially when handling other humans. I'm not saying all LLM models are bad, I'm saying I'm talking about their architectural foundation, which is prone to these problems.
Our measurement was the rate of losing customers compared with them opening tickets complaining about some issue. We had roughly 12x more losses when using the LLM as a replacement for our support team in our A/B test.

npilk · Answer

I've been meaning to write up a blog post. I've had a lot of success with:
- Using LLMs to summarize, create structure, etc., rather than relying on them to generate responses
- Having my apps rely on receiving LLM output in an exact format (e.g. an OpenAI function call) and raising an error otherwise
- Never showing generated outputs directly to a user or saving them directly to the database (avoids a lot of potential for unexpected behavior)
This kind of stuff is less flashy than chatbots for everything, but IMO has a lot more potential to actually impact how software works at scale in the next ~year or two.

Palmik · Answer

Building with LLMs does not have to be that different from building on top of any other APIs. () The two things that are crucial are:
(1) Observability
Being able to see and store the flow of inputs and outputs for your pipelines. In development, this is invaluable for seeing under the hood and debugging. In production, it's essential for collecting data for eval and training.
(2) Configurability
Being able to quickly swap out and try different configurations -- whether it's "prompt templates", model parameters / providers, etc.
To that end, I have my own internal framework & platform that tackles these problems in a holistic way. Here's a short example:
- The basic building block https://www.loom.com/share/d09d38d2c316468fa9f38f5b386fc114
- Tracing for more complex system: https://screenbud.com/shot/61dc59a6-5c73-4610-8168-753b79062...
- Code example in TypeScript & Rust: https://gist.github.com/Palmik/42940e3887d6446244f8b74ce0243...
I recently wrote a bit more about this here: https://news.ycombinator.com/item?id=36760727
Would be happy to chat and brainstorm with folks working on similar problems.
() I am basing this on experience from working on Google's Bard and now on my own products.

freedude · Answer

It is an over-hyped product looking for a way to be applied. At its best a LLM can only ever give you the average answer which may or may not be the correct answer and probably isn't the best answer. We have enough average.Sometimes it invents answers, which in our environment we cannot ever tolerate. When this major deficiency is fixed and it can simply say, "I don't know" then we can consider working with it. One false positive makes it ineffective as a knowledge tool. It would be analogous to a report from an unknown source which contains bad data. You are best burning it and starting over with data you can trust, rather than attempt to fix it.

PheonixPharts · Answer

After an initial first rush of getting features to prod, at this point I would say we're doing "Evaluation Driven Development". That is new features are built by building out the evaluations of those results as the starting point.
At least from the people I've talked with, how important evaluations/tests are to your team seems to be the major differentiator between the people rushing out hot garbage and those seriously building products in this space.
More specially: we're running hundreds/thousands of evaluations on a wide range of scenarios for prompts we're using. We can very quickly see if there are regressions, and have a small team of people working on improving this functionality.

bearjaws · Answer

The problem seems to be right now that LLMs are 98% predictable. OpenAI functions for example invoke properly 90%+ of the time, but every now again, even with nearly identical input, it will not invoke.
These problems become multiplicative, if you have 3-4 chatgpt calls that are chained together your failure rate is closer to 10%.
Unfortunately in the health tech space, even 1% failure rate is unacceptable. I have some theories and PoC work I am doing to improve the rate. As with all "AI/ML" projects its not as simple as take problem and apply AI to solve it.

irrational · Answer

I work for a large Fortune 100 company. About 100k employees. Everyone was sent an email from the very top saying that all LLM/AI tools were forbidden without written authorization from Legal and somebody in the C-suite. The basic rationale was they didn&rsquo;t want to risk anyone using some else&rsquo;s IP and didn&rsquo;t want to risk the possibility of the tools getting a hold of our IP.

GPUboy · Answer

Cheat layer has thousands of global users running our agents, and we started building agents over a year ago. We were the first to get approved by openAI to sell GPT-3 for code generation in summer 2021. We're growing MAU 40% MOM and MRR 50% MOM. We won #1 product of the day 4/1 on Product hunt, and users can pay to access our no-code agents that uses a unique project management interface to build actually useful agents for businesses.
Some improvements we've made: 1) Agents use a global supabase to store "successful" code from tasks so new users don't have to go through regenerations to find the same solution.
2) Each user has their own supabase to grow valuable business data, and our autonomous sales agents can now maintain conversations across email, sms, and voice calls. This is our defensive moat vs Microsoft, since they can't hire overnight to copy this data and can't sell it back to customers. This allows agents to implicitly learn things like "high converting copy" and our new users don't have to start from 0 when starting a new business. We even had a customer reply to one agent 5+ times: https://twitter.com/CheatLayer/status/1676959310562349059?s=...
3) We designed a new 'guidelines' framework that dynamically constructs the system prompt based on context, which allowed us to push the limits in specific use cases.
4) A library to divide up tasks and solve them as a hierarchy https://www.youtube.com/watch?v=FFb59WmQoFU
5) Most exciting recent update was adding voice synthesis to our autonomous sales agents, who can continue the conversations across email, sms, and now voice. https://www.youtube.com/watch?v=2s4iQ_joToc
6) We setup a testing framework for the guidelines I mentioned above to make sure we constantly iterate. In this way we can also prove the old GPT-4 model is definitely worse and some automations don't work at all if you take full advantage of the new model.
Check out our weekly live streams for all the updates, since there's been a lot since we've been working on this since 2021: https://www.youtube.com/@CheatLayer/streams

rvz · Answer

Let's go a bit further than that:Who has deployed their LLM in production and is profitable right now with their LLM startup and has at least $500K+ MRR in less than a year and is bootstrapped with zero VC financing?We need to cut through the hype and see if there is sustainable demand from the abundance of so-called AI startups that seem to keep reporting that they are unprofitable or continuously loss making for years.

rogerkirkness · Answer

Langchain is so immature in this area - it doesn't even have a plugin for logging. When we hit production general availability, we realized how bad the assessment/evaluation/observability of this technology is. It is so early, and there are so many weird things about it compared to deterministic software.
What we ended up building looks something like:
- We added a way to log which embedded chunks are being referenced by searches that result from a prompt, so you can see the relationships between prompts and the resultant search results. Unlike a direct query, this is ambiguous. And lots and lots of logging of what's going on and where it gets or more importantly where it doesn't look but should. In particular you want to see if/where it reaches for custom tool vs. just answering from the core LLM (usually undesireable).
- We built a user facing thumbs up/down with reason mechanic that feeds into a separate table we use to improve results.
- We added a way to trace entire conversations that includes both the user facing prompts, responses, as well as many internal logs generated by the agent chain. This way, we can see where in the stepped process it breaks down and improve it.
- We built what amounts to a library for 'prompt testing' which is basically taking user prompts that leads to hits (that we want to maintain quality for over time) or misses (that failed due to something not being as good as it could), and we run those for each new build. Essentially it's just a file for each 'capability' (e.g. each agent) that contains a bunch of QA prompts and answers. It doesn't generate the same answer every time even with 0 temperature, so we have to manually eyeball test some of the answers for a new build to see if it is better, worse or the same. This has enabled a lot, like changing LLMs to see the effects, that was not possible without the structured unit-test-like approach to QA. We could probably have an LLM evaluate if the answers are worse/better/the same but haven't done it yet.
- We look at all the prompt/response pairs that customers are sending through and use an LLM to summarize and assess the quality of the answers. It is remarkably good at this and has made it easier to identify which things are an issue.
- We are using a vector DB which is weird and not ideal. Thankfully, large scale databases are starting to add support for vector type data, which we will move to as soon as we possibly can, because really the only difference is how well it indexes and reads/writes vector type data.
- We have established a dashboard with metrics including obvious things like how many unique people used it each day and the overall messages. But less obvious things too like 'Resolves within 3 prompts' rate, which is a proxy for basically does it get the job done or do they constantly have to chase it down. Depending on the problem you're solving, similar metrics will end up being evolved.
There's probably other things the team are doing I'm not even read into yet as it's such an immature and fast moving process but these are the general areas.

whispersnow · Answer

"In production" is different from "for enterprise usage".

dncornholio · Answer

Gotta be honest, I have no clue what I&rsquo;m doing. We just pushed to prod and finding stuff out while we go. Logging all conversations and because we have no idea what we are doing, just trying different stuff.For context, I build in some api calls and LLM is asking for the required data. My proof of concept was enough to make the client say, ship it! And we&rsquo;re rolling with it since

lamp987 · Answer

The bubble is real.

curo · Answer

OpenAI open sourced their evals framework. You can use it to evaluate different models but also your entire prompt chain setup. https://github.com/openai/evalsThey also have a registry of evals built in.

yandie · Answer

Would love to hear feedback and thoughts on how people approach monitoring in production in real world applications in general! It's an area that I think not enough people talk about when operating LLMs.We spent a lot of time working with various companies with GenAI use cases before LLM was a thing and captured them in our library called LangKit - it's designed to be generic and pluggable into many different systems, including langchain: https://github.com/whylabs/langkit/. It's designed beyond prompt engineering and aims to provide automated ways to monitor LLM once deployed. Happy to answer any questions here!

hoerzu · Answer

I tried LLM Studio today and I must say the hallucination score was quite helpful: https://www.rungalileo.io/llm-studio/

badrequest · Answer

We've built effectively a proxy between us and ChatGPT so that we can observe what value users are getting out of it, and to take advantage of more advanced features later. A lot of this work has been speculative, like, we think/hope the information we're collecting will be of use to us, but nobody can articulate precisely how or has a plan for where to start.For instance, we keep track of edits that users make to generated summaries so that we can later manage fine-tunings specific to users that we might pass along to the model, but I couldn't describe a plan for making use of them today.

fudoshin2596 · Answer

Productionizing LLMs is a Multi headed chimera. From getting the vector db retrieval in order, to proper classification, monitoring, and reliable generations. Still searching for a solution to many of these problems, but https://www.parea.ai helps with the testing, eval, iteration part. Wonder what folks use for the vector db retrieval part?

saqadri · Answer

There are several areas where we are seeing LLMs being deployed successfully in production. Often those are in comparatively boring areas, like NLP, classification and intent recognition, for which LLMs work quite well. So far most of this adoption is at companies with existing ML expertise in-house.We haven&rsquo;t seen any usage of the more exciting-looking demos of semi-autonomous agents in production yet, and probably won&rsquo;t for some time.

rymc · Answer

Weights & Biases Prompt tool [1] is useful for easily logging all the interactions with LLMs (supports LangChain + Llamaindex too). It's particularly helpful for debugging since it logs all the information we need to understand and debug.[1] https://wandb.ai/site/prompts

llmllmllm · Answer

https://flowch.ai is live and is a hit with early users. As users supply their own prompts regressions in LLMs aren't really an issue, we're currently using both GPT3.5 and GPT-4, both have their place.
Rapidly moved from demos to people actively using it for their day to day work and automation :)
We're taking advantage of new features as they become available, such as OpenAI's functions and larger context windows. Things are evolving quickly!
(and no, we're not using LangChain!)
Simple example of a GPT4 generated report using custom data: https://flowch.ai/shared/73523ec6-4d1d-48a4-bb16-4e9cc01adf1...
A summary of the conversation so far, adding the text of this comment thread to the system: https://flowch.ai/shared/95dd82d1-39d4-4750-b9df-ab2b83cf7ed...

sergiotapia · Answer

Having a tough time keeping token usage low, and context beefy enough to response to user's with useful information.
One strategy I've applied is to qualify a question into one of several categories and then use a small hyperfocused context.
Another is to remove commas and currency symbols from money values.

How are you improving your use of LLMs in production?

Lots of companies building with LLMs and seemingly infinite startups showing off flashy demos with a waitlist. Who's deployed to production, released to everyone, and iterating on them? How are you evaluating them over time and making sure you don't regress stuff that's already working?

"In production" is different from "for enterprise usage".

The bubble is real.

OpenAI open sourced their evals framework. You can use it to evaluate different models but also your entire prompt chain setup. https://github.com/openai/evals
They also have a registry of evals built in.

I tried LLM Studio today and I must say the hallucination score was quite helpful: https://www.rungalileo.io/llm-studio/

Weights & Biases Prompt tool [1] is useful for easily logging all the interactions with LLMs (supports LangChain + Llamaindex too). It's particularly helpful for debugging since it logs all the information we need to understand and debug.
[1] https://wandb.ai/site/prompts