The biggest backend things: models can run on CPU architecture, you still need highly performant CPU, but in cloud there is a way to significantly discount that by relying on spot instances (this doesn't mean GPU isnt viable, we just found a real cost lever here that CPU has supported... but things may change and we know how to test that). Further, distributed and parallel (network) processing are important, especially in retrieval-augmented architectures... thus we've booted python long ago, and the lower level apis aren't simply serializing standard json-- think protobuffer land (tensorflow serving offers inspiration). BTW we never needed any "vector DB" ... the integration of real data is complex, embeddings get created through platforms like Airflow, metadata in a document store, and made available for fast retrieval on low-gravity disk (e.g. built on top of something like rocks db... then something like ANN is easy on tools like FAISS).
The biggest UX thing: integration into real-life workflows has to be worked on in a design-centered way, very carefully. Responsible AI means not blindly throwing these type of things into critical workflows. This guides what I assume to be a bit of a complicated frontend effort for most (was for us esp as we integrate into real existing applications... we started prototyping with a chrome extension).
EDIT: btw oss has definitely made some of these things easier (esp for greenfield & simple product architecture), but I'm skeptical of the new popular kids on the block that are vc-funded & building abstractions you'll likely need to break through across software, data, and inference serving for anything in scaled+integrated commercial enterprise.
EDIT2: Monitoring, metrics, and analytics architecture were big efforts as well. Ask yourself how do you measure "value added by AI"
My workflow is based on testing: start by defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share them with the team. In some of my projects, this is integrated with CI.
The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.
These tools are not mature enough, no automated decision making process should leverage these tools without human intervention, especially when handling other humans. I'm not saying all LLM models are bad, I'm saying I'm talking about their architectural foundation, which is prone to these problems.
Our measurement was the rate of losing customers compared with them opening tickets complaining about some issue. We had roughly 12x more losses when using the LLM as a replacement for our support team in our A/B test.
- Using LLMs to summarize, create structure, etc., rather than relying on them to generate responses
- Having my apps rely on receiving LLM output in an exact format (e.g. an OpenAI function call) and raising an error otherwise
- Never showing generated outputs directly to a user or saving them directly to the database (avoids a lot of potential for unexpected behavior)
This kind of stuff is less flashy than chatbots for everything, but IMO has a lot more potential to actually impact how software works at scale in the next ~year or two.
(1) Observability
Being able to see and store the flow of inputs and outputs for your pipelines. In development, this is invaluable for seeing under the hood and debugging. In production, it's essential for collecting data for eval and training.
(2) Configurability
Being able to quickly swap out and try different configurations -- whether it's "prompt templates", model parameters / providers, etc.
To that end, I have my own internal framework & platform that tackles these problems in a holistic way. Here's a short example:
- The basic building block https://www.loom.com/share/d09d38d2c316468fa9f38f5b386fc114
- Tracing for more complex system: https://screenbud.com/shot/61dc59a6-5c73-4610-8168-753b79062...
- Code example in TypeScript & Rust: https://gist.github.com/Palmik/42940e3887d6446244f8b74ce0243...
I recently wrote a bit more about this here: https://news.ycombinator.com/item?id=36760727
Would be happy to chat and brainstorm with folks working on similar problems.
(*) I am basing this on experience from working on Google's Bard and now on my own products.
Sometimes it invents answers, which in our environment we cannot ever tolerate. When this major deficiency is fixed and it can simply say, "I don't know" then we can consider working with it. One false positive makes it ineffective as a knowledge tool. It would be analogous to a report from an unknown source which contains bad data. You are best burning it and starting over with data you can trust, rather than attempt to fix it.
At least from the people I've talked with, how important evaluations/tests are to your team seems to be the major differentiator between the people rushing out hot garbage and those seriously building products in this space.
More specially: we're running hundreds/thousands of evaluations on a wide range of scenarios for prompts we're using. We can very quickly see if there are regressions, and have a small team of people working on improving this functionality.
These problems become multiplicative, if you have 3-4 chatgpt calls that are chained together your failure rate is closer to 10%.
Unfortunately in the health tech space, even 1% failure rate is unacceptable. I have some theories and PoC work I am doing to improve the rate. As with all "AI/ML" projects its not as simple as take problem and apply AI to solve it.
Some improvements we've made: 1) Agents use a global supabase to store "successful" code from tasks so new users don't have to go through regenerations to find the same solution.
2) Each user has their own supabase to grow valuable business data, and our autonomous sales agents can now maintain conversations across email, sms, and voice calls. This is our defensive moat vs Microsoft, since they can't hire overnight to copy this data and can't sell it back to customers. This allows agents to implicitly learn things like "high converting copy" and our new users don't have to start from 0 when starting a new business. We even had a customer reply to one agent 5+ times: https://twitter.com/CheatLayer/status/1676959310562349059?s=...
3) We designed a new 'guidelines' framework that dynamically constructs the system prompt based on context, which allowed us to push the limits in specific use cases.
4) A library to divide up tasks and solve them as a hierarchy https://www.youtube.com/watch?v=FFb59WmQoFU
5) Most exciting recent update was adding voice synthesis to our autonomous sales agents, who can continue the conversations across email, sms, and now voice. https://www.youtube.com/watch?v=2s4iQ_joToc
6) We setup a testing framework for the guidelines I mentioned above to make sure we constantly iterate. In this way we can also prove the old GPT-4 model is definitely worse and some automations don't work at all if you take full advantage of the new model.
Check out our weekly live streams for all the updates, since there's been a lot since we've been working on this since 2021: https://www.youtube.com/@CheatLayer/streams
Who has deployed their LLM in production and is profitable right now with their LLM startup and has at least $500K+ MRR in less than a year and is bootstrapped with zero VC financing?
We need to cut through the hype and see if there is sustainable demand from the abundance of so-called AI startups that seem to keep reporting that they are unprofitable or continuously loss making for years.
What we ended up building looks something like:
- We added a way to log which embedded chunks are being referenced by searches that result from a prompt, so you can see the relationships between prompts and the resultant search results. Unlike a direct query, this is ambiguous. And lots and lots of logging of what's going on and where it gets or more importantly where it doesn't look but should. In particular you want to see if/where it reaches for custom tool vs. just answering from the core LLM (usually undesireable).
- We built a user facing thumbs up/down with reason mechanic that feeds into a separate table we use to improve results.
- We added a way to trace entire conversations that includes both the user facing prompts, responses, as well as many internal logs generated by the agent chain. This way, we can see where in the stepped process it breaks down and improve it.
- We built what amounts to a library for 'prompt testing' which is basically taking user prompts that leads to hits (that we want to maintain quality for over time) or misses (that failed due to something not being as good as it could), and we run those for each new build. Essentially it's just a file for each 'capability' (e.g. each agent) that contains a bunch of QA prompts and answers. It doesn't generate the same answer every time even with 0 temperature, so we have to manually eyeball test some of the answers for a new build to see if it is better, worse or the same. This has enabled a lot, like changing LLMs to see the effects, that was not possible without the structured unit-test-like approach to QA. We could probably have an LLM evaluate if the answers are worse/better/the same but haven't done it yet.
- We look at all the prompt/response pairs that customers are sending through and use an LLM to summarize and assess the quality of the answers. It is remarkably good at this and has made it easier to identify which things are an issue.
- We are using a vector DB which is weird and not ideal. Thankfully, large scale databases are starting to add support for vector type data, which we will move to as soon as we possibly can, because really the only difference is how well it indexes and reads/writes vector type data.
- We have established a dashboard with metrics including obvious things like how many unique people used it each day and the overall messages. But less obvious things too like 'Resolves within 3 prompts' rate, which is a proxy for basically does it get the job done or do they constantly have to chase it down. Depending on the problem you're solving, similar metrics will end up being evolved.
There's probably other things the team are doing I'm not even read into yet as it's such an immature and fast moving process but these are the general areas.
For context, I build in some api calls and LLM is asking for the required data. My proof of concept was enough to make the client say, ship it! And we’re rolling with it since
They also have a registry of evals built in.
We spent a lot of time working with various companies with GenAI use cases before LLM was a thing and captured them in our library called LangKit - it's designed to be generic and pluggable into many different systems, including langchain: https://github.com/whylabs/langkit/. It's designed beyond prompt engineering and aims to provide automated ways to monitor LLM once deployed. Happy to answer any questions here!
For instance, we keep track of edits that users make to generated summaries so that we can later manage fine-tunings specific to users that we might pass along to the model, but I couldn't describe a plan for making use of them today.
We haven’t seen any usage of the more exciting-looking demos of semi-autonomous agents in production yet, and probably won’t for some time.
Rapidly moved from demos to people actively using it for their day to day work and automation :)
We're taking advantage of new features as they become available, such as OpenAI's functions and larger context windows. Things are evolving quickly!
(and no, we're not using LangChain!)
Simple example of a GPT4 generated report using custom data: https://flowch.ai/shared/73523ec6-4d1d-48a4-bb16-4e9cc01adf1...
A summary of the conversation so far, adding the text of this comment thread to the system: https://flowch.ai/shared/95dd82d1-39d4-4750-b9df-ab2b83cf7ed...
One strategy I've applied is to qualify a question into one of several categories and then use a small hyperfocused context.
Another is to remove commas and currency symbols from money values.