That all makes sense to me and I think is the right direction to be headed. However, it's been a bit since the inception of some of these projects/cool demos but I haven't seen anyone who uses agents as a core/regular part of their workflow.
I'm curious if you use these agents regularly or know someone that does. Or if you're working on one of these, I'd love to know what are some of the hidden challenges to making a useful product with agents? What's the main bottle neck?
Any thoughts are welcome!
One thing that is still confusing to me, is that we've been building products with machine learning pretty heavily for a decade now and somehow abandoned all that we have learned about the process now that we're building "AI".
The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable.
You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect. This means a product worthy digit classifier needs to be much higher accuracy.
Go look at some of the LLM benchmarks out there, even in these happy cases it's rare to see any LLM getting above 90%. Then consider you want to chain these calls together to create proper agent based workflows. Even with 90% accuracy in each task, chain 3 of these together and you're down to 0.9 x 0.9 x 0.9 = 0.73, 73% accuracy.
This is by far this biggest obstacle towards seeing more useful products built with agents. There are cases where lower accuracy results are acceptable, but most people don't even consider this before embarking on their journey to build an AI product/agent.
Some recent actual uses cases for me where an agent would NOT be able to help me although I really wish it would:
1. An agent to automate generating web pages from design images - Given an image, produce the HTML and CSS. LLMs couldn't do this for my simple page from a web designer. Not even close, even mixing up vertical/horizontal flex arrangement. When I cropped the image to just a small section, it still couldn't do it. Tried a couple LLMs, none even came close. And these are pretty simple basic designs! I had to do it all manually.
2. Story Generator Agent - Write a story from a given outline (for educational purposes). Even at a very detailed outline level, and with a large context window, kept forgetting key points, repetitive language, no plot development. I just have to write the story myself.
3. Illustrator Agent - Image generation for above story. Images end up very "LLM" looking, often miss key elements in the story, but one thing is worst of all: no persistent characters. This is already a big problem with text, but an even bigger problems with images. Every image for the same story has a character who looks different, but I want them to be the same.
4. Publisher Agent - Package things together above so I can get a complete package of illustrated stories on topics available on web/mobile for viewing, tracking progress, at varying levels.
Just some examples of where LLMs are currently not moving the needle much if at all.
Then I asked it to add a test suite to a rails side project. It created missing factories, corrected a broken test database configuration, and wrote tests for the classes and controllers that I asked it to.
I didn't have to get involved with mundane details. I did have to intervene here and there, but not much. The tests aren't the best in the world, but IMO they're adding value by at least covering the happy path. They're not as good as an experienced person would write.
I did spend a non-trivial amount of time fiddling with the prompts I used to teach OI about Promptr as well as the prompts I used to get it to successfully create the test suite.
The total cost was around $11 using GPT4 turbo.
I think in this case it was a fun experiment. I think in the future, this type of tooling will be ubiquitous.
For example we use it for:
- Website Loading: Automate proxy and browser selection to load sites effectively. Start with the cheapest and simplest way of extracting data, which is fetching the site without any JS or actual browser. If that doesn't work, the agent tries to load the site with a browser and a simple proxy, and so on.
- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.
- Network Analysis: Identify desired data within network calls.
- Validation: Hallucination checks and verification that the data is actually on the website and in the right format. (this is mostly traditional code though)
- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.
The main challenge:
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.
Edit: You can try out a simplified version of this in our playground: https://www.kadoa.com/add
The use cases are pretty straight forward and low risk:
1. Run a Google web search.
2. Query a news API.
3. Write a document based on the above, while citing sources.
Here's an example of something written yesterday, where I'm forecasting whether July 2024 will be the hottest on record: https://emergingtrajectories.com/a/forecast/74
This is working well in that the writeups are great and there are some "aha" moments, like the agent finding and referencing the The National Snow and Ice Data Center (NSIDC)... Very cool! I wouldn't have thought of it.
Then there's the part where the agent also tells me that the Oregon Department of Transportation has holidays during the summer, which doesn't matter at all.
So, YMMV, as they say... But I am more productive with these agents. I wouldn't publish anything formally without confirming and reviewing the content, though.
You can improve on the retrieved documents in many ways, like - by better chunking,
- better embedding,
- embedding several rephrased versions of the query,
- embedding a hypothetical answer to the prompt,
- hybrid retrieval (vector similarity + keyword/tfidf/bm25 related search),
- massively incorporating meta data,
- introducing additional (or hierarchical) summaries of the documents,
- returning not only the chunks but also adjacent text,
- re-ranking the candidate documents,
- fine tuning the LLM and much, much more.
However, at the end of the day a RAG system usually still has a hard time answering questions that require an overview of your data. Example questions are:
- "What are the key differences between the new and the old version of document X?"
- "Which documents can I ask you questions about?"
- "How do the regulations differ between case A and case B?"
In these cases it is really helpful to incorporate LLMs to decide how to process the prompt. This can be something simple like query-routing, or rephrasing/enhancing the original prompt until something useful comes up. But it can also be agents that come up with sub-queries and a plan on how to combine the partial answers. You can also build a network of agents with different roles (like coordinator/planner, reviewer, retriever, ...) to come up with an answer.
* edited the formatting
But they're universally garbage because they require the LLM to do a lot of things that LLMs are completely incompetent at. It's just way too early to expect to be able to remove that work and have it be done by an LLM.
The fact is LLMs are useful because they easily do some work that you're terrible at, and you easily do a lot of work that it's terrible at, and this makes the LLM a good tool because you+LLM is better than either part of that equation alone.
It's natural to think of the things that come effortlessly to you as easy, and to not even notice you're doing any work. But that doesn't change the fact that the LLM is completely incompetent at many of these things. It's way too early to remove the human from the loop.
The more notable common paradigm of Agent workflows that will persist even if there's an AI crash is retrieval-augmented generation (RAG), which at a high-level essentially is few-shot text generation based on prior existing examples. There will always be value in aligning LLM output to be much more expected, such as "generate text in the style of these examples" or "use these examples to answer the user's question."
Startups that just market themselves as "chat with your data!", even though they are RAG based, are gimmicks though and won't survive because they have no moat.
If you are using AI agents to automate a workflow [1] execution, then the question to ask is where is non-determinism in the workflow. As in, where do humans scratch their head as opposed to rely on deterministic computations.
It turns out, a lot of times, as humans, we scratch our head just once for a given kind of objectives to come with a plan. Once we devise a plan, we execute the same plan over and over again without much difficulty.
This inherent pattern in how humans solve problems sort of diminishes the value of AI agents because even in the best case scenario the agents would only be solving a one-time, front-loaded pain. The value add would have been immense if the pain has been recurrent for a given objective.
That is not to say there is no role for AI agents. We are trying to infuse AI agents into an environment where we as humans adapted pretty well. AI agents will have to create newer objectives and goals that we humans have not realized. Finding that uncharted territory, or blue ocean, is where the opportunity is.
[1] By 'workflow' I mean a series of steps to take in order to achieve an overall objective.
1. Planning is hard and exponential decay: Most demos try to start with a single sentence e.g. "order me a Dominos pizza" and go do the whole thing. Turns out planning has been one of the things that LLMs are not that good at. Also, even for a low probability p of failure at a given step, you'd get all steps rights with probability (1-p)^n which gets bad as n grows.
2. Reliability matters and vision is not quite there yet: GPT4V is great, and there have been a handful of domain-specific open source models more focused on understanding screenshots but most of them are not good enough yet to work reliably. And for most applications, reliability is key if you are going to trust the agent to do things on your behalf.
Disclaimer: I'm one of the founders of Autotab (https://www.autotab.com/), we're building a desktop app that lets anyone teach an AI to do a task just by showing it once. We've gone all in on reliability, building our own browser on top of Chromium to give us the bare metal control needed to deliver 98%+ reliability without any site-specific fine tuning.
The other opinionated thing we've done is to focus on "Show, don't tell". We've found that for most important automations it is easier to show the agent the workflow than it would be to write a paragraph describing the steps. If you were to train a human, would you explain where to click or just share your screen & explain with a voice over?
Some stories from our users: One works in IT and sometimes spends hours on- and off-boarding employees (60,000 people company), they need to do 20 different steps across 8 different software applications. Another example is a recruiting company that has many employees looking for candidates and sending messages on LinkedIn all day. In general we mostly see automations that take action or sync data across different software applications.
The problem is temporary: good AI agents don't exist, because sufficiently intelligent AI doesn't yet exist.
(Agency and broad-domain intelligence are basically the same thing. Being able to answer questions relevant to planning is planning.)
This state of affairs is in stark contrast to the crypto/Web3 space, where no one ever presented a use case even conditional on the existence of good blockchain technology.
1. Find, annotate, aggregate, organize, summarize, etc all of my knowledge from notes
2. A Google substitute with direct answers in place of SEO junktext and countless ads
3. Writing boilerplate code, especially in unfamiliar languages
4. Dynamic, general, richly nuanced multimodal content moderation without the human labor bill
5. As an extremely effective personal tutor for learning nearly anything
I view AI as commoditizing general intelligence. You can supply it, like turning on the tap, wherever intelligence helps. I inject intelligence into moderating Discord message harassment, to detect when my 3D prints fail, to filter fluff from articles, clean up unstructured data, flag inappropriate images, etc. (All with the same model!) The world is overwhelmingly starved of intelligence. What extremely limited supply we have of this scarce resource (via humans) is woefully insufficient, and often extreme overkill where deployed. I now have access to a pennies-on-the-dollar supply of (low/mediocre quality) intelligence. Bet that I'll use it anywhere possible to unlock personal value and free up my intelligence for use where it's actually needed.
I'm pretty convinced at this point that the term "agents" is almost useless, because so many people are carrying entirely different mental models of what the term means - so it invites conversations where no-one is actually talking about the same exact idea.
This is a consequence of the "auto-regressive" model and its lack of in-built self-correction, and it is a limiting factor in actual applications.
LeCun's tweet:
But then again, it's just another search engine, essentially. So for how long would it stay useful before it accepts payments to promote certain offers?
Well, except customer service bots (assuming the goal is to inexpensively absorb the energy of unhappy customers so they give up rather than actually getting the result they want or leaving, both of which cost the company money).
I've had success in building multi-agent workflows. Which in a sense are an ensemble of experts that have different prompts to help bounce and validate answers off each other. For example, with one LLM prompt you can ask a question and another can validate the answer. A bit of strength in numbers defense against hallucinations.
I wrote an example doing this in this article: https://medium.com/neuml/ai-powered-parenting-can-ai-help-yo...
They're simply better than naive RAG, especially when you need to access APIs, format content or compare different sections of the knowledge base.
Here are a few demos we have in the open:
> HackerNews AI: Interacts with the hackernews API - https://hn.aidev.run
> ArXiv AI: Reads, summarizes and compares arxiv papers - https://arxiv.aidev.run
(love that it can give you a comparison between 2 papers)
These use cases can only be possible using agents (or whatever that means)
I can honestly say that my use of search engines has decreased drastically and replaced with SOTA LLMs + Web retrieval.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
- Cleaning up / changing something in bulk ( eg. cleaning attributes from a class)
- Generating unit tests ! ( just follow up on what it actually tests though)
Feed in a collection of docs about applications in use at an organization including their user guides; summarize what the capability of each application is; identify what capabilities are high risk; prioritize which applications need the most security visibility
Usually this is a classic difficult problem of inventory and 100 meetings.
Perfect? Nope. A huge leap forward? Yes.