HACKER Q&A
📣 toddmorey

AI powered feature ready to deploy. Tools I can use to evaluate it?


As the title says, wanting to test a new AI feature at scale. Seen enough of the "oops" from Google, Figma, etc on AI product rollouts to be nervous. I know it's never perfect, but want to do as thorough testing as possible.

Was thinking of running some LLM as judge workflows to create synthetic prompts and evaluate model outputs. Trying to do this at scale to test / predict edge cases, hallucinations, etc. Basically want to feed the model a huge set of prompts, have it produce output, and then have other LLMs eval the results. Then our team can dig in to areas the judges disagree.

What tools, services, stacks, etc may help with this? I'm sure it's a thing but I don't know how other teams approach this. Thanks!!


  👤 ziggyzecat Accepted Answer ✓
> Basically want to feed the model a huge set of prompts, have it produce output, and then have other LLMs eval the results. Then our team can dig in to areas the judges disagree.

A few weeks later: "oops".

You have to build your own tools and sets for RAG, but you know that. Don't slack off, when it comes to testing and evaluation of LLMs, nobody can be trusted yet.