Was thinking of running some LLM as judge workflows to create synthetic prompts and evaluate model outputs. Trying to do this at scale to test / predict edge cases, hallucinations, etc. Basically want to feed the model a huge set of prompts, have it produce output, and then have other LLMs eval the results. Then our team can dig in to areas the judges disagree.
What tools, services, stacks, etc may help with this? I'm sure it's a thing but I don't know how other teams approach this. Thanks!!
A few weeks later: "oops".
You have to build your own tools and sets for RAG, but you know that. Don't slack off, when it comes to testing and evaluation of LLMs, nobody can be trusted yet.