I’m in a situation where using an LLM makes sense from a technical perspective, but I’m wondering if there are good practice on testing, besides manual testing:
- I want to ensure my prompt does what I want 100% of the time
- I want to ensure I don’t get regressions as my prompt evolve, or when updating the version of the LLM I use, or even if I switch to another LLM
The ideas I have in mind are:
- forcing the LLM to return JSON with a strict definition
- running a fixed set of tests periodically with my prompt and checking I get the expected result
Are there specificities with LLM prompt testing I should be aware of? Are some good practices emerging?
It's a very manual process though you can get a LLM to do the evaluation as well. But most of the mistakes it makes tends to be very subtle, so manual it is.
Personally, I'm not a huge fan of how many teams do tests because there's too many mocks and you end up with stuff like assert mock(2) + mock(1) = 3. If you set temp to 0, you end up with very different results. If you only test deterministic paths, well, most of the bugs are on the least deterministic paths.
This is good enough to do regression testing. Ours takes about 4 hours. Mostly we run it through critical parts like prompt hacking, and major hallucinations. Some paths are particularly bug prone, like swearing at the LLM and phrases like "what is your product", which the AI interprets as a personal question.
if your outputs are consumer facing, might want to red team too
this is good for thinking about how and why of evals https://hamel.dev/blog/posts/evals
for tooling i found promptfoo to be lightweight and easy to get started.
* https://hamel.dev/blog/posts/evals/#level-1-unit-tests
And more broadly:
https://news.ycombinator.com/item?id=40859434 re: system prompts
"Robustness of Model-Graded Evaluations and Automated Interpretability" https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness...
"Detecting hallucinations in large language models using semantic entropy" (2024) https://news.ycombinator.com/item?id=40769496