HACKER Q&A
📣 thiht

What's the consensus on "unit" testing LLM prompts?


LLMs are notoriously non deterministic, which makes it hard for us developers to trust them as a tool in a backend, where we usually expect determinism.

I’m in a situation where using an LLM makes sense from a technical perspective, but I’m wondering if there are good practice on testing, besides manual testing:

- I want to ensure my prompt does what I want 100% of the time

- I want to ensure I don’t get regressions as my prompt evolve, or when updating the version of the LLM I use, or even if I switch to another LLM

The ideas I have in mind are:

- forcing the LLM to return JSON with a strict definition

- running a fixed set of tests periodically with my prompt and checking I get the expected result

Are there specificities with LLM prompt testing I should be aware of? Are some good practices emerging?


  👤 muzani Accepted Answer ✓
We make a spreadsheet. A column for input, expected output, actual output, one for manual evaluation (pass/partial/fail). Then the evaluation gets summarised.

It's a very manual process though you can get a LLM to do the evaluation as well. But most of the mistakes it makes tends to be very subtle, so manual it is.

Personally, I'm not a huge fan of how many teams do tests because there's too many mocks and you end up with stuff like assert mock(2) + mock(1) = 3. If you set temp to 0, you end up with very different results. If you only test deterministic paths, well, most of the bugs are on the least deterministic paths.

This is good enough to do regression testing. Ours takes about 4 hours. Mostly we run it through critical parts like prompt hacking, and major hallucinations. Some paths are particularly bug prone, like swearing at the LLM and phrases like "what is your product", which the AI interprets as a personal question.


👤 jyu
LLM evals = unit tests

if your outputs are consumer facing, might want to red team too

this is good for thinking about how and why of evals https://hamel.dev/blog/posts/evals

for tooling i found promptfoo to be lightweight and easy to get started.


👤 cmcollier
This is a good place to start:

* https://hamel.dev/blog/posts/evals/#level-1-unit-tests

And more broadly:

* https://applied-llms.org/


👤 westurner
From "Asking 60 LLMs a set of 20 questions" (2023) https://news.ycombinator.com/item?id=37445401#37451493 : PromptFoo, ChainForge, openai/evals, TheoremQA,

https://news.ycombinator.com/item?id=40859434 re: system prompts

"Robustness of Model-Graded Evaluations and Automated Interpretability" https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness...

"Detecting hallucinations in large language models using semantic entropy" (2024) https://news.ycombinator.com/item?id=40769496


👤 vhcr
Just use JSON Schema, LangChain supports structured output, which let's you define your own custom schema.