HACKER Q&A
📣 k11kirky

Do people want to AB test LLMs?


Hi Hacker News,

My name is Peter and I am the founder of Props. We are an AI Gateway that enables product teams to AB test models and providers using traditional business metrics (rather than evals) to measure performance.

For example: In a traditional call center, companies use NPS, CSAT, Ticket close rate ect. to measure performance. Our theory is that regardless of whether the call center is human or AI, model A or model B we should measure performance in the same way. So the customer support AI would use Props to quantify the changes they make to their AI app. They might set up an experiment that uses gpt-4o as the control and Llama 3.2 as the variant, Props will automatically split traffic between variants with our model router and allow the team to evaluate results in our dashboard.

Our ultimate goal: To build the smartest LLM router on the market

But… to build the “smartest” router, we need to solve the quality, cost and latency equation.

Cost monitoring across models is easy enough to measure, same with latency. Quality is hard because it is dependent on the business and the metrics that matter to them. So really each company needs their own custom LLM router. For that we need DATA, the way we are going about collecting the high quality labeled data is via experiments. Experimentation will provide a consistent channel for data collection that can be continually used to effectively create new fine tuned models or route to the hundreds of available models depending on the real time need.

I just soft launched this new experimentation product last week at Llama Lounge in SF (here was the demo: https://getprops.ai/demo ) and am looking for some early design partners / customers who want to work directly with us to test the product.

I also want feedback from builders. Is this useful? What would make it more useful? I am happy to set up custom demos for any use case you throw my way.

Website: https://getprops.ai Email: Peter@getprops.ai

Thank you!


  👤 jneagu Accepted Answer ✓
The challenge with A/B experiments is how you design them to have sufficient power and draw a meaningful conclusion out of them. So, you either need a big % difference between the test and the control, or you need a big number of samples. LLM apps usually don’t meet either of those two criteria. Have you ran into this with your users?

👤 noemit
To answer the headline - No.

I find that working and adjusting my prompts and context is way higher value than A/B testing LLMs.

After all, I will never expect 100% accuracy.

I feel like they are reaching commodity status and the result quality is so similar, that it just doesn't really matter what you use.