How are you testing your LLM applications?

Question

I'm looking to test an LLM application that uses a combo of LLM calls, Tool calls, code, and RAG. Testing will be both in production (errors, regressions) and in development (isolate a part of the system, iterate, run experiments). Would you use a platform or roll your own solution?LangSmith looks interesting with tracing+playground, but I'm not using LangChain. I'm seeing established companies like Weights & Biases and Arize, but they seem more model focused than application focused. I'm also seeing startups like Parea and Scorecard that could be interesting.

farouqaldori · Accepted Answer

I'm the co-founder & CTO at www.finetunedb.com - and before launching FinetuneDB, I founded a genai app in the application layer (visus.ai) where I built my own in-house testing toolkit.
No matter if you build or buy, what you want to achieve is the following:
1. Visibility over what's going on in production. 2. Ability to identify both good and bad outputs. 3. Performance measuring (custom evaluations). 4. Feedback loop with fine-tuning.
With those pieces in place, you can start making data-driven decisions.
What we've learned so far is that each application is unique, so if you could tell me more about your app I could share some insights.

Joschkabraun · Answer

Co-founder of Parea here, thanks for the mention!
We offer testing/evaluation in development ([1], [2]), and production ([3]). You can use pre-built evals ([4]) or create your own ([5]). Do logging ([6]) and go from trace to playground ([7]). For TypeScript & Python.
[1]: https://docs.parea.ai/evaluation/offline/experiments
[2]: https://docs.parea.ai/platform/test_hub/benchmarking
[3]: https://docs.parea.ai/evaluation/evals-in-trace
[4]: https://docs.parea.ai/evaluation/list_of_autoeval_functions
[5]: https://docs.parea.ai/evaluation/evaluation-functions/create
[6]: https://docs.parea.ai/observability/logging_and_tracing
[7]: https://docs.parea.ai/observability/open_trace_in_playground

resiros · Answer

I am biased, but I would use a platform and not roll your own solution. You will tend to underestimate the depth of capabilities needed for an eval framework.
Now for solutions, shameless plug here, we are building an open-source platform for experimenting and evaluating complex LLM apps (https://github.com/agenta-ai/agenta). We offer automatic evaluators as well as human annotation capabilities. Currently, we only provide testing before deployment, but we have plans to include post-production evaluations as well.
Other tools I would look at in the space are promptfoo (also open-source, more dev oriented), humanloop (one of the most feature complete tools in the space, enterprise oriented), however more enterprise oriented / costly) and vellum (YC company, more focused towards product teams)

hwchase17 · Answer

Cofounder of LangChain here!LangSmith is designed to be independent of LangChain. Some resources:tracing without LangChain: https://github.com/langchain-ai/langsmith-cookbook/blob/main...Docs on tracing without LangChain: https://docs.smith.langchain.com/tracing/tracing-faq#how-do-...We are also actively revamping our documents on this! Hopefully will be better soonWe're also actively revamping our documentation on this topic, so example