LangSmith looks interesting with tracing+playground, but I'm not using LangChain. I'm seeing established companies like Weights & Biases and Arize, but they seem more model focused than application focused. I'm also seeing startups like Parea and Scorecard that could be interesting.
No matter if you build or buy, what you want to achieve is the following:
1. Visibility over what's going on in production. 2. Ability to identify both good and bad outputs. 3. Performance measuring (custom evaluations). 4. Feedback loop with fine-tuning.
With those pieces in place, you can start making data-driven decisions.
What we've learned so far is that each application is unique, so if you could tell me more about your app I could share some insights.
We offer testing/evaluation in development ([1], [2]), and production ([3]). You can use pre-built evals ([4]) or create your own ([5]). Do logging ([6]) and go from trace to playground ([7]). For TypeScript & Python.
[1]: https://docs.parea.ai/evaluation/offline/experiments
[2]: https://docs.parea.ai/platform/test_hub/benchmarking
[3]: https://docs.parea.ai/evaluation/evals-in-trace
[4]: https://docs.parea.ai/evaluation/list_of_autoeval_functions
[5]: https://docs.parea.ai/evaluation/evaluation-functions/create
[6]: https://docs.parea.ai/observability/logging_and_tracing
[7]: https://docs.parea.ai/observability/open_trace_in_playground
Now for solutions, shameless plug here, we are building an open-source platform for experimenting and evaluating complex LLM apps (https://github.com/agenta-ai/agenta). We offer automatic evaluators as well as human annotation capabilities. Currently, we only provide testing before deployment, but we have plans to include post-production evaluations as well.
Other tools I would look at in the space are promptfoo (also open-source, more dev oriented), humanloop (one of the most feature complete tools in the space, enterprise oriented), however more enterprise oriented / costly) and vellum (YC company, more focused towards product teams)
LangSmith is designed to be independent of LangChain. Some resources:
tracing without LangChain: https://github.com/langchain-ai/langsmith-cookbook/blob/main...
Docs on tracing without LangChain: https://docs.smith.langchain.com/tracing/tracing-faq#how-do-...
We are also actively revamping our documents on this! Hopefully will be better soon
We're also actively revamping our documentation on this topic, so example