HACKER Q&A
📣 astrobotanical

Where can I find a real-time comparison of LLM chatbot performance?


I'd like to know which chatbot I should leverage for a particular task, as I assume different tools are better suited for different applications.

I've seen formal studies that have examined different dimensions of LLM chatbot performance (e.g. informational or linguistic quality, logical reasoning, creativity), and many anecdotal reports by the HN commentariat. I assume these analyses become outdated quickly, considering the rate at which the tools are evolving.

Are there entities that are evaluating LLM's and publishing the results as quickly?


  👤 ubutler Accepted Answer ✓
LMSYS’ Chatbot Arena (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...) is widely regarded as one of, if not, the most reliable open benchmarks for LLMs. Real users provide prompts to chatbots and then blindly pick the best response. The only drawback is that the leaderboard is restricted to the most popular models and, even then, it can take a while for new models to be added. This is understandable given the considerable ongoing costs associated with continuously updating the leaderboard.

There is also the Open LLM Leaderboard by HuggingFace (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) which aggregates a number of benchmarks, some (eg, MMLU) more trustworthy than others (eg, TruthfulQA). There are real concerns, however, that ML practitioners are gaming the leaderboard by contaminating their training data with evaluation data.

There are a number of other leaderboards such as OpenCompass (https://rank.opencompass.org.cn/leaderboard-llm-v2) and Yet Another LLM Leaderboard (https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leade...) that I have seen suggested although I personally have found the most success with Chatbot Arena and the Open LLM Leaderboard.

I would also suggest checking out the LLM Explorer (https://llm.extractum.io/), which has all of these benchmarks and more in single location and allows you to sort and filter by a wide range of variables. That has been particularly helpful for me when trying to find models that will fit on my GPUs.

N.B. I am not affiliated with any of the benchmarks and services mentioned above.


👤 verdverm
https://chat.lmsys.org/ is the go-to for general comparison (click leaderboard tab)

As for task specific, probably have to dig into papers. Typically, if you have something non-generic, you'll want to fine-tune


👤 extractum
Start with LLM Explorer. https://llm.extractum.io

👤 solardev
Wouldn't it be simpler and more accurate for you to just try X variations of your personal prompt(s) across Y services over Z runs? There's so much variance that no single study would capture every use case.

👤 RecycledEle
Try them yourself.

Ask the samy question is several chatbots, and disqualify the bad ones.