Are you using LLM based evaluations?
Is anyone here using LLM-based evaluations heavily? What core problems does it solve for you and how difficult is it to set up? For context, I'm building an AI powered chatbot which works on internal company documents and I'm wondering what to do regarding evaluations.
Like to evaluate if you like the output of the LLM? Or are you training LLM 2 to estimate how well LLM 1 did?
What do I win if I guess what "evaluations" means?