- Where do you store your prompts? - Do you use version control? - How do you test prompts after editing? - Where do you store your test sets? - Do you evaluate results? If so, how? - Are you fine-tuning models like GPT-3.5 for better/cheaper results?
Most of the score dimensions are deterministic but we've added some where we integrated an LLM to do the scoring (... which brings the new problem of scoring the scoring prompt!). We also do a manual scan of the outputs to sanity check. Not doing any fine tuning yet as we're getting pretty good results with just prompting.
Please share your current status :)