Do you use a standard prompt when evaluating LLM capabilities?

Question

I'm wondering which prompts (if any) could be used when comparing the effectiveness of different LLMs or new versions of the same model (e.g. updated models released by OpenAI, Anthropic, etc)

OSRL · Accepted Answer

"If I had 16 apples in my possession 54 hours ago , and 26 hours ago I exchanged 9 apples for 17 oranges , how many apples would remain in my possession today?

The correct answer: 7 x APPLES

  (( I'm not positive , but when I used this to test whether certain mobile apps were truly using ChatGPT 4 , I would be able to tell if they were lying or not when asking them this question to see whether or not it could successfully provide a correct answer to this question , or not ))

....

  (( Typically , ChatGPT-3 would fail to answer this question correctly. I would know it was not using ChatGPT-4 if it was an incorrect answer.. and since ChatGPT-4 would always provide the correct answer to the question , I was always able to use this exact question whenever needing to test the ability levels of the LLM I am using ... ChatGPT-4 has a much higher level of success with its increased performance abilities when compared to older models.. and when asking it to complete the more complex tasks & problems, it has a processing success rate that is much higher and much more advanced and powerful, leading to more chance of success than when asking the same question to earlier models of the same LLM / earlier models of ChatGPT ))

  ... This type of a deeper complex logic-based questioning can be used to identify and/or judge the overall performance and quality of the LLM when trying to find a higher level of performance from a LLM and when trying to differentiate between older and newer (less advanced and more advanced) models  .. or before deciding which one you should use. 

  ... You can mix it around and rearrange or reword the question to use different types of formats , if trying to increase the difficulty level for more advanced model comparisons. Try to make the question into a deeper more logical reasoning test , and just keep trying to complicate the problem progressively higher until you get to a point where the comparison between different models ; eventually one of them should respond incorrectly to the problem , and one should respond correctly.

This is relatively simple and commonly used technique.. I'm not sure if it will help , but I hope it can help somebody in this area , at least a little bit ...

   Cheers

If you like, add me! DISCORD USERNAME: oldschool.gg