I then ask some basic queries “where is person y on each day this week”
It gets about 80% of the answers right, but complete randomly gets the remainder utterly wrong. I then tell it what is wrong, and what the right answer should be, and it apologises and updates the answer, and then performs equally poorly for the next employee name.
I have spent all day prompt-tuning, and trying to explain exactly what the table means and what im after, and it reassures me that it knows exactly what it is doing, and then fails again completely.
Seems strange that this thing can allegedly score 90th centile in lsat but completely botch very basic table reading?
You'd think this example is near the best case scenario with easily parsable tabular data published widely and publicly every year, all season. Yet it was comically wrong, and embarrassingly confident that it was not.