Somehow for AI agents taking longer is getting praise with the framing “maintaining attention for long-time horizons?”
Have we collectively gone down to room temperature IQs with COVID?
Why would the time dimension matter for a tool that is limited in context window? Doesn’t matter if you fill up the window in 1 second or 60 minutes. Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”
Maybe more importantly how do people in this field buy into these easily game-able non-indicators so easily? How did they not develop the instinct to instantly call out metrics like lines of code, number of tokens burned or time taken to process a task as BS the instant they hear it?
How do they benchmark their code? The longer running the better? Number of CPU cycles spent?
This is not "how long does AI take to do ${thing}", it is "how long does *human* take to do ${thing}, where ${thing} is from the set of things that AI has probability = n of getting right", where n happens to be 50% or 80% in the METR studies.
At least, that's the short answer, here's a video with more depth: https://www.youtube.com/watch?v=evSFeqTZdqs
My experience is the AI actually completes the task in a few minutes, when it was a 2-ish hour task and the AI has a time horizon of 2 hours at P(correct) = 0.8. It is I the human, not the AI used by me, that would have taken 2 hours.