HACKER Q&A
📣 akira_067

Agent evaluations, what is everything I should know?


I'm currently building coding agents, and wondering what the standard is for creating and running evals for most people? I gather that the tasks and their definitions will be dramatically different across domains and instances, so I'm not hoping for a one size fits all. Just... what actually works for you in practice?


  👤 adastra22 Accepted Answer ✓
The capabilities of the tool matter more. Claude Code, Codex, Cursor CLI all have different feature sets. This usually determines the choice more than base model capabilities.