Approaches I've tried or seen: - Separate validation layer before tool execution - Hard-coded pre/post conditions in the tool wrapper - Secondary model auditing planned actions before they run
The secondary-model approach doubles costs. Tool wrappers work but need defensive code for every tool.
What's actually working in production? Specifically for agents that write to databases, send emails, or call APIs where mistakes are hard to undo.
- gate access to secrets via external service that replaces placeholder values with actual secrets, e.g. something like agentvault.co
- have it perform the action on a staging env with fake data, then replay the recorded action on real data without the LLM involvement (e.g. use something like stagehand / director.ai to write the initial browser automation script, but then replay the recorded LLM actions deterministically after you see it work the first time)