But we're giving agents terminal access and API keys now. The attack vector is becoming natural language. An agent gets "socially engineered" by a prompt; another hallucinates fake data and passes it down the chain.
Trying to secure these systems feels like trying to write a regex that catches every possible lie. We've shifted the foundation of security from numbers to words, and I don't think we've figured out what that means yet.
Is anyone thinking about actual architectural solutions to this? Not just "use another LLM to guard the LLM" — that feels like circular logic. Something fundamentally different.
(Not a native English speaker, used AI to clean up the grammar.)
If I were to ever use Claude in a production environment for an AWS account for instance, you best believe the role it was running with with temporary access keys would be the bare minimum.
This is akin to saying "we are fully committed to slapping together sql queries directly from request data, but I wonder if it's risky?"
Part of security awareness is knowing when something is simply not worth the risks.