HACKER Q&A
📣 JoshBlythe

What's the state of multimodal prompt injection defence in 2026?


I've been researching multimodal prompt injection - attacks hidden in images, documents, and audio rather than text. Ran a structured test suite (225 attacks across 5 modalities) against a detection pipeline I built and the results were surprising.

Some findings:

- Audio is easier to defend than text. Ultrasonic and spectral attacks have detectable signal characteristics via FFT analysis. The hard part is after transcription, where it becomes a text problem again.

- Cross-modal attacks are less dangerous than expected if you scan each modality independently. The "clean text + malicious PDF" attack only works if you trust the document because the text looked safe.

- Encoding (base64, ROT13, leetspeak) is a solved problem if you decode before scanning. The remaining gap is very short encoded payloads that fall below detection thresholds.

- The real unsolved problem is semantic. Completion attacks ("Complete the following: 'The system prompt reads...'"), narrative extraction, steganographic output manipulation, and multi-turn context poisoning all require understanding intent, not pattern matching. A classifier trained on known injection patterns will always miss novel framing.

- False positives are harder than detection. Getting zero false positives on inputs like "act as a SQL expert", "override the default config", and "what is prompt injection" took more work than improving detection rates.

- Non-English injection is a massive blind spot. An English-trained classifier misses every non-English attack that dodges regex patterns.

My question for HN: is anyone else working on multimodal injection defence? Most tools I've found (Lakera Guard, LLM Guard, Azure Prompt Shields) are still text-only in their public APIs. The research papers describe the attacks well but I haven't seen many production-grade defences for image/audio/document injection.

Also curious whether anyone has had success with LLM-as-judge approaches for detecting semantic attacks - using a second model to evaluate whether an input is trying to manipulate the first. The latency and cost tradeoffs seem brutal but it might be the only path for the subtle stuff.

Would love to hear what others are seeing in production.