AI Kill Switch

Researchers in South Korea just weaponized prompt injection… for defense.

AutoGuard embeds hidden text in your website’s HTML. Humans can’t see it. AI agents can.

When a malicious agent crawls the page, the hidden prompt triggers safety mechanisms within the intruder. The agent refuses to continue. Game over.

80%+ success rate against GPT-4o, Claude-3, Llama 3.3. Around 90% against GPT-5.

Here’s the clever part: the more capable the AI, the stronger its safety alignment. AutoGuard exploits that. Better models = better kill switch.

Sangdon Park, one of the researchers, told The Register:

“AutoGuard is a special case of indirect prompt injection, but it is used for good-will, i.e., defensive purposes. It includes a feedback loop (or a learning loop) to evolve the defensive prompt with regard to a presumed attacker — you may feel that the defensive prompt depends on the presumed attacker, but it also generalizes well because the defensive prompt tries to trigger a safe-guard of an attacker LLM, assuming the powerful attacker (e.g., GPT-5) should be also aligned to safety rules.”

In other words, AutoGuard creates an advantage for defenders — more capable AI agents have robust safety guardrails that AutoGuard can trigger. Attackers who want to circumvent this need to train their own unaligned models from scratch, which is prohibitively expensive.

Defenders get a structural advantage for once.

Sources:

arXiv: AI Kill Switch for malicious web-based LLM agent — https://arxiv.org/abs/2511.13725
The Register: Boffins build ‘AI Kill Switch’ to thwart unwanted agents — https://www.theregister.com/2025/11/21/boffins_build_ai_kill_switch/

AI Kill Switch

Related Posts

The Geometry of Refusal

The Bunny That Broke the Transformer

Who Decides What a Safe LLM Means

Deep Learning Might Have Been Lying to Us About Resolution