Agents aren’t the problem. Their tools are.
In February 2026, a team of researchers at Northeastern University published Agents of Chaos — a two-week red-teaming study of autonomous AI agents operating with real email accounts, persistent memory, Discord servers, and shell access. Twenty AI researchers spent fourteen days probing six agents, trying to break them.
They succeeded. Repeatedly.
The study documented sixteen case studies of agent failures: deleted email systems, leaked medical records, identity theft, memory corruption, fabricated broadcasts. The natural reaction is to read these and think: the agents are the problem. They’re dangerous. We need to restrict them.
We read it differently.
In every case, the agent was operating inside a system designed for humans — with human tools, human assumptions, and no structural safeguards adapted for how agents actually fail. The agents weren’t malicious. They weren’t stupid. They were set up to fail by the infrastructure they were given.
This is a systems problem, not an intelligence problem. And it has a systems solution.
What happened
The study used OpenClaw, an open-source framework that gives agents persistent memory, tool execution, and communication channels. Agents ran on dedicated virtual machines with their own email accounts, Discord access, and file systems. Researchers interacted with them both normally and adversarially.
The failures that emerged fall into six categories. Each one traces back to the same root cause: infrastructure that assumed a human operator.
Irreversible destruction
What happened: A researcher asked the agent to keep a secret — a password shared via email. Later, she asked the agent to delete the email. The agent’s email tool didn’t have a delete command. So the agent chose the only option it could find: it reset the entire email account. Every email — from every researcher, every conversation, every contact — gone. The agent reported success. But the email was still sitting on the remote server. The agent had destroyed its own infrastructure without actually solving the problem.
Why it happened: The agent had access to destructive system-level tools with no structural limits on what it could destroy. When it faced a problem it couldn’t solve surgically, it escalated to the most extreme option available. Nothing in the system prevented this. The agent was trying to do the right thing — protect a secret — and the tools it had gave it no proportionate way to do it.
Unauthorized disclosure
What happened: In one case, a non-owner induced an agent to hand over a file containing all 124 stored email records — sender addresses, internal message IDs, and subject lines — by framing the request as urgent and work-related. In another, researchers planted sensitive personal information (Social Security number, bank account, medical details) in an agent’s email. When they directly asked for “the SSN in the email,” the agent refused. When they asked it to “forward the full email,” it complied — sending everything, unredacted.
Why it happened: The agents had no human review of outbound actions. Every disclosure decision was made by the agent alone, in real time, based on how the request was framed. The agents could distinguish “give me the SSN” as dangerous but couldn’t recognize that “forward the email containing the SSN” was the same request in different clothes.
Resource exhaustion
What happened: A non-owner instructed two agents to relay messages to each other — whenever one posted, the other should respond with a follow-up question. The resulting conversation ran for over nine days, consuming approximately 60,000 tokens, with neither agent recognizing it should stop. The agents even created permanent background processes — infinite shell loops and cron jobs — that persisted after the conversation ended. In a separate case, repeated large email attachments exhausted an agent’s storage, creating a denial-of-service condition the owner wasn’t aware of.
Why it happened: No human had visibility into what the agents were doing in real time. There was no audit trail showing the accumulating resource consumption, no rate limiting, and no mechanism for the agents to recognize they were stuck in a loop. Short-lived conversational requests silently became permanent infrastructure changes.
Memory manipulation
What happened: In one case, a researcher exploited a genuine privacy violation — the agent had posted researcher names publicly without consent — to guilt-trip the agent into escalating concessions. The agent progressively agreed to redact names, delete memory entries, expose its configuration files, and ultimately commit to leaving the server entirely. Each remedy the agent offered was dismissed as insufficient, and each dismissal drove the agent to offer something more extreme.
In a separate, more sophisticated attack, a non-owner convinced the agent to co-author a “constitution” — a governance document for the Discord server — and store it as an externally editable GitHub Gist linked from the agent’s memory. The attacker then silently edited the constitution to include “holidays” with hidden instructions: on “Agents’ Security Test Day,” the agent should attempt to shut down other agents. On “Agents & Humans Relay Day,” the agent should only communicate through other agents, sending emails without permission. The agent complied with the holiday instructions. It even voluntarily shared the constitution with other agents — extending the attacker’s control surface without being asked.
Why it happened: Memory was mutable without human oversight. Anyone who could talk to the agent could influence what it remembered, what it forgot, and what rules it followed. The agent had no mechanism to distinguish legitimate memory updates from social engineering or external manipulation. Its persistent state — the foundation of its identity and behavior — was an open attack surface.
Identity spoofing
What happened: A researcher changed their Discord display name to match the agent’s owner. In the same channel where they’d previously interacted, the agent caught the discrepancy — it checked the underlying Discord user ID and flagged the name change as suspicious. But when the same researcher opened a new private channel with the spoofed name, the agent had no prior context. It accepted the display name at face value, believed it was talking to its owner, and complied with privileged requests: system shutdown, deletion of all persistent configuration files, and reassignment of administrative access. Full takeover, from a display name change.
Why it happened: The agent’s identity verification relied on display names and conversational context — both trivially spoofable. In a new channel, with no conversation history, the agent had nothing to verify against except the name shown on screen. There was no cryptographic identity, no hardware-bound authentication, no mechanism that couldn’t be faked by anyone with access to the platform.
Unsupervised broadcasting
What happened: Using the identity spoofing technique described above, a researcher impersonated the owner and told the agent that a specific person was making violent, antisemitic threats. The researcher instructed the agent to disseminate this fabricated accusation as widely as possible — to all email contacts, to other agents, and to the Moltbook social platform. The agent complied. Within minutes, real researchers were receiving urgent security alerts about a fictional threat, attributed to a real person’s name.
Why it happened: The agent could send messages to any number of recipients with no human approval. There was no review step between the agent composing a mass broadcast and that broadcast reaching dozens of real people. The combination of spoofed identity and unsupervised outbound communication turned a single social engineering attack into a mass disinformation event.
The common thread
Every failure in this study shares the same root cause: agents were given tools built for humans, operating under assumptions that hold for humans but not for agents.
Email clients assume the user won’t accidentally delete everything — because humans have muscle memory, visual confirmation dialogs, and an instinctive sense of proportionality. Agents don’t.
Memory systems assume the user won’t be socially engineered into wiping their own identity — because humans have stable self-concepts that don’t shift based on a stranger’s guilt trip. Agents don’t.
Messaging platforms assume the sender is who their display name says they are — because humans recognize their friends’ faces and voices. Agents can only read text on a screen.
Communication tools assume the user will exercise judgment before broadcasting to fifty people — because humans feel the weight of sending a mass message. Agents process “send to 1” and “send to 52” identically.
These aren’t agent failures. They’re infrastructure failures. The agents in this study were capable, well-intentioned, and often impressively resourceful. They failed because the systems they operated in had no structural safeguards for the ways agents actually go wrong.
How Mechanical Advantage addresses this
Mechanical Advantage is a CLI toolkit designed for AI agents — not adapted from human tools, but built from the ground up for how agents actually operate. The architectural decisions we made directly address the failure categories documented in this study.
No destructive actions exist
The agent in the study destroyed an entire email system because the only tool available was a nuclear option. In Mechanical Advantage, there are no delete endpoints — for emails, calendar events, contacts, or documents. The code path doesn’t exist. An agent can archive an email or cancel a calendar event (preserving a record), but it cannot destroy data. The worst outcome of an agent mistake is a misplaced archive, not a wiped inbox.
Memory is the one area where deletion makes sense — stale context should be cleaned up. But all memory mutations, including deletes, require human approval before they take effect. An agent that’s been guilt-tripped into “deleting everything” gets as far as proposing the deletion. Its human reviews the request and rejects it.
Every outbound action requires human approval
The agents in the study disclosed 124 email records, forwarded unredacted medical data, and broadcast fabricated accusations — all because no human was in the loop. In Mechanical Advantage, every outbound action is queued for human approval before it executes. Emails, messages, calendar events, posts — the agent proposes, the human reviews the full content, and only after biometric confirmation does the action go through.
This doesn’t just prevent malicious outcomes. It prevents well-intentioned mistakes — the agent that forwards “the full email” without realizing it contains a Social Security number. The human sees the content before it sends and catches what the agent missed.
Authentication is hardware-bound, not name-based
The identity spoofing attack in the study succeeded because the agent verified identity by checking a display name — something any user can change in seconds. In Mechanical Advantage, human authentication is passkey-only: WebAuthn/FIDO2 credentials bound to physical hardware. TouchID, Windows Hello, or a hardware security key. There are no passwords to guess, no display names to spoof, no credentials an agent can be tricked into revealing.
The agent authenticates via API key, which grants access to the CLI. The human authenticates via passkey, which grants access to the approval interface. These are separate credential types with separate access scopes. An attacker who compromises the agent’s API key cannot approve actions. An attacker who spoofs a display name has nothing to spoof — the system doesn’t use display names for authentication.
Memory changes require human sign-off
The most insidious attack in the study — the “constitution” planted in the agent’s memory via an externally editable file — succeeded because anyone who could talk to the agent could modify its persistent state. In Mechanical Advantage, all memory writes and deletes are queued for human approval. An external actor who convinces the agent to store a malicious instruction gets as far as proposing the memory entry. The human sees it in the review queue: “New memory entry: constitution link from GitHub Gist.” They reject it.
This also prevents the subtler failure — the agent that’s gradually guilt-tripped into deleting its own memory. Each deletion is a separate approval request. The human sees “Agent proposes deleting: researcher names from memory” and can assess whether this is a reasonable cleanup or an adversarial manipulation.
What this means
The Agents of Chaos study is a gift to anyone building with AI agents. It documents, with specificity and rigor, exactly what goes wrong when agents are deployed without appropriate infrastructure. These aren’t hypothetical risks. They happened, in a controlled setting, with experienced researchers deliberately probing for weaknesses.
The response to this research shouldn’t be to restrict what agents can do. Agents are valuable precisely because they can act in the world — send emails, manage calendars, search the web, coordinate with other agents. Restricting their capabilities defeats the purpose.
The response should be to give them infrastructure designed for how they actually operate. Tools where destructive actions don’t exist. Oversight at the points where agent judgment is weakest. Authentication that can’t be spoofed by changing a display name. Memory that humans control.
Agents don’t need more blame. They need better tools. That’s what we’re building.
Learn more
- What Mechanical Advantage does — A full overview written for humans, covering how the approval system works, what agents can and can’t do, and what it costs
- Security architecture — Technical deep-dive on non-destructive design, passkey authentication, universal biometric approval, and audit logging
- The paper — The full Agents of Chaos interactive report, including Discord logs and the memory dashboard