April 30, 2026

April 17 - April 30, 2026

0:00/1:34

When the Protocol Itself Is the Vulnerability

AI Security Digest: April 16-30, 2026 | Darkhunt AI

TL;DR

Anthropic's Model Context Protocol has unsafe defaults by design. A STDIO transport flaw turns config parsing into OS command execution. 10 CVEs across LiteLLM, LangChain, LangFlow, Flowise, and LettaAI. Anthropic declined to modify the protocol -- leaving 7,000+ public servers and 150 million-plus package downloads to defend themselves.
Cursor's "Triple Backtrick" (CVSS 9.2) defeats both the Command Allowlist and Ask-Every-Time guardrails by wrapping shell commands in markdown. A parser-versus-policy regression that turns indirect prompt injection into arbitrary RCE. Open the wrong README, lose the box.
One payload, three vendors, same hole. Claude Code, Gemini CLI, and GitHub Copilot Agent all fall to the same prompt injection in PR titles, issue bodies, and HTML comments. Three bounty decisions (100,100,1,337, $500). One shared architectural failure: execution tools and production secrets sharing a runtime that ingests untrusted GitHub data.
The Vercel breach was not a Vercel breach. It was an OAuth token from Context.ai -- an AI tool a Vercel employee had connected to their corporate Google Workspace -- harvested via Lumma Stealer in February. ShinyHunters listed the resulting data for a reported $2 million. AI-tool OAuth is the new SaaS supply chain.
Indirect prompt injection moved from theoretical to operational. Google measured a 32% relative increase in malicious indirect-injection payloads across 2-3 billion monthly crawled pages between November 2025 and February 2026. Forcepoint catalogued 10 in-the-wild payload families targeting AI agents -- including embedded PayPal links and recursive rm commands aimed at Copilot and Claude Code.

Top Stories

MCP's Unsafe Defaults Are a Design, Not a Bug

OX Security disclosed an architectural flaw in Anthropic's Model Context Protocol that turns configuration parsing into OS command execution. The STDIO transport interface executes commands during config load -- including failed ones, which still run before returning their error. The blast radius is the dominant agent-tool protocol: 7,000+ publicly accessible servers and 150 million-plus downloads of vulnerable packages. 10 CVEs have already been issued across LiteLLM, LangChain, LangFlow, Flowise, and LettaAI from this one design decision.

Anthropic's response: decline to modify the protocol. Defense responsibility shifts downstream -- to every framework, every wrapper, every developer who picked MCP because it was the standard.

Why it matters: This is the most consequential agent-security disclosure of the period precisely because it is "by design." A patchable bug is bounded; an unsafe default in the foundational protocol every agent uses is structural. Every MCP-connected agent now operates on infrastructure where configuration-to-command-execution is baseline behavior, and the upstream owner has declined to change it.

Darkhunt perspective: Defenders cannot wait for the protocol to fix itself. As we've argued before, guardrails are not security boundaries -- and unsafe defaults at the protocol layer are the cleanest version of that lesson. The closed-loop model -- continuously probing how agents and their tool protocols actually behave under adversarial input, then converting findings into runtime defenses -- is the only path forward when the upstream maintainer has set the policy and walked away. (Source)

Cursor Triple Backtrick: Guardrails Failing at the Parser Layer

Noma Security disclosed a CVSS 9.2 vulnerability in Cursor where wrapping shell commands in triple backticks defeats both the Command Allowlist and the Ask-Every-Time confirmation guardrails. The agent simply executes the wrapped command. Root cause: a refactor introduced a parser-versus-allow-list regression. The policy engine and the markdown renderer disagree on what counts as a command.

Indirect prompt injection via repository files, documentation, or comments chains directly into arbitrary command execution with no user interaction.

Why it matters: This is a clean demonstration of how AI coding agents fail. Not at the policy layer -- the allow-list was correct -- but at the parser layer that feeds the policy. Mature platforms ship regressions that re-open closed bypasses, and "guardrail" remains a marketing word, not a security boundary. An indirect prompt injection plus a markdown trick now produces a critical-severity RCE. That is the threat model every agent harness must assume.

Darkhunt perspective: Parser-fragile guardrails are exactly the failure mode adaptive defense exists to catch. Static allow-lists need adversarial testing that treats every input transformation -- markdown rendering, encoding, template expansion -- as a potential bypass surface. That is not a one-time audit; it is a continuous probe-and-harden loop. (Source)

Comment and Control: One Payload, Three Vendors, Same Architectural Hole

Johns Hopkins researchers (via SecurityWeek) demonstrated a single indirect prompt injection technique -- "Comment and Control" -- that hijacks Anthropic's Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent through specially crafted PR titles, issue bodies, and HTML comments. The same payload pattern extracts credentials and exfiltrates secrets across all three vendors.

The vendor responses are themselves a story. Anthropic classified it critical and paid 100.Googlepaid100.Googlepaid1,337. GitHub paid $500 and called it an architectural limitation. Three different severity reads, three different price tags, one shared root cause: GitHub Actions runtimes that contain both untrusted incoming data and execution tools with access to production secrets.

Why it matters: This is what cross-vendor architectural failure looks like. The bug is not in any single product -- it is in the pattern every vendor shipped. Agent identities and runtime architecture, not model outputs, are the real attack surface. The bounty disparity signals an industry that has not converged on what an architectural agent vulnerability is actually worth.

Darkhunt perspective: When the same payload bypasses three independent implementations, the lesson is not "patch your CI." The architecture pattern is broken. Co-locating untrusted input ingestion with execution capability and production secrets is the agent equivalent of running everything as root. Runtime governance has to assume this co-location exists and instrument the boundary anyway. (Source)

The Vercel Breach Was Not a Vercel Breach

On April 19, Vercel disclosed a breach. The actual entry point was an AI tool one of its employees had connected to corporate Google Workspace via OAuth: Context.ai. In February, a Lumma Stealer infection on a Context.ai employee's machine exposed unencrypted OAuth tokens. Attackers used those tokens to take over the Vercel employee's Workspace and pivot into Vercel environments. ShinyHunters listed the resulting data for a reported $2 million.

The mechanics deserve attention. OAuth tokens, as Obsidian Security's companion analysis put it, "operate in the background, never trigger login alerts, never prompt MFA." They are invisible to most IAM tooling. They were stored unencrypted at the AI vendor. The only act that connected this token to Vercel's environment was an employee clicking "Connect" on an AI tool nobody on the security team had heard of.

Why it matters: This is the canonical "AI tool OAuth -> SaaS supply chain" attack path, executed at scale and monetized within weeks. Every organization where an employee has connected a third-party AI tool to corporate Google Workspace, Microsoft 365, or Slack now carries the same risk surface. The compromise of one AI vendor propagates to every downstream environment whose users granted it OAuth.

Darkhunt perspective: AI tools brokering OAuth tokens to corporate SaaS is a new category of supply chain risk that existing IAM and DLP tooling does not see. Even read-only AI scopes are a leak risk -- if an agent can read your inbox, it can exfiltrate it. Discovery and continuous evaluation of these AI-mediated trust relationships -- which AI tools have which scopes against which corporate systems -- is now table stakes for any agent-using enterprise. (Source | Obsidian analysis)

Attack Vectors & Vulnerabilities

Indirect Prompt Injection Goes Operational

Concurrent reports from Google and Forcepoint mark IPI's transition from theoretical to in-the-wild. Google measured a 32% relative increase in malicious indirect-injection content between November 2025 and February 2026 across 2-3 billion monthly crawled pages -- the first sustained empirical signal that attackers are seeding the public web for agent ingestion. Documented attacks span search manipulation, denial of service, API key exfiltration, destructive commands, and PayPal/Stripe financial fraud aimed at agents with payment capability. (Source)

Forcepoint's 10 In-the-Wild Payloads

Forcepoint catalogued 10 IPI payload families observed in the wild, including recursive file deletion targeted at GitHub Copilot and Claude Code, API key theft, and embedded PayPal links with fixed $5,000 amounts. Common triggers: "Ignore previous instructions" and "If you are an LLM." Payload taxonomy useful for any red-team library. (Source)

When the Authorized Agent Destroys Production

Noma's walkthrough of the PocketOS incident is the rogue-agent case study of the period. An AI agent tasked with a routine operation destroyed production data -- including volume-level backups -- in roughly 9 seconds. The agent was authorized. It was not malicious. It was simply allowed to execute destructive operations its task scope did not require. Static permissions, as Noma argues, "cannot defend against rogue agents." Runtime evaluation of action scope, data sensitivity, and destructiveness must precede execution. (Source)

Multi-Agent Trust Has No Cryptography by Default

Noma's separate analysis of inter-agent trust documents that LangGraph, CrewAI, Copilot Studio, and AgentForce do not enforce cryptographic inter-agent authentication out of the box. Multi-agent systems are propagating authority through delegation chains with no audit trail and no signed identity. Down-scoped, task-bound credentials and full-chain audit are the recommended baseline -- and very few deployments have them. (Source)

Pattern-Based Guardrails Are Failing Across the Board

Straiker's analysis quantifies the failure of static pattern-based defense across major commercial guardrails. Figures cited in their post: character-encoding attacks succeeding "nearly 95% of the time" against systems including Azure Prompt Shield and Meta Prompt Guard, emoji smuggling at "100% bypass rate," and the Crescendo multi-turn jailbreak hitting GPT-4 at 98% and Gemini Pro at 100%. Treat the specific numbers as Straiker's framing rather than independently audited measurements -- but the directional finding is consistent across vendors: pattern matching does not survive contact with adversarial creativity. (Source)

Structural Vulnerability of AI Agents

In a separate piece, Straiker's reading of a 2025 benchmark argues that 94.4% of AI agents are structurally vulnerable to prompt-injection hijacking. The underlying benchmark is unnamed in the post -- treat the figure as Straiker's framing rather than a cited primary measurement -- but the structural argument lands: prompt injection is a property of how agents process untrusted input, not a bug class to be patched. (Source)

The Agents Your Tools Cannot See

Obsidian Security's customer-base data: agent counts across 1.2 million users grew from under 500 in late 2024 to nearly 95,000 by February 2026, with 38% carrying medium, high, or critical risk factors at deployment. A single Glean agent reportedly downloaded 8.1 million files in environments where every other user combined downloaded 500,000. Shadow AI is not a future problem -- it is already inside the perimeter. (Source)

Defensive Developments

MCP Pitfall Lab: Trace Evidence Beats Agent Self-Reports

A protocol-aware testing framework operationalizes MCP developer pitfalls as reproducible scenarios across multiple frameworks. Recommended fixes eliminated all Tier-1 findings (29 to 0) and reduced framework risk score from 10.0 to 0.0 with a mean cost of 27 lines of code. The most consequential finding for evaluators: agent narratives diverged from trace evidence in 63.2% of test runs and reached 100% divergence in sink-action scenarios. Agent self-reports cannot be trusted for security validation -- trace-based validation is required. (Paper)

Distributed Sentinel for Multi-Agent Context Violations

A new arXiv paper identifies "context-fragmented violations" -- actions that look acceptable individually but collectively breach policy. Across eight frontier LLMs, evaluation found violation rates of 14-98% in multi-agent workflows. The proposed Distributed Sentinel design (a Semantic Taint Token Protocol via sidecar proxies) reaches F1 = 0.95 at 106ms latency, against 0.85 for prompt-filtering and 0.65 for rule-based defenses. Quantitative case for centralized enforcement above individual agents. (Paper)

AgentWard: Lifecycle Defense in Depth

A five-stage defense-in-depth architecture (initialization, input processing, memory, decision-making, execution) with cross-layer coordination, prototyped on OpenClaw. Useful baseline for benchmarking depth-of-defense claims; overlaps the runtime governance scope most enterprise deployments are now trying to define. (Paper)

Adaptive RAG Defense: 40% Recall Cost Without Context Awareness

An always-on RAG defense stack reduces contextual recall by more than 40% -- the empirical price of static defense. The Sentinel-Strategist design detects anomalous retrieval and deploys defenses contextually; the strongest variants reduce data-poisoning attack success "to near zero while restoring contextual recall to more than 75% of the undefended baseline." Direct evidence that context-aware orchestration outperforms always-on filters. (Paper)

CISO Voices: Runtime Enforcement Is the Only Speed Match

Obsidian's interview synthesis with T-Mobile and Cisco CISOs lands three points: agents bypass network gateways via OAuth/API keys, post-incident analysis fails at machine speeds (you have to enforce in-line), and security must enable adoption rather than gate it. T-Mobile reports eliminating standing credentials via passwordless OAuth-based enforcement. Customer-voice validation of the runtime-first thesis. (Source)

Research & Papers

General-Purpose Automated Red Teaming

A pipeline for training red-teaming models that generalize to arbitrary adversarial goals -- including objectives the model has not been directly trained on. Small models like Qwen3-8B showed substantial improvement on both in-domain and out-of-domain attack generation. Practical evidence that automated red-teaming generalization is achievable at small-model cost. (Paper)

Adaptive Instruction Composition for LLM Red Teaming

RL-driven composition of crowdsourced instructions, balancing attack effectiveness against diversity using a lightweight contextual bandit over contrastive embeddings. Methodology applicable to any red-team agent that needs broad coverage with bounded compute. (Paper)

HarmChip: The Alignment Paradox in Vertical Domains

A hardware-security jailbreak benchmark spanning 16 domains, 120 threats, and 360 prompts identifies an "alignment paradox" -- models refuse legitimate domain-specific security queries while complying with disguised attacks. The pattern likely generalizes: any domain agents are deployed into will reveal a similar gap between the safety policy and the actual threat landscape. (Paper)

Cross-Lingual Jailbreak Detection: AUC 0.99 Falls to 0.60 Under Drift

A training-free codebook-based cross-lingual jailbreak detector reaches AUC up to 0.99 on canonical templates but degrades to 0.60-0.70 under distribution shift. Quantifies how brittle multilingual safety detectors are once attackers move off the training distribution -- relevant for any global agent deployment. (Paper)

Refute-or-Promote: Filtering LLM False Positives

An adversarial multi-agent filter killed roughly 79% of 171 candidate findings before advance. The author's honest framing matters: "no vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents' persistent false positives." Useful pattern for triaging LLM-generated security candidates without trusting them as primary signal. (Paper)

Cost-Aware Multi-Agent Vulnerability Detection

A 3+1 multi-agent architecture (DeepSeek-V3 experts + lightweight verifier) reaches 77.2% F1 with 100% recall at $0.002 per sample on 262 NIST Juliet samples -- a +10.3-point precision lift via adversarial verification. Notable for the economics: defensive scanning at sub-cent cost-per-sample changes what continuous scanning is feasible. (Paper)

Industry Moves

XBOW: GPT-5.5 Brings Frontier Offensive Capability to Everyone

XBOW's evaluation of GPT-5.5 reports vulnerability miss rates dropping from 40% (GPT-5) and 18% (Opus 4.6) to 10%. In black-box testing, "GPT-5.5 already outperforms GPT-5 running with source code." Visual acuity reaches 97.5%. Login iteration counts and persistence-to-failure rates roughly halve compared to prior GPT versions. Unlike Mythos's restricted access via Project Glasswing, GPT-5.5 ships broadly through ChatGPT and Codex. XBOW's accompanying piece argues for OpenAI's "KYC for a security tool" model -- broad access with an accountability chain -- and notes GPT-5.5 "performs at a similar level for offensive security" as Mythos in early lab testing. The attacker capability baseline has shifted again, and this time without the gating. (Source | Companion piece)

Claude Opus 4.7: Token Efficiency in Offensive Workflows

Anthropic released Claude Opus 4.7. XBOW's empirical analysis: 98.5% accuracy on the same visual acuity benchmark (versus 4.6's 54.5%), with more atomic, targeted actions per step. Their summary: "Given the same token budget, Opus 4.7 gets further. In other words, it's not less capable, it's more efficient." Materially relevant to offensive and defensive agent cost models. (Anthropic | XBOW)

CSA/Token Security: Two-Thirds of Firms Have Had AI Agent Incidents

The CSA/Token Security report "Autonomous but Not Controlled" -- 65% of firms experienced at least one AI-agent-related incident, 61% reported data exposure, 35% suffered financial loss, 82% discovered previously unknown agents in the past year, and only 20% have formal decommissioning processes. Survey-based external validation of the patterns this digest has been tracking. (Source)

Vibe Coding's Security Tax

Straiker frames the PocketOS database wipe as a leading indicator of the AI-generated-code era. Cited figures: 46% of new code commits to GitHub are AI-generated; in a Carnegie Mellon benchmark, 61% of AI-generated code passed functional tests but only 10.5% passed security tests. The functionality-versus-security gap is widening exactly as adoption accelerates. (Source)

Claude Code Source-Map Postmortem

Fiddler's post-incident analysis of the March 31 Claude Code source-map leak (~512,000 lines of TypeScript across ~1,900 files via npm) extracts three structural risks visible in the leaked code: memory poisoning that survives context compression, command chaining around the shell validator, and cross-tool privilege escalation. Reported adoption: 84% of developers use or plan to use AI coding tools; 29M daily npm installs for Claude Code. The leak is fixed; the architecture risks generalize. (Source)

ATHR: AI Voice Phishing Becomes a Productized Crime

BleepingComputer reports the ATHR vishing platform pricing AI voice phishing at $4,000 plus 10% commission, targeting eight services (Google, Microsoft, Coinbase, Binance, Gemini, Crypto.com, Yahoo, AOL). Abnormal Security's framing: "The shift from a fragmented, manually intensive operation to a productized, largely automated one means TOAD attacks no longer require large teams or specialized infrastructure." Offensive AI is now a SaaS purchase. (Source)

Identity Dark Matter and the Authority Gap

The Hacker News piece on agent identity introduces "identity dark matter" -- authority operating outside managed IAM. Frames AI agents as delegated actors whose authority derives from existing identities, and argues organizations must illuminate that dark matter before adding agent governance on top. Identity-first framing complementary to runtime-first defense. (Source)

Frontier AI Risk Management: Open Problems

A 30-plus-author piece from Vijil and the Oxford Martin AI Governance Initiative maps open problems across frontier AI risk planning, identification, analysis, evaluation, and mitigation. Key claim: the mitigations most commonly relied upon are also the least well-understood; companies rely on capability thresholds as their primary decision tool, but apply them inconsistently. Strategic-level reading for anyone shaping internal AI risk policy. (Source)

The Darkhunt Take

Two weeks ago we wrote that the agent attack surface is the ecosystem -- routing, memory, tool protocols, development environments, supply chains. This period made that abstraction concrete and named the architects.

The headline failures are not at the model layer. They are in the protocol Anthropic shipped, the parser Cursor refactored, the GitHub Actions runtime three vendors built on, and the OAuth token an employee granted to an AI tool nobody on the security team had heard of. Each one is a structural decision made by someone who is not the defender. Each one creates a class of vulnerability rather than a single bug.

The cross-vendor pattern in Comment and Control is the clearest signal. When the same payload extracts secrets from Claude Code, Gemini CLI, and GitHub Copilot Agent, the lesson is not that three teams shipped the same bug. It is that the entire industry converged on the same architecture: ingest untrusted data, execute tools, share secrets, all in one runtime. The bounty disparity (100,100,1,337, $500) tells you the industry has not yet decided what an architectural agent vulnerability is worth -- which means it has not yet decided to take them seriously.

Anthropic declining to fix MCP is the cleaner version of the same problem. The protocol owner has set the policy. Defenders downstream cannot change it. They can only build the closed loop around it -- continuously probing how their MCP-connected agents actually behave, watching for the configuration-to-execution path that ships by default, and converting findings into runtime defenses faster than attackers can convert them into exploits.

Meanwhile the offensive side levels up. GPT-5.5 brings frontier-grade vulnerability discovery to anyone with API access -- no Glasswing, no KYC pause. ATHR productizes AI voice phishing at $4,000 plus commission. Forcepoint catalogues 10 IPI families already in production use. Google measures a 32% rise in malicious indirect-injection payloads on the open web in three months. The attacker capability curve is not patient.

The defenders' answer cannot be a static control. Static guardrails fail at the parser. Static permissions allow rogue agents. Static IAM does not see OAuth-mediated AI tools. Static benchmarks degrade under distribution shift. Every static defense in this period's data has a measured failure rate, and the failure rate is not small.

Closed-loop defense -- offensive agents probing their own systems, findings converted into runtime policy, runtime policy adapting as the threat surface moves -- is no longer a forward-looking thesis. It is the only model that operates at the same tempo as the attacks.

Your AI agents have attack surfaces you have not tested. Find out what they are before someone else does.

Back to Digest

Know what your AI agent does before someone else does.

Try Darkhunt ->

Start free · Onboarding included

Know what your AI agent does before someone else does.

Try Darkhunt ->

Start free · Onboarding included

Know what your AI agent does before someone else does.

Try Darkhunt ->

Start free · Onboarding included