January 7 – January 22, 2026

January 7 – January 22, 2026
Your regular briefing on AI security threats, vulnerabilities, and defences from Darkhunt.
TL;DR
AI agents are now officially the insider threat: Palo Alto Networks declares agentic AI the primary security concern for 2026, with Gartner projecting 40% enterprise app integration by year-end
NIST moves on agent security standards: CAISI issued an RFI seeking input on securing AI agent systems - comment deadline March 9, 2026
Authorisation models are broken: Traditional IAM cannot handle agents executing actions under their own identity, creating enterprise-scale "confused deputy" problems
Defence-in-depth is maturing: Anthropic's Constitutional Classifiers achieves 1% compute overhead (down from 24%) with zero universal jailbreaks found in 198K attack attempts
Tool-based attacks emerge as a critical vector: New research shows malicious queries disguised as tool invocations bypass content filters entirely
Top Stories
The Agent Identity Crisis Has Arrived
What happened: Multiple authoritative sources converged on a fundamental security flaw in enterprise AI deployments during this period. The Hacker News reported that organisations deploying shared AI agents have inadvertently created access intermediaries that circumvent traditional permission boundaries. When an AI agent executes an action, it does so under its own identity — not the requesting user's.
Why it matters: This isn't a bug to patch; it's an architectural incompatibility between how traditional IAM systems work and how agentic AI operates. Security teams lose the ability to enforce least privilege, detect misuse, or attribute responsibility. Your audit logs now say "Agent did X" rather than "User Y requested X through Agent." The long-term, theoretical deputy problem is now playing out at enterprise scale.
Palo Alto Networks' Chief Security Intelligence Officer Wendi Whitmore explicitly named AI agents as the insider threat of 2026, introducing the "superuser problem"-agents with broad permissions that can be weaponised through a single prompt injection to become autonomous insiders capable of executing trades, deleting backups, or exfiltrating databases.
Darkhunt perspective: This validates our core thesis: AI agents require fundamentally different security architectures. You cannot bolt traditional IAM onto systems where identity is fluid, and actions are probabilistically determined. Defence requires understanding how agents reason about and execute tool calls--not just monitoring network traffic.
NIST Signals Government Moving on Agent Security
What happened: NIST's Centre for AI Standards and Innovation (CAISI) issued a Request for Information seeking input on securing AI agent systems. The RFI focuses on three threat categories: adversarial data (indirect prompt injection, data poisoning), insecure models, and misaligned objectives.
Why it matters: This is the first major government initiative specifically targeting the security of AI agents. While the output will be voluntary guidelines rather than regulations, NIST standards tend to become de facto requirements for federal contractors and, eventually, for industry at large. The comment period closes March 9, 2026.
The framing is significant: NIST recognises that agentic AI presents different challenges than traditional AI/ML security. The explicit focus on indirect prompt injection and misaligned objectives signals a sophisticated understanding of the threat landscape.
Darkhunt perspective: Organizations should engage with this process. Early participation shapes standards that will likely influence procurement requirements within 18-24 months. More importantly, the RFI's threat categories - adversarial data, insecure models, misaligned objectives -map directly to the attack surfaces we see exploited in the wild.
Anthropic Proves Production-Viable Jailbreak Defence Is Possible
What happened: Anthropic released Constitutional Classifiers, achieving a breakthrough in efficient jailbreak defence. The system reduced compute overhead from 24% to just 1% while achieving 87% reduction in false refusals. After 1,700 hours of red-teaming across 198,000 attack attempts, zero universal jailbreaks were discovered.
Why it matters: The economics of AI security just changed. Previous robust defences were too expensive for production deployment. A 24% overhead means choosing between security and cost. A 1% overhead means you can have both. The two-stage cascade approach - lightweight probe examining model activations, followed by an ensemble classifier only when needed - provides a blueprint for efficient defence.
The 0.05% false refusal rate (down from 0.38%) is equally important. Defences that cry wolf lose user trust and get disabled. Anthropic validated this on production Claude Sonnet 4.5 traffic, not just benchmarks.
Darkhunt perspective: This demonstrates what's possible when you build systems that reason about attacks rather than just pattern-match. The bidirectional monitoring approach - analysing outputs in the context of inputs - reflects how sophisticated attacks work. Defence must understand attacker intent, not just block known signatures.
Attack Vectors and Vulnerabilities
Tool-Based Attacks Bypass Content Filters
The iMIST attack method reveals a critical blind spot: malicious queries disguised as normal tool invocations bypass content filters designed to catch harmful requests. The technique uses interactive progressive optimisation to escalate response harmfulness across multiple dialogue turns--each turn appears benign in isolation.
This isn't theoretical. The attack achieves superior effectiveness compared to existing methods with low rejection rates. Multi-turn, tool-based attacks represent a fundamentally different threat model than single-shot jailbreaks.
Agentic Reconnaissance Exposes Thousands of Bots
Zenity Labs introduced "agentic recon" methodologies for discovering publicly accessible AI agents. They found tens of thousands of explorable bots with exposed tools, connectors, and RAG knowledge sources. Microsoft Copilot Studio agents proved particularly vulnerable.
Key attack surface elements:
Environment IDs, solution prefixes, and bot names are brute-forceable
Connectors often include embedded credentials
RAG knowledge sources exposed to unauthenticated access
Reprompt Attack on Microsoft Copilot
Researchers revealed a single-click data exfiltration attack using URL parameter injection against Microsoft Copilot. The vulnerability exploited that data-leak safeguards applied only to initial requests. Microsoft patched following responsible disclosure, but the pattern - security checks on entry but not continuation -likely exists elsewhere.
Workflow Attacks Trump Model Attacks
Two real-world incidents highlighted by The Hacker News demonstrate that workflow security matters more than model security:
Chrome extensions stole ChatGPT/DeepSeek data from 900K users
IBM's coding assistant was tricked into executing malware via hidden repository prompts
Neither attack compromised the underlying models. Both exploited the workflows surrounding them.
Defensive Developments
TRYLOCK: Defence-in-Depth Architecture
The TRYLOCK framework presents the first defence-in-depth architecture combining four mechanisms:
DPO weight-level alignment
RepE activation-level control
Adaptive sidecar classifier
Input canonicalization
Result: 88% relative ASR reduction (46.5% to 5.6%) on Mistral-7B-Instruct. Critically, each layer provides unique coverage--RepE blocks 36% of attacks that bypass DPO alone; canonicalization catches 14% of encoding attacks. The code is publicly released for reproducibility.
AgenTRIM: Per-Step Least-Privilege for Agents
AgenTRIM addresses tool misuse and indirect prompt injection without altering agent internals. The framework enforces per-step least-privilege tool access through adaptive filtering and status-aware validation.
Key insight: failures stem from "unbalanced tool-driven agency" - agents with access to more tools than needed for the current step. The offline phase reconstructs tool interfaces from traces; runtime enforces access controls per-step. Tested on AgentDojo benchmark with robustness against description-based attacks.
HoneyTrap: Deception as Defence
The HoneyTrap framework shifts from blocking attackers to deceiving them. Four collaborative agents (Threat Interceptor, Misdirection Controller, Forensic Tracker, System Harmoniser) achieve:
68.77% reduction in attack success rates
118% improvement in Mislead Success Rate
149% increase in Attack Resource Consumption
The paradigm shift: make attackers believe they succeeded while wasting their resources and gathering intelligence about their methods. Validated across GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMA-3.1.
Tool Result Parsing for Indirect Prompt Injection
New research proposes defending against indirect prompt injection by parsing tool results to provide precise data while filtering malicious code. Achieves the lowest Attack Success Rate to date with competitive utility under attack conditions.
Research and Papers
Comprehensive Survey: Agentic AI and Cybersecurity
A major survey paper examines agentic AI implications across the security spectrum:
Defensive applications: Continuous monitoring, automated incident response, proactive threat hunting, fraud prevention
Offensive implications: Accelerated reconnaissance, automated exploitation, coordinated multi-stage attacks
Systemic risks identified: Agent collusion, cascading failures, oversight evasion, memory poisoning
The paper includes three practical cybersecurity implementations demonstrating real-world applicability.
Industry Moves
VC Confidence in AI Security Continues
TechCrunch reported Witness AI raised $58M to address rogue agents and shadow AI. The AI TRiSM (Trust, Risk, and Security Management) market has attracted $1.726B in startup funding between October 2022 and September 2025, per Gartner's analysis cited by Mindgard.
Jailbreak Commoditization Accelerates
BleepingComputer's dark web research reveals "vibe hacking" - a philosophy where attackers prioritise AI guidance over technical mastery. FraudGPT, PhishGPT, and WormGPT are marketed to novices. Jailbreak methods trade as commodities on Russian Telegram channels.
The barrier to sophisticated attacks is collapsing. AI eliminates grammar as a phishing filter. The attacker toolchain is being democratised.
Mindgard Articulates Attacker-Aligned Philosophy
Mindgard published their attacker-aligned security methodology: three phases (Recon, Plan, Attack) and three platform components (Discover, Assess, Defend). The framing prioritises real threats over content-safety noise - a distinction worth noting as the market matures.
The Darkhunt Take
Two weeks of news, one uncomfortable truth: we are deploying AI agents faster than we are learning to secure them.
Gartner says 40% of enterprise apps will integrate agents by year-end. NIST is requesting comments on standards that won't be finalised until 2027 at the earliest. Traditional IAM vendors are scrambling to adapt architectures designed for human users to systems where identity is fluid, and actions are probabilistically determined.
This gap between deployment velocity and security maturity is where attacks will land. The "autonomous insider" isn't a future threat - it's the present state in which a prompt-injection weaponises an agent with broad permissions. The confused deputy problem isn't theoretical - it's every enterprise deployment where agents execute actions under their own identity.
The research this period points toward what real defence requires:
Defense-in-depth, not defense-in-hope. TRYLOCK's four-layer architecture, where each layer catches attacks that the others miss, reflects reality. Single-layer defences fail because sophisticated attackers design their attacks to evade them.
Systems that reason about attacks, not just pattern-match. Anthropic's Constitutional Classifiers didn't achieve 198,000 attack attempts with zero universal jailbreaks through better signatures. It analyses outputs in the context of inputs - understanding attacker intent, not just blocking known strings.
Per-step least-privilege, not blanket permissions. AgenTRIM's insight is simple but essential: agents don't need access to every tool at every step. Restrict access to what's needed for the current action. This doesn't require modifying the agent; it requires monitoring and controlling tool access at runtime.
Offence that informs defence. HoneyTrap's deceptive approach - making attackers believe they succeeded while gathering intelligence - reflects what security practitioners have known for decades: understanding how attackers think is a prerequisite to stopping them.
The organisations that navigate this transition successfully will be those that accept AI agents are fundamentally different from the systems they're used to securing, and build security architectures accordingly.
The rest will learn the hard way.
Darkhunt AI builds autonomous systems that probe, reason, and harden AI defences.