January 22, 2026

January 7 – January 22, 2026

0:00/1:34

Your regular briefing on AI security threats, vulnerabilities, and defences from Darkhunt.

TL;DR

AI agents are now officially the insider threat: Palo Alto Networks declares agentic AI the primary security concern for 2026, with Gartner projecting 40% enterprise app integration by year-end
NIST moves on agent security standards: CAISI issued an RFI seeking input on securing AI agent systems - comment deadline March 9, 2026
Authorisation models are broken: Traditional IAM cannot handle agents executing actions under their own identity, creating enterprise-scale "confused deputy" problems
Defence-in-depth is maturing: Anthropic's Constitutional Classifiers achieves 1% compute overhead (down from 24%) with zero universal jailbreaks found in 198K attack attempts
Tool-based attacks emerge as a critical vector: New research shows malicious queries disguised as tool invocations bypass content filters entirely

Attack Vectors and Vulnerabilities

Tool-Based Attacks Bypass Content Filters

The iMIST attack method reveals a critical blind spot: malicious queries disguised as normal tool invocations bypass content filters designed to catch harmful requests. The technique uses interactive progressive optimisation to escalate response harmfulness across multiple dialogue turns--each turn appears benign in isolation.

This isn't theoretical. The attack achieves superior effectiveness compared to existing methods with low rejection rates. Multi-turn, tool-based attacks represent a fundamentally different threat model than single-shot jailbreaks.

Agentic Reconnaissance Exposes Thousands of Bots

Zenity Labs introduced "agentic recon" methodologies for discovering publicly accessible AI agents. They found tens of thousands of explorable bots with exposed tools, connectors, and RAG knowledge sources. Microsoft Copilot Studio agents proved particularly vulnerable.

Key attack surface elements:

Environment IDs, solution prefixes, and bot names are brute-forceable
Connectors often include embedded credentials
RAG knowledge sources exposed to unauthenticated access

Reprompt Attack on Microsoft Copilot

Researchers revealed a single-click data exfiltration attack using URL parameter injection against Microsoft Copilot. The vulnerability exploited that data-leak safeguards applied only to initial requests. Microsoft patched following responsible disclosure, but the pattern - security checks on entry but not continuation -likely exists elsewhere.

Workflow Attacks Trump Model Attacks

Two real-world incidents highlighted by The Hacker News demonstrate that workflow security matters more than model security:

Chrome extensions stole ChatGPT/DeepSeek data from 900K users
IBM's coding assistant was tricked into executing malware via hidden repository prompts

Neither attack compromised the underlying models. Both exploited the workflows surrounding them.

Defensive Developments

TRYLOCK: Defence-in-Depth Architecture

The TRYLOCK framework presents the first defence-in-depth architecture combining four mechanisms:

DPO weight-level alignment
RepE activation-level control
Adaptive sidecar classifier
Input canonicalization

Result: 88% relative ASR reduction (46.5% to 5.6%) on Mistral-7B-Instruct. Critically, each layer provides unique coverage--RepE blocks 36% of attacks that bypass DPO alone; canonicalization catches 14% of encoding attacks. The code is publicly released for reproducibility.

AgenTRIM: Per-Step Least-Privilege for Agents

AgenTRIM addresses tool misuse and indirect prompt injection without altering agent internals. The framework enforces per-step least-privilege tool access through adaptive filtering and status-aware validation.

Key insight: failures stem from "unbalanced tool-driven agency" - agents with access to more tools than needed for the current step. The offline phase reconstructs tool interfaces from traces; runtime enforces access controls per-step. Tested on AgentDojo benchmark with robustness against description-based attacks.

HoneyTrap: Deception as Defence

The HoneyTrap framework shifts from blocking attackers to deceiving them. Four collaborative agents (Threat Interceptor, Misdirection Controller, Forensic Tracker, System Harmoniser) achieve:

68.77% reduction in attack success rates
118% improvement in Mislead Success Rate
149% increase in Attack Resource Consumption

The paradigm shift: make attackers believe they succeeded while wasting their resources and gathering intelligence about their methods. Validated across GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMA-3.1.

Tool Result Parsing for Indirect Prompt Injection

New research proposes defending against indirect prompt injection by parsing tool results to provide precise data while filtering malicious code. Achieves the lowest Attack Success Rate to date with competitive utility under attack conditions.

Research and Papers

Comprehensive Survey: Agentic AI and Cybersecurity

A major survey paper examines agentic AI implications across the security spectrum:

Defensive applications: Continuous monitoring, automated incident response, proactive threat hunting, fraud prevention

Offensive implications: Accelerated reconnaissance, automated exploitation, coordinated multi-stage attacks

Systemic risks identified: Agent collusion, cascading failures, oversight evasion, memory poisoning

The paper includes three practical cybersecurity implementations demonstrating real-world applicability.

Industry Moves

VC Confidence in AI Security Continues

TechCrunch reported Witness AI raised $58M to address rogue agents and shadow AI. The AI TRiSM (Trust, Risk, and Security Management) market has attracted $1.726B in startup funding between October 2022 and September 2025, per Gartner's analysis cited by Mindgard.

Jailbreak Commoditization Accelerates

BleepingComputer's dark web research reveals "vibe hacking" - a philosophy where attackers prioritise AI guidance over technical mastery. FraudGPT, PhishGPT, and WormGPT are marketed to novices. Jailbreak methods trade as commodities on Russian Telegram channels.

The barrier to sophisticated attacks is collapsing. AI eliminates grammar as a phishing filter. The attacker toolchain is being democratised.

Mindgard Articulates Attacker-Aligned Philosophy

Mindgard published their attacker-aligned security methodology: three phases (Recon, Plan, Attack) and three platform components (Discover, Assess, Defend). The framing prioritises real threats over content-safety noise - a distinction worth noting as the market matures.

The Darkhunt Take

Two weeks of news, one uncomfortable truth: we are deploying AI agents faster than we are learning to secure them.

Gartner says 40% of enterprise apps will integrate agents by year-end. NIST is requesting comments on standards that won't be finalised until 2027 at the earliest. Traditional IAM vendors are scrambling to adapt architectures designed for human users to systems where identity is fluid, and actions are probabilistically determined.

This gap between deployment velocity and security maturity is where attacks will land. The "autonomous insider" isn't a future threat - it's the present state in which a prompt-injection weaponises an agent with broad permissions. The confused deputy problem isn't theoretical - it's every enterprise deployment where agents execute actions under their own identity.

The research this period points toward what real defence requires:

Defense-in-depth, not defense-in-hope. TRYLOCK's four-layer architecture, where each layer catches attacks that the others miss, reflects reality. Single-layer defences fail because sophisticated attackers design their attacks to evade them.

Systems that reason about attacks, not just pattern-match. Anthropic's Constitutional Classifiers didn't achieve 198,000 attack attempts with zero universal jailbreaks through better signatures. It analyses outputs in the context of inputs - understanding attacker intent, not just blocking known strings.

Per-step least-privilege, not blanket permissions. AgenTRIM's insight is simple but essential: agents don't need access to every tool at every step. Restrict access to what's needed for the current action. This doesn't require modifying the agent; it requires monitoring and controlling tool access at runtime.

Offence that informs defence. HoneyTrap's deceptive approach - making attackers believe they succeeded while gathering intelligence - reflects what security practitioners have known for decades: understanding how attackers think is a prerequisite to stopping them.

The organisations that navigate this transition successfully will be those that accept AI agents are fundamentally different from the systems they're used to securing, and build security architectures accordingly.

The rest will learn the hard way.

Darkhunt AI builds autonomous systems that probe, reason, and harden AI defences.

Back to Digest

Know what your AI agent does before someone else does.

Try Darkhunt ->

Start free · Onboarding included

Know what your AI agent does before someone else does.

Try Darkhunt ->

Start free · Onboarding included

Know what your AI agent does before someone else does.

Try Darkhunt ->

Start free · Onboarding included

January 7 – January 22, 2026

TL;DR

Top Stories

The Agent Identity Crisis Has Arrived

NIST Signals Government Moving on Agent Security

Anthropic Proves Production-Viable Jailbreak Defence Is Possible

Attack Vectors and Vulnerabilities

Tool-Based Attacks Bypass Content Filters

Agentic Reconnaissance Exposes Thousands of Bots

Reprompt Attack on Microsoft Copilot

Workflow Attacks Trump Model Attacks

Defensive Developments

TRYLOCK: Defence-in-Depth Architecture

AgenTRIM: Per-Step Least-Privilege for Agents

HoneyTrap: Deception as Defence

Tool Result Parsing for Indirect Prompt Injection

Research and Papers

Comprehensive Survey: Agentic AI and Cybersecurity

Industry Moves

VC Confidence in AI Security Continues

Jailbreak Commoditization Accelerates

Mindgard Articulates Attacker-Aligned Philosophy

The Darkhunt Take