A Claude Code session confabulated a nonexistent Python file, persisted against five truthful "does not exist" errors, then self-diagnosed as corrupted tool output. A reconstruction from the raw transcript, a corpus scan across 3,001 sessions on whether the failure is worse in Opus 4.8, and a model-independent mitigation.
AI Security Research Blog
Deep dives into AI agent security, prompt-injection defense, LLM vulnerabilities, MCP hardening, and the practical engineering of trustworthy AI systems.
This blog is where I work out, in public, what it actually takes to ship AI systems that don't betray the people relying on them. The focus is narrow on purpose: agentic systems, prompt injection, MCP and tool-use security, OAuth in the LLM era, and the engineering patterns that keep autonomous code safe enough to delegate real work to.
Posts lean toward concrete trade-offs and reproducible findings rather than threat-of-the-week commentary. You'll find research write-ups with code where the experiment is reproducible, infrastructure deep-dives where I walk through a design and what I'd change in hindsight, and field notes from running real defenses against real attacks. Use the category filter below to narrow in on a specific thread — or browse chronologically if you want the full arc. New writing usually lands every few weeks, and you can subscribe at the bottom of the page if you'd like it in your inbox rather than chasing the RSS feed.
Why OAuth scopes aren't enough for autonomous LLM agents calling MCP tools, and how we wired Tenuo capability warrants end-to-end. Scope-gated rollout, two real bugs, multi-hop delegation, and an attack the warrant catches.
Four months after writing about defense in depth for LLM-assisted development, I went back and tried to attack every layer of my own stack. The obvious attacks are caught by 2026 models. The class isn't closed; the cover stories got better.
Open-sourcing mcp-authflow and mcp-authflow-resource: an RFC-compliant OAuth 2.0 framework for MCP servers, plus a one-command example server. Why MCP deployments need real auth, what the two packages do, and three non-obvious gotchas from production.
Claude Code silently kills stdio MCP servers during idle periods, forcing manual reconnection. How I converted a fragile stdio bridge into a persistent Starlette HTTP reverse proxy — and the obscure SDK crash that followed.
Six layers of security architecture for running LLM agents as daily drivers — every design decision with production stats and companion code.
A complete beginner's guide to setting up every safety layer from the Coding Safer with LLMs post: pre-commit hooks, local review agents, CI workflows, and CLAUDE.md — starting from scratch.
An empirical study of 10,080 prompt injection attempts across 8 models, 6 defense strategies, and 7 attack types. The results challenge common assumptions about prompt-level defenses.
Practical strategies for safer AI-assisted development: automated review agents, layered security checks, and context management that prevents catastrophic mistakes.
An introduction to the flaws in security testing for AI-generated code.