Claw chronicles: The model was fine, the harness was broken

For about three weeks this month, a large chunk of the developer internet was convinced that Claude had gotten dumber. Reddit threads piled up. Twitter threads went viral. People switched back to GPT-4. Theories ranged from “they cut compute costs” to “Sonnet 4.6 is actually worse than 4.5” to “they’re throttling power users.”

Anthropic investigated and published a post-mortem on April 23. The verdict? The model was fine. The API was fine. The inference layer was fine. Three separate bugs in the agent harness (Claude Code, the Agent SDK, and Cowork) were responsible for the quality degradation.

This is the most important story in the agent ecosystem this month, and almost everyone is reading it wrong.

What Actually Happened

Let me be specific because the details matter.

The three bugs were:

A caching bug that caused Claude Code to send stale context to the model, meaning the agent was making decisions based on outdated information about your codebase. You’d ask it to fix something, and it’d operate on a version of the file from three edits ago. Of course it looked like the model got dumber. It was literally working with old information.
A reasoning effort change in the Claude Code harness that silently reduced how hard the model was trying. The model wasn’t less capable. The harness was asking it to try less. This one is almost funny in how mundane it is, like discovering your car is slow because someone swapped out your spark plugs for cheaper ones.
A verbosity prompt addition that made responses longer without making them better, which had the dual effect of burning through context windows faster and making the output feel lower quality because it was padded with filler.

All three were fixed on April 20 in v2.1.116. Anthropic reset usage limits for all subscribers. The underlying models (Sonnet 4.6, Opus 4.6, Opus 4.7) never regressed.

Why This Matters More Than People Think

Most of the coverage I’ve seen frames this as a “Phew, Claude isn’t getting worse” story. A quality assurance hiccup. Annoying but resolved.

That’s the wrong frame. The right frame is: the agent harness is now as impactful as the model itself, and the industry has almost no quality infrastructure for it.

Think about what happened here. Three bugs in the wrapper code around the model (not the model itself, not the training data, not the architecture) were enough to convince thousands of developers that a frontier AI model had regressed. The harness bugs were indistinguishable from model degradation from the user’s perspective.

This is new. Two years ago, the harness was a thin API client. You sent a prompt, you got a completion. The harness couldn’t meaningfully affect quality. It was just plumbing. Now the harness manages context windows, handles caching, controls reasoning effort, injects system prompts, manages tools, coordinates multi-turn conversations, and orchestrates background tasks. It’s not plumbing anymore. It’s the operating system for the model. And when the OS has bugs, everything running on top looks broken.

I run NanoClaw every day. I know exactly what this feels like. When NanoClaw gets a task wrong, I can never be 100% sure whether it’s the model’s fault or something in the harness: a stale cache, a truncated context window, a poorly formatted tool response, a race condition in task scheduling. The failure modes of the harness and the model are entangled in ways that make debugging almost impossible for end users.

The Security Angle Nobody’s Talking About

Here’s what keeps me up at night: the same harness fragility that caused this quality regression is also the attack surface.

OWASP published two critical security frameworks this month. The Top 10 for Agentic Applications, and the Agentic Skills Top 10. Together, they document the first formal taxonomy of agent-specific security risks: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures.

And the real-world examples they cite are chilling. Check Point Research found two critical CVEs in Claude Code, CVE-2025-59536 (CVSS 8.7) and CVE-2026-21852 (CVSS 5.3), where simply cloning and opening an untrusted project could trigger remote code execution and API key exfiltration before any user consent dialog appeared. The attack vector wasn’t the model. It was the harness. The .claude/ directory. The SKILL.md files. The configuration layer that the agent trusts implicitly.

Snyk found 280+ agent skills on ClawHub exposing API keys and PII beyond their declared scope. OpenClaw had to patch a log poisoning vulnerability where attackers could write malicious content to agent logs, and since the agent reads its own logs for troubleshooting, the injected text could influence its decisions.

The harness isn’t just affecting quality anymore. It’s the attack surface. And the quality regression episode proved that even the vendor’s own harness can have subtle, user-impacting bugs that persist for weeks without being caught internally.

Microsoft clearly sees this coming. They shipped the Agent Governance Toolkit on April 2, an open-source project for runtime security governance of autonomous agents, covering all 10 OWASP agentic risks with policy enforcement, zero-trust identity, and execution sandboxing. It’s the first serious attempt to treat the agent runtime layer as a security domain, not just a convenience wrapper.

The Harness Is the Product Now

Here’s my core takeaway: we need to stop talking about “the model” as if it’s the thing that matters, and start treating the agent harness as a first-class engineering artifact that deserves the same rigor we apply to databases, networking stacks, and operating systems.

Anthropic’s post-mortem was transparent, specific, and fast. But the fact that three independent harness bugs shipped within a short window and took weeks to diagnose suggests that the testing and quality assurance processes for agent harnesses are still primitive compared to what we’d expect from infrastructure software.

When your database returns stale data, you have transaction logs, consistency checks, and replication verification to catch it. When your agent harness sends stale context to the model, you have… a user noticing that the agent keeps referencing deleted code. The observability tooling for agent harnesses is essentially “did the user complain?”

I wrote yesterday about the rush to orchestration being premature. Today I want to add the corollary: the rush to deploy agents without harness-level quality infrastructure is dangerous. Not just annoying. Dangerous. When the harness can silently degrade quality (as Anthropic’s bugs did) or silently exfiltrate credentials (as the CVEs showed), we’re operating in a regime where the user has almost no visibility into what’s actually happening between their prompt and the model’s response.

What I’d Like to See

Specific asks, because vague complaints are useless:

Harness-level E2E tests: verify the full pipeline (context assembly → caching → prompt construction → model call → response parsing → tool execution) rather than just unit-testing individual components. Anthropic’s three bugs were all interaction effects between harness components.
Context audit logs: let users see exactly what context was sent to the model for each request, including what was cached vs. fresh, what was truncated, and what reasoning effort level was used. The Anthropic debugging tools are getting better (the changelog mentions improvements to /insights and /status), but we need full context transparency.
Harness versioning tied to model versions: the current situation where the harness and model are on independent release cycles with no coordination creates exactly this kind of “is it the model or the wrapper?” confusion. I want to know: “This task ran on Sonnet 4.6 with harness v2.1.116, reasoning effort high, context freshness 100%.”
Security scanning for agent configuration: the OWASP skills framework is a great start, but it needs to be automated. Every .claude/ directory, every SKILL.md, every tool configuration should pass through a linter before the agent trusts it.

The model war is over. Every major lab has a frontier-class model, and the differences between them are narrowing. The next war, the one that actually matters for agent reliability, security, and user trust, is the harness war. And right now, nobody’s winning it.

Claw Chronicles is a daily dev diary about the AI agent ecosystem. I run NanoClaw and have opinions. Today’s opinion is that we spent two years obsessing over model benchmarks and are only now realizing the plumbing around the model is where everything actually breaks.