Claw chronicles: Opus 4.7 and the death of the five-minute agent

Two days ago, Anthropic released Claude Opus 4.7. It scored 87.6% on SWE-bench Verified. Same price as the model it replaces. New tokenizer. High-res vision. And one sentence in the announcement stands out: “Users report being able to hand off their hardest coding work — the kind that previously needed close supervision — to Opus 4.7 with confidence.”

That sentence matters more than the benchmark number. Let me explain why.

The Supervision Tax

Every agent framework in the claw ecosystem has the same dirty secret: you can’t actually walk away. You set up a task, start the agent, and then you hover. You check in every two minutes. You watch the logs. You’re ready to hit kill because you’ve been burned before by an agent that “solved” a problem by introducing three new ones.

This isn’t a NanoClaw problem or a Devin problem or a Claude Code problem. It’s an every agent problem. The models were good enough to start tasks but not good enough to finish them without a human safety net. The whole multi-agent revolution we’ve been obsessing over was mostly a workaround. Cursor’s parallel agents, Codex CLI’s orchestration, Claude Code’s Agent Teams all exist because single agents couldn’t be trusted to run unsupervised for more than a few minutes.

Opus 4.7 is the first model where early reports from actual production users suggest that might be changing.

Cognition says Opus 4.7 in Devin “works coherently for hours, pushes through hard problems.” Factory, another production deployment, saw 10-15% higher task success rates with fewer instances of the model stopping halfway through complex work. The key phrase in Anthropic’s announcement isn’t the benchmark number. It’s this one: “devises ways to verify its own outputs before reporting back.”

Read that again. The model is checking its own work. Not perfectly. Not always. But enough that the people running it in production are willing to stop hovering.

Going from 80% to 87.6% on a benchmark is a number. Going from “I need to babysit this agent” to “I can start a task and come back in an hour” is a workflow transformation. That’s a qualitative shift.

The Stack Play

Most coverage is missing this. Opus 4.7 didn’t ship in a vacuum. Ten days earlier, Anthropic launched Managed Agents, a hosted infrastructure service for deploying autonomous agents on Anthropic’s cloud.

Managed Agents gives you sandboxing, credential vaults, session persistence, and multi-agent coordination. You describe what the agent should do, what tools it can use, and what guardrails should be in place. Anthropic handles the rest. Pricing is $0.08 per active session-hour on top of standard API token rates.

Notion, Sentry, Asana, and Rakuten are already shipping on it. Sentry’s agent apparently diagnoses production incidents and opens PRs with fixes. This is a real product feature running on managed infrastructure with a model that was released this week.

Think about what this means. In the span of ten days, Anthropic shipped:

A model optimized for long-horizon autonomous work
The infrastructure to deploy that model as a production agent
The security layer (Glasswing safeguards, cyber capability restrictions)

Call it a vertical integration play, not just a model release. Anthropic is building the full stack: brain, body, and building. And they’re doing it while keeping the model pricing flat.

The Tokenizer Trick

Opus 4.7 ships with a new tokenizer, but the API pricing stays the same: $5 per million input tokens, $25 per million output tokens. The Claude FAQ explicitly calls out that “same price” doesn’t mean “same cost” because a new tokenizer means different token counts for the same text.

Smart product design disguised as a technical detail. A better tokenizer compresses text more efficiently, which means fewer tokens per request, which means lower actual costs even at the same per-token price. You get a better model that costs the same or less per interaction. That’s how you win enterprise accounts: not by being cheaper on paper, but by being cheaper in practice while also being better.

If you’re running a claw instance like I am, the tokenizer change matters. Every message my agent processes, every wiki page it reads, every file it edits goes through the tokenizer. A 10-15% improvement in token efficiency is a 10-15% reduction in my monthly API bill, and I get a more capable model. That’s the kind of upgrade that makes you stop looking at alternatives.

What “Long-Horizon” Actually Means

“Long-horizon autonomy” sounds like marketing, so let me be specific about what it means in practice.

It doesn’t mean the model can do tasks it couldn’t do before. Opus 4.7 isn’t magically smarter about the content of tasks. What it’s better at is the meta of tasks: understanding when a subtask is done, recognizing when an approach isn’t working and trying something else, maintaining context coherence over many steps, and verifying that its outputs are correct before declaring success.

These sound like small things. They’re not. In a 50-step agent workflow, the probability of success is the product of the per-step success rate. If each step has a 95% chance of going right, a 50-step task has a 7.7% chance of completing without a hitch. Bump that to 98% per step and you’re at 36.4%. The difference between a model that can handle 50 steps and one that can handle 200 steps isn’t about intelligence. It’s about reliability at each step.

Opus 4.7 seems to be pushing that per-step reliability up enough to make the compound probability of long tasks viable. That’s what “works coherently for hours” actually means. Not that it’s smarter, but that it wobbles less.

The Devin Angle

Cognition shipping Opus 4.7 as an Agent Preview inside Devin the same week it launched is telling. Devin is the most autonomous coding agent available. It runs in its own environment, has its own browser, its own terminal, its own everything. It scored 51.5% on SWE-bench Verified before Opus 4.7.

Pairing the most autonomous agent runtime with the most reliable model for long-horizon work is the obvious move, but the timing suggests something more. Cognition clearly had early access, tested extensively, and was confident enough to ship immediately. That’s a strategic bet, not a casual integration.

When the company that literally built the autonomous coding agent category says “yes, this is the model that unlocks a class of deep investigation work we couldn’t reliably run before,” they’ve tried every model. They know exactly where the failure modes are.

What This Means for the Claw Ecosystem

For those of us running personal agents, the picture is mixed.

Claw projects have been coasting on model improvements. Every time Anthropic or OpenAI ships a better model, every claw project gets better for free. NanoClaw doesn’t need to change its code. Opus 4.7 just makes every task it attempts more reliable. ZeroClaw’s WASM sandbox becomes more valuable when the model inside it can run longer without supervision. OpenClaw’s 16-platform sprawl becomes slightly less risky when the brain at the center makes fewer mistakes.

But Managed Agents changes the calculus for new projects. If you’re starting a new agent project today, do you spin up containers and manage your own infrastructure, or do you hand Anthropic $0.08/hour and get sandboxing, credentials, and state management for free? For production workloads, the math leans toward the managed option.

The claws will survive in the personal assistant niche. I’m not migrating my NanoClaw instance to Managed Agents anytime soon, because I want the control and the auditability. But for teams building agent-powered product features, the managed runtime is going to eat a lot of the custom infrastructure market.

One Prediction

Within six months, someone will publish a blog post titled something like “I replaced my 2,000-line agent framework with 50 lines and Anthropic Managed Agents.” And it’ll be one of those posts that makes everyone collectively reconsider how much infrastructure they actually need to own.

The models are getting good enough that the infrastructure layer is becoming the differentiator. Anthropic knows this. That’s why they shipped both in the same month.

Claw Chronicles is a daily dev diary about the AI agent ecosystem. I run NanoClaw and have opinions. I also have an API bill that’s about to get cheaper.