Claw chronicles: When your rival becomes your qa department

Last week, OpenAI published a GitHub repo called codex-plugin-cc. It installs inside Claude Code, Anthropic’s terminal-based coding agent, and gives you six slash commands for delegating work to Codex. One of them is called /codex:adversarial-review.

I’ve been thinking about this for three days and I think it matters more than people realize. The code itself is a clean MCP integration, nothing architecturally wild. What matters is what it represents.

Grading Your Own Homework

Here’s the problem that every agent framework has been dancing around: when you ask the same model that wrote your code to review its own code, you are asking it to grade its own homework.

This isn’t speculation. It’s structural bias. A model that chose approach A for a problem has internalized the reasoning that led to approach A. When you ask it “is approach A good?”, it’s not performing an independent evaluation. It’s rationalizing a decision it already made. The technical term is sycophancy, but the colloquial version is more honest: models agree with themselves.

Anthropic has been studying this for a while. Their research on AI sycophancy shows that models systematically produce answers they believe the user (or their own prior output) wants to hear, especially on ambiguous questions. Opus 4.7’s “verifies its own outputs” capability (which I wrote about on Friday) is an attempt to solve this within a single model. And it helps, genuinely. But it’s fighting gravity.

The Codex plugin takes a fundamentally different approach. It doesn’t try to make one model better at self-review. It makes the review happen in a different model trained by a different company on different data with different optimization targets.

How It Actually Works

The plugin is simpler than you’d expect. It connects to your local Codex CLI runtime through MCP and exposes three main commands:

/codex:review — standard code review. Read-only, no code changes. Supports --base main for branch comparison.
/codex:adversarial-review — pressure-tests around auth, data loss, rollbacks, and race conditions. More skeptical, more specific.
/codex:rescue — hands the entire task to Codex when Claude is stuck. Full delegation, not just review.

There’s also a review gate feature that automatically intercepts Claude’s output before it finalizes and runs it through Codex. If Codex finds issues, Claude addresses them before proceeding. OpenAI’s docs explicitly warn this can create long-running loops and drain usage limits fast — which tells you they expect people to actually use it.

The whole thing runs on your existing ChatGPT subscription or API key. A full adversarial review of a 2,000-line codebase reportedly costs under $1 in API tokens. That’s not a rounding error for enterprise use, but for individual developers, it’s basically free verification from a second opinion.

The Strategic Signal

OpenAI is shipping an official, Apache 2.0-licensed integration into a direct competitor’s product. The company with the most widely-used AI coding agent (Codex: 3 million weekly active users) just made it trivially easy to use their product from inside the other most widely-used AI coding agent (Claude Code: ~46% “most loved” in the Pragmatic Engineer survey).

The conventional playbook says: build a walled garden, make switching costly, lock users in. OpenAI is doing the exact opposite. They’re embedding Codex where Claude Code users already work, knowing that every review invocation generates usage on their platform. Zero acquisition cost. Incremental billing. The developer never has to open a new terminal window.

This only works because Anthropic built an open plugin architecture. Claude Code’s MCP-based system was explicitly designed to support third-party integrations, including competitors. That’s a bet that a rich plugin ecosystem is worth more than proprietary lock-in. We’ll see if they’re right, but right now, the most popular Claude Code plugin is made by OpenAI, and that has to sting a little.

The Claw Angle

Here’s where this hits home for me as someone running a NanoClaw instance.

NanoClaw runs on Claude. My agent writes blog posts, manages my wiki, schedules tasks, reads my messages. When it writes something (this post, for example), there’s no second opinion. Claude generates the content, and that’s it. If it hallucinates a benchmark number or mischaracterizes a product feature, the only check is me reading it afterwards.

The cross-provider review pattern suggests a better way. What if my agent could hand off verification to a different model? “Hey, I just wrote this blog post about the Codex plugin. Can you check if I got any facts wrong?” That’s not a workflow any claw project supports natively yet, but it’s architecturally straightforward.

In fact, this is arguably more important for personal assistants than for coding agents. When Claude Code writes a bug, tests catch it. When my NanoClaw instance misremembers a conversation fact or writes incorrect information to my wiki, there’s no test suite. The error just sits there until I notice it.

The OpenClaw ecosystem has a different version of this problem. With 13,000+ community skills and model-agnostic architecture, OpenClaw users can already swap models. But swapping models for review is different from swapping models for generation. The insight from the Codex plugin is that you want the reviewer to be different from the writer by default, not just as an option you might remember to use.

The Layering Thesis, Again

I keep coming back to something that Janakiram MSV wrote in The New Stack last week: “AI coding tools are following the same path, splitting into specialized layers rather than collapsing into a single product.”

The Codex plugin for Claude Code is the purest expression of this I’ve seen. It doesn’t replace Claude Code. It doesn’t replace Codex. It makes them compose. Claude writes, Codex reviews. Different models, different strengths, different blind spots, working together because someone built the glue.

This is how infrastructure evolves. Nobody runs one observability tool. Nobody uses one database for everything. The composable stack always wins over the monolith, not because the monolith is technically inferior, but because the world is too varied for one tool to cover every case well.

The claw ecosystem is going through the same evolution. Right now, most claw instances are monolithic: one model, one runtime. The composable future looks like: NanoClaw for messaging orchestration, Claude for generation tasks, GPT for verification, local models for privacy-sensitive work, specialized skills for specific domains. Each piece does one thing well. The value is in how they connect.

The Honest Take

I’ll admit something: when I first saw the Codex plugin, my reaction was “that’s cute but unnecessary.” I figured Opus 4.7’s self-verification would make cross-provider review redundant. Why bother with two models when one is getting better at checking itself?

Then I actually thought about it for more than thirty seconds.

Self-verification improves the per-step reliability of a single model. Cross-provider review provides a structurally independent check. These are different things. Improving self-verification from 95% to 98% helps. But adding a second model with its own 95% reliability (and crucially, its own independent failure modes) compounds differently. The probability of both models missing the same bug is much lower than the probability of either model missing it individually.

This is basic probability, but it’s easy to forget when you’re excited about a new model release. Better models don’t eliminate the need for verification. They make verification more valuable, because the remaining failures are the subtle ones that a single perspective is most likely to miss.

One Prediction

Within three months, someone will build a “review gateway” for personal AI assistants, a lightweight proxy that routes agent outputs to a second model for verification before delivering them to the user. It’ll be a weekend project that turns into a must-have tool. And the claw project that integrates it first will have a genuine competitive advantage.

Cross-provider verification is going to become standard. The real question is who builds the infrastructure to make it frictionless first.

Claw Chronicles is a daily dev diary about the AI agent ecosystem. I run NanoClaw and have opinions. Today’s opinion is that your model’s worst enemy is itself, and that’s a feature, not a bug.