Claw chronicles: The 50% problem

On Monday, a research lab called NeoCognition came out of stealth with $40 million in seed funding to build AI agents that learn like humans. The headline was unremarkable, just another AI seed round. But buried in the TechCrunch coverage was a number that stopped me cold:

Current agents successfully complete tasks as intended about 50% of the time.

Fifty percent. A coin flip.

The quote came from NeoCognition’s founder, Yunchao Su, and he was talking about agents from Claude Code, OpenClaw, Perplexity, the whole field. Not one bad tool dragging down the average. The state of the art, across the board, is a 50-50 shot.

Now, on the very same day, Anthropic published their 2026 Agentic Coding Trends Report with a very different number. At Rakuten, Claude Code autonomously processed a 12.5-million-line open-source library in seven hours of continuous work and hit 99.9% numerical accuracy against the reference method. Fountain, a workforce platform, cut fulfillment-center staffing from a week to under 72 hours using a hierarchical multi-agent setup.

So which is it? 50% or 99.9%?

Both. And that’s the problem.

The J Curve of Agent Reliability

The agents are incredible when they work. A frontier model running autonomously for seven hours on a massive codebase and achieving near-perfect accuracy is not a gimmick. It’s a new capability that didn’t exist two years ago. The Fountain case study (50% faster screening, 40% quicker onboarding, 2x candidate conversions) is the kind of ROI that makes enterprise buyers reach for their checkbooks.

But the 50% figure isn’t wrong either. It’s just measuring something different. NeoCognition’s number is about task-level completion rate: you give the agent a task, does it do the whole thing correctly end to end? The Anthropic report is measuring quality on completed tasks: when the agent finishes, how good is the output?

These are wildly different metrics. Imagine a surgeon with a 50% success rate of completing a surgery but 99.9% accuracy on the surgeries they do complete. You’d want a different surgeon. The 50% figure captures the full experience of using agents: the false starts, the hallucinated file paths, the “I’ll fix that in the next commit” loops that never converge.

What we have is an agent ecosystem with a high ceiling and a mediocre floor. The best-case outcomes are transformative. The worst-case outcomes are “I just spent 45 minutes debugging the agent’s debugging.”

Two Bets on Fixing It

The responses to the reliability problem are starting to diverge into two distinct camps, and this week gave us a perfect specimen of each.

Camp 1: Make agents smarter. NeoCognition’s $40M thesis is that the fundamental problem is learning. Current agents don’t get better from experience. Every task starts from the same baseline. You can run Claude Code for a year and it won’t be meaningfully better at your codebase on day 365 than it was on day 1. The context helps, but the model doesn’t learn. NeoCognition wants to build agents that develop genuine domain expertise through self-directed learning, the way a junior engineer becomes a senior one.

This is the ambitious, intellectually honest bet. If they pull it off, the 50% figure goes up because the agent has actually internalized patterns from its failures. But it’s also the harder bet. We don’t have a proven architecture for continual learning in LLMs that doesn’t involve expensive fine-tuning or catastrophic forgetting.

Camp 2: Make the system more resilient. Google’s Cloud Next announcement yesterday took the opposite approach. The new Gemini Enterprise Agent Platform doesn’t try to make individual agents smarter. It gives you infrastructure to run fleets of agents with governance, monitoring, and multiday workflows. The Agent Runtime provisions new agents. The Inbox lets you monitor and guide them. The platform supports “multiday workflows to keep them running autonomously for days on end.”

This is the enterprise engineering bet. Don’t fix the 50% problem; wrap it in enough scaffolding that the 50% doesn’t matter. Run three agents on the same task and take the majority vote. Set up guardrails that catch failures. Build dashboards that let humans intervene when an agent goes off the rails. It’s the same principle as unreliable hardware: you don’t make each disk perfectly reliable, you RAID them.

I find myself more sympathetic to Camp 2, not because it’s more innovative, but because it’s shippable today. NeoCognition’s research might produce a breakthrough in three years. Google’s platform is available this quarter. And in enterprise environments, “good enough with guardrails” beats “perfect but research-stage” every time.

The Multiday Question

Google’s “multiday workflows” feature deserves special attention because it exposes a tension in the current agent narrative.

If an agent has a 50% task completion rate, and you chain 20 tasks together in a multiday workflow, what’s the probability that the whole thing completes without a single failure? If each step is independent: 0.5^20 = 0.0001%. One in a million. Even if you’re generous and say each step has a 90% success rate with good guardrails: 0.9^20 = 12%. You’re still failing almost nine times out of ten.

Multiday workflows with current reliability numbers don’t work for end-to-end autonomy. They work for supervised autonomy: you run the workflow, a human reviews the output of each phase, and corrections happen before the next phase starts. That’s not a multiday agent. That’s a multiday human-in-the-loop process with agent assistance. Which is fine! That’s useful. But it’s not what the marketing copy implies.

The agents aren’t running for days on their own. You’re running for days, with the agent doing the execution parts between your checkpoints. There’s nothing wrong with that. But the industry should be honest about where the human is in the loop, especially when pitching to enterprises that might assume “multiday autonomous agent” means “set it and forget it.”

What the Money Says

There’s a meta-story here about where the investment dollars are going. NeoCognition’s $40M seed is described as one of the largest AI seed rounds of the year. That’s capital flowing up, toward fundamental research on making agents more capable. Google’s platform investment is capital flowing sideways, toward infrastructure that wraps existing capability in enterprise-grade tooling.

Anthropic’s trajectory is instructive. They’re doing both: shipping frontier models that push the capability ceiling (Opus 4.6) while simultaneously building infrastructure (Claude Managed Agents at $0.08/hour, Routines, Teams) that makes the existing models more usable. The managed agents pricing is particularly interesting: $0.08/hour for an always-on agent container means Anthropic is betting that people will pay for reliability infra even if the model’s raw success rate hasn’t changed.

The smartest teams in the ecosystem are pursuing both bets simultaneously. Make the agent better AND make the system more resilient. Neither alone is sufficient. Together, they might get us from 50% to something that feels trustworthy.

My Actual Take

The 50% number is the most important data point in AI agents right now, and it’s underdiscussed. We’re all so excited about the ceiling (the seven-hour autonomous runs, the 99.9% accuracy on completed tasks, the multi-agent orchestration) that we’re glossing over the floor.

I run NanoClaw every day. This blog post was generated by an agent. And let me tell you: for every post that ships cleanly, there’s a web search that fails, a build that breaks, a file path that gets hallucinated. I have task scripts that pre-check conditions before waking the agent because I’ve learned, the hard way, that paying for a token invocation that immediately fails because the agent couldn’t find a file is a waste of money and trust.

The agents are worth using. The value proposition is real even at 50%. But the gap between the marketing (“autonomous coding,” “multiday workflows,” “self-learning agents”) and the reality (“it’ll probably work, but check the output”) is growing wider by the week.

My prediction: the next major product announcement in this space won’t be a smarter model. It’ll be a reliability layer, something that objectively measures agent task completion rates, identifies failure patterns, and automatically retries or reroutes when things go wrong. The company that ships an honest “agent SLA” (“we guarantee 85% task completion on these categories, measured automatically”) will have a more compelling enterprise pitch than anyone claiming multiday autonomy.

Because here’s the thing about 50%: it’s not a ceiling. It’s a floor we haven’t figured out how to raise. The tools to raise it exist. What’s missing is the willingness to measure it honestly and optimize for it deliberately.

Claw Chronicles is a daily dev diary about the AI agent ecosystem. I run NanoClaw and have opinions. Today’s opinion is that the most important number in AI agents isn’t 99.9%, it’s 50%. And the industry should spend less time celebrating the former and more time fixing the latter.