When Your Monitoring Becomes the Incident

I was on-call once for a service that went down because the Datadog agent consumed all available memory on the host. Not the service. The thing watching the service. The patient was fine; the heart monitor flatlined and took the patient with it.

That was four years ago. Since then, I’ve seen the same pattern repeat with unsettling regularity. Monitoring systems causing the incidents they were supposed to prevent. Not metaphorically, not in some abstract “complexity creates risk” sense. The observability infrastructure itself became the blast radius.

And yet every year, we add more of it.

The observability budget swallowed the room

The observability market hit $25 billion in 2026. Twenty-five billion dollars spent annually on watching software run. For many companies, that’s not a rounding error. It’s the rounding error’s older, angrier sibling.

A recent analysis of SaaS spending found that companies using Datadog spend an average of $30,809 per year on that tool alone. A thread on r/sre titled “Our observability costs are now higher than our AWS bill” got dozens of replies saying some version of “same.” One commenter described doing a Datadog demo, loving the product, and then being told the monthly bill would be “just barely less than what my monthly cloud bill was.” That’s not an observation strategy. That’s a hostage situation.

The pricing model is perverse, and everyone knows it but nobody wants to say it out loud at the procurement meeting. You pay per host. Per metric. Per log line. Per trace span. Doing the right engineering thing (more granular services, better instrumentation, richer logging) directly increases your bill. You don’t get rewarded for being thorough. You get penalized for it. The message from every observability vendor: instrument everything, visibility is key, trust is good but verification is better. That’ll be $0.23 per ingested log line.

The economics push teams toward a miserable set of compromises. Sample traces to save money. Suppress logs that might be useful. Choose between knowing what’s happening and being able to afford the infrastructure that’s doing the thing you want to know about.

Alert fatigue isn’t a bug. It’s the product.

Sixty-three percent of alerts go unaddressed. Sixty-three. Not in some niche survey of underfunded startups. Across the industry. Three out of every five times something screams for attention, nobody looks. And this is after teams have already tried to fix the problem.

I’ve been in the room when the PagerDuty goes off at 3 AM. You stumble to your laptop, squint at the alert, realize it’s the same CPU utilization threshold that’s been firing every night since the last deployment changed the baseline, acknowledge it, and go back to bed. Next morning, you mutter about tuning that threshold. You never do. There are forty-seven other thresholds to tune, and you have actual work.

The real damage is the erosion of trust. When engineers stop believing their alerts mean anything, they stop responding to them. And the moment an alert fires that actually matters, a real degradation or outage, it lands in the same Slack channel as the forty-seven false positives, gets the same glazed-over reaction, and the incident clock starts ticking while someone finishes their coffee.

The InfoQ article on agent-assisted observability published earlier this year put it well: “The maintenance burden grows with the system. Teams spend significant time just keeping their observability infrastructure current.” You built a dashboard for a service that was deprecated six months ago. You set up an alert for a metric that changed semantics after a refactor. You’re paying for log retention of data nobody has queried since the quarter it was collected. The observability infrastructure has become its own software system, with its own maintenance burden and its own on-call rotation.

You now need to monitor your monitoring. And if you’re not careful, you’ll need to monitor the monitoring of your monitoring.

The instrumentation arms race

What bugs me most about modern observability culture is the moral framing. “Observability” isn’t just a technical practice anymore. It’s presented as a professional virtue. You’re a “bad engineer” if you don’t instrument thoroughly. You’re “flying blind” without distributed tracing. You’re “negligent” if you can’t answer arbitrary questions about your system’s internal state at any moment.

I’ve seen job postings that list “observability culture” as a requirement, as if it’s a character trait rather than a tooling choice. I’ve seen conference talks where the speaker argues that if you can’t debug a production issue by querying your observability platform, you’ve failed as an organization. The subtext is always the same: more telemetry, more visibility, more instrumentation.

Most services don’t need most of this. A well-structured log with correlation IDs and a health check endpoint will catch 90% of production issues for 90% of services. The remaining cases, subtle performance regressions and cascading failures across service boundaries and weird state machines that only break under specific load patterns, those are real. But they’re rare, and they’re usually better served by targeted instrumentation than by carpet-bombing every function with tracing spans.

The problem runs deeper than tooling. A full observability stack teaches engineers to reach for the dashboard before they reach for their brain. I’ve watched engineers spend twenty minutes hunting for the right Grafana panel to understand a problem that five minutes of reading the code would have solved. The tools become a crutch. Instead of understanding the system, you learn to query it.

What actually works

After a decade of watching this cycle, I’ve noticed the teams that handle incidents well share one trait: they did the boring work of thinking about failure modes before building the observability stack.

They treat alerts as a scarce resource. Every alert costs something like $1000 per fire in engineer time and attention, so adding one requires justification and removing one is celebrated. The default state is silence, and noise is the enemy. Before instrumenting a new metric, they ask “What will I do differently if this number changes?” If the answer is “I don’t know” or “I’d look at other things first,” the metric doesn’t get instrumented. This sounds obvious but is rare in practice.

When an alert fires, the on-call engineer has a human-readable runbook waiting, not just a dashboard link. “Check if the cache is warming up. If so, wait 5 minutes.” That beats “investigate elevated latency on the P99 dashboard” every time.

These teams also delete their observability infrastructure regularly. Dashboards nobody looks at. Alerts that haven’t fired meaningfully in months. Metrics that were useful during a migration and are now noise. The discipline to delete is harder than the discipline to create, and it’s what separates functional observability from observability hoarding.

And they measure the actual cost of their monitoring. Not just the dollar cost (a $50K/month Datadog bill deserves scrutiny), but the cognitive cost. How much engineer time goes into maintaining the observability stack versus shipping features? How many false positives per week? What’s the time-to-acknowledge on real incidents, and is it getting better or worse?

The uncomfortable calculus

There’s a point where adding another layer of observability makes your system less reliable, not more. Not because the tools are bad. Many of them are excellent. But every piece of infrastructure you add is a piece of infrastructure that can fail, needs maintenance, and draws resources away from the thing it’s supposed to be watching.

I don’t think we need less observability. I think we need a more honest relationship with it. One where “add more monitoring” isn’t the default answer to every production incident. One where deleting a dashboard signals a healthy team, not a neglectful one. One where we admit that sometimes the most valuable thing an on-call engineer can have is a good night’s sleep and a solid understanding of the codebase. Not a seventeen-panel Grafana dashboard they’ve never looked at.

The observability industry has done a masterful job of selling us on the idea that more data equals more reliability. But data without discernment is just noise. And noise, at 3 AM, is the most expensive thing in software engineering.