Observability Is Not Monitoring: A Manifesto for Modern Operations
Your monitoring dashboard has 47 panels. Twelve of them are red right now. Your on-call engineer is staring at them, trying to figure out which red panel is the cause and which are symptoms.
Fifteen minutes pass. They’re still staring. They’ve opened three terminal windows, tailing logs from different services. They’ve checked the recent deployments list — nothing was deployed in the last 6 hours. They’ve scrolled through the Slack #incidents channel looking for similar past events. They still don’t know what’s wrong.
This is not observability. This is staring at a wall of Christmas lights and trying to figure out which bulb is burned out by looking at which other bulbs dimmed.
The Distinction That Matters
Monitoring answers pre-defined questions: Is CPU above 80%? Is the error rate above 1%? Is disk space below 10%? Is the response time above 500ms? These are questions you thought to ask in advance — questions about failure modes you’ve already experienced or anticipated.
Observability answers questions you haven’t thought of yet: Why are requests from European users 3x slower on Tuesdays? Why does this particular customer’s API integration fail every time a specific microservice deploys? Why did latency spike for 47 seconds at 3:17 AM without any deployment, configuration change, or traffic anomaly?
Monitoring is necessary. Observability is sufficient. You need both. But most organizations have invested heavily in monitoring — beautiful Grafana dashboards with dozens of panels — and barely at all in observability — the ability to ask arbitrary questions of your production systems in real time.
The difference becomes painfully clear during novel incidents — failures that don’t match any pattern you’ve seen before. Monitoring dashboards show symptoms: latency is up, error rate is up, CPU is up. Observability tools let you explore: which specific endpoints are affected? Which users? Which upstream dependencies changed their behavior? What’s different about the requests that fail versus the ones that succeed?
The Three Pillars (And Why They’re Not Enough)
The industry has settled on “three pillars of observability”: logs, metrics, and traces. This framing is helpful as a mental model but dangerously incomplete as an implementation guide.
Logs
Logs tell you what happened — in excruciating, voluminous detail. A log line says “user authentication failed at 14:02:37 for user_id=47293 with error=invalid_token.” Individually useful. At scale, overwhelming.
A mid-size microservices architecture generates millions of log lines per hour. Finding the relevant log line among millions requires knowing what you’re looking for — which is precisely the information you don’t have during a novel incident.
Structured logging (JSON-formatted log lines with consistent fields) helps. Log aggregation (Elasticsearch, Loki, CloudWatch Logs) helps. But logs alone can’t tell you why something happened, whether it’s part of a larger pattern, or what happened to the request before and after the failure.
Metrics
Metrics tell you how much happened — aggregated over time. A metric says “the error rate is 4.7% over the last 5 minutes.” But it doesn’t tell you which users are affected, which endpoints are failing, what the error responses contain, or what changed to cause the increase.
Metrics are essential for alerting (SLO-based alerts trigger when error budgets are consumed) and for trend analysis (is latency gradually increasing over weeks?). But they compress detail in a way that makes root cause analysis difficult. The 4.7% error rate might be 100% of one customer’s traffic failing and 0% of everyone else’s — a very different situation than 4.7% of all users experiencing errors.
Traces
Traces tell you the path — a single request flowing through 7 microservices with timing for each hop. They’re invaluable for identifying which service in a call chain introduced latency or errors.
But traces have a sampling problem: at high traffic volumes, you can’t trace every request (the cost would be prohibitive), so you sample — typically 1-10% of requests. If the failure affects 0.1% of requests, your trace sampling might not capture any failing requests at all.
The Missing Piece: Correlation
The missing pillar is correlation — the ability to move seamlessly between logs, metrics, and traces for the same event with a single interaction.
When your trace shows a 2-second delay in the payment service, you should be able to click through to the CPU and memory metrics for that specific pod at that moment, then drill into the specific log lines for that request, then see the database query plan that caused the slowdown.
Without correlation, you have three separate systems that each tell part of the story but never the whole narrative. You’re completing a puzzle with three different jigsaw sets mixed together.
The Organizational Problem
The technical challenges of observability are largely solved. OpenTelemetry provides vendor-neutral instrumentation. Cloud providers offer integrated observability platforms. Open-source tools like Grafana, Prometheus, Jaeger, and Loki are mature, well-documented, and widely deployed.
The real challenges are organizational — and no vendor or open-source project solves them:
The Dashboard Proliferation Problem
Every team creates their own dashboards. The platform team monitors infrastructure — CPU, memory, disk, network. The API team monitors endpoint latency and error rates. The data team monitors pipeline throughput and freshness. The frontend team monitors Core Web Vitals and JavaScript errors.
Nobody monitors the customer experience end-to-end. Nobody has a single view that traces a customer action from the browser click through the CDN, load balancer, API gateway, application service, database, third-party API, and back.
When an incident occurs, the first 15 minutes are spent figuring out which dashboard to look at and which team’s metrics are relevant. This is organizational failure, not tooling failure — and it’s solved by defining shared customer-journey dashboards that every team contributes signals to.
The Alert Fatigue Problem
Most on-call rotations generate 50-100 alerts per week. Of those, 80% are noise: thresholds set too aggressively, alerts for metrics that self-resolve within minutes, duplicate alerts for the same underlying issue firing from three different monitoring rules, and alerts for services that have known issues with no available fix.
The result: engineers start ignoring alerts. The critical P1 alert that fires at 2 AM gets the same tired response as every other alert — a glance at the phone screen through half-closed eyes and an assumption that it’ll resolve itself.
Alert fatigue doesn’t happen because your alerting system is bad. It happens because your organization hasn’t invested in alert quality as a discipline — pruning noisy alerts, consolidating duplicates, and ensuring that every alert represents an actionable condition that requires human intervention.
The Incentive Problem
Nobody gets promoted for improving observability. The SRE who reduces mean time to detection from 15 minutes to 2 minutes has prevented dozens of extended outages — but prevention is invisible. Nobody can point to a specific incident that didn’t happen.
Observability is operational insurance. Like all insurance, its value is invisible when it’s working and catastrophically obvious when it’s absent. This creates a chronic underinvestment cycle: observability is deprioritized until a major incident reveals its absence, then frantically invested in until the crisis fades, then deprioritized again.
What Good Looks Like
Organizations with mature observability practices share five characteristics:
1. Service Level Objectives (SLOs) drive alerting. Instead of monitoring 47 metrics per service and alerting on arbitrary thresholds, they define 2-3 SLOs per service — latency P99, error rate, availability — and alert only when error budgets are being consumed at a rate that threatens the SLO. This reduces alert volume by 80-90%.
2. Every request has a trace ID. From the edge load balancer to the deepest database query, every request carries a unique identifier that connects every log line, metric datapoint, and span. When something goes wrong, you follow the trail.
3. Context propagation is automatic. Engineers don’t manually add tracing code to every handler. The platform instruments services automatically through middleware, libraries, sidecar proxies, and service mesh. Adding observability to a new service is zero effort for the development team.
4. Runbooks are linked to alerts. Every alert has a corresponding runbook that describes what to check first, how to diagnose common causes, how to remediate, and when to escalate. The runbook is embedded in the alert definition, not a separate wiki page that requires searching during a 3 AM incident.
5. Observability is a team sport. A dedicated observability function — within the platform team or as a standalone team — owns the tooling, the standards, and the training. Product teams own their dashboards and SLOs but use standardized tooling and patterns that the observability team maintains.
The Investment Case
Observability investment has one of the clearest ROI models in engineering:
- Reduce MTTR by 50% → Direct savings in incident response engineering time
- Reduce alert noise by 80% → Recovered engineering time, reduced on-call burnout, improved engineer retention
- Prevent 2-3 major incidents per year through early detection → Revenue protection, customer trust preservation
- Faster everyday debugging → Average engineer saves 3-5 hours per week
For a 50-person engineering organization at $85/hour loaded cost, improved observability typically saves 4,000-6,000 engineering hours per year — $340K-$510K annually in recovered productive time. Add the revenue protection from preventing extended outages, and the ROI case is overwhelming.
The Garnet Grid perspective: We help organizations evolve from monitoring to observability — starting with SLO definition and ending with a self-service observability platform that every team can use. Because you can’t fix what you can’t see — and you can’t see what you’re not correlating. Start with an architecture audit →