Observability vs Monitoring: The Stack Every Team Needs
At 3 AM on a Tuesday, our pager went off: "API latency exceeds 2 seconds." I opened Grafana, confirmed the spike, and then stared at the screen for 20 minutes trying to figure out why. CPU was normal. Memory was normal. Database queries looked fine. Error rates were flat. Every individual metric looked healthy, but the system was clearly broken.
The problem turned out to be a single upstream dependency that had started returning responses 1.5 seconds slower than usual. We didn't have distributed tracing, so there was no way to see that one specific external call in the request chain was the bottleneck. Our monitoring told us something was wrong. It couldn't tell us what or why.
That incident crystallized the difference between monitoring and observability for me. Monitoring answers "is this thing working?" Observability answers "why isn't this thing working?" — and more importantly, it lets you answer questions you didn't think to ask before the system broke.
This article covers the three pillars of observability (logs, metrics, traces), the major tools in the ecosystem (OpenTelemetry, Grafana, Datadog, Prometheus), and a practical guide to building an observability stack that doesn't cost $10,000 per month.
Monitoring vs Observability: The Real Difference
Monitoring and observability are related but distinct concepts. The confusion between them causes teams to either under-invest (basic monitoring without debugging capability) or over-invest (expensive observability platforms for simple applications).
Monitoring is the practice of collecting, aggregating, and alerting on predefined metrics. You decide in advance what to measure (CPU usage, error rate, request latency) and set thresholds that trigger alerts. Monitoring is reactive — it tells you when known failure modes occur.
Observability is the ability to understand the internal state of a system from its external outputs. It's proactive — it lets you explore and diagnose problems you've never encountered before. A system is "observable" when you can ask arbitrary questions about its behavior without deploying new instrumentation.
The distinction matters because Charity Majors (Honeycomb CEO) puts it well: "Monitoring is for known-unknowns. Observability is for unknown-unknowns." In simple systems, monitoring is sufficient. In distributed systems with complex interactions, observability becomes essential.
| Aspect | Monitoring | Observability |
|---|---|---|
| Question | "Is it working?" | "Why isn't it working?" |
| Approach | Predefined dashboards and alerts | Exploratory, ad-hoc querying |
| Data Model | Aggregated metrics | High-cardinality events, traces, structured logs |
| Failure Mode | Handles known failure patterns | Handles novel, unexpected failures |
| Cost | Lower (less data, simpler tools) | Higher (more data, more sophisticated tools) |
| Best For | Simple systems, known patterns | Distributed systems, complex interactions |
The Three Pillars: Logs, Metrics, and Traces
Pillar 1: Logs
Logs are timestamped records of discrete events. They're the oldest and most fundamental form of system telemetry. Every developer has written console.log("here") at some point — that's logging at its most basic.
Structured vs Unstructured: The single most impactful thing you can do for your logging is make it structured. Instead of:
User 12345 placed order 67890 for $49.99 at 2026-03-15T10:30:00Z
Use structured JSON:
{"timestamp": "2026-03-15T10:30:00Z", "level": "info", "event": "order_placed", "user_id": "12345", "order_id": "67890", "amount": 49.99, "currency": "USD"}
Structured logs enable querying, filtering, and aggregation. You can ask "show me all orders over $100 from user 12345 in the last 24 hours" — impossible with unstructured text.
Log Aggregation: In distributed systems, logs are scattered across dozens of services and instances. You need a central place to search them. The major options:
- ELK Stack (Elasticsearch + Logstash + Kibana): The classic open-source solution. Powerful but operationally heavy — Elasticsearch clusters require significant tuning and maintenance. Elastic's documentation estimates 30-40 hours of initial setup for production.
- Loki (Grafana Labs): A newer approach that indexes only log labels (not full text), making it 10-100x cheaper than Elasticsearch for most use cases. Pairs naturally with Grafana. Grafana Loki documentation.
- CloudWatch Logs (AWS) / Cloud Logging (GCP): Managed solutions that integrate with their respective cloud platforms. Simple but can get expensive at scale.
Pillar 2: Metrics
Metrics are numeric measurements collected at regular intervals. Unlike logs (which record individual events), metrics aggregate data over time — average response time, total request count, 95th percentile latency.
The RED Method (from Tom Wilkie at Grafana Labs) provides a framework for what to measure in request-driven services:
- Rate: Number of requests per second
- Errors: Number of failed requests per second
- Duration: Distribution of request latency (p50, p95, p99)
The USE Method (from Brendan Gregg) is for infrastructure resources:
- Utilization: Percentage of resource in use
- Saturation: Amount of work queued (waiting)
- Errors: Count of error events
Prometheus is the de facto standard for metrics collection. It uses a pull-based model (scraping endpoints) and a powerful query language (PromQL). According to the CNCF Annual Survey, Prometheus is used by over 80% of Kubernetes users for monitoring.
Pillar 3: Traces
Distributed traces follow a single request as it propagates through multiple services. Each service contributes a "span" — a unit of work with a start time, duration, and metadata. The collection of spans for a single request forms a trace.
Traces answer the question that logs and metrics cannot: "Which specific service in the call chain caused this request to be slow?" Without tracing, diagnosing latency in a microservices architecture is essentially guesswork.
Key concepts:
- Trace ID: A unique identifier that follows the request across all services
- Span: A single unit of work (database query, HTTP call, function execution)
- Span context: Metadata propagated between services (trace ID, span ID, sampling flags)
- Sampling: Collecting traces for a fraction of requests to manage cost (1%, 10%, or adaptive)
OpenTelemetry: The Standard That Won
OpenTelemetry (OTel) is the CNCF project that provides a vendor-neutral standard for generating, collecting, and exporting telemetry data (logs, metrics, and traces). It's the merger of two earlier projects (OpenTracing and OpenCensus) and has become the industry standard.
Why OpenTelemetry matters:
- Vendor neutrality: Instrument your code once, export to any backend (Grafana, Datadog, New Relic, Jaeger, etc.). No vendor lock-in.
- Auto-instrumentation: Libraries for most languages (Java, Python, Node.js, Go, .NET) that automatically instrument HTTP clients, database drivers, and message queues without code changes.
- Correlation: Traces, metrics, and logs share the same context (trace ID, span ID), making it possible to jump from a metric anomaly to the specific trace that caused it.
- Community: OTel is the second most active CNCF project after Kubernetes. The ecosystem is large and growing.
The OpenTelemetry Collector
The OTel Collector is a proxy that receives, processes, and exports telemetry data. It sits between your application and your observability backend, providing a single configuration point for routing, filtering, and transforming data.
A typical setup:
- Applications emit telemetry via OTel SDKs (auto-instrumented or manual)
- Telemetry is sent to the OTel Collector (running as a sidecar or daemon)
- The Collector routes data: metrics → Prometheus, traces → Jaeger/Tempo, logs → Loki
- Grafana dashboards query all three backends for unified visibility
Tool Comparison: Open Source vs Commercial
| Tool | Type | Strengths | Cost |
|---|---|---|---|
| Grafana + Prometheus + Loki + Tempo | Open Source (LGTM stack) | Free, full control, OTel native, large community | $0 (self-host) or Grafana Cloud free tier |
| Datadog | Commercial SaaS | Best UX, unified platform, excellent integrations | $15-34/host/mo + per-feature costs |
| New Relic | Commercial SaaS | 100GB free/mo, full-stack observability | Free tier, then $0.30/GB ingested |
| Honeycomb | Commercial SaaS | Best-in-class tracing and exploration | Free tier (20M events/mo), then per-event |
| Elastic Observability (ELK) | Open Source + Commercial | Powerful log search, APM, mature ecosystem | $0 (self-host) or Elastic Cloud from $95/mo |
| AWS CloudWatch | Managed (AWS) | Native AWS integration, zero setup | Per-metric, per-log-GB, per-trace pricing |
| Jaeger | Open Source | CNCF project, excellent distributed tracing | $0 (self-host) |
My Opinionated Recommendation: The LGTM Stack
For most teams — especially those with fewer than 50 engineers and tight budgets — the open-source LGTM stack from Grafana Labs is the right answer:
- Loki: Log aggregation
- Grafana: Visualization and dashboards
- Tempo: Distributed tracing
- Mimir: Long-term metrics storage (Prometheus-compatible)
This stack is fully open source, integrates natively with OpenTelemetry, and provides a unified interface for all three pillars of observability. Grafana Cloud offers a generous free tier (50GB logs, 10K metrics series, 50GB traces per month) that covers most small-to-medium applications.
The alternative — Datadog — is genuinely excellent. Its UX is the best in the industry, its integrations are comprehensive, and its AI-powered root cause analysis saves real debugging time. But Datadog's cost model scales aggressively. A team running 20 hosts with APM, logs, and infrastructure monitoring can easily spend $3,000-5,000/month. For a startup or mid-sized team, that's hard to justify when open-source alternatives exist.
My rule of thumb: use Grafana Cloud's free tier to start. Upgrade to Grafana Cloud paid when you need more capacity. Only consider Datadog when you have the budget ($5,000+/month) AND the team size (50+ engineers) to justify the UX premium.
Building Your Observability Stack: Action Plan
Phase 1: Foundation (Week 1-2)
- Adopt structured logging. Update all services to emit JSON-formatted logs with consistent fields: timestamp, level, service name, trace ID (if available), and relevant business context.
- Deploy Prometheus. Add Prometheus scraping for basic infrastructure metrics (CPU, memory, disk, network) and application metrics (request rate, error rate, latency).
- Set up Grafana. Create dashboards for the RED method (Rate, Errors, Duration) for each service. Set up alerts for critical thresholds.
Phase 2: Logs and Correlation (Week 3-4)
- Deploy Loki. Ship structured logs from all services to Loki. Create log-based dashboards and alerts in Grafana.
- Add correlation IDs. Generate a unique request ID at the entry point (API gateway) and propagate it through all service calls. Include it in every log line.
- Build log-to-metric alerts. Use Loki's log-based alerting to detect patterns that metrics miss (specific error messages, unusual log patterns).
Phase 3: Distributed Tracing (Week 5-8)
- Integrate OpenTelemetry. Add the OTel SDK to each service. Start with auto-instrumentation (HTTP clients, database drivers, message queues).
- Deploy Tempo. Configure the OTel Collector to export traces to Tempo. Set up sampling (start with 10% of requests).
- Connect the pillars. Use Grafana's exemplar feature to link metrics → traces. Add trace IDs to log lines for log → trace correlation.
Phase 4: Operationalize (Ongoing)
- Create SLOs (Service Level Objectives). Define target reliability (e.g., 99.9% of requests under 500ms) and alert when SLO budgets are consumed.
- Build runbooks. For every alert, create a runbook that starts with the observability tools needed to diagnose the issue.
- Practice incident response. Run game days where you inject failures and use your observability stack to diagnose them. This builds muscle memory for real incidents.
- Manage costs. Monitor telemetry volume. Use sampling for traces, log level management, and metric aggregation to control costs as your system grows.
Common Mistakes to Avoid
- Alert fatigue: Too many alerts desensitize the team. Alert only on symptoms (latency, errors), not causes (CPU usage). If an alert fires more than once a week without requiring action, delete it.
- Dashboard sprawl: 50 dashboards that nobody looks at are worse than 5 dashboards that everyone uses. Create team-specific dashboards with the RED method and a single "service overview" dashboard.
- Logging everything: High-volume debug logging in production is expensive and creates noise. Use log levels appropriately. Debug logs should be off by default and enabled dynamically when investigating issues.
- Ignoring cardinality: High-cardinality labels in Prometheus (user IDs, request IDs, URLs with query parameters) can cause memory explosions. Use traces for high-cardinality data, metrics for low-cardinality aggregates.
- Skipping correlation: Logs without trace IDs, metrics without exemplars, traces without service context — each pillar in isolation provides limited value. The power comes from correlation.
Sources
- OpenTelemetry
- Grafana Loki
- The RED Method — Tom Wilkie
- The USE Method — Brendan Gregg
- Observability 101 — Honeycomb
- CNCF Annual Survey 2024
- ELK Stack — Elastic
- CNCF Projects
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
