observability-expert
Expert-level observability covering the three pillars (metrics, logs, traces), OpenTelemetry instrumentation, Prometheus metric types and PromQL, Grafana dashboard design using RED/USE methods, structured logging, distributed tracing with sampling strategies, SLO-based alerting, and Loki log
Observability Expert
Observability is not monitoring. Monitoring tells you when something is broken by checking known failure
modes. Observability lets you debug unknown failures by asking questions of your system after the fact.
The three pillars — metrics, logs, traces — are complementary, not interchangeable. You need all three.
Core Mental Model
Metrics tell you what is happening at aggregate level (request rate, error rate, latency). They're
cheap to store and query but lose individual request detail. Logs tell you what happened for specific
events, with full context. They're expensive at scale but essential for debugging. Traces tell you *where
time was spent across service boundaries for a single request. The power is in correlation*: a metric
spike leads you to a time window, logs give you the error details, a trace shows you the slow span. Design
your telemetry so all three can be linked by a common trace ID.
The Three Pillars
Metrics (Prometheus/OTEL):
"5% of requests are failing" → WHERE to look
Aggregated, sampled, cheap to store and query
Logs (Loki/CloudWatch/ELK):
"ORDER-123 failed: constraint violation on user_id" → WHAT happened
Event-level detail, expensive at scale, essential for context
Traces (Jaeger/Tempo/Zipkin):
"Payment service took 2.3s; 1.8s was in the DB query" → WHY it was slow
Request-level, cross-service, shows causality
Correlation: trace_id links all three
Metric alert fires → log query filters by time + service → trace ID found → full trace loaded
OpenTelemetry: Architecture
Your App
│
▼
OTel SDK (instrumentation)
│ OTLP (gRPC/HTTP)
▼
OTel Collector
├── Receivers: OTLP, Jaeger, Zipkin, Prometheus scrape
├── Processors: batch, memory_limiter, resource detection, sampling
└── Exporters: Jaeger, Tempo, Prometheus, CloudWatch, Datadog, OTLP
Backends:
Traces → Jaeger / Grafana Tempo / Zipkin
Metrics → Prometheus / Mimir / Datadog
Logs → Loki / ELK / CloudWatch Logs
FastAPI + OpenTelemetry Auto-Instrumentation
# requirements.txt
# opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-fastapi
# opentelemetry-instrumentation-httpx opentelemetry-instrumentation-sqlalchemy
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.sdk.resources import Resource
def setup_telemetry(app):
resource = Resource.create({
"service.name": "order-api",
"service.version": os.environ.get("APP_VERSION", "unknown"),
"deployment.environment": os.environ.get("ENVIRONMENT", "development"),
})
# Traces
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(tracer_provider)
# Metrics
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://otel-collector:4317"),
export_interval_millis=10000
)
metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[reader]))
# Auto-instrument
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
Manual Spans with Semantic Conventions
from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes
tracer = trace.get_tracer(__name__)
async def process_payment(order_id: str, amount: float) -> dict:
with tracer.start_as_current_span(
"payment.process",
kind=trace.SpanKind.CLIENT,
attributes={
SpanAttributes.DB_SYSTEM: "postgresql",
"app.order.id": order_id,
"app.payment.amount": amount,
}
) as span:
try:
result = await payment_gateway.charge(order_id, amount)
span.set_attribute("app.payment.transaction_id", result["transaction_id"])
span.set_status(trace.Status(trace.StatusCode.OK))
return result
except PaymentDeclinedError as e:
# Business error — mark as error but not exception
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
span.set_attribute("app.payment.decline_reason", e.reason)
raise
except Exception as e:
# Unexpected error — record exception with stack trace
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR))
raise
OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Scrape Prometheus metrics from services
prometheus:
config:
scrape_configs:
- job_name: 'fastapi-services'
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 1s
send_batch_size: 1000
send_batch_max_size: 2000
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
# Tail-based sampling: keep 100% of errors, 10% of successful traces
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
deployment.environment: "environment"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Prometheus Metric Types
| Type | Use When | Example |
| Counter | Things that only go up | requests_total, errors_total |
| Gauge | Things that go up and down | active_connections, queue_depth, memory_bytes |
| Histogram | Distribution of values (latency, size) | request_duration_seconds |
| Summary | Client-side quantiles (avoid at scale) | Legacy; prefer Histogram |
rate() or increase() in PromQL. Gauges are used directly.
from prometheus_client import Counter, Histogram, Gauge
# Counter: always increment, never decrement
requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status_code']
)
# Histogram: buckets should match your SLO
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'path'],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
# Gauge: current state
active_users = Gauge('active_users_current', 'Currently active users')
# Usage
with request_duration.labels(method='GET', path='/orders').time():
result = process_request()
requests_total.labels(method='GET', path='/orders', status_code='200').inc()
PromQL Essentials
# Rate of requests per second (5m window)
rate(http_requests_total[5m])
# Error rate percentage
rate(http_requests_total{status_code=~"5.."}[5m]) /
rate(http_requests_total[5m]) * 100
# 99th percentile latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Apdex score (SLO: 100ms target, 300ms frustrated)
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m])) +
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
) / 2 / sum(rate(http_request_duration_seconds_count[5m]))
# Multi-window burn rate for SLO alerting
# 1h window: fast burn detection
rate(http_requests_total{status="500"}[1h]) /
rate(http_requests_total[1h])
# Recording rules (precompute expensive queries)
# In prometheus rules YAML:
# - record: job:http_requests:rate5m
# expr: sum(rate(http_requests_total[5m])) by (job)
Grafana Dashboard Design
RED Method (for services receiving requests)
- Rate: Requests per second
- Errors: Error rate %
- Duration: Latency (p50, p95, p99)
USE Method (for resources: CPU, disk, network)
- Utilization: % of time resource is busy
- Saturation: Queued/waiting work
- Errors: Error events
// Grafana panel: SLO status with threshold coloring
{
"title": "Error Rate (SLO: <0.1%)",
"type": "stat",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"legendFormat": "Error Rate %"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.05},
{"color": "red", "value": 0.1}
]
},
"unit": "percent"
}
}
}
Structured Logging
import structlog
import logging
# Configure structlog with JSON output
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.stdlib.add_logger_name,
structlog.processors.CallsiteParameterAdder([
structlog.processors.CallsiteParameter.FILENAME,
structlog.processors.CallsiteParameter.LINENO,
]),
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.BoundLogger,
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
)
log = structlog.get_logger()
# Always log with structured fields — never string interpolation
log.info("order.processed",
order_id="ord-123",
user_id="usr-456",
amount=99.99,
duration_ms=142,
trace_id=get_current_trace_id() # Correlate with traces!
)
# Output: {"event": "order.processed", "order_id": "ord-123", "level": "info",
# "timestamp": "2024-01-15T10:30:00Z", "trace_id": "abc123..."}
SLO-Based Alerting
# Prometheus alerting rule: multi-window burn rate
groups:
- name: slo_alerts
rules:
# Fast burn: 2% budget consumed in 1 hour → page immediately
- alert: SLOHighBurnRate
expr: |
(
rate(http_requests_total{job="order-api",status=~"5.."}[1h])
/ rate(http_requests_total{job="order-api"}[1h])
) > (14.4 * 0.001) # 14.4x burn rate on 0.1% error budget
and
(
rate(http_requests_total{job="order-api",status=~"5.."}[5m])
/ rate(http_requests_total{job="order-api"}[5m])
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
runbook_url: "https://wiki.example.com/runbooks/order-api-error-rate"
annotations:
summary: "Order API SLO burn rate critical"
description: "Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn rate"
# Slow burn: 5% budget consumed in 6 hours → ticket
- alert: SLOLowBurnRate
expr: |
(
rate(http_requests_total{job="order-api",status=~"5.."}[6h])
/ rate(http_requests_total{job="order-api"}[6h])
) > (6 * 0.001)
for: 15m
labels:
severity: warning
Anti-Patterns
❌ Logging at DEBUG level in production — use sampling or dynamic log level adjustment
❌ High-cardinality label values in Prometheus (user IDs, order IDs as labels) — causes cardinality explosion
❌ Histograms with wrong bucket boundaries — buckets must bracket your SLO threshold
❌ Tracing 100% of requests — tail-based sampling keeps 100% of errors, sample the rest
❌ Threshold-based alerts instead of SLO-based — "CPU > 80%" tells you nothing about user impact
❌ Alerts without runbook URLs — an alert that fires without guidance causes MTTR inflation
❌ Ignoring cold start metrics — Lambda/Cloud Run p99 is dominated by cold starts; track separately
❌ String interpolation in log messages — log.info(f"processed {order_id}") is unsearchable
❌ Missing trace ID in logs — without this, metrics → logs → traces correlation is manual and slow
Quick Reference
Metric naming conventions:
<namespace>_<unit>_total (counter: requests_total, errors_total)
<namespace>_<unit>_bytes (gauge: memory_bytes, queue_bytes)
<namespace>_duration_seconds (histogram: http_request_duration_seconds)
<namespace>_<unit>_created (auto-created timestamp for counters)
PromQL cheat sheet:
rate(counter[5m]) → per-second rate over 5min window
increase(counter[1h]) → total increase over 1 hour
histogram_quantile(0.99, ...) → 99th percentile
avg_over_time(gauge[5m]) → average of gauge over time
topk(5, metric) → top 5 series by value
sum by (label) (metric) → aggregate by label
OTel span status guide:
UNSET → default, didn't set status (treated as OK by backends)
OK → explicitly mark successful (use sparingly)
ERROR → something went wrong (set description)
Use record_exception() for stack traces on unexpected errors
Sampling strategy:
Head-based: decision at trace start (fast, lose tail errors)
Tail-based: decision after trace complete (catches errors, needs collector buffer)
Parent-based: inherit parent's sampling decision (distributed systems default)Skill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen