Metrics & Monitoring System
Metrics and monitoring provide observability: metrics (counters, histograms, gauges), logs, and traces. Common stacks include Prometheus + Grafana, ELK, and Jaeger. This article explains metric types, collection, storage, alerting, and SLI/SLO with examples and a reference table.
Overview
- Metric types: Counter (monotonic), Gauge (current value), Histogram (distribution and percentiles), Summary. Used for QPS, latency, error rate, connection count, and more.
- Collection: Pull mode (Prometheus scrape) or push mode (StatsD, app push). Apps expose
/metricsor use an agent to collect. - Storage: Time-series databases (Prometheus, InfluxDB, VictoriaMetrics); query and aggregate by time range and labels.
- Alerting: Rules (e.g.
error_rate > 0.01) trigger alerts; Alertmanager deduplicates, groups, and routes to notification channels. - SLI/SLO: Service Level Indicators measure behavior; Service Level Objectives define targets (e.g. 99.9% availability, P99 < 200ms).
Example
Example 1: Common metrics
| Metric | Type | Description |
|---|---|---|
| http_requests_total | Counter | Total request count |
| http_request_duration_seconds | Histogram | Latency distribution |
| active_connections | Gauge | Current connection count |
| error_rate | Derived | errors / total requests |
| queue_depth | Gauge | Current backlog size |
Example 2: Golden signals
- Latency: Request duration (P50, P99, P999). Use histograms or summaries.
- Traffic: QPS, RPS. Use counters and
rate(). - Errors: Error rate, 5xx count. Use counters by status code.
- Saturation: Queue depth, CPU, memory usage. Use gauges.
Example 3: Prometheus metric format
Plain text# Counter http_requests_total{method="GET", path="/api/users", status="200"} 1523 # Histogram (buckets + count + sum) http_request_duration_seconds_bucket{le="0.1"} 100 http_request_duration_seconds_bucket{le="0.5"} 250 http_request_duration_seconds_count 300 http_request_duration_seconds_sum 45.2
- Labels add dimensions; avoid high-cardinality labels (e.g. userId) to prevent series explosion.
Example 4: Alert rule (Prometheus)
YAMLgroups: - name: api rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 for: 5m labels: { severity: critical } annotations: summary: "Error rate above 1%"
rate()over a window;forrequires sustained violation; annotations provide context.
Example 5: Alert design best practices
- Avoid alert storm: Aggregate by service; use grouping and inhibition; tier alerts (P0/P1/P2).
- Actionable: Every alert should have a runbook or clear next step.
- Avoid always/never firing: Tune thresholds; avoid rules that fire constantly or never.
Example 6: Cardinality
- High-cardinality labels (userId, requestId) multiply series count. Prefer low-cardinality labels: service, instance, method, status.
- Sample or aggregate high-cardinality data; store raw events in logs if needed.
Core Mechanism / Behavior
- Counter: Only increases; use
rate()orirate()for per-second. Resets require care in rate calculation. - Histogram: Buckets define distribution; percentiles computed from buckets. Summary computes percentiles on the client; no aggregation across instances.
- Labels: Dimensions for filtering and grouping; each unique label combination is a time series. Cardinality = product of label value counts.
- Retention: Prometheus defaults ~15 days; long-term storage (Thanos, Cortex) extends retention.
SLI and SLO
- SLI: A measurable indicator (e.g. "percentage of requests with status 2xx").
- SLO: Target for the SLI (e.g. "99.9% of requests succeed").
- Error budget: 100% - SLO; e.g. 99.9% leaves 0.1% "budget" for failures. Use it to decide when to pause releases or invest in reliability.
- SLA: Contract with users; SLO is internal target; SLA may include penalties.
Prometheus and Grafana in Practice
- Prometheus: Pull-based; scrapes
/metricsfrom targets. Define scrape config (targets, interval). Use service discovery (K8s, Consul) for dynamic targets. Retention is typically 15 days; use remote write (Thanos, Cortex) for long-term. - Grafana: Query Prometheus (or other backends); build dashboards with panels (graph, table, stat). Use variables for service/instance selection. Share dashboards and set up alerting from Grafana or Prometheus.
- Recording rules: Pre-compute expensive queries (e.g. rate, aggregation) to reduce query load and speed dashboards.
- Labels vs logs: Metrics for numbers (counts, histograms); logs for events and context. Do not put high-cardinality data in metrics.
Key Rules
- Metrics need dimensions (service, instance, endpoint, status) for filtering and aggregation.
- Cardinality: Avoid high-cardinality labels; sample or aggregate when necessary.
- Alerts must be actionable; each alert should have a runbook.
- SLI/SLO: Define what you measure and your target; use error budget for release and reliability decisions.
What's Next
See Distributed Tracing and Java Profiling. See Slow Query and GC for performance monitoring.