Metrics & Monitoring System

Metrics and monitoring provide observability: metrics (counters, histograms, gauges), logs, and traces. Common stacks include Prometheus + Grafana, ELK, and Jaeger. This article explains metric types, collection, storage, alerting, and SLI/SLO with examples and a reference table.

Overview

  • Metric types: Counter (monotonic), Gauge (current value), Histogram (distribution and percentiles), Summary. Used for QPS, latency, error rate, connection count, and more.
  • Collection: Pull mode (Prometheus scrape) or push mode (StatsD, app push). Apps expose /metrics or use an agent to collect.
  • Storage: Time-series databases (Prometheus, InfluxDB, VictoriaMetrics); query and aggregate by time range and labels.
  • Alerting: Rules (e.g. error_rate > 0.01) trigger alerts; Alertmanager deduplicates, groups, and routes to notification channels.
  • SLI/SLO: Service Level Indicators measure behavior; Service Level Objectives define targets (e.g. 99.9% availability, P99 < 200ms).

Example

Example 1: Common metrics

MetricTypeDescription
http_requests_totalCounterTotal request count
http_request_duration_secondsHistogramLatency distribution
active_connectionsGaugeCurrent connection count
error_rateDerivederrors / total requests
queue_depthGaugeCurrent backlog size

Example 2: Golden signals

  • Latency: Request duration (P50, P99, P999). Use histograms or summaries.
  • Traffic: QPS, RPS. Use counters and rate().
  • Errors: Error rate, 5xx count. Use counters by status code.
  • Saturation: Queue depth, CPU, memory usage. Use gauges.

Example 3: Prometheus metric format

Plain text
# Counter
http_requests_total{method="GET", path="/api/users", status="200"} 1523

# Histogram (buckets + count + sum)
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 250
http_request_duration_seconds_count 300
http_request_duration_seconds_sum 45.2
  • Labels add dimensions; avoid high-cardinality labels (e.g. userId) to prevent series explosion.

Example 4: Alert rule (Prometheus)

YAML
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Error rate above 1%"
  • rate() over a window; for requires sustained violation; annotations provide context.

Example 5: Alert design best practices

  • Avoid alert storm: Aggregate by service; use grouping and inhibition; tier alerts (P0/P1/P2).
  • Actionable: Every alert should have a runbook or clear next step.
  • Avoid always/never firing: Tune thresholds; avoid rules that fire constantly or never.

Example 6: Cardinality

  • High-cardinality labels (userId, requestId) multiply series count. Prefer low-cardinality labels: service, instance, method, status.
  • Sample or aggregate high-cardinality data; store raw events in logs if needed.

Core Mechanism / Behavior

  • Counter: Only increases; use rate() or irate() for per-second. Resets require care in rate calculation.
  • Histogram: Buckets define distribution; percentiles computed from buckets. Summary computes percentiles on the client; no aggregation across instances.
  • Labels: Dimensions for filtering and grouping; each unique label combination is a time series. Cardinality = product of label value counts.
  • Retention: Prometheus defaults ~15 days; long-term storage (Thanos, Cortex) extends retention.

SLI and SLO

  • SLI: A measurable indicator (e.g. "percentage of requests with status 2xx").
  • SLO: Target for the SLI (e.g. "99.9% of requests succeed").
  • Error budget: 100% - SLO; e.g. 99.9% leaves 0.1% "budget" for failures. Use it to decide when to pause releases or invest in reliability.
  • SLA: Contract with users; SLO is internal target; SLA may include penalties.

Prometheus and Grafana in Practice

  • Prometheus: Pull-based; scrapes /metrics from targets. Define scrape config (targets, interval). Use service discovery (K8s, Consul) for dynamic targets. Retention is typically 15 days; use remote write (Thanos, Cortex) for long-term.
  • Grafana: Query Prometheus (or other backends); build dashboards with panels (graph, table, stat). Use variables for service/instance selection. Share dashboards and set up alerting from Grafana or Prometheus.
  • Recording rules: Pre-compute expensive queries (e.g. rate, aggregation) to reduce query load and speed dashboards.
  • Labels vs logs: Metrics for numbers (counts, histograms); logs for events and context. Do not put high-cardinality data in metrics.

Key Rules

  • Metrics need dimensions (service, instance, endpoint, status) for filtering and aggregation.
  • Cardinality: Avoid high-cardinality labels; sample or aggregate when necessary.
  • Alerts must be actionable; each alert should have a runbook.
  • SLI/SLO: Define what you measure and your target; use error budget for release and reliability decisions.

What's Next

See Distributed Tracing and Java Profiling. See Slow Query and GC for performance monitoring.