Metrics & Monitoring System

Metrics and monitoring provide observability: metrics (counters, histograms, gauges), logs, and traces. Common stacks include Prometheus + Grafana, ELK, and Jaeger. This article explains metric types, collection, storage, alerting, and SLI/SLO with examples and a reference table.

Overview

Metric types: Counter (monotonic), Gauge (current value), Histogram (distribution and percentiles), Summary. Used for QPS, latency, error rate, connection count, and more.
Collection: Pull mode (Prometheus scrape) or push mode (StatsD, app push). Apps expose /metrics or use an agent to collect.
Storage: Time-series databases (Prometheus, InfluxDB, VictoriaMetrics); query and aggregate by time range and labels.
Alerting: Rules (e.g. error_rate > 0.01) trigger alerts; Alertmanager deduplicates, groups, and routes to notification channels.
SLI/SLO: Service Level Indicators measure behavior; Service Level Objectives define targets (e.g. 99.9% availability, P99 < 200ms).

Example

Example 1: Common metrics

Metric	Type	Description
http_requests_total	Counter	Total request count
http_request_duration_seconds	Histogram	Latency distribution
active_connections	Gauge	Current connection count
error_rate	Derived	errors / total requests
queue_depth	Gauge	Current backlog size

Example 2: Golden signals

Latency: Request duration (P50, P99, P999). Use histograms or summaries.
Traffic: QPS, RPS. Use counters and rate().
Errors: Error rate, 5xx count. Use counters by status code.
Saturation: Queue depth, CPU, memory usage. Use gauges.

Example 3: Prometheus metric format

Plain text
# Counter
http_requests_total{method="GET", path="/api/users", status="200"} 1523

# Histogram (buckets + count + sum)
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 250
http_request_duration_seconds_count 300
http_request_duration_seconds_sum 45.2

Labels add dimensions; avoid high-cardinality labels (e.g. userId) to prevent series explosion.

Example 4: Alert rule (Prometheus)

YAML
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Error rate above 1%"

rate() over a window; for requires sustained violation; annotations provide context.

Example 5: Alert design best practices

Avoid alert storm: Aggregate by service; use grouping and inhibition; tier alerts (P0/P1/P2).
Actionable: Every alert should have a runbook or clear next step.
Avoid always/never firing: Tune thresholds; avoid rules that fire constantly or never.

Example 6: Cardinality

High-cardinality labels (userId, requestId) multiply series count. Prefer low-cardinality labels: service, instance, method, status.
Sample or aggregate high-cardinality data; store raw events in logs if needed.

Core Mechanism / Behavior

Counter: Only increases; use rate() or irate() for per-second. Resets require care in rate calculation.
Histogram: Buckets define distribution; percentiles computed from buckets. Summary computes percentiles on the client; no aggregation across instances.
Labels: Dimensions for filtering and grouping; each unique label combination is a time series. Cardinality = product of label value counts.
Retention: Prometheus defaults ~15 days; long-term storage (Thanos, Cortex) extends retention.

SLI and SLO

SLI: A measurable indicator (e.g. "percentage of requests with status 2xx").
SLO: Target for the SLI (e.g. "99.9% of requests succeed").
Error budget: 100% - SLO; e.g. 99.9% leaves 0.1% "budget" for failures. Use it to decide when to pause releases or invest in reliability.
SLA: Contract with users; SLO is internal target; SLA may include penalties.

Prometheus and Grafana in Practice

Prometheus: Pull-based; scrapes /metrics from targets. Define scrape config (targets, interval). Use service discovery (K8s, Consul) for dynamic targets. Retention is typically 15 days; use remote write (Thanos, Cortex) for long-term.
Grafana: Query Prometheus (or other backends); build dashboards with panels (graph, table, stat). Use variables for service/instance selection. Share dashboards and set up alerting from Grafana or Prometheus.
Recording rules: Pre-compute expensive queries (e.g. rate, aggregation) to reduce query load and speed dashboards.
Labels vs logs: Metrics for numbers (counts, histograms); logs for events and context. Do not put high-cardinality data in metrics.

Key Rules

Metrics need dimensions (service, instance, endpoint, status) for filtering and aggregation.
Cardinality: Avoid high-cardinality labels; sample or aggregate when necessary.
Alerts must be actionable; each alert should have a runbook.
SLI/SLO: Define what you measure and your target; use error budget for release and reliability decisions.

What's Next

See Distributed Tracing and Java Profiling. See Slow Query and GC for performance monitoring.