Distributed Tracing Basics

Distributed tracing records the call path of a request across services for latency and fault diagnosis. Common implementations: Jaeger, Zipkin, SkyWalking. This article explains Trace, Span, sampling, and instrumentation with a reference table.

Overview

  • Trace: Full call tree for one request; has a unique TraceId.
  • Span: A node in the trace; one in-process unit or one RPC call. Has SpanId, ParentSpanId, start time, duration, tags.
  • Context propagation: TraceId and SpanId are passed via HTTP headers or RPC context to link the chain.
  • Sampling: Full tracing is costly; sample (e.g. 1%, 10%) or by error/slow requests. Production often uses sampling.

Example

Example 1: Call chain example

Plain text
TraceId: abc123
  Span1: gateway (10ms)
    Span2: order-service (8ms)
      Span3: user-service RPC (5ms)
      Span4: db query (3ms)
    Span5: inventory-service RPC (6ms)
  • Shows total time, per-step share, and which service is slow.

Example 2: Propagation

  • HTTP: X-Trace-Id, X-Span-Id, X-Parent-Span-Id etc. in headers.
  • Dubbo: via RpcContext, Filter.
  • Frameworks: Spring Cloud Sleuth, OpenTelemetry, Brave, etc.

Example 3: Key concepts

ConceptDescription
TraceIdUnique for the whole chain
SpanIdUnique for current node
ParentSpanIdParent node for building tree
SamplingControls volume and cost

Example 4: Sampling strategies

  • Constant: e.g. 10% of requests.
  • Rate-limited: e.g. 100 traces/second.
  • Probabilistic: sample errors and slow requests at higher rate.

Core Mechanism / Behavior

  • Instrumentation: Automatic (agent, SDK) or manual (start/end span, add tags). Must propagate context across process boundaries.
  • Export: Spans sent to collector; stored (Elasticsearch, Cassandra) and queried via UI (Jaeger, Zipkin).

Key Rules

  • Full chain propagation: Gateway, services, MQ, DB must participate; a missing hop breaks the chain.
  • Sampling: Use sampling in production; increase for errors and slow requests for debugging.
  • Storage and query: Trace volume is high; need suitable storage and UI for analysis.

What's Next

See API Gateway, RPC for call chain. See Metrics/Monitoring for observability.