Distributed Tracing Basics

Distributed tracing records the call path of a request across services for latency and fault diagnosis. Common implementations: Jaeger, Zipkin, SkyWalking. This article explains Trace, Span, sampling, and instrumentation with a reference table.

Overview

Trace: Full call tree for one request; has a unique TraceId.
Span: A node in the trace; one in-process unit or one RPC call. Has SpanId, ParentSpanId, start time, duration, tags.
Context propagation: TraceId and SpanId are passed via HTTP headers or RPC context to link the chain.
Sampling: Full tracing is costly; sample (e.g. 1%, 10%) or by error/slow requests. Production often uses sampling.

Example

Example 1: Call chain example

Plain text
TraceId: abc123
  Span1: gateway (10ms)
    Span2: order-service (8ms)
      Span3: user-service RPC (5ms)
      Span4: db query (3ms)
    Span5: inventory-service RPC (6ms)

Shows total time, per-step share, and which service is slow.

Example 2: Propagation

HTTP: X-Trace-Id, X-Span-Id, X-Parent-Span-Id etc. in headers.
Dubbo: via RpcContext, Filter.
Frameworks: Spring Cloud Sleuth, OpenTelemetry, Brave, etc.

Example 3: Key concepts

Concept	Description
TraceId	Unique for the whole chain
SpanId	Unique for current node
ParentSpanId	Parent node for building tree
Sampling	Controls volume and cost

Example 4: Sampling strategies

Constant: e.g. 10% of requests.
Rate-limited: e.g. 100 traces/second.
Probabilistic: sample errors and slow requests at higher rate.

Core Mechanism / Behavior

Instrumentation: Automatic (agent, SDK) or manual (start/end span, add tags). Must propagate context across process boundaries.
Export: Spans sent to collector; stored (Elasticsearch, Cassandra) and queried via UI (Jaeger, Zipkin).

Key Rules

Full chain propagation: Gateway, services, MQ, DB must participate; a missing hop breaks the chain.
Sampling: Use sampling in production; increase for errors and slow requests for debugging.
Storage and query: Trace volume is high; need suitable storage and UI for analysis.

What's Next

See API Gateway, RPC for call chain. See Metrics/Monitoring for observability.