Distributed Tracing Basics
Distributed tracing records the call path of a request across services for latency and fault diagnosis. Common implementations: Jaeger, Zipkin, SkyWalking. This article explains Trace, Span, sampling, and instrumentation with a reference table.
Overview
- Trace: Full call tree for one request; has a unique TraceId.
- Span: A node in the trace; one in-process unit or one RPC call. Has SpanId, ParentSpanId, start time, duration, tags.
- Context propagation: TraceId and SpanId are passed via HTTP headers or RPC context to link the chain.
- Sampling: Full tracing is costly; sample (e.g. 1%, 10%) or by error/slow requests. Production often uses sampling.
Example
Example 1: Call chain example
Plain textTraceId: abc123 Span1: gateway (10ms) Span2: order-service (8ms) Span3: user-service RPC (5ms) Span4: db query (3ms) Span5: inventory-service RPC (6ms)
- Shows total time, per-step share, and which service is slow.
Example 2: Propagation
- HTTP:
X-Trace-Id,X-Span-Id,X-Parent-Span-Idetc. in headers. - Dubbo: via RpcContext, Filter.
- Frameworks: Spring Cloud Sleuth, OpenTelemetry, Brave, etc.
Example 3: Key concepts
| Concept | Description |
|---|---|
| TraceId | Unique for the whole chain |
| SpanId | Unique for current node |
| ParentSpanId | Parent node for building tree |
| Sampling | Controls volume and cost |
Example 4: Sampling strategies
- Constant: e.g. 10% of requests.
- Rate-limited: e.g. 100 traces/second.
- Probabilistic: sample errors and slow requests at higher rate.
Core Mechanism / Behavior
- Instrumentation: Automatic (agent, SDK) or manual (start/end span, add tags). Must propagate context across process boundaries.
- Export: Spans sent to collector; stored (Elasticsearch, Cassandra) and queried via UI (Jaeger, Zipkin).
Key Rules
- Full chain propagation: Gateway, services, MQ, DB must participate; a missing hop breaks the chain.
- Sampling: Use sampling in production; increase for errors and slow requests for debugging.
- Storage and query: Trace volume is high; need suitable storage and UI for analysis.
What's Next
See API Gateway, RPC for call chain. See Metrics/Monitoring for observability.