Common Production Pitfalls in RPC
Common RPC issues in production: missing timeouts, incorrect retry, serialization errors, connection leaks, registry/network failure, large objects and slow calls. This article groups typical pitfalls and fixes with a quick reference.
Overview
- Timeout: Missing or too large → threads/connections exhausted, cascading failure. Every call needs a reasonable timeout.
- Retry: Retrying non-idempotent calls → duplicate charges, duplicate orders. For writes and operations with side effects, use
retries=0. - Serialization: Large objects, cycles, incompatible types → failure or OOM. Keep parameters and return values small; avoid deep nesting.
- Connections: Leaked connections or too-small pools → NoRouteToHost, connection timeout. Tune pool size and ensure proper close.
- Registry: Registry outage → no service discovery. Cached provider list may keep things working briefly; monitor and recover.
- Network: Cross-DC, cross-region latency. Set timeouts accordingly; consider proximity routing and multi-site.
Example
Example 1: Common issues and fixes
| Issue | Symptom | Fix |
|---|---|---|
| No timeout | Thread pool full, no response | Set timeout for all calls |
| Non-idempotent retry | Duplicate charge, duplicate order | retries=0 for writes |
| Large objects | Slow serialization, OOM, timeout | Pagination, lean DTOs, chunking |
| Connection leak | Growing connections, NoRouteToHost | Check close, pool config |
| Registry failure | No discovery, all fail | Local cache, multi-registry, health check |
| Slow calls | Backlog, more timeouts | Optimize downstream, rate limit, circuit breaker |
Example 2: Idempotency and retry
Java// Non-idempotent: no retry @DubboReference(retries = 0) OrderService orderService; // Idempotent: allow retry @DubboReference(retries = 2) UserService userService;
Example 3: Large objects
- Avoid passing large List or Map in RPC. Use pagination, streaming, or pass IDs and fetch on demand. Monitor serialization size and duration.
Example 4: Timeout sizing
- Use 2–3× P99 latency or the maximum acceptable latency. Too small → false timeouts; too large → resource exhaustion.
Core Mechanism / Behavior
- Timeout: Applied at connect, read, or total-call level. Must be set on every call path.
- Retry: Only on retryable errors (network, timeout). Not on 4xx or business errors. Idempotent only.
- Connections: Long-lived, pooled. Leak = not returned to pool; monitor connection count.
Key Rules
- Always set timeout; use 2–3× P99 or business-acceptable max.
- Retry only for idempotent calls; for writes and side-effect operations, disable retry.
- Monitor: Timeout rate, error rate, connection count, serialization size; alert and investigate when abnormal.
What's Next
See Timeout/Retry/Fallback, Idempotency Design, Circuit Breaker. See Serialization and Dubbo Architecture for protocol and connection details.