Common Production Pitfalls in RPC

Common RPC issues in production: missing timeouts, incorrect retry, serialization errors, connection leaks, registry/network failure, large objects and slow calls. This article groups typical pitfalls and fixes with a quick reference.

Overview

  • Timeout: Missing or too large → threads/connections exhausted, cascading failure. Every call needs a reasonable timeout.
  • Retry: Retrying non-idempotent calls → duplicate charges, duplicate orders. For writes and operations with side effects, use retries=0.
  • Serialization: Large objects, cycles, incompatible types → failure or OOM. Keep parameters and return values small; avoid deep nesting.
  • Connections: Leaked connections or too-small pools → NoRouteToHost, connection timeout. Tune pool size and ensure proper close.
  • Registry: Registry outage → no service discovery. Cached provider list may keep things working briefly; monitor and recover.
  • Network: Cross-DC, cross-region latency. Set timeouts accordingly; consider proximity routing and multi-site.

Example

Example 1: Common issues and fixes

IssueSymptomFix
No timeoutThread pool full, no responseSet timeout for all calls
Non-idempotent retryDuplicate charge, duplicate orderretries=0 for writes
Large objectsSlow serialization, OOM, timeoutPagination, lean DTOs, chunking
Connection leakGrowing connections, NoRouteToHostCheck close, pool config
Registry failureNo discovery, all failLocal cache, multi-registry, health check
Slow callsBacklog, more timeoutsOptimize downstream, rate limit, circuit breaker

Example 2: Idempotency and retry

Java
// Non-idempotent: no retry
@DubboReference(retries = 0)
OrderService orderService;

// Idempotent: allow retry
@DubboReference(retries = 2)
UserService userService;

Example 3: Large objects

  • Avoid passing large List or Map in RPC. Use pagination, streaming, or pass IDs and fetch on demand. Monitor serialization size and duration.

Example 4: Timeout sizing

  • Use 2–3× P99 latency or the maximum acceptable latency. Too small → false timeouts; too large → resource exhaustion.

Core Mechanism / Behavior

  • Timeout: Applied at connect, read, or total-call level. Must be set on every call path.
  • Retry: Only on retryable errors (network, timeout). Not on 4xx or business errors. Idempotent only.
  • Connections: Long-lived, pooled. Leak = not returned to pool; monitor connection count.

Key Rules

  • Always set timeout; use 2–3× P99 or business-acceptable max.
  • Retry only for idempotent calls; for writes and side-effect operations, disable retry.
  • Monitor: Timeout rate, error rate, connection count, serialization size; alert and investigate when abnormal.

What's Next

See Timeout/Retry/Fallback, Idempotency Design, Circuit Breaker. See Serialization and Dubbo Architecture for protocol and connection details.