Common Production Pitfalls in RPC

Common RPC issues in production: missing timeouts, incorrect retry, serialization errors, connection leaks, registry/network failure, large objects and slow calls. This article groups typical pitfalls and fixes with a quick reference.

Overview

Timeout: Missing or too large → threads/connections exhausted, cascading failure. Every call needs a reasonable timeout.
Retry: Retrying non-idempotent calls → duplicate charges, duplicate orders. For writes and operations with side effects, use retries=0.
Serialization: Large objects, cycles, incompatible types → failure or OOM. Keep parameters and return values small; avoid deep nesting.
Connections: Leaked connections or too-small pools → NoRouteToHost, connection timeout. Tune pool size and ensure proper close.
Registry: Registry outage → no service discovery. Cached provider list may keep things working briefly; monitor and recover.
Network: Cross-DC, cross-region latency. Set timeouts accordingly; consider proximity routing and multi-site.

Example

Example 1: Common issues and fixes

Issue	Symptom	Fix
No timeout	Thread pool full, no response	Set timeout for all calls
Non-idempotent retry	Duplicate charge, duplicate order	`retries=0` for writes
Large objects	Slow serialization, OOM, timeout	Pagination, lean DTOs, chunking
Connection leak	Growing connections, NoRouteToHost	Check close, pool config
Registry failure	No discovery, all fail	Local cache, multi-registry, health check
Slow calls	Backlog, more timeouts	Optimize downstream, rate limit, circuit breaker

Example 2: Idempotency and retry

Java
// Non-idempotent: no retry
@DubboReference(retries = 0)
OrderService orderService;

// Idempotent: allow retry
@DubboReference(retries = 2)
UserService userService;

Example 3: Large objects

Avoid passing large List or Map in RPC. Use pagination, streaming, or pass IDs and fetch on demand. Monitor serialization size and duration.

Example 4: Timeout sizing

Use 2–3× P99 latency or the maximum acceptable latency. Too small → false timeouts; too large → resource exhaustion.

Core Mechanism / Behavior

Timeout: Applied at connect, read, or total-call level. Must be set on every call path.
Retry: Only on retryable errors (network, timeout). Not on 4xx or business errors. Idempotent only.
Connections: Long-lived, pooled. Leak = not returned to pool; monitor connection count.

Key Rules

Always set timeout; use 2–3× P99 or business-acceptable max.
Retry only for idempotent calls; for writes and side-effect operations, disable retry.
Monitor: Timeout rate, error rate, connection count, serialization size; alert and investigate when abnormal.

What's Next

See Timeout/Retry/Fallback, Idempotency Design, Circuit Breaker. See Serialization and Dubbo Architecture for protocol and connection details.