Service Discovery

Service discovery lets services call each other by service name instead of hardcoded IP addresses. Providers register themselves with a registry; consumers resolve the service name to a list of instances and choose where to call. This article covers client-side vs server-side discovery, common implementations, and operational considerations with a comparison table.

Overview

Client-side discovery: The consumer queries the registry, gets the provider list, performs load balancing, and calls the chosen instance directly. Examples: Dubbo + Nacos, Eureka client.
Server-side discovery: The consumer only knows the gateway or load balancer address. The gateway or LB queries the registry and forwards the request. Examples: K8s Service, Consul with Nginx.
Registry: Nacos, Eureka, Consul, Zookeeper, Etcd. Stores service name → list of (address, port, metadata). Supports health checks and heartbeat renewal.
Choice: Client-side reduces a hop and gives clients control over load balancing; server-side keeps clients simple and centralizes control.

Example

Example 1: Registration and discovery flow

Plain text
Provider starts → registers with Registry (serviceName, ip:port, metadata)
                → sends heartbeats to stay registered
Consumer calls → queries Registry for serviceName → gets [ip1:port1, ip2:port2, ...]
              → load balances to pick one → makes RPC/HTTP call
              → (optional) caches list locally for next call

Registration happens at startup; discovery happens per call or is cached and refreshed periodically.

Example 2: Client-side vs server-side

Mode	Pros	Cons
Client-side	One less hop, flexible load balancing	Client needs SDK, aware of registry
Server-side	Simple client, centralized control	Extra hop, gateway/LB can be bottleneck

Client-side: consumer has the full list and can implement custom load balancing (e.g. consistent hash, least active). Server-side: consumer just calls the gateway; gateway does discovery and routing.

Example 3: Health check

Heartbeat: Provider sends periodic heartbeats to the registry. Missed heartbeats → instance removed. Used by Nacos, Eureka.
Probe: Registry or sidecar probes the instance (TCP connect, HTTP GET). Fails → instance removed. Used by K8s (liveness, readiness).
Health checks ensure traffic is not sent to dead or overloaded instances.

Example 4: K8s Service (server-side)

K8s Service is a DNS name (e.g. user-service) that resolves to backend Pod IPs.
Kube-proxy or iptables routes traffic to a healthy Pod. No explicit registry for the consumer; K8s manages discovery and load balancing.
Pods register via labels; Service selects Pods by label selector.

Example 5: Nacos registration (client-side)

Java
// Provider
NamingService naming = NamingFactory.createNamingService(serverAddr);
naming.registerInstance("user-service", "192.168.1.1", 8080);

// Consumer
List<Instance> instances = naming.getAllInstances("user-service");
// Pick one, e.g. by load balancing

Provider registers on startup; consumer gets the list when needed. Nacos supports both temporary (ephemeral) and persistent instances.

Example 6: Metadata and versioning

Registry can store metadata (version, region, tags). Consumers filter by metadata for canary, region affinity, or version routing.
Example: user-service with version=2.0 and group=gray for canary traffic.

Core Mechanism / Behavior

Registration: Provider sends (serviceName, address, port, metadata) on startup. Sends heartbeats to stay registered. On shutdown, should deregister (or rely on heartbeat timeout).
Discovery: Consumer subscribes (push) or polls (pull) for serviceName. Receives list updates when instances change. Push reduces latency; pull is simpler.
Caching: Consumers typically cache the instance list locally. Registry outage does not immediately break calls if the cache is valid. Cache TTL and refresh strategy matter.
Consistency: Registry may be eventually consistent. Brief windows of stale data (e.g. instance down but still in list) are possible. Load balancing and retries mitigate.

Registry High Availability

Multi-node: Registry should run as a cluster (e.g. Nacos cluster, Eureka peer replication, Consul cluster). No single point of failure.
Persistence: Some registries persist data; others are in-memory with replication. Persistence helps recovery after restart.
Local cache: Consumers cache the instance list. When the registry is down, consumers use the cache. Stale data is preferred over total failure for a short period.
Fallback: Define behavior when registry is unreachable (e.g. use last known list, fail fast, or use static config).

Failure Modes and Resilience

When the registry is unavailable, consumers should rely on a local cache of the last known instance list. This allows calls to continue for a limited time, at the cost of potentially using stale data (e.g. calling an instance that has gone down). Define a cache TTL and refresh strategy: too short increases load on the registry; too long increases the window of stale data. When the registry comes back, consumers should refresh. When an instance is unreachable, the client should retry another instance from the list; combining discovery with retry and circuit breaker improves resilience. Avoid a single registry as a critical dependency that brings down all callers when it fails; use clustering and local cache to tolerate registry outages.

Choosing a Registry

Nacos supports both service discovery and configuration; it fits the Dubbo and Spring Cloud ecosystem. Eureka is simple and battle-tested; it is peer-replicated and AP. Consul offers service discovery, health checks, and KV store; it can run as a single node or cluster. Zookeeper is CP and used by Kafka, Dubbo, and others; it has stronger consistency but higher operational complexity. Etcd is used by Kubernetes and is suitable when you are in the K8s ecosystem. Consider your existing stack, consistency requirements, and operational preferences when choosing.

Key Rules

Registry must be highly available; use multi-node, persistence, and local caching so consumers can operate when the registry has issues.
Health checks remove failing instances promptly; avoid sending traffic to dead or unhealthy nodes.
Instance up/down should be quickly notified (push) or periodically refreshed (pull) to reduce calls to down instances.
Metadata supports versioning, canary, and regional routing; use it for advanced routing needs.

What's Next

See RPC Fundamentals, Load Balancing, API Gateway. See High Availability for multi-instance and failover patterns.