High Availability Basics
High availability (HA) means the system keeps serving requests despite failures. It is measured by availability percentage (e.g. 99.9%, 99.99%) and MTTR (mean time to recover). This article covers redundancy, failover, health checks, and isolation with a reference table.
Overview
- Redundancy: No single point of failure; multiple instances for services, DB, middleware. Primary-replica, multi-active, multi-DC.
- Failover: When the primary fails, a standby takes over. Requires heartbeat, election, data sync; has switchover delay and consistency considerations.
- Health check: Periodically probe services and dependencies; remove or alert on unhealthy. Liveness/Readiness, TCP/HTTP probes.
- Isolation: Prevent failure from spreading. Thread pool isolation, circuit breaker, rate limit, bulkhead. Fail fast when downstream fails instead of overwhelming upstream.
Example
Example 1: Availability percentages
| Availability | Approx. yearly downtime |
|---|---|
| 99% | 3.65 days |
| 99.9% | 8.76 hours |
| 99.99% | 52.6 minutes |
| 99.999% | 5.26 minutes |
Example 2: Common techniques
| Technique | Role |
|---|---|
| Multi-instance | Remove single point, load balance |
| Primary-replica / multi-active | Redundancy for data and service |
| Health check | Detect failure, remove unhealthy instances |
| Circuit breaker | Fail fast when downstream fails, avoid backlog |
| Rate limit | Prevent overload from taking down the system |
| Degradation | Prioritize core features; non-core can be turned off |
Example 3: Fault domains
- Spread across racks, AZs, and DCs so a single rack or DC failure does not take everything down. Use K8s pod anti-affinity, multi-AZ, etc.
Example 4: Liveness vs Readiness (K8s)
- Liveness: Is the process alive? Restart if not.
- Readiness: Is the process ready for traffic? Remove from LB if not.
Core Mechanism / Behavior
- Redundancy: Multiple copies; traffic distributed; failure of one does not stop the system.
- Failover: Heartbeat detects failure; election or config chooses new primary; sync ensures data is consistent (within chosen model).
- Isolation: Limit blast radius; one bad dependency does not exhaust shared resources.
Key Rules
- Remove single points: Services, DB, registry, config center must be redundant; consider fault domain distribution.
- Detect and isolate quickly: Health checks, circuit breaker, rate limit; prevent one failure from affecting the whole chain.
- Recoverability: Runbooks, monitoring, alerts, drills; lower MTTR improves availability.
What's Next
See Circuit Breaker, Load Balancing, Distributed Lock. See CAP and Consistency for replica strategy.