High Availability Basics

High availability (HA) means the system keeps serving requests despite failures. It is measured by availability percentage (e.g. 99.9%, 99.99%) and MTTR (mean time to recover). This article covers redundancy, failover, health checks, and isolation with a reference table.

Overview

  • Redundancy: No single point of failure; multiple instances for services, DB, middleware. Primary-replica, multi-active, multi-DC.
  • Failover: When the primary fails, a standby takes over. Requires heartbeat, election, data sync; has switchover delay and consistency considerations.
  • Health check: Periodically probe services and dependencies; remove or alert on unhealthy. Liveness/Readiness, TCP/HTTP probes.
  • Isolation: Prevent failure from spreading. Thread pool isolation, circuit breaker, rate limit, bulkhead. Fail fast when downstream fails instead of overwhelming upstream.

Example

Example 1: Availability percentages

AvailabilityApprox. yearly downtime
99%3.65 days
99.9%8.76 hours
99.99%52.6 minutes
99.999%5.26 minutes

Example 2: Common techniques

TechniqueRole
Multi-instanceRemove single point, load balance
Primary-replica / multi-activeRedundancy for data and service
Health checkDetect failure, remove unhealthy instances
Circuit breakerFail fast when downstream fails, avoid backlog
Rate limitPrevent overload from taking down the system
DegradationPrioritize core features; non-core can be turned off

Example 3: Fault domains

  • Spread across racks, AZs, and DCs so a single rack or DC failure does not take everything down. Use K8s pod anti-affinity, multi-AZ, etc.

Example 4: Liveness vs Readiness (K8s)

  • Liveness: Is the process alive? Restart if not.
  • Readiness: Is the process ready for traffic? Remove from LB if not.

Core Mechanism / Behavior

  • Redundancy: Multiple copies; traffic distributed; failure of one does not stop the system.
  • Failover: Heartbeat detects failure; election or config chooses new primary; sync ensures data is consistent (within chosen model).
  • Isolation: Limit blast radius; one bad dependency does not exhaust shared resources.

Key Rules

  • Remove single points: Services, DB, registry, config center must be redundant; consider fault domain distribution.
  • Detect and isolate quickly: Health checks, circuit breaker, rate limit; prevent one failure from affecting the whole chain.
  • Recoverability: Runbooks, monitoring, alerts, drills; lower MTTR improves availability.

What's Next

See Circuit Breaker, Load Balancing, Distributed Lock. See CAP and Consistency for replica strategy.