High Availability Basics

High availability (HA) means the system keeps serving requests despite failures. It is measured by availability percentage (e.g. 99.9%, 99.99%) and MTTR (mean time to recover). This article covers redundancy, failover, health checks, and isolation with a reference table.

Overview

Redundancy: No single point of failure; multiple instances for services, DB, middleware. Primary-replica, multi-active, multi-DC.
Failover: When the primary fails, a standby takes over. Requires heartbeat, election, data sync; has switchover delay and consistency considerations.
Health check: Periodically probe services and dependencies; remove or alert on unhealthy. Liveness/Readiness, TCP/HTTP probes.
Isolation: Prevent failure from spreading. Thread pool isolation, circuit breaker, rate limit, bulkhead. Fail fast when downstream fails instead of overwhelming upstream.

Example

Example 1: Availability percentages

Availability	Approx. yearly downtime
99%	3.65 days
99.9%	8.76 hours
99.99%	52.6 minutes
99.999%	5.26 minutes

Example 2: Common techniques

Technique	Role
Multi-instance	Remove single point, load balance
Primary-replica / multi-active	Redundancy for data and service
Health check	Detect failure, remove unhealthy instances
Circuit breaker	Fail fast when downstream fails, avoid backlog
Rate limit	Prevent overload from taking down the system
Degradation	Prioritize core features; non-core can be turned off

Example 3: Fault domains

Spread across racks, AZs, and DCs so a single rack or DC failure does not take everything down. Use K8s pod anti-affinity, multi-AZ, etc.

Example 4: Liveness vs Readiness (K8s)

Liveness: Is the process alive? Restart if not.
Readiness: Is the process ready for traffic? Remove from LB if not.

Core Mechanism / Behavior

Redundancy: Multiple copies; traffic distributed; failure of one does not stop the system.
Failover: Heartbeat detects failure; election or config chooses new primary; sync ensures data is consistent (within chosen model).
Isolation: Limit blast radius; one bad dependency does not exhaust shared resources.

Key Rules

Remove single points: Services, DB, registry, config center must be redundant; consider fault domain distribution.
Detect and isolate quickly: Health checks, circuit breaker, rate limit; prevent one failure from affecting the whole chain.
Recoverability: Runbooks, monitoring, alerts, drills; lower MTTR improves availability.

What's Next

See Circuit Breaker, Load Balancing, Distributed Lock. See CAP and Consistency for replica strategy.