Distributed Systems Patterns
Overview
Don't distribute unless you must. When you must, choose the lowest escalation level that meets requirements. Each level adds complexity, failure modes, and operational burden — only escalate when you've outgrown the current level.
Escalation Decision
code
Single node handles the load?
YES → Level 0 (don't distribute)
NO → Do you need shared state across nodes?
NO → Level 1 (stateless multi-node)
YES → Is eventual consistency acceptable?
YES → Is it just presence/registry?
YES → Level 2 (lightweight)
NO → Level 4 (CRDTs)
NO → Level 3 (Raft consensus)
Cluster > 50 nodes? → Level 5
See escalation-ladder.md for full details on each level with code examples and migration triggers.
Quick Reference
| Problem | Pattern | Tool/Library |
|---|---|---|
| Large clusters (>50 nodes) | Partisan overlay network | partisan |
| Strong consistency | Raft consensus | :ra, raft |
| Eventual consistency | CRDTs | delta_crdt, lasp |
| Leader election | Consensus or external coordinator | :ra, etcd |
| Service discovery | DNS, Consul, K8s | libcluster |
| Distributed state sync | PubSub | Phoenix.PubSub |
| Presence tracking | CRDT-based tracker | Phoenix.Tracker |
| Clock synchronization | Hybrid Logical Clocks | hlclock |
| Ephemeral distributed KV | LWW Registers + HLC | groot |
| Feature flags (distributed) | LWW Registers + HLC | rollout |
| Split-brain prevention | Quorum + fencing | :ra, external coordinator |
| Causal ordering | Vector clocks | hlclock (HLC) or manual (vector clocks) |
| Partition testing | Cookie-based node isolation | schism |
Common Mistakes
- •Distributing too early: Most Elixir apps never need distribution. A single BEAM node handles millions of processes. Start at Level 0.
- •Choosing Raft when CRDTs suffice: If eventual consistency is acceptable (shopping carts, presence, likes), CRDTs avoid all consensus complexity.
- •Ignoring asymmetric partitions: Most test tools simulate symmetric partitions. Real networks also produce one-way failures that violate Raft assumptions.
- •Single Global Process: Using a GenServer as a global cache works on one node but becomes a bottleneck and single point of failure on a cluster.
- •Trusting CP/AP labels: Many systems are neither strictly CP nor AP. Verify actual guarantees, don't trust vendor marketing.
Reference Files
- •
escalation-ladder.md— Full Distribution Escalation Ladder (Levels 0-5) with code examples - •
consensus.md— Raft via:ra, CRDTs viadelta_crdt, Multi-Raft, Vector Clocks, Paxos - •
clustering.md— Distributed Erlang, Partisan, libcluster configs, PubSub, debugging patterns - •
failure-modes.md— Split-brain, fencing tokens, clock drift, gray failures, metastable failures, limplocks, concurrency bugs, race conditions, blast radius, redundancy conditions - •
cap-tradeoffs.md— CAP theorem critique, fundamental limits, consistency models, when to use which approach - •
production-checklist.md— 20-item deployment checklist - •
resilience-principles.md— Architectural principles from Hebert, Brooker, Luu, Kleppmann, Cook et al.
Related Skills
- •elixir-patterns: GenServer, Supervisor, OTP patterns, overload management
- •production-quality: Monitoring, observability, error handling
- •phoenix-liveview: Phoenix.PubSub, Phoenix.Tracker
Use the distributed-systems-expert agent for deep analysis of distributed architectures, consensus algorithm selection, and distributed bug investigation.