System Architecture Expert
When to use this Skill
Use this Skill when:
- •Designing distributed systems
- •Writing system design documentation
- •Preparing for system design interviews
- •Creating architecture diagrams
- •Analyzing trade-offs between design choices
- •Reviewing or improving existing system designs
System Design Framework
1. Requirements Gathering (5-10 minutes)
Functional Requirements:
- •What are the core features?
- •What actions can users perform?
- •What are the inputs and outputs?
Non-Functional Requirements:
- •Scale: How many users? How much data?
- •Performance: Latency requirements? (p50, p95, p99)
- •Availability: What uptime is needed? (99.9%, 99.99%)
- •Consistency: Strong or eventual consistency?
Constraints:
- •Budget limitations
- •Technology stack constraints
- •Team expertise
- •Timeline
Example Questions:
- How many daily active users? - What's the read:write ratio? - What's the average data size? - What's the peak load vs average load? - Do we need real-time updates? - Can we have data loss?
2. Capacity Estimation (Back-of-the-envelope)
Calculate:
Traffic: - DAU = 100M users - Each user makes 10 requests/day - QPS = 100M * 10 / 86400 ≈ 11,574 QPS - Peak QPS = 2-3x average ≈ 30,000 QPS Storage: - 100M users * 1KB per user = 100GB - With 3x replication = 300GB - Growth: 300GB * 365 days = 109.5TB/year Bandwidth: - QPS * average request size - 11,574 * 10KB = 115.74MB/s
Memory/Cache:
- •80-20 rule: 20% of data gets 80% of traffic
- •Cache = 20% of total data for hot data
3. High-Level Design
Core Components:
- •Client Layer (Web, Mobile, Desktop)
- •API Gateway / Load Balancer
- •Application Servers (Business logic)
- •Cache Layer (Redis, Memcached)
- •Database (SQL, NoSQL, or both)
- •Message Queue (Kafka, RabbitMQ)
- •Object Storage (S3, GCS)
- •CDN (CloudFront, Akamai)
Draw Architecture:
[Clients] → [CDN]
↓
[Load Balancer]
↓
[Application Servers]
↙ ↓ ↘
[Cache] [DB] [Queue] → [Workers]
↓
[Object Storage]
4. Database Design
SQL vs NoSQL Decision:
Use SQL when:
- •ACID transactions required
- •Complex queries with JOINs
- •Structured data with relationships
- •Examples: PostgreSQL, MySQL
Use NoSQL when:
- •Massive scale (horizontal scaling)
- •Flexible schema
- •High write throughput
- •Examples: Cassandra, DynamoDB, MongoDB
Sharding Strategy:
- •Hash-based:
user_id % num_shards - •Range-based: Users 1-100M on shard 1
- •Geographic: US users on US shard
- •Consistent hashing: For even distribution
Schema Design:
-- Example: URL Shortener
CREATE TABLE urls (
id BIGSERIAL PRIMARY KEY,
short_url VARCHAR(10) UNIQUE NOT NULL,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
click_count INT DEFAULT 0,
INDEX (short_url),
INDEX (user_id)
);
5. Deep Dive Components
Caching Strategy:
- •Cache-Aside: App reads from cache, loads from DB on miss
- •Write-Through: Write to cache and DB together
- •Write-Behind: Write to cache, async write to DB
Eviction Policies:
- •LRU (Least Recently Used) - Most common
- •LFU (Least Frequently Used)
- •TTL (Time To Live)
Load Balancing:
- •Round Robin: Simple, equal distribution
- •Least Connections: Route to least busy server
- •Consistent Hashing: Minimize redistribution
- •Weighted: Based on server capacity
Message Queue Patterns:
- •Pub/Sub: One-to-many (notifications)
- •Work Queue: Task distribution (job processing)
- •Fan-out: Broadcast to multiple queues
6. Scalability Patterns
Horizontal Scaling:
- •Add more servers
- •Use load balancers
- •Stateless application servers
- •Session stored in cache/DB
Vertical Scaling:
- •Add more CPU/RAM to servers
- •Limited by hardware
- •Simpler but has limits
Microservices:
Monolith: [Single App] → [DB] Microservices: [User Service] → [User DB] [Post Service] → [Post DB] [Feed Service] → [Feed DB]
Benefits:
- •Independent scaling
- •Technology flexibility
- •Fault isolation
Drawbacks:
- •Increased complexity
- •Network latency
- •Distributed transactions
7. Reliability & Availability
Replication:
- •Master-Slave: One writer, multiple readers
- •Master-Master: Multiple writers (conflict resolution needed)
- •Multi-region: Geographic redundancy
Failover:
- •Active-Passive: Standby server takes over
- •Active-Active: Both servers handle traffic
Rate Limiting:
- •Token bucket algorithm
- •Leaky bucket algorithm
- •Fixed window counter
- •Sliding window log
Circuit Breaker:
States: Closed → Normal operation Open → Reject requests immediately Half-Open → Test if service recovered
8. Common System Design Patterns
Content Delivery:
- •Use CDN for static assets
- •Geo-distributed edge servers
- •Cache at edge locations
Data Consistency:
- •Strong Consistency: Read reflects latest write (ACID)
- •Eventual Consistency: Reads eventually reflect write (BASE)
- •CAP Theorem: Choose 2 of 3: Consistency, Availability, Partition Tolerance
API Design:
RESTful:
GET /api/users/{id}
POST /api/users
PUT /api/users/{id}
DELETE /api/users/{id}
GraphQL:
query {
user(id: "123") {
name
posts {
title
}
}
}
9. System Design Template
Use this structure (based on system_design/00_template.md):
# {System Name}
## 1. Requirements
### Functional
- [List core features]
### Non-Functional
- Scale: [Users, QPS, Data]
- Performance: [Latency requirements]
- Availability: [Uptime target]
## 2. Capacity Estimation
- Traffic: [QPS calculations]
- Storage: [Data size, growth]
- Bandwidth: [Network requirements]
## 3. API Design
[endpoint] - [description]
## 4. High-Level Architecture [Diagram] ## 5. Database Schema [Tables and relationships] ## 6. Detailed Design ### Component 1 [Deep dive] ### Component 2 [Deep dive] ## 7. Scalability [How to scale each component] ## 8. Trade-offs [Decisions and alternatives]
10. Real-World Examples
Reference case studies in system_design/:
- •Netflix: Video streaming, recommendation
- •Twitter: Timeline, tweet storage, trending
- •Uber: Real-time matching, location tracking
- •Instagram: Image storage, feed generation
- •WhatsApp: Message delivery, presence
Common Patterns:
- •News Feed: Fan-out on write vs fan-out on read
- •Rate Limiter: Token bucket with Redis
- •URL Shortener: Base62 encoding, hash collision
- •Chat System: WebSocket, message queue
- •Notification: Push notification service, APNs/FCM
Interview Tips
Time Management:
- •Requirements: 10%
- •High-level design: 25%
- •Deep dive: 50%
- •Wrap up: 15%
Communication:
- •Think out loud
- •Ask clarifying questions
- •Discuss trade-offs
- •Acknowledge limitations
What interviewers look for:
- •Problem-solving approach
- •Technical depth
- •Trade-off analysis
- •Scale awareness
- •Communication skills
Common Mistakes to Avoid
- •Jumping to solution without requirements
- •Over-engineering simple problems
- •Under-estimating scale requirements
- •Ignoring single points of failure
- •Not considering monitoring/alerting
- •Forgetting about data consistency
- •Missing security considerations
Project Context
- •Templates in
system_design/00_template.md - •Case studies in
system_design/*.md - •Reference materials in
doc/system_design/ - •Follow the established documentation pattern