Skip to main content
Coding Agents & IDEsDocumented

system-design-architect

Distributed systems design architect. CAP theorem, consistency models, caching patterns, sharding, message queues, circuit breakers, load balancing, and the numbers every engineer should know.

Share:

Installation

npx clawhub@latest install system-design-architect

View the full skill documentation and source below.

Documentation

System Design Architect

The Design Framework (Use for Every System)

1. Requirements (5-10 min)
   - Functional: What does the system DO?
   - Non-functional: Scale, latency, consistency, availability

2. Estimates (2-5 min)
   - QPS (queries per second) — read vs write ratio
   - Storage: data size * growth rate * retention
   - Bandwidth: QPS * payload size

3. High-Level Design (10-15 min)
   - API design: endpoints, request/response
   - Core components: draw the boxes
   - Data flow: how does a request travel?

4. Deep Dive (20-30 min)
   - Bottlenecks and how to solve them
   - Database choice and schema
   - Caching strategy
   - Failure modes and resilience

5. Trade-offs
   - What you chose and why
   - What you sacrificed
   - What you'd change with more time

Numbers Every Engineer Should Know

L1 cache hit:               1 ns
L2 cache hit:               4 ns
L3 cache hit:               10 ns
RAM read:                   100 ns
NVMe SSD read:              120 µs
Network (same datacenter):  500 µs
Network (cross-region):     50-150 ms
Disk seek (HDD):            10 ms

Order of magnitude:
1 server handles:           ~100,000 req/s (simple)
1 database handles:         ~10,000 QPS (Postgres, with indexes)
Redis:                      ~1,000,000 ops/s
1 Kafka partition:          ~100 MB/s throughput

Storage rough estimates:
1 char = 1 byte
1 tweet-sized object = 256 bytes
1 photo = 300 KB
1 minute HD video = 50 MB
1 TB = 10^12 bytes

Scale estimates:
Twitter: 500M tweets/day = ~6K tweets/s
YouTube: 500 hours of video uploaded per minute
Instagram: 100M photos/day = ~1,000 photos/s

CAP Theorem & Consistency Models

CAP: A distributed system can guarantee only 2 of 3:
  C (Consistency) — All nodes see the same data simultaneously
  A (Availability) — Every request gets a (non-error) response
  P (Partition tolerance) — System works when network partitions occur

Since network partitions ALWAYS happen, choose CP or AP:
  CP: Consistent + Partition-tolerant (accept downtime over stale data)
      Examples: HBase, ZooKeeper, Etcd, MongoDB
  AP: Available + Partition-tolerant (accept stale data over downtime)
      Examples: Cassandra, DynamoDB, CouchDB, DNS

Consistency Levels (Ordered Strong → Weak)

LevelDescriptionUse When
LinearizableLike a single machine; reads always see latest writeFinancial transactions
SequentialAll nodes see same order, not necessarily real-timeCollaborative editing
CausalCausally related ops in orderSocial feeds
EventualNodes converge eventuallyDNS, social likes
Read-your-writesYou always see your own writesProfile updates
Monotonic readsReads never go backwards in timeSocial timeline

Load Balancing

Algorithms:
  Round Robin     → simple, works if servers are identical
  Weighted RR     → different server capacities
  Least Connections → server with fewest active connections
  IP Hash         → same client → same server (session stickiness)
  Consistent Hashing → distribute with minimal redistribution on changes

Layer 4 (Transport): Route by TCP/IP. Fast, no app visibility.
  Use for: TCP traffic, very high throughput

Layer 7 (Application): Route by HTTP headers, URL, cookies.
  Use for: HTTP services, A/B testing, canary deployments

Health checks:
  Active: LB sends probe requests every 5-30s
  Passive: Monitor actual traffic for errors
  
Session stickiness options:
  1. Stateless apps (best) — any server can handle any request
  2. Server-side sessions + shared cache (Redis)
  3. JWT tokens — client carries state, stateless servers

Caching Strategy

Cache Tiers

Browser → CDN → API Cache → Application Cache → Database

CDN (CloudFront, Fastly, Cloudflare):
  - Static assets: CSS, JS, images, videos
  - Cache-Control: max-age=31536000 (1 year) for versioned assets
  - Purge on deploy

Application Cache (Redis/Memcached):
  - Session data
  - API response cache
  - Computed results, leaderboards, counters

Database Query Cache:
  - Materialized views
  - Query result caching

Cache Patterns

Cache-Aside (Lazy Loading):
  1. Read from cache
  2. Miss: read from DB, write to cache, return
  3. TTL-based invalidation
  Good for: read-heavy, can tolerate stale data
  Problem: cache miss on first request, thundering herd

Write-Through:
  1. Write to cache AND DB synchronously
  Good for: consistency, no stale reads
  Problem: write latency, cache full of data never read

Write-Behind (Async):
  1. Write to cache immediately (return success)
  2. Async: flush to DB
  Good for: write performance
  Problem: data loss risk if cache fails

Read-Through:
  Cache sits in front of DB, handles misses
  Good for: simple app code (cache is transparent)

Refresh-Ahead:
  Proactively refresh before expiry
  Good for: predictable access patterns, critical data
# Cache-aside pattern
import redis
import json

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_user(user_id: str) -> dict:
    cache_key = f"user:{user_id}"
    
    # Try cache first
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Cache miss — fetch from DB
    user = db.query_user(user_id)  # Slow
    if user:
        r.setex(cache_key, 3600, json.dumps(user))  # TTL: 1 hour
    
    return user

# Cache invalidation on write
def update_user(user_id: str, data: dict):
    db.update_user(user_id, data)
    r.delete(f"user:{user_id}")  # Invalidate cache
    
    # Or: r.set(f"user:{user_id}", json.dumps(updated_user), ex=3600)

Cache Busting Problems

Thundering herd: Cache expires, 10K requests hit DB simultaneously
  Solution: 
    - Cache lock: Only 1 request rebuilds cache, others wait
    - Jitter: Randomize TTL ±10% to stagger expiry
    - Cache warming: Proactively rebuild before expiry

Cache penetration: Request for non-existent data bypasses cache, hits DB every time
  Solution: Cache negative results (NULL) with short TTL (60s)
  
Cache stampede: Cache server restart / invalidation causes all caches to miss at once
  Solution: Circuit breaker, circuit-level locking, cache warming script

Database Design Patterns

Sharding Strategies

Horizontal Sharding (partitioning rows across databases):

1. Range-based sharding
   - User IDs 1-1M → Shard 1, 1M-2M → Shard 2
   - Problem: Hot spots if range is popular
   
2. Hash-based sharding
   - shard = hash(user_id) % num_shards
   - Problem: Resharding requires migrating data
   
3. Consistent hashing
   - Virtual nodes on a ring
   - Adding a shard only moves 1/N data
   - Used by: DynamoDB, Cassandra, Redis Cluster
   
4. Directory-based sharding
   - Lookup table: user_id → shard_id
   - Flexible, but lookup table is a bottleneck

Vertical sharding: Split by table/feature (users in DB1, orders in DB2)
  Good for: feature isolation, different scaling needs
  Problem: Cross-shard joins are expensive (do in app layer)

Read Replicas

Primary (read + write) → Replica 1 (read only)
                        → Replica 2 (read only)
                        → Replica 3 (read only + analytics)

Route reads to replicas:
  - Most reads go to replicas
  - Writes always go to primary
  - Inconsistency window: replication lag (usually <1s)
  
Use read-your-writes routing:
  - After a write, route that user's reads to primary for a short time
  - Or: read from primary for operations that must see latest data

Message Queues and Event-Driven Architecture

Point-to-point queue (RabbitMQ, SQS):
  Producer → [Queue] → Consumer (one consumer per message)
  Use for: task distribution, work queues, one-time processing

Pub/Sub topic (Kafka, SNS, Google Pub/Sub):
  Producer → [Topic] → Consumer Group 1 (all messages)
                     → Consumer Group 2 (all messages)
  Use for: event streaming, audit logs, multiple consumers

Kafka key concepts:
  Topic: named stream of events
  Partition: parallel lane within a topic (ordered within partition)
  Consumer group: set of consumers sharing partition processing
  Offset: position within a partition (Kafka stores until retention)
  
Ordering guarantee:
  - Within a partition: always ordered
  - Across partitions: no ordering (use single partition for strict ordering)
  - To keep related events together: partition by entity ID (user_id, order_id)

Event-Driven Patterns

Event Sourcing:
  Store ALL events, not just current state
  State = replay of all events from beginning
  
  + Complete audit trail
  + Easy to rebuild state at any point in time
  + Event-driven by design
  - Replay can be slow (snapshots solve this)
  - Schema evolution is hard
  
CQRS (Command Query Responsibility Segregation):
  Commands → Write Model (normalized, consistent)
  Queries  → Read Model (denormalized, fast, eventual)
  
  + Read model optimized for queries
  + Write model optimized for consistency
  - Eventual consistency between models
  - More complexity
  
Saga Pattern (distributed transactions):
  Choreography: Each service emits events, others react
  Orchestration: Central coordinator tells services what to do
  
  Order Service → emit OrderCreated
               ← Inventory Service reserves items
               ← Payment Service charges card
               ← Shipping Service creates shipment
               
  If any step fails: compensating transactions to undo

Circuit Breaker Pattern

from enum import Enum
import time
from threading import Lock

class State(Enum):
    CLOSED = "closed"       # Normal operation, requests flow
    OPEN = "open"           # Failing, reject requests fast
    HALF_OPEN = "half_open" # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, success_threshold=2, timeout=60):
        self.failure_threshold = failure_threshold
        self.success_threshold = success_threshold
        self.timeout = timeout
        
        self.state = State.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self._lock = Lock()
    
    def call(self, func, *args, **kwargs):
        with self._lock:
            if self.state == State.OPEN:
                if time.time() - self.last_failure_time > self.timeout:
                    self.state = State.HALF_OPEN
                else:
                    raise CircuitOpenError("Circuit is open")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        with self._lock:
            if self.state == State.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self.state = State.CLOSED
                    self.failure_count = 0
    
    def _on_failure(self):
        with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = State.OPEN

Design Patterns for Common Systems

URL Shortener (bit.ly)

Key decisions:
  1. ID generation: Base62 encode auto-increment OR random 7-char string
  2. Storage: Redis for hot links, SQL for persistence
  3. Redirect: 301 (permanent, cached by browser) vs 302 (temporary, track every click)
  4. Scale: 100M URLs, 10B redirects/day = ~115K redirects/s
  
Schema:
  short_code TEXT PRIMARY KEY
  long_url TEXT
  created_at TIMESTAMP
  expires_at TIMESTAMP
  click_count INTEGER  -- eventually consistent counter
  
Bottleneck: 115K reads/s → Redis cache with LRU eviction
  Cache key: short_code → long_url
  Cache hit ratio goal: 95%+ (Zipf distribution: 20% of links get 80% traffic)

Rate Limiter

Token bucket: Most common
  - Bucket starts with N tokens
  - Refill at rate R tokens/second
  - Request consumes 1 token; reject if empty
  Implementation: Redis + Lua script (atomic)
  
Fixed window: Simple
  - Count requests per minute window
  - Problem: burst at window boundary (2x limit in 2s)
  
Sliding window log:
  - Store timestamp of each request
  - Count requests in last N seconds
  - Problem: memory-intensive (1 entry per request)
  
Sliding window counter:
  - Interpolate between two fixed windows
  - Balance of accuracy and memory

Redis Lua script (atomic token bucket):
  local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[1])
  if tokens > 0 then
    redis.call('SET', KEYS[1], tokens - 1, 'EX', ARGV[2])
    return 1  -- allowed
  end
  return 0   -- rejected

Reliability Patterns

SLA/SLO/SLI:
  SLI: Measurement (e.g., 99.5% of requests in <200ms)
  SLO: Target (e.g., achieve above SLI monthly)
  SLA: Contract with consequences if SLO is missed

Error budget: 
  99.9% availability = 8.7 hours downtime/year
  99.99% = 52 minutes/year
  Use error budget to decide when to ship new features vs focus on reliability

Bulkhead pattern: Isolate failures
  - Separate thread pools per downstream dependency
  - Database timeouts can't exhaust HTTP thread pool
  
Retry with exponential backoff:
  wait = min(cap, base * 2^attempt) + jitter
  jitter prevents synchronized retries from causing thundering herd
  
Health checks:
  Shallow: "Am I up?" → responds 200 (load balancer check)
  Deep: "Are my dependencies up?" → checks DB, cache, downstream (startup check only)