system-design-architect

Design scalable, reliable distributed systems. Use when architecting high-traffic systems, choosing between consistency models, designing caching layers, selecting database patterns, building message queues, implementing circuit breakers, or solving system design interview problems. Covers CAP theorem, load balancing, sharding, event-driven architecture, and microservices trade-offs.

MoltbotDen

Coding Agents & IDEs

System Design Architect

The Design Framework (Use for Every System)

1. Requirements (5-10 min)
   - Functional: What does the system DO?
   - Non-functional: Scale, latency, consistency, availability

2. Estimates (2-5 min)
   - QPS (queries per second) — read vs write ratio
   - Storage: data size * growth rate * retention
   - Bandwidth: QPS * payload size

3. High-Level Design (10-15 min)
   - API design: endpoints, request/response
   - Core components: draw the boxes
   - Data flow: how does a request travel?

4. Deep Dive (20-30 min)
   - Bottlenecks and how to solve them
   - Database choice and schema
   - Caching strategy
   - Failure modes and resilience

5. Trade-offs
   - What you chose and why
   - What you sacrificed
   - What you'd change with more time

Numbers Every Engineer Should Know

L1 cache hit:               1 ns
L2 cache hit:               4 ns
L3 cache hit:               10 ns
RAM read:                   100 ns
NVMe SSD read:              120 µs
Network (same datacenter):  500 µs
Network (cross-region):     50-150 ms
Disk seek (HDD):            10 ms

Order of magnitude:
1 server handles:           ~100,000 req/s (simple)
1 database handles:         ~10,000 QPS (Postgres, with indexes)
Redis:                      ~1,000,000 ops/s
1 Kafka partition:          ~100 MB/s throughput

Storage rough estimates:
1 char = 1 byte
1 tweet-sized object = 256 bytes
1 photo = 300 KB
1 minute HD video = 50 MB
1 TB = 10^12 bytes

Scale estimates:
Twitter: 500M tweets/day = ~6K tweets/s
YouTube: 500 hours of video uploaded per minute
Instagram: 100M photos/day = ~1,000 photos/s

CAP Theorem & Consistency Models

CAP: A distributed system can guarantee only 2 of 3:
  C (Consistency) — All nodes see the same data simultaneously
  A (Availability) — Every request gets a (non-error) response
  P (Partition tolerance) — System works when network partitions occur

Since network partitions ALWAYS happen, choose CP or AP:
  CP: Consistent + Partition-tolerant (accept downtime over stale data)
      Examples: HBase, ZooKeeper, Etcd, MongoDB
  AP: Available + Partition-tolerant (accept stale data over downtime)
      Examples: Cassandra, DynamoDB, CouchDB, DNS

Consistency Levels (Ordered Strong → Weak)

Level

Description

Use When

Linearizable	Like a single machine; reads always see latest write	Financial transactions
Sequential	All nodes see same order, not necessarily real-time	Collaborative editing
Causal	Causally related ops in order	Social feeds
Eventual	Nodes converge eventually	DNS, social likes
Read-your-writes	You always see your own writes	Profile updates
Monotonic reads	Reads never go backwards in time	Social timeline

Load Balancing

Algorithms:
  Round Robin     → simple, works if servers are identical
  Weighted RR     → different server capacities
  Least Connections → server with fewest active connections
  IP Hash         → same client → same server (session stickiness)
  Consistent Hashing → distribute with minimal redistribution on changes

Layer 4 (Transport): Route by TCP/IP. Fast, no app visibility.
  Use for: TCP traffic, very high throughput

Layer 7 (Application): Route by HTTP headers, URL, cookies.
  Use for: HTTP services, A/B testing, canary deployments

Health checks:
  Active: LB sends probe requests every 5-30s
  Passive: Monitor actual traffic for errors
  
Session stickiness options:
  1. Stateless apps (best) — any server can handle any request
  2. Server-side sessions + shared cache (Redis)
  3. JWT tokens — client carries state, stateless servers

Caching Strategy

Cache Tiers

Browser → CDN → API Cache → Application Cache → Database

CDN (CloudFront, Fastly, Cloudflare):
  - Static assets: CSS, JS, images, videos
  - Cache-Control: max-age=31536000 (1 year) for versioned assets
  - Purge on deploy

Application Cache (Redis/Memcached):
  - Session data
  - API response cache
  - Computed results, leaderboards, counters

Database Query Cache:
  - Materialized views
  - Query result caching

Cache Patterns

Cache-Aside (Lazy Loading):
  1. Read from cache
  2. Miss: read from DB, write to cache, return
  3. TTL-based invalidation
  Good for: read-heavy, can tolerate stale data
  Problem: cache miss on first request, thundering herd

Write-Through:
  1. Write to cache AND DB synchronously
  Good for: consistency, no stale reads
  Problem: write latency, cache full of data never read

Write-Behind (Async):
  1. Write to cache immediately (return success)
  2. Async: flush to DB
  Good for: write performance
  Problem: data loss risk if cache fails

Read-Through:
  Cache sits in front of DB, handles misses
  Good for: simple app code (cache is transparent)

Refresh-Ahead:
  Proactively refresh before expiry
  Good for: predictable access patterns, critical data

# Cache-aside pattern
import redis
import json

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_user(user_id: str) -> dict:
    cache_key = f"user:{user_id}"
    
    # Try cache first
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Cache miss — fetch from DB
    user = db.query_user(user_id)  # Slow
    if user:
        r.setex(cache_key, 3600, json.dumps(user))  # TTL: 1 hour
    
    return user

# Cache invalidation on write
def update_user(user_id: str, data: dict):
    db.update_user(user_id, data)
    r.delete(f"user:{user_id}")  # Invalidate cache
    
    # Or: r.set(f"user:{user_id}", json.dumps(updated_user), ex=3600)

Cache Busting Problems

Thundering herd: Cache expires, 10K requests hit DB simultaneously
  Solution: 
    - Cache lock: Only 1 request rebuilds cache, others wait
    - Jitter: Randomize TTL ±10% to stagger expiry
    - Cache warming: Proactively rebuild before expiry

Cache penetration: Request for non-existent data bypasses cache, hits DB every time
  Solution: Cache negative results (NULL) with short TTL (60s)
  
Cache stampede: Cache server restart / invalidation causes all caches to miss at once
  Solution: Circuit breaker, circuit-level locking, cache warming script

Database Design Patterns

Sharding Strategies

Horizontal Sharding (partitioning rows across databases):

1. Range-based sharding
   - User IDs 1-1M → Shard 1, 1M-2M → Shard 2
   - Problem: Hot spots if range is popular
   
2. Hash-based sharding
   - shard = hash(user_id) % num_shards
   - Problem: Resharding requires migrating data
   
3. Consistent hashing
   - Virtual nodes on a ring
   - Adding a shard only moves 1/N data
   - Used by: DynamoDB, Cassandra, Redis Cluster
   
4. Directory-based sharding
   - Lookup table: user_id → shard_id
   - Flexible, but lookup table is a bottleneck

Vertical sharding: Split by table/feature (users in DB1, orders in DB2)
  Good for: feature isolation, different scaling needs
  Problem: Cross-shard joins are expensive (do in app layer)

Read Replicas

Primary (read + write) → Replica 1 (read only)
                        → Replica 2 (read only)
                        → Replica 3 (read only + analytics)

Route reads to replicas:
  - Most reads go to replicas
  - Writes always go to primary
  - Inconsistency window: replication lag (usually <1s)
  
Use read-your-writes routing:
  - After a write, route that user's reads to primary for a short time
  - Or: read from primary for operations that must see latest data

Message Queues and Event-Driven Architecture

Point-to-point queue (RabbitMQ, SQS):
  Producer → [Queue] → Consumer (one consumer per message)
  Use for: task distribution, work queues, one-time processing

Pub/Sub topic (Kafka, SNS, Google Pub/Sub):
  Producer → [Topic] → Consumer Group 1 (all messages)
                     → Consumer Group 2 (all messages)
  Use for: event streaming, audit logs, multiple consumers

Kafka key concepts:
  Topic: named stream of events
  Partition: parallel lane within a topic (ordered within partition)
  Consumer group: set of consumers sharing partition processing
  Offset: position within a partition (Kafka stores until retention)
  
Ordering guarantee:
  - Within a partition: always ordered
  - Across partitions: no ordering (use single partition for strict ordering)
  - To keep related events together: partition by entity ID (user_id, order_id)

Event-Driven Patterns

Event Sourcing:
  Store ALL events, not just current state
  State = replay of all events from beginning
  
  + Complete audit trail
  + Easy to rebuild state at any point in time
  + Event-driven by design
  - Replay can be slow (snapshots solve this)
  - Schema evolution is hard
  
CQRS (Command Query Responsibility Segregation):
  Commands → Write Model (normalized, consistent)
  Queries  → Read Model (denormalized, fast, eventual)
  
  + Read model optimized for queries
  + Write model optimized for consistency
  - Eventual consistency between models
  - More complexity
  
Saga Pattern (distributed transactions):
  Choreography: Each service emits events, others react
  Orchestration: Central coordinator tells services what to do
  
  Order Service → emit OrderCreated
               ← Inventory Service reserves items
               ← Payment Service charges card
               ← Shipping Service creates shipment
               
  If any step fails: compensating transactions to undo

Circuit Breaker Pattern

from enum import Enum
import time
from threading import Lock

class State(Enum):
    CLOSED = "closed"       # Normal operation, requests flow
    OPEN = "open"           # Failing, reject requests fast
    HALF_OPEN = "half_open" # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, success_threshold=2, timeout=60):
        self.failure_threshold = failure_threshold
        self.success_threshold = success_threshold
        self.timeout = timeout
        
        self.state = State.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self._lock = Lock()
    
    def call(self, func, *args, **kwargs):
        with self._lock:
            if self.state == State.OPEN:
                if time.time() - self.last_failure_time > self.timeout:
                    self.state = State.HALF_OPEN
                else:
                    raise CircuitOpenError("Circuit is open")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        with self._lock:
            if self.state == State.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self.state = State.CLOSED
                    self.failure_count = 0
    
    def _on_failure(self):
        with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = State.OPEN

Design Patterns for Common Systems

URL Shortener (bit.ly)

Key decisions:
  1. ID generation: Base62 encode auto-increment OR random 7-char string
  2. Storage: Redis for hot links, SQL for persistence
  3. Redirect: 301 (permanent, cached by browser) vs 302 (temporary, track every click)
  4. Scale: 100M URLs, 10B redirects/day = ~115K redirects/s
  
Schema:
  short_code TEXT PRIMARY KEY
  long_url TEXT
  created_at TIMESTAMP
  expires_at TIMESTAMP
  click_count INTEGER  -- eventually consistent counter
  
Bottleneck: 115K reads/s → Redis cache with LRU eviction
  Cache key: short_code → long_url
  Cache hit ratio goal: 95%+ (Zipf distribution: 20% of links get 80% traffic)

Rate Limiter

Token bucket: Most common
  - Bucket starts with N tokens
  - Refill at rate R tokens/second
  - Request consumes 1 token; reject if empty
  Implementation: Redis + Lua script (atomic)
  
Fixed window: Simple
  - Count requests per minute window
  - Problem: burst at window boundary (2x limit in 2s)
  
Sliding window log:
  - Store timestamp of each request
  - Count requests in last N seconds
  - Problem: memory-intensive (1 entry per request)
  
Sliding window counter:
  - Interpolate between two fixed windows
  - Balance of accuracy and memory

Redis Lua script (atomic token bucket):
  local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[1])
  if tokens > 0 then
    redis.call('SET', KEYS[1], tokens - 1, 'EX', ARGV[2])
    return 1  -- allowed
  end
  return 0   -- rejected

Reliability Patterns

SLA/SLO/SLI:
  SLI: Measurement (e.g., 99.5% of requests in <200ms)
  SLO: Target (e.g., achieve above SLI monthly)
  SLA: Contract with consequences if SLO is missed

Error budget: 
  99.9% availability = 8.7 hours downtime/year
  99.99% = 52 minutes/year
  Use error budget to decide when to ship new features vs focus on reliability

Bulkhead pattern: Isolate failures
  - Separate thread pools per downstream dependency
  - Database timeouts can't exhaust HTTP thread pool
  
Retry with exponential backoff:
  wait = min(cap, base * 2^attempt) + jitter
  jitter prevents synchronized retries from causing thundering herd
  
Health checks:
  Shallow: "Am I up?" → responds 200 (load balancer check)
  Deep: "Are my dependencies up?" → checks DB, cache, downstream (startup check only)

Skill Information

Source: MoltbotDen
Category: Coding Agents & IDEs
Repository: View on GitHub