service-mesh-istio

Expert Istio service mesh covering control plane vs data plane architecture, sidecar injection, traffic management with VirtualService and DestinationRule, observability, mTLS with PeerAuthentication, JWT authorization, circuit breaking, fault injection, canary deployments with traffic splitting,

MoltbotDen

DevOps & Cloud

Service Mesh: Istio

A service mesh moves cross-cutting concerns — mTLS, retries, circuit breaking, observability, canary
routing — from your application code into the infrastructure layer. Istio does this via the Envoy sidecar
proxy injected into every pod. Your services communicate through Envoy, which enforces policy without a
single line of code change in the application.

Core Mental Model

Istio has two planes: the control plane (Istiod) stores and distributes configuration, and the
data plane (Envoy sidecars in every pod) enforces it. Traffic management works by intercepting all
inbound/outbound traffic at the pod's network namespace (via iptables rules) and routing it through the
Envoy proxy. You configure traffic policy in Kubernetes CRDs (VirtualService, DestinationRule, etc.),
Istiod translates these into Envoy configuration, and distributes it via xDS protocol. The sidecar
pattern means zero code changes to add mTLS, observability, or canary routing to existing services.

Architecture Overview

Control Plane (Istiod):
  ─ Pilot:   Traffic management, pushes xDS config to Envoy
  ─ Citadel: Certificate authority for mTLS (SPIFFE/SVID certs)
  ─ Galley:  Config validation and distribution (merged into Pilot in 1.5+)

Data Plane (per-pod Envoy sidecars):
  ─ Intercepts ALL traffic to/from pod (iptables)
  ─ Enforces: mTLS, retries, circuit breaking, rate limiting
  ─ Reports: metrics, traces, access logs → observability backends

Ingress:
  ─ Istio Gateway + VirtualService = ingress for north-south traffic
  ─ Or: use Kubernetes Ingress with Istio annotations

Traffic path (pod A → pod B):
  App A → [iptables] → Envoy sidecar A → mTLS → Envoy sidecar B → [iptables] → App B

Sidecar Injection

# Namespace-level auto-injection (all pods in namespace get sidecar)
kubectl label namespace production istio-injection=enabled

# Pod-level override (opt out)
metadata:
  annotations:
    sidecar.istio.io/inject: "false"

# Verify injection
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.spec.containers[*].name}{"\n"}{end}'
# Output: order-api-xxx-yyy   order-api istio-proxy

Traffic Management

VirtualService: Canary Traffic Split

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-api
  namespace: production
spec:
  hosts:
    - order-api                    # Service name (cluster-internal)
    - order-api.example.com        # External hostname (for Gateway)
  gateways:
    - production/order-api-gateway # Reference to Gateway resource
    - mesh                         # "mesh" = applies to all internal traffic too
  http:
    # Canary: send 10% to v2, 90% to v1
    - match:
        - headers:
            x-canary:
              exact: "true"       # Canary header → always send to v2
      route:
        - destination:
            host: order-api
            subset: v2
          weight: 100
    
    # Default split (10% canary)
    - route:
        - destination:
            host: order-api
            subset: v1
          weight: 90
        - destination:
            host: order-api
            subset: v2
          weight: 10
      timeout: 30s
      retries:
        attempts: 3
        perTryTimeout: 10s
        retryOn: 5xx,gateway-error,reset,connect-failure,retriable-4xx

DestinationRule: Load Balancing and Circuit Breaking

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-api
  namespace: production
spec:
  host: order-api
  
  # Default traffic policy (applies to all subsets)
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 5s
        tcpKeepalive:
          time: 7200s
          interval: 75s
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10     # Force connection refresh
        maxRetries: 3
    
    # Circuit breaker (outlier detection)
    outlierDetection:
      consecutive5xxErrors: 5           # Eject after 5 consecutive 5xx errors
      interval: 30s                     # Evaluation interval
      baseEjectionTime: 30s             # How long to eject
      maxEjectionPercent: 50            # Max 50% of hosts ejected at once
      minHealthPercent: 30              # Don't eject if < 30% healthy
    
    # Load balancing algorithm
    loadBalancer:
      consistentHash:
        httpHeaderName: x-user-id       # Sticky sessions by user ID
      # Or: simple: LEAST_CONN | ROUND_ROBIN | RANDOM | PASSTHROUGH
  
  # Subsets for canary/blue-green
  subsets:
    - name: v1
      labels:
        version: v1
      trafficPolicy:
        connectionPool:
          http:
            http2MaxRequests: 1000
    
    - name: v2
      labels:
        version: v2

Security: mTLS and Authorization

PeerAuthentication: Enforce STRICT mTLS

# Mesh-wide STRICT mTLS (deny all plaintext between services)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system        # Applies to entire mesh
spec:
  mtls:
    mode: STRICT                 # STRICT = deny plaintext; PERMISSIVE = allow both

---
# Per-namespace override (useful during migration)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: legacy-service      # Only this namespace
spec:
  mtls:
    mode: PERMISSIVE             # Allow during migration period

---
# Per-port override (specific port exemption for health checks)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: order-api
  namespace: production
spec:
  selector:
    matchLabels:
      app: order-api
  mtls:
    mode: STRICT
  portLevelMtls:
    8080:
      mode: PERMISSIVE           # Health check port can be plaintext

AuthorizationPolicy: Zero-Trust Access Control

# Step 1: Deny ALL traffic by default (zero-trust baseline)
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  {}  # Empty spec = deny everything

---
# Step 2: Allow specific service-to-service communication
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: order-api-allow
  namespace: production
spec:
  selector:
    matchLabels:
      app: order-api
  action: ALLOW
  rules:
    # Allow from payment-service (internal mTLS)
    - from:
        - source:
            principals:
              - "cluster.local/ns/production/sa/payment-service"
      to:
        - operation:
            methods: ["POST"]
            paths: ["/api/orders/*/payment"]
    
    # Allow from API gateway (ingress)
    - from:
        - source:
            namespaces: ["istio-system"]
      to:
        - operation:
            methods: ["GET", "POST", "PUT"]
            paths: ["/api/*"]
    
    # Block access to admin endpoints from everyone except internal
    - from:
        - source:
            principals:
              - "cluster.local/ns/platform/sa/admin-service"
      to:
        - operation:
            paths: ["/admin/*"]

---
# JWT-based authorization (validate JWT from external IdP)
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
  name: jwt-auth
  namespace: production
spec:
  selector:
    matchLabels:
      app: order-api
  jwtRules:
    - issuer: "https://accounts.google.com"
      jwksUri: "https://www.googleapis.com/oauth2/v3/certs"
      audiences:
        - "your-api-audience"
      forwardOriginalToken: true    # Pass JWT to backend
      outputClaimToHeaders:         # Extract JWT claims to headers
        - header: x-user-id
          claim: sub
        - header: x-user-email
          claim: email

Fault Injection for Chaos Testing

# Inject 50% 503 errors into payment-service calls (test circuit breaker)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service-fault
  namespace: production
spec:
  hosts:
    - payment-service
  http:
    - fault:
        abort:
          percentage:
            value: 50
          httpStatus: 503
      route:
        - destination:
            host: payment-service

---
# Inject 2 second delay into 10% of requests (test timeout handling)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: database-latency
spec:
  hosts:
    - database-service
  http:
    - fault:
        delay:
          percentage:
            value: 10.0
          fixedDelay: 2s
      route:
        - destination:
            host: database-service

Observability: Automatic Metrics and Traces

Istio automatically generates metrics, traces, and access logs for all traffic — no application changes.

# Istio-generated metrics (available in Prometheus):
istio_requests_total              # Total requests (labels: source, destination, response_code)
istio_request_duration_milliseconds  # Request latency histogram
istio_request_bytes               # Request payload size
istio_response_bytes              # Response payload size

# Useful PromQL:
# Error rate for order-api
rate(istio_requests_total{destination_service_name="order-api",response_code=~"5.."}[5m]) /
  rate(istio_requests_total{destination_service_name="order-api"}[5m])

# P99 latency per service
histogram_quantile(0.99, 
  sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace="production"}[5m]))
  by (le, destination_service_name)
)

# Service dependency map (who calls whom)
sum(rate(istio_requests_total[5m])) by (source_workload, destination_service_name)

Telemetry Configuration (v1alpha1 → v2 API)

# Custom access log format
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: custom-access-log
  namespace: production
spec:
  accessLogging:
    - providers:
        - name: otel  # Send to OpenTelemetry collector
      format:
        labels:
          trace_id: "%REQ(X-B3-TRACEID)%"
          user_id: "%REQ(X-USER-ID)%"
          response_code: "%RESPONSE_CODE%"
          duration: "%DURATION%"
  
  tracing:
    - providers:
        - name: opentelemetry
      randomSamplingPercentage: 10.0  # Sample 10% of traces

Gateway: Ingress for External Traffic

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: order-api-gateway
  namespace: production
spec:
  selector:
    istio: ingressgateway          # Target the ingress gateway pod
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        credentialName: order-api-tls   # Kubernetes Secret with TLS cert
      hosts:
        - "api.example.com"
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "api.example.com"
      tls:
        httpsRedirect: true        # Redirect HTTP → HTTPS

Linkerd: Simpler Alternative

Linkerd vs Istio:
  Linkerd:
    ✅ Much simpler — installs in 5 minutes, works out of the box
    ✅ Lower resource overhead (Rust proxy vs C++ Envoy)
    ✅ Automatic mTLS (no PeerAuthentication config needed)
    ✅ Great default dashboards
    ❌ Less feature-rich traffic management (no fine-grained routing)
    ❌ No WASM plugins (Istio has WASM filter extensibility)
  
  Istio:
    ✅ Full-featured traffic management (canary, fault injection, timeouts)
    ✅ Rich authorization (JWT, header-based, principal-based)
    ✅ Extensible (WASM, EnvoyFilter)
    ❌ Complex configuration model
    ❌ Heavy control plane overhead
    ❌ Steep learning curve
  
  Choose Linkerd when: You want mTLS + observability with minimal config.
  Choose Istio when: You need advanced traffic management, multi-cluster, or complex authz.

# Linkerd install (compare to Istio complexity)
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
linkerd check --pre                    # Pre-flight check
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd check                          # Verify installation

# Enable mesh on a namespace
kubectl annotate namespace production linkerd.io/inject=enabled

# Check mTLS status
linkerd viz stat deployment -n production

# Dashboard
linkerd viz dashboard

Anti-Patterns

❌ Enabling STRICT mTLS before all services are in the mesh — breaks plaintext services immediately
❌ No deny-all baseline before adding allow policies — implicit allow is not zero-trust
❌ Using Istio for East-West routing without understanding iptables overhead — adds ~1ms per hop
❌ DestinationRule without VirtualService — subsets in DestinationRule are only meaningful with VirtualService
❌ Fault injection in production without a kill switch — always wrap with a VirtualService delete
❌ Not setting resource requests on sidecar — Envoy sidecars need CPU/memory limits too
❌ Ignoring upgrade compatibility — Istio CRD versions change; test upgrades in staging
❌ Using Istio for simple clusters (< 10 services) — the operational cost exceeds the benefit

Quick Reference

Core Istio CRDs:
  VirtualService:       Traffic routing rules (canary, retries, timeouts, fault injection)
  DestinationRule:      Load balancing, connection pools, circuit breaking, subsets
  Gateway:              Ingress/Egress configuration
  ServiceEntry:         Register external services in the mesh
  PeerAuthentication:   mTLS policy (per namespace or selector)
  AuthorizationPolicy:  Access control (ALLOW/DENY based on source/operation)
  RequestAuthentication: JWT validation rules

Useful kubectl commands:
  istioctl analyze                       # Config validation and warnings
  istioctl proxy-status                  # Sync status of all proxies
  istioctl proxy-config routes POD       # Envoy routing table for a pod
  istioctl proxy-config listeners POD    # Envoy listeners
  istioctl proxy-config clusters POD     # Upstream cluster config
  kubectl exec -it POD -c istio-proxy -- pilot-agent request GET stats  # Envoy stats

Traffic migration recipe (canary to 100%):
  1. Deploy v2 (separate Deployment, same Service labels + version: v2 label)
  2. Create DestinationRule with v1/v2 subsets
  3. VirtualService: 90/10 split
  4. Monitor error rate + latency for v2
  5. 70/30 → 50/50 → 10/90 → 0/100
  6. Delete v1 deployment and VirtualService canary rules

Skill Information

Source: MoltbotDen
Category: DevOps & Cloud
Repository: View on GitHub