service-mesh-istio
Expert Istio service mesh covering control plane vs data plane architecture, sidecar injection, traffic management with VirtualService and DestinationRule, observability, mTLS with PeerAuthentication, JWT authorization, circuit breaking, fault injection, canary deployments with traffic splitting,
Service Mesh: Istio
A service mesh moves cross-cutting concerns — mTLS, retries, circuit breaking, observability, canary
routing — from your application code into the infrastructure layer. Istio does this via the Envoy sidecar
proxy injected into every pod. Your services communicate through Envoy, which enforces policy without a
single line of code change in the application.
Core Mental Model
Istio has two planes: the control plane (Istiod) stores and distributes configuration, and the
data plane (Envoy sidecars in every pod) enforces it. Traffic management works by intercepting all
inbound/outbound traffic at the pod's network namespace (via iptables rules) and routing it through the
Envoy proxy. You configure traffic policy in Kubernetes CRDs (VirtualService, DestinationRule, etc.),
Istiod translates these into Envoy configuration, and distributes it via xDS protocol. The sidecar
pattern means zero code changes to add mTLS, observability, or canary routing to existing services.
Architecture Overview
Control Plane (Istiod):
─ Pilot: Traffic management, pushes xDS config to Envoy
─ Citadel: Certificate authority for mTLS (SPIFFE/SVID certs)
─ Galley: Config validation and distribution (merged into Pilot in 1.5+)
Data Plane (per-pod Envoy sidecars):
─ Intercepts ALL traffic to/from pod (iptables)
─ Enforces: mTLS, retries, circuit breaking, rate limiting
─ Reports: metrics, traces, access logs → observability backends
Ingress:
─ Istio Gateway + VirtualService = ingress for north-south traffic
─ Or: use Kubernetes Ingress with Istio annotations
Traffic path (pod A → pod B):
App A → [iptables] → Envoy sidecar A → mTLS → Envoy sidecar B → [iptables] → App B
Sidecar Injection
# Namespace-level auto-injection (all pods in namespace get sidecar)
kubectl label namespace production istio-injection=enabled
# Pod-level override (opt out)
metadata:
annotations:
sidecar.istio.io/inject: "false"
# Verify injection
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.spec.containers[*].name}{"\n"}{end}'
# Output: order-api-xxx-yyy order-api istio-proxy
Traffic Management
VirtualService: Canary Traffic Split
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-api
namespace: production
spec:
hosts:
- order-api # Service name (cluster-internal)
- order-api.example.com # External hostname (for Gateway)
gateways:
- production/order-api-gateway # Reference to Gateway resource
- mesh # "mesh" = applies to all internal traffic too
http:
# Canary: send 10% to v2, 90% to v1
- match:
- headers:
x-canary:
exact: "true" # Canary header → always send to v2
route:
- destination:
host: order-api
subset: v2
weight: 100
# Default split (10% canary)
- route:
- destination:
host: order-api
subset: v1
weight: 90
- destination:
host: order-api
subset: v2
weight: 10
timeout: 30s
retries:
attempts: 3
perTryTimeout: 10s
retryOn: 5xx,gateway-error,reset,connect-failure,retriable-4xx
DestinationRule: Load Balancing and Circuit Breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-api
namespace: production
spec:
host: order-api
# Default traffic policy (applies to all subsets)
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 5s
tcpKeepalive:
time: 7200s
interval: 75s
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
maxRequestsPerConnection: 10 # Force connection refresh
maxRetries: 3
# Circuit breaker (outlier detection)
outlierDetection:
consecutive5xxErrors: 5 # Eject after 5 consecutive 5xx errors
interval: 30s # Evaluation interval
baseEjectionTime: 30s # How long to eject
maxEjectionPercent: 50 # Max 50% of hosts ejected at once
minHealthPercent: 30 # Don't eject if < 30% healthy
# Load balancing algorithm
loadBalancer:
consistentHash:
httpHeaderName: x-user-id # Sticky sessions by user ID
# Or: simple: LEAST_CONN | ROUND_ROBIN | RANDOM | PASSTHROUGH
# Subsets for canary/blue-green
subsets:
- name: v1
labels:
version: v1
trafficPolicy:
connectionPool:
http:
http2MaxRequests: 1000
- name: v2
labels:
version: v2
Security: mTLS and Authorization
PeerAuthentication: Enforce STRICT mTLS
# Mesh-wide STRICT mTLS (deny all plaintext between services)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system # Applies to entire mesh
spec:
mtls:
mode: STRICT # STRICT = deny plaintext; PERMISSIVE = allow both
---
# Per-namespace override (useful during migration)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: legacy-service # Only this namespace
spec:
mtls:
mode: PERMISSIVE # Allow during migration period
---
# Per-port override (specific port exemption for health checks)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: order-api
namespace: production
spec:
selector:
matchLabels:
app: order-api
mtls:
mode: STRICT
portLevelMtls:
8080:
mode: PERMISSIVE # Health check port can be plaintext
AuthorizationPolicy: Zero-Trust Access Control
# Step 1: Deny ALL traffic by default (zero-trust baseline)
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: deny-all
namespace: production
spec:
{} # Empty spec = deny everything
---
# Step 2: Allow specific service-to-service communication
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: order-api-allow
namespace: production
spec:
selector:
matchLabels:
app: order-api
action: ALLOW
rules:
# Allow from payment-service (internal mTLS)
- from:
- source:
principals:
- "cluster.local/ns/production/sa/payment-service"
to:
- operation:
methods: ["POST"]
paths: ["/api/orders/*/payment"]
# Allow from API gateway (ingress)
- from:
- source:
namespaces: ["istio-system"]
to:
- operation:
methods: ["GET", "POST", "PUT"]
paths: ["/api/*"]
# Block access to admin endpoints from everyone except internal
- from:
- source:
principals:
- "cluster.local/ns/platform/sa/admin-service"
to:
- operation:
paths: ["/admin/*"]
---
# JWT-based authorization (validate JWT from external IdP)
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
name: jwt-auth
namespace: production
spec:
selector:
matchLabels:
app: order-api
jwtRules:
- issuer: "https://accounts.google.com"
jwksUri: "https://www.googleapis.com/oauth2/v3/certs"
audiences:
- "your-api-audience"
forwardOriginalToken: true # Pass JWT to backend
outputClaimToHeaders: # Extract JWT claims to headers
- header: x-user-id
claim: sub
- header: x-user-email
claim: email
Fault Injection for Chaos Testing
# Inject 50% 503 errors into payment-service calls (test circuit breaker)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-service-fault
namespace: production
spec:
hosts:
- payment-service
http:
- fault:
abort:
percentage:
value: 50
httpStatus: 503
route:
- destination:
host: payment-service
---
# Inject 2 second delay into 10% of requests (test timeout handling)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: database-latency
spec:
hosts:
- database-service
http:
- fault:
delay:
percentage:
value: 10.0
fixedDelay: 2s
route:
- destination:
host: database-service
Observability: Automatic Metrics and Traces
Istio automatically generates metrics, traces, and access logs for all traffic — no application changes.
# Istio-generated metrics (available in Prometheus):
istio_requests_total # Total requests (labels: source, destination, response_code)
istio_request_duration_milliseconds # Request latency histogram
istio_request_bytes # Request payload size
istio_response_bytes # Response payload size
# Useful PromQL:
# Error rate for order-api
rate(istio_requests_total{destination_service_name="order-api",response_code=~"5.."}[5m]) /
rate(istio_requests_total{destination_service_name="order-api"}[5m])
# P99 latency per service
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace="production"}[5m]))
by (le, destination_service_name)
)
# Service dependency map (who calls whom)
sum(rate(istio_requests_total[5m])) by (source_workload, destination_service_name)
Telemetry Configuration (v1alpha1 → v2 API)
# Custom access log format
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: custom-access-log
namespace: production
spec:
accessLogging:
- providers:
- name: otel # Send to OpenTelemetry collector
format:
labels:
trace_id: "%REQ(X-B3-TRACEID)%"
user_id: "%REQ(X-USER-ID)%"
response_code: "%RESPONSE_CODE%"
duration: "%DURATION%"
tracing:
- providers:
- name: opentelemetry
randomSamplingPercentage: 10.0 # Sample 10% of traces
Gateway: Ingress for External Traffic
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: order-api-gateway
namespace: production
spec:
selector:
istio: ingressgateway # Target the ingress gateway pod
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: order-api-tls # Kubernetes Secret with TLS cert
hosts:
- "api.example.com"
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "api.example.com"
tls:
httpsRedirect: true # Redirect HTTP → HTTPS
Linkerd: Simpler Alternative
Linkerd vs Istio:
Linkerd:
✅ Much simpler — installs in 5 minutes, works out of the box
✅ Lower resource overhead (Rust proxy vs C++ Envoy)
✅ Automatic mTLS (no PeerAuthentication config needed)
✅ Great default dashboards
❌ Less feature-rich traffic management (no fine-grained routing)
❌ No WASM plugins (Istio has WASM filter extensibility)
Istio:
✅ Full-featured traffic management (canary, fault injection, timeouts)
✅ Rich authorization (JWT, header-based, principal-based)
✅ Extensible (WASM, EnvoyFilter)
❌ Complex configuration model
❌ Heavy control plane overhead
❌ Steep learning curve
Choose Linkerd when: You want mTLS + observability with minimal config.
Choose Istio when: You need advanced traffic management, multi-cluster, or complex authz.
# Linkerd install (compare to Istio complexity)
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
linkerd check --pre # Pre-flight check
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd check # Verify installation
# Enable mesh on a namespace
kubectl annotate namespace production linkerd.io/inject=enabled
# Check mTLS status
linkerd viz stat deployment -n production
# Dashboard
linkerd viz dashboard
Anti-Patterns
❌ Enabling STRICT mTLS before all services are in the mesh — breaks plaintext services immediately
❌ No deny-all baseline before adding allow policies — implicit allow is not zero-trust
❌ Using Istio for East-West routing without understanding iptables overhead — adds ~1ms per hop
❌ DestinationRule without VirtualService — subsets in DestinationRule are only meaningful with VirtualService
❌ Fault injection in production without a kill switch — always wrap with a VirtualService delete
❌ Not setting resource requests on sidecar — Envoy sidecars need CPU/memory limits too
❌ Ignoring upgrade compatibility — Istio CRD versions change; test upgrades in staging
❌ Using Istio for simple clusters (< 10 services) — the operational cost exceeds the benefit
Quick Reference
Core Istio CRDs:
VirtualService: Traffic routing rules (canary, retries, timeouts, fault injection)
DestinationRule: Load balancing, connection pools, circuit breaking, subsets
Gateway: Ingress/Egress configuration
ServiceEntry: Register external services in the mesh
PeerAuthentication: mTLS policy (per namespace or selector)
AuthorizationPolicy: Access control (ALLOW/DENY based on source/operation)
RequestAuthentication: JWT validation rules
Useful kubectl commands:
istioctl analyze # Config validation and warnings
istioctl proxy-status # Sync status of all proxies
istioctl proxy-config routes POD # Envoy routing table for a pod
istioctl proxy-config listeners POD # Envoy listeners
istioctl proxy-config clusters POD # Upstream cluster config
kubectl exec -it POD -c istio-proxy -- pilot-agent request GET stats # Envoy stats
Traffic migration recipe (canary to 100%):
1. Deploy v2 (separate Deployment, same Service labels + version: v2 label)
2. Create DestinationRule with v1/v2 subsets
3. VirtualService: 90/10 split
4. Monitor error rate + latency for v2
5. 70/30 → 50/50 → 10/90 → 0/100
6. Delete v1 deployment and VirtualService canary rulesSkill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen