aws-architect
Expert-level AWS architecture patterns covering the Well-Architected Framework, IAM least privilege design, VPC networking, CDK infrastructure-as-code, compute tradeoffs, database selection, and cost optimization. Trigger phrases: designing AWS infrastructure, AWS CDK, IAM policies, VPC design, Lamb
AWS Architect
AWS is not a collection of services — it's a composable platform with opinionated primitives. Expert AWS
architecture means understanding why each service exists, when it's the right choice, and how to wire
them together securely, cheaply, and operationally sustainably.
Core Mental Model
The Well-Architected Framework is your north star: Operational Excellence, Security, Reliability,
Performance Efficiency, and Cost Optimization. Every architecture decision maps to at least one pillar.
Security is the load-bearing wall — you cannot retrofit least-privilege after the fact. Design IAM first,
networking second, compute third. The blast radius of any failure should be bounded by your account/VPC/subnet
topology before code runs. Cost is architecture — a Lambda with a 1 GB memory ceiling and a 15-minute timeout
is a design decision, not a tuning knob.
IAM: Least Privilege in Depth
Principal Hierarchy
AWS Organizations (SCPs)
└── Account boundary (resource-based policies)
└── IAM Role (identity-based policy)
└── Permission boundary (ceiling)
└── Session policy (further narrowing)
SCPs are guardrails, not grants. An SCP Allow means "accounts in this OU may have this permission" — the
IAM policy still needs to grant it explicitly.
Permission Boundaries Pattern
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowServicesInBoundary",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"dynamodb:GetItem",
"dynamodb:PutItem"
],
"Resource": [
"arn:aws:s3:::my-app-bucket/*",
"arn:aws:dynamodb:us-east-1:123456789012:table/my-app-*"
]
},
{
"Sid": "DenyPrivilegeEscalation",
"Effect": "Deny",
"Action": [
"iam:CreateRole",
"iam:AttachRolePolicy",
"iam:PutRolePolicy",
"sts:AssumeRole"
],
"Resource": "*"
}
]
}
IAM Conditions to Always Include
{
"Condition": {
"StringEquals": {
"aws:RequestedRegion": ["us-east-1", "us-west-2"]
},
"Bool": {
"aws:SecureTransport": "true",
"aws:MultiFactorAuthPresent": "true"
},
"ArnLike": {
"aws:PrincipalArn": "arn:aws:iam::*:role/allowed-role-*"
}
}
}
VPC Design: Three-Tier Architecture
Internet Gateway
│
┌───────▼────────┐ AZ-A AZ-B
│ Public Subnet │ 10.0.1.0/24 10.0.2.0/24
│ (ALB, NAT GW) │
└───────┬─────────┘
│ (private route via NAT GW)
┌───────▼────────┐ 10.0.11.0/24 10.0.12.0/24
│ Private Subnet │ (App servers, ECS, Lambda)
│ (App Tier) │
└───────┬─────────┘
│ (VPC endpoint or isolated)
┌───────▼────────┐ 10.0.21.0/24 10.0.22.0/24
│ Isolated Subnet │ (RDS, ElastiCache)
│ (Data Tier) │ No outbound route
└─────────────────┘
Security Groups vs NACLs
| Dimension | Security Group | NACL |
| Stateful? | Yes (return traffic auto-allowed) | No (must allow inbound AND outbound) |
| Scope | ENI-level | Subnet-level |
| Rules | Allow only | Allow + Deny |
| Use for | Fine-grained resource access | Subnet-level DDoS/IP blocking |
VPC Endpoints vs NAT Gateway
- Interface endpoint (PrivateLink): S3, DynamoDB, SSM, Secrets Manager — keep traffic off internet, no data charges for intra-region
- Gateway endpoint: S3 and DynamoDB only, free, route table entry
- NAT Gateway: ~$0.045/hr + $0.045/GB — expensive at scale. Use endpoints to avoid it for AWS APIs.
# Terraform: VPC with endpoints to avoid NAT Gateway for AWS APIs
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [aws_route_table.private.id]
}
resource "aws_vpc_endpoint" "ssm" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ssm"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.endpoints.id]
private_dns_enabled = true
}
AWS CDK: L1 / L2 / L3 Constructs
L3 (Patterns) ┌─ ApplicationLoadBalancedFargateService
│ aws-ecs-patterns — opinionated, fewer knobs
L2 (Constructs)┌─ aws_ecs.FargateService, aws_ec2.Vpc
│ Sensible defaults, escape hatches via .node.defaultChild
L1 (Cfn*) ┌─ CfnTaskDefinition — direct CloudFormation
│ Full control, verbose, no defaults
CDK VPC with Full Tier Isolation
import * as ec2 from 'aws-cdk-lib/aws-ec2';
const vpc = new ec2.Vpc(this, 'AppVpc', {
ipAddresses: ec2.IpAddresses.cidr('10.0.0.0/16'),
maxAzs: 3,
natGateways: 1, // Cost optimization: single NAT GW (add per-AZ for HA)
subnetConfiguration: [
{
name: 'Public',
subnetType: ec2.SubnetType.PUBLIC,
cidrMask: 24,
},
{
name: 'Private',
subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
cidrMask: 24,
},
{
name: 'Isolated',
subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
cidrMask: 24,
},
],
gatewayEndpoints: {
S3: { service: ec2.GatewayVpcEndpointAwsService.S3 },
DYNAMODB: { service: ec2.GatewayVpcEndpointAwsService.DYNAMODB },
},
});
// Interface endpoints for SSM (no NAT needed for EC2 management)
vpc.addInterfaceEndpoint('SsmEndpoint', {
service: ec2.InterfaceVpcEndpointAwsService.SSM,
subnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
});
Lambda Handler with Structured Logging
import json
import logging
import os
from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.metrics import MetricUnit
from aws_lambda_powertools.utilities.typing import LambdaContext
logger = Logger(service="order-processor")
tracer = Tracer(service="order-processor")
metrics = Metrics(namespace="MyApp", service="order-processor")
@logger.inject_lambda_context(log_event=True)
@tracer.capture_lambda_handler
@metrics.log_metrics(capture_cold_start_metric=True)
def handler(event: dict, context: LambdaContext) -> dict:
order_id = event.get("order_id")
logger.info("Processing order", extra={"order_id": order_id, "source": event.get("source")})
try:
result = process_order(order_id)
metrics.add_metric(name="OrdersProcessed", unit=MetricUnit.Count, value=1)
return {"statusCode": 200, "body": json.dumps(result)}
except ValueError as e:
logger.warning("Validation error", extra={"order_id": order_id, "error": str(e)})
return {"statusCode": 400, "body": json.dumps({"error": str(e)})}
except Exception as e:
logger.exception("Unexpected error processing order", extra={"order_id": order_id})
metrics.add_metric(name="OrderProcessingErrors", unit=MetricUnit.Count, value=1)
raise # Re-raise for Lambda retry / DLQ
Lambda: Cold Starts and SnapStart
Cold Start Anatomy
Container provisioning (~100–500ms)
→ Runtime init (JVM: ~1–5s, Python: ~100ms, Node: ~100ms)
→ Handler init code (your module-level code)
→ Handler invocation
Mitigation strategies:
- Provisioned concurrency: Pre-warm N instances (costs money even when idle)
- SnapStart (Java/Kotlin): Snapshot after init, restore from snapshot (~10x faster)
- ARM64 (Graviton2): ~20% cheaper, often faster cold starts
- Minimize package size: Fewer imports = faster init
# ✅ Initialize heavyweight clients OUTSIDE handler (reused across warm invocations)
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['TABLE_NAME'])
# ✅ Use environment variable for config (not SSM on every invocation)
REGION = os.environ['AWS_REGION']
def handler(event, context):
# table is already initialized — no cold start penalty here
response = table.get_item(Key={'id': event['id']})
return response.get('Item')
ECS Fargate vs EKS Trade-offs
| Dimension | ECS Fargate | EKS (Managed) |
| Control plane cost | Free | ~$0.10/hr/cluster |
| Operational burden | Low | Medium-High |
| Container density | 1 task = 1+ vCPU+MEM unit | Bin-packing via scheduler |
| Networking | awsvpc (ENI per task) — simpler | CNI (VPC-native or overlay) |
| Ecosystem | AWS-native | CNCF ecosystem (Helm, Argo, etc.) |
| Auto-scaling | Service auto-scaling, ECS Exec | HPA/VPA/KEDA, cluster-autoscaler |
| Best for | Simpler workloads, cost-sensitive | Complex orchestration, 50+ services |
RDS Multi-AZ vs Aurora Global
RDS Multi-AZ:
Primary (writes+reads) ──sync replication──► Standby (AZ-B)
Failover: ~1-2 min (DNS flip), standby promotes
Aurora Cluster:
Writer instance ──shared storage (6 copies, 3 AZs)──► Reader instance(s)
Failover: ~30s (reader promotes), storage always consistent
Aurora Global:
Primary region ──async replication (<1s lag)──► Secondary region(s)
Use for: DR, read scaling across regions, near-local latency reads
Choose Aurora when: connection count > 1000 (pgBouncer helps RDS), need <30s failover, multi-region reads,
serverless auto-pause for dev/staging, or storage auto-scaling without pre-provisioning.
S3: Storage Classes and Lifecycle
Standard → Hot data, frequent access ($0.023/GB)
Intelligent-Tiering → Unknown access patterns (monitoring fee + tiering)
Standard-IA → Infrequent access, retrieval fee (~30-day minimum)
Glacier Instant → Archive, ms retrieval, 90-day minimum
Glacier Flexible → Archive, minutes-hours retrieval, cheapest storage
Deep Archive → Long-term archive, 12hr retrieval, $0.00099/GB
Lifecycle rule: Standard → Standard-IA (30d) → Glacier Instant (90d) → Deep Archive (365d)
{
"Rules": [{
"ID": "auto-archive",
"Status": "Enabled",
"Filter": {"Prefix": "logs/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER_IR"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
],
"Expiration": {"Days": 2555},
"NoncurrentVersionTransitions": [
{"NoncurrentDays": 7, "StorageClass": "GLACIER_IR"}
],
"NoncurrentVersionExpiration": {"NoncurrentDays": 90}
}]
}
CloudFront with Lambda@Edge
Viewer Request → Lambda@Edge (A/B testing, auth header injection, URL rewrites)
Origin Request → Lambda@Edge (cache key manipulation, auth to origin)
Origin Response → Lambda@Edge (response header normalization, fallback origins)
Viewer Response → Lambda@Edge (security headers, cookie manipulation)
CloudFront Functions (cheaper, faster, limited):
- JS only, <1ms execution, no network calls
- Use for: URL normalization, query string manipulation, simple auth
Always add security headers via CloudFront Functions:
// CloudFront Function: security-headers
function handler(event) {
var response = event.response;
var headers = response.headers;
headers['strict-transport-security'] = { value: 'max-age=63072000; includeSubdomains; preload' };
headers['x-content-type-options'] = { value: 'nosniff' };
headers['x-frame-options'] = { value: 'DENY' };
headers['x-xss-protection'] = { value: '1; mode=block' };
headers['referrer-policy'] = { value: 'strict-origin-when-cross-origin' };
headers['content-security-policy'] = {
value: "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
};
return response;
}
AWS Organizations and Cost Management
Root
├── Management Account (billing only, no workloads)
├── Security OU
│ ├── Audit Account (CloudTrail, Config aggregator)
│ └── Log Archive Account (centralized S3 logs)
├── Infrastructure OU
│ └── Shared Services Account (Transit Gateway, Route 53, ECR)
└── Workloads OU
├── Production OU → prod account(s)
└── SDLC OU → dev/staging accounts
SCP: Prevent disabling CloudTrail
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "DenyCloudTrailDisable",
"Effect": "Deny",
"Action": [
"cloudtrail:DeleteTrail",
"cloudtrail:StopLogging",
"cloudtrail:UpdateTrail"
],
"Resource": "*"
}]
}
Anti-Patterns
❌ S3 public-read on entire bucket — use CloudFront OAC (Origin Access Control) instead
❌ Hardcoded credentials — use IAM roles, instance profiles, IRSA
❌ Security groups with 0.0.0.0/0 ingress on port 22/3389 — use SSM Session Manager
❌ Single-AZ RDS in production — always Multi-AZ, even for cost-sensitive workloads
❌ Lambda timeouts set to 15 minutes as default — set the minimum viable timeout + 20% buffer
❌ All infrastructure in default VPC — default VPC has no isolation; create purpose-built VPCs
❌ Missing resource tagging — you cannot do cost allocation, compliance, or automation without tags
❌ IAM users with long-lived access keys — use roles, OIDC federation, or IAM Identity Center
❌ CloudWatch alarms without actions — alarms that don't alert are theater
❌ ECS tasks with task role = AdministratorAccess — scope task roles to exactly what the app needs
Quick Reference
IAM Decision Tree:
Human needs access? → IAM Identity Center (SSO)
EC2/ECS needs AWS access? → Instance profile / Task role
Lambda needs AWS access? → Execution role
Cross-account access? → Role assumption with ExternalId
CI/CD pipeline? → OIDC identity provider (GitHub Actions → OIDC → role)
VPC CIDR Planning (avoid overlaps with corporate/VPN):
Dev: 10.0.0.0/16
Staging: 10.1.0.0/16
Prod: 10.2.0.0/16
Shared: 10.100.0.0/16
Compute Decision Tree:
Stateless, event-driven, <15min? → Lambda
Long-running, simple container? → ECS Fargate
Complex orchestration, 20+ svcs? → EKS
Batch processing, spot-friendly? → ECS/EKS with Spot Instances or AWS Batch
Cost Quick Wins:
1. Right-size EC2 (Compute Optimizer recommendations)
2. S3 Intelligent-Tiering for unknown patterns
3. Delete unattached EBS volumes (often forgotten)
4. NAT Gateway → VPC endpoints for AWS API traffic
5. Reserved Instances / Savings Plans for stable workloads (1yr = ~30% savings)Skill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen