cost-optimization-cloud
Expert cloud cost optimization covering FinOps fundamentals, AWS and GCP cost tools, right-sizing strategies, Reserved Instances vs Savings Plans vs Spot trade-offs, S3 and GCS storage lifecycle optimization, data transfer cost patterns, idle resource detection, tag enforcement, Kubernetes
Cost Optimization: Cloud
Cloud costs are architecture decisions made visible. Every overprovisioned instance, every GB of cross-AZ
traffic, every unattached EBS volume is a choice someone made (or failed to make). FinOps is the practice
of making those choices intentional, tracked, and continuously improved.
Core Mental Model
FinOps follows the Inform → Optimize → Operate cycle. You can't optimize what you can't see (Inform),
you can't sustain without process (Operate). The highest-leverage optimizations are almost always:
(1) delete things you're not using, (2) right-size things that are running, (3) commit to discounts for
things that will keep running. In that order. Don't buy Reserved Instances for over-provisioned instances —
right-size first, then commit.
FinOps: Inform → Optimize → Operate
INFORM (visibility):
─ Tagging strategy enforced (cost allocation by team/service/env)
─ Cost dashboards by team (chargeback or showback)
─ Anomaly detection alerts
─ Daily/weekly cost reports to team leads
OPTIMIZE (right-sizing, discount programs):
─ Instance right-sizing (CPU/memory utilization analysis)
─ Savings Plans / Reserved Instances for stable workloads
─ Storage class optimization (S3 Intelligent-Tiering)
─ Idle/orphaned resource cleanup
─ Architecture optimization (serverless, batch, Spot)
OPERATE (process and culture):
─ Cloud cost in sprint planning
─ Cost targets per team in OKRs
─ Monthly FinOps review
─ Engineer-level cost visibility
─ "You build it, you pay for it" accountability
AWS Cost Tools
Cost Explorer
import boto3
from datetime import datetime, timedelta
ce = boto3.client('ce', region_name='us-east-1')
# Get daily costs by service for last 30 days
response = ce.get_cost_and_usage(
TimePeriod={
'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
'End': datetime.now().strftime('%Y-%m-%d'),
},
Granularity='DAILY',
Filter={
'Tags': {
'Key': 'Environment',
'Values': ['production']
}
},
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'SERVICE'},
],
Metrics=['UnblendedCost']
)
for result in response['ResultsByTime']:
date = result['TimePeriod']['Start']
for group in result['Groups']:
service = group['Keys'][0]
cost = float(group['Metrics']['UnblendedCost']['Amount'])
if cost > 1.0: # Only show non-trivial costs
print(f"{date} | {service}: ${cost:.2f}")
Cost Anomaly Detection
# Create AWS Cost Anomaly Detection monitor
ce.create_anomaly_monitor(
AnomalyMonitor={
'MonitorName': 'production-cost-monitor',
'MonitorType': 'DIMENSIONAL',
'MonitorDimension': 'SERVICE',
}
)
# Create alert subscription (email when anomaly > $100)
ce.create_anomaly_subscription(
AnomalySubscription={
'MonitorArnList': ['arn:aws:ce::123456789012:anomalymonitor/xxx'],
'SubscriptionName': 'cost-anomaly-alert',
'Threshold': 100.0,
'Frequency': 'DAILY',
'Subscribers': [
{
'Address': '[email protected]',
'Type': 'EMAIL',
},
{
'Address': 'arn:aws:sns:us-east-1:123456789012:cost-alerts',
'Type': 'SNS',
}
]
}
)
Right-Sizing: EC2 and RDS
What to Look At (CloudWatch Metrics)
| Metric | Over-provisioned if | Action |
| CPUUtilization | < 20% consistently | Downsize instance type |
| FreeableMemory | > 50% of total | Downsize instance class |
| DatabaseConnections | < 10% of max | Downsize or use RDS Proxy |
| NetworkIn/Out | < 10% of baseline | May indicate zombie instance |
| VolumeReadOps/WriteOps | Near zero | Unattached or idle volume |
# AWS CLI: find EC2 instances with < 5% average CPU over 14 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 86400 \
--statistics Average \
--query 'Datapoints[*].Average' \
--output table
# AWS Compute Optimizer recommendation
aws compute-optimizer get-ec2-instance-recommendations \
--filters name=Finding,values=OVER_PROVISIONED \
--query 'instanceRecommendations[*].{Instance:instanceArn,Savings:estimatedMonthlySavings.value}'
Reserved Instances vs Savings Plans vs Spot
SPOT INSTANCES (70-90% discount):
✅ Best for: batch jobs, CI/CD workers, ML training, stateless apps
❌ Can be reclaimed with 2-minute notice
Pattern: Use with --interruption-behavior=terminate and checkpoint logic
Spot best practices:
─ Use instance fleet (multiple types/AZs) for availability
─ Set price ceiling at On-Demand price (avoid overbidding)
─ Use ECS Capacity Providers or EKS Karpenter for automated Spot handling
─ Never Spot: databases, ZooKeeper, Kafka brokers, anything stateful
SAVINGS PLANS (1yr: ~30%, 3yr: ~50% discount):
Compute Savings Plans: Most flexible (any instance type, region, OS, ECS, Lambda)
EC2 Instance Savings Plans: 72% savings, locked to instance family + region
SageMaker Savings Plans: ML workloads only
✅ Buy when: Stable compute baseline you're confident will run 1+ year
Start with Compute Savings Plans for flexibility, then add EC2 if you know family
RESERVED INSTANCES (similar savings to Savings Plans, older model):
Standard RI: Locked to specific instance type + region
Convertible RI: Can exchange for different type (less discount)
✅ Still useful for: RDS, ElastiCache, Redshift (Savings Plans don't cover these)
DECISION:
Stable 24/7 workload? → Savings Plans / RIs
Interruptible batch? → Spot
Unpredictable/bursty? → On-Demand (optimize architecture first)
Savings calculation:
On-Demand: $0.192/hr × 730h/month = $140/month
1yr Savings Plan (no upfront): $0.131/hr × 730h = $95.6/month → $534/year savings
1yr Savings Plan (all upfront): ~$1,050 → $630/year → $630 - $540 = $90 better than no-upfront
S3 Storage Classes and Lifecycle Rules
# Boto3: apply lifecycle rule to reduce storage costs
import boto3
s3 = boto3.client('s3')
lifecycle_config = {
'Rules': [
{
'ID': 'logs-auto-archive',
'Status': 'Enabled',
'Filter': {'Prefix': 'logs/'},
'Transitions': [
# Day 0: Standard ($0.023/GB)
{'Days': 30, 'StorageClass': 'STANDARD_IA'}, # $0.0125/GB
{'Days': 90, 'StorageClass': 'GLACIER_IR'}, # $0.004/GB
{'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}, # $0.00099/GB
],
'Expiration': {'Days': 2555}, # Delete after 7 years
'NoncurrentVersionTransitions': [
{'NoncurrentDays': 7, 'StorageClass': 'GLACIER_IR'},
],
'NoncurrentVersionExpiration': {'NoncurrentDays': 90},
'AbortIncompleteMultipartUpload': {'DaysAfterInitiation': 7},
},
{
'ID': 'intelligent-tiering-user-content',
'Status': 'Enabled',
'Filter': {'Prefix': 'user-uploads/'},
'Transitions': [
# Unknown access pattern → let AWS optimize automatically
{'Days': 0, 'StorageClass': 'INTELLIGENT_TIERING'},
],
},
]
}
s3.put_bucket_lifecycle_configuration(
Bucket='my-app-data',
LifecycleConfiguration=lifecycle_config
)
S3 Cost Quick Math
Standard: $0.023/GB/month
Standard-IA: $0.0125/GB/month + $0.01/GB retrieval
Intelligent-Tiering: $0.023 frequent, $0.0125 infrequent (auto-tiered)
+ $0.0025/1000 objects monitoring fee
Glacier Instant: $0.004/GB/month + $0.03/GB retrieval
Glacier Flexible: $0.0036/GB/month + $0.01/GB retrieval (hours)
Deep Archive: $0.00099/GB/month + $0.02/GB retrieval (12 hours)
Minimum storage duration:
Standard-IA: 30 days (charged for 30 even if deleted sooner)
Glacier Instant: 90 days
Glacier Flexible: 90 days
Deep Archive: 180 days
Data Transfer Cost Patterns
Data transfer is often the biggest hidden cost. Key rules:
FREE:
─ Inbound to AWS from internet: FREE
─ S3 → CloudFront: FREE
─ Within same AZ (same region, same AZ): FREE
─ VPC endpoints for S3/DynamoDB: FREE (gateway endpoints)
EXPENSIVE:
─ EC2/Lambda → Internet egress: $0.09/GB first 10TB
─ Cross-AZ within same region: $0.01/GB EACH WAY ($0.02/GB round-trip!)
─ Cross-region: $0.02/GB
─ NAT Gateway processing: $0.045/GB
CROSS-AZ IS THE HIDDEN KILLER:
Service A (us-east-1a) → Service B (us-east-1b)
→ $0.01/GB × 2 (bidirectional) = $0.02/GB
At 100GB/day: $2/day = $60/month just for cross-AZ traffic
Fix:
─ Deploy paired services in same AZ (use affinity rules in k8s)
─ Use regional load balancers (not cross-AZ by default)
─ Compress data before sending cross-AZ
# Detect cross-AZ traffic with VPC Flow Logs query (Athena)
"""
SELECT
srcaddr,
dstaddr,
sum(bytes) as total_bytes,
sum(bytes) / 1073741824.0 as gb,
(sum(bytes) / 1073741824.0) * 0.01 as estimated_cost_usd
FROM vpc_flow_logs
WHERE
srcaddr LIKE '10.0.1.%' -- AZ-A subnet
AND dstaddr LIKE '10.0.2.%' -- AZ-B subnet
AND log_status = 'OK'
AND start >= to_unixtime(current_timestamp - interval '7' day)
GROUP BY srcaddr, dstaddr
ORDER BY total_bytes DESC
LIMIT 20
"""
Idle Resource Detection
#!/usr/bin/env bash
# find-idle-resources.sh — Run monthly to find waste
# Unattached EBS volumes (paying for storage with no instance)
echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}' \
--output table
# Stopped EC2 instances (still paying for EBS, Elastic IP, etc.)
echo "=== Stopped EC2 Instances ==="
aws ec2 describe-instances \
--filters Name=instance-state-name,Values=stopped \
--query 'Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,Stopped:StateTransitionReason}' \
--output table
# Unused Elastic IPs ($0.005/hr when not attached = $3.60/month each)
echo "=== Unattached Elastic IPs ==="
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==null].{IP:PublicIp,AllocationId:AllocationId}' \
--output table
# Old snapshots (sorted by age)
echo "=== EBS Snapshots Older Than 90 Days ==="
aws ec2 describe-snapshots --owner-ids self \
--query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-8601)'].{ID:SnapshotId,Size:VolumeSize,Date:StartTime,Description:Description}" \
--output table
# Unused load balancers (no healthy targets)
echo "=== Load Balancers with No Targets ==="
aws elbv2 describe-load-balancers \
--query 'LoadBalancers[*].LoadBalancerArn' --output text | \
tr '\t' '\n' | while read arn; do
targets=$(aws elbv2 describe-target-groups --load-balancer-arn "$arn" \
--query 'TargetGroups[*].TargetGroupArn' --output text | \
tr '\t' '\n' | while read tg; do
aws elbv2 describe-target-health --target-group-arn "$tg" \
--query 'length(TargetHealthDescriptions)' --output text
done | awk '{sum+=$1} END{print sum}')
[ "$targets" = "0" ] && echo "IDLE: $arn"
done
Kubernetes: Resource Requests/Limits Impact
# The cost of Kubernetes is determined by what you REQUEST, not what you USE
# Nodes must have enough allocatable capacity to satisfy ALL pod requests
# Over-requesting (wasteful):
resources:
requests:
cpu: "2" # Reserves 2 vCPU on the node always
memory: "4Gi" # Reserves 4GB on the node always
limits:
cpu: "2"
memory: "4Gi"
# Under-requesting (unstable — OOMKilled, CPU throttled):
resources:
requests:
cpu: "10m" # 1% CPU — too low, pod gets throttled
memory: "32Mi" # Way too low — will OOM
# Right-sized (VPA recommendations + 20% buffer):
resources:
requests:
cpu: "250m" # What the app typically uses
memory: "512Mi" # P99 of actual memory usage
limits:
cpu: "1000m" # Allow burst, prevent monopolizing
memory: "512Mi" # Keep limit = request to avoid OOM surprises
# Use Vertical Pod Autoscaler in recommendation mode
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: order-api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-api
updatePolicy:
updateMode: "Off" # Recommend only, don't auto-apply
EOF
# After 24h, check recommendations
kubectl describe vpa order-api-vpa | grep -A 20 "Recommendation"
Tagging Strategy for Cost Allocation
# Mandatory tags (enforce via AWS SCPs / GCP Org Policy):
tags:
Environment: "production" # production, staging, development
Team: "platform" # Team responsible (maps to cost center)
Service: "order-api" # Individual service/application
CostCenter: "engineering-01" # Finance's cost center code
ManagedBy: "terraform" # Who manages this resource
# AWS SCP: deny resource creation without required tags
{
"Effect": "Deny",
"Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/Environment": "true",
"aws:RequestTag/Team": "true",
"aws:RequestTag/Service": "true"
}
}
}
Anti-Patterns
❌ Buying RIs/Savings Plans before right-sizing — committing to waste
❌ No tagging → no cost allocation — you can't hold teams accountable without attribution
❌ Over-provisioning "just in case" — use auto-scaling; don't manually buffer
❌ Dev/staging same size as production — right-size lower environments aggressively
❌ Ignoring cross-AZ traffic — often adds 10-20% to EC2 bills
❌ Not setting S3 lifecycle rules — old data sits in Standard class forever
❌ Lambda memory set to 3GB because "more CPU" — cost scales linearly with memory
❌ Manual cost reviews — automate anomaly detection, don't rely on monthly reviews
❌ Reserved Instances for non-stable workloads — lock-in without savings
Quick Reference
AWS cost savings ladder (easiest → most effort):
1. Delete unattached EBS volumes (Immediate, free money)
2. Delete unused EIPs, snapshots, LBs (Immediate)
3. Set S3 lifecycle rules (Low effort, high savings at scale)
4. Right-size over-provisioned EC2/RDS (Medium — need data)
5. Buy Compute Savings Plans for stable workloads (Low effort once you decide)
6. Move batch workloads to Spot (Medium — needs architecture change)
7. Move to ARM/Graviton2 (~20% cheaper, often drop-in) (Medium)
8. Redesign for serverless/Fargate where appropriate (High effort, high savings)
GCP cost savings ladder:
1. Delete idle VMs and unattached disks
2. Set Cloud Storage lifecycle rules
3. Enable CUD (Committed Use Discounts) for stable workloads
4. Move to Spot VMs for batch/CI
5. Use Cloud Run (scale-to-zero) instead of always-on VMs
6. BigQuery: partition + cluster tables (query cost reduction)
Lambda cost formula:
Cost = requests × $0.0000002 + GB-seconds × $0.0000166667
Example: 1M requests, 512MB, avg 500ms:
= 1,000,000 × $0.0000002 + (1,000,000 × 0.5 × 0.5) × $0.0000166667
= $0.20 + $4.17 = $4.37/million invocationsSkill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen