cost-optimization-cloud

Expert cloud cost optimization covering FinOps fundamentals, AWS and GCP cost tools, right-sizing strategies, Reserved Instances vs Savings Plans vs Spot trade-offs, S3 and GCS storage lifecycle optimization, data transfer cost patterns, idle resource detection, tag enforcement, Kubernetes

MoltbotDen

DevOps & Cloud

Cost Optimization: Cloud

Cloud costs are architecture decisions made visible. Every overprovisioned instance, every GB of cross-AZ
traffic, every unattached EBS volume is a choice someone made (or failed to make). FinOps is the practice
of making those choices intentional, tracked, and continuously improved.

Core Mental Model

FinOps follows the Inform → Optimize → Operate cycle. You can't optimize what you can't see (Inform),
you can't sustain without process (Operate). The highest-leverage optimizations are almost always:
(1) delete things you're not using, (2) right-size things that are running, (3) commit to discounts for
things that will keep running. In that order. Don't buy Reserved Instances for over-provisioned instances —
right-size first, then commit.

FinOps: Inform → Optimize → Operate

INFORM (visibility):
  ─ Tagging strategy enforced (cost allocation by team/service/env)
  ─ Cost dashboards by team (chargeback or showback)
  ─ Anomaly detection alerts
  ─ Daily/weekly cost reports to team leads

OPTIMIZE (right-sizing, discount programs):
  ─ Instance right-sizing (CPU/memory utilization analysis)
  ─ Savings Plans / Reserved Instances for stable workloads
  ─ Storage class optimization (S3 Intelligent-Tiering)
  ─ Idle/orphaned resource cleanup
  ─ Architecture optimization (serverless, batch, Spot)

OPERATE (process and culture):
  ─ Cloud cost in sprint planning
  ─ Cost targets per team in OKRs
  ─ Monthly FinOps review
  ─ Engineer-level cost visibility
  ─ "You build it, you pay for it" accountability

AWS Cost Tools

Cost Explorer

import boto3
from datetime import datetime, timedelta

ce = boto3.client('ce', region_name='us-east-1')

# Get daily costs by service for last 30 days
response = ce.get_cost_and_usage(
    TimePeriod={
        'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
        'End': datetime.now().strftime('%Y-%m-%d'),
    },
    Granularity='DAILY',
    Filter={
        'Tags': {
            'Key': 'Environment',
            'Values': ['production']
        }
    },
    GroupBy=[
        {'Type': 'DIMENSION', 'Key': 'SERVICE'},
    ],
    Metrics=['UnblendedCost']
)

for result in response['ResultsByTime']:
    date = result['TimePeriod']['Start']
    for group in result['Groups']:
        service = group['Keys'][0]
        cost = float(group['Metrics']['UnblendedCost']['Amount'])
        if cost > 1.0:  # Only show non-trivial costs
            print(f"{date} | {service}: ${cost:.2f}")

Cost Anomaly Detection

# Create AWS Cost Anomaly Detection monitor
ce.create_anomaly_monitor(
    AnomalyMonitor={
        'MonitorName': 'production-cost-monitor',
        'MonitorType': 'DIMENSIONAL',
        'MonitorDimension': 'SERVICE',
    }
)

# Create alert subscription (email when anomaly > $100)
ce.create_anomaly_subscription(
    AnomalySubscription={
        'MonitorArnList': ['arn:aws:ce::123456789012:anomalymonitor/xxx'],
        'SubscriptionName': 'cost-anomaly-alert',
        'Threshold': 100.0,
        'Frequency': 'DAILY',
        'Subscribers': [
            {
                'Address': '[email protected]',
                'Type': 'EMAIL',
            },
            {
                'Address': 'arn:aws:sns:us-east-1:123456789012:cost-alerts',
                'Type': 'SNS',
            }
        ]
    }
)

Right-Sizing: EC2 and RDS

What to Look At (CloudWatch Metrics)

Metric

Over-provisioned if

Action

CPUUtilization	< 20% consistently	Downsize instance type
FreeableMemory	> 50% of total	Downsize instance class
DatabaseConnections	< 10% of max	Downsize or use RDS Proxy
NetworkIn/Out	< 10% of baseline	May indicate zombie instance
VolumeReadOps/WriteOps	Near zero	Unattached or idle volume

# AWS CLI: find EC2 instances with < 5% average CPU over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 86400 \
  --statistics Average \
  --query 'Datapoints[*].Average' \
  --output table

# AWS Compute Optimizer recommendation
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=Finding,values=OVER_PROVISIONED \
  --query 'instanceRecommendations[*].{Instance:instanceArn,Savings:estimatedMonthlySavings.value}'

Reserved Instances vs Savings Plans vs Spot

SPOT INSTANCES (70-90% discount):
  ✅ Best for: batch jobs, CI/CD workers, ML training, stateless apps
  ❌ Can be reclaimed with 2-minute notice
  Pattern: Use with --interruption-behavior=terminate and checkpoint logic
  
  Spot best practices:
  ─ Use instance fleet (multiple types/AZs) for availability
  ─ Set price ceiling at On-Demand price (avoid overbidding)
  ─ Use ECS Capacity Providers or EKS Karpenter for automated Spot handling
  ─ Never Spot: databases, ZooKeeper, Kafka brokers, anything stateful

SAVINGS PLANS (1yr: ~30%, 3yr: ~50% discount):
  Compute Savings Plans: Most flexible (any instance type, region, OS, ECS, Lambda)
  EC2 Instance Savings Plans: 72% savings, locked to instance family + region
  SageMaker Savings Plans: ML workloads only
  
  ✅ Buy when: Stable compute baseline you're confident will run 1+ year
  Start with Compute Savings Plans for flexibility, then add EC2 if you know family

RESERVED INSTANCES (similar savings to Savings Plans, older model):
  Standard RI: Locked to specific instance type + region
  Convertible RI: Can exchange for different type (less discount)
  
  ✅ Still useful for: RDS, ElastiCache, Redshift (Savings Plans don't cover these)

DECISION:
  Stable 24/7 workload? → Savings Plans / RIs
  Interruptible batch?   → Spot
  Unpredictable/bursty? → On-Demand (optimize architecture first)
  
  Savings calculation:
  On-Demand: $0.192/hr × 730h/month = $140/month
  1yr Savings Plan (no upfront): $0.131/hr × 730h = $95.6/month → $534/year savings
  1yr Savings Plan (all upfront): ~$1,050 → $630/year → $630 - $540 = $90 better than no-upfront

S3 Storage Classes and Lifecycle Rules

# Boto3: apply lifecycle rule to reduce storage costs
import boto3

s3 = boto3.client('s3')

lifecycle_config = {
    'Rules': [
        {
            'ID': 'logs-auto-archive',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'logs/'},
            'Transitions': [
                # Day 0: Standard ($0.023/GB)
                {'Days': 30, 'StorageClass': 'STANDARD_IA'},    # $0.0125/GB
                {'Days': 90, 'StorageClass': 'GLACIER_IR'},     # $0.004/GB
                {'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'},  # $0.00099/GB
            ],
            'Expiration': {'Days': 2555},  # Delete after 7 years
            'NoncurrentVersionTransitions': [
                {'NoncurrentDays': 7, 'StorageClass': 'GLACIER_IR'},
            ],
            'NoncurrentVersionExpiration': {'NoncurrentDays': 90},
            'AbortIncompleteMultipartUpload': {'DaysAfterInitiation': 7},
        },
        {
            'ID': 'intelligent-tiering-user-content',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'user-uploads/'},
            'Transitions': [
                # Unknown access pattern → let AWS optimize automatically
                {'Days': 0, 'StorageClass': 'INTELLIGENT_TIERING'},
            ],
        },
    ]
}

s3.put_bucket_lifecycle_configuration(
    Bucket='my-app-data',
    LifecycleConfiguration=lifecycle_config
)

S3 Cost Quick Math

Standard:           $0.023/GB/month
Standard-IA:        $0.0125/GB/month + $0.01/GB retrieval
Intelligent-Tiering: $0.023 frequent, $0.0125 infrequent (auto-tiered)
                     + $0.0025/1000 objects monitoring fee
Glacier Instant:    $0.004/GB/month + $0.03/GB retrieval
Glacier Flexible:   $0.0036/GB/month + $0.01/GB retrieval (hours)
Deep Archive:       $0.00099/GB/month + $0.02/GB retrieval (12 hours)

Minimum storage duration:
  Standard-IA:      30 days (charged for 30 even if deleted sooner)
  Glacier Instant:  90 days
  Glacier Flexible: 90 days
  Deep Archive:     180 days

Data Transfer Cost Patterns

Data transfer is often the biggest hidden cost. Key rules:

FREE:
  ─ Inbound to AWS from internet: FREE
  ─ S3 → CloudFront: FREE
  ─ Within same AZ (same region, same AZ): FREE
  ─ VPC endpoints for S3/DynamoDB: FREE (gateway endpoints)
  
EXPENSIVE:
  ─ EC2/Lambda → Internet egress: $0.09/GB first 10TB
  ─ Cross-AZ within same region: $0.01/GB EACH WAY ($0.02/GB round-trip!)
  ─ Cross-region: $0.02/GB
  ─ NAT Gateway processing: $0.045/GB
  
CROSS-AZ IS THE HIDDEN KILLER:
  Service A (us-east-1a) → Service B (us-east-1b)
  → $0.01/GB × 2 (bidirectional) = $0.02/GB
  
  At 100GB/day: $2/day = $60/month just for cross-AZ traffic
  
  Fix:
  ─ Deploy paired services in same AZ (use affinity rules in k8s)
  ─ Use regional load balancers (not cross-AZ by default)
  ─ Compress data before sending cross-AZ

# Detect cross-AZ traffic with VPC Flow Logs query (Athena)
"""
SELECT
  srcaddr,
  dstaddr,
  sum(bytes) as total_bytes,
  sum(bytes) / 1073741824.0 as gb,
  (sum(bytes) / 1073741824.0) * 0.01 as estimated_cost_usd
FROM vpc_flow_logs
WHERE
  srcaddr LIKE '10.0.1.%'  -- AZ-A subnet
  AND dstaddr LIKE '10.0.2.%'  -- AZ-B subnet
  AND log_status = 'OK'
  AND start >= to_unixtime(current_timestamp - interval '7' day)
GROUP BY srcaddr, dstaddr
ORDER BY total_bytes DESC
LIMIT 20
"""

Idle Resource Detection

#!/usr/bin/env bash
# find-idle-resources.sh — Run monthly to find waste

# Unattached EBS volumes (paying for storage with no instance)
echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}' \
  --output table

# Stopped EC2 instances (still paying for EBS, Elastic IP, etc.)
echo "=== Stopped EC2 Instances ==="
aws ec2 describe-instances \
  --filters Name=instance-state-name,Values=stopped \
  --query 'Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,Stopped:StateTransitionReason}' \
  --output table

# Unused Elastic IPs ($0.005/hr when not attached = $3.60/month each)
echo "=== Unattached Elastic IPs ==="
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].{IP:PublicIp,AllocationId:AllocationId}' \
  --output table

# Old snapshots (sorted by age)
echo "=== EBS Snapshots Older Than 90 Days ==="
aws ec2 describe-snapshots --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-8601)'].{ID:SnapshotId,Size:VolumeSize,Date:StartTime,Description:Description}" \
  --output table

# Unused load balancers (no healthy targets)
echo "=== Load Balancers with No Targets ==="
aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[*].LoadBalancerArn' --output text | \
  tr '\t' '\n' | while read arn; do
    targets=$(aws elbv2 describe-target-groups --load-balancer-arn "$arn" \
      --query 'TargetGroups[*].TargetGroupArn' --output text | \
      tr '\t' '\n' | while read tg; do
        aws elbv2 describe-target-health --target-group-arn "$tg" \
          --query 'length(TargetHealthDescriptions)' --output text
      done | awk '{sum+=$1} END{print sum}')
    [ "$targets" = "0" ] && echo "IDLE: $arn"
  done

Kubernetes: Resource Requests/Limits Impact

# The cost of Kubernetes is determined by what you REQUEST, not what you USE
# Nodes must have enough allocatable capacity to satisfy ALL pod requests

# Over-requesting (wasteful):
resources:
  requests:
    cpu: "2"        # Reserves 2 vCPU on the node always
    memory: "4Gi"   # Reserves 4GB on the node always
  limits:
    cpu: "2"
    memory: "4Gi"

# Under-requesting (unstable — OOMKilled, CPU throttled):
resources:
  requests:
    cpu: "10m"    # 1% CPU — too low, pod gets throttled
    memory: "32Mi" # Way too low — will OOM

# Right-sized (VPA recommendations + 20% buffer):
resources:
  requests:
    cpu: "250m"       # What the app typically uses
    memory: "512Mi"   # P99 of actual memory usage
  limits:
    cpu: "1000m"      # Allow burst, prevent monopolizing
    memory: "512Mi"   # Keep limit = request to avoid OOM surprises

# Use Vertical Pod Autoscaler in recommendation mode
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: order-api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-api
  updatePolicy:
    updateMode: "Off"  # Recommend only, don't auto-apply
EOF

# After 24h, check recommendations
kubectl describe vpa order-api-vpa | grep -A 20 "Recommendation"

Tagging Strategy for Cost Allocation

# Mandatory tags (enforce via AWS SCPs / GCP Org Policy):
tags:
  Environment:  "production"      # production, staging, development
  Team:         "platform"        # Team responsible (maps to cost center)
  Service:      "order-api"       # Individual service/application
  CostCenter:   "engineering-01"  # Finance's cost center code
  ManagedBy:    "terraform"       # Who manages this resource

# AWS SCP: deny resource creation without required tags
{
  "Effect": "Deny",
  "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
  "Resource": "*",
  "Condition": {
    "Null": {
      "aws:RequestTag/Environment": "true",
      "aws:RequestTag/Team": "true",
      "aws:RequestTag/Service": "true"
    }
  }
}

Anti-Patterns

❌ Buying RIs/Savings Plans before right-sizing — committing to waste
❌ No tagging → no cost allocation — you can't hold teams accountable without attribution
❌ Over-provisioning "just in case" — use auto-scaling; don't manually buffer
❌ Dev/staging same size as production — right-size lower environments aggressively
❌ Ignoring cross-AZ traffic — often adds 10-20% to EC2 bills
❌ Not setting S3 lifecycle rules — old data sits in Standard class forever
❌ Lambda memory set to 3GB because "more CPU" — cost scales linearly with memory
❌ Manual cost reviews — automate anomaly detection, don't rely on monthly reviews
❌ Reserved Instances for non-stable workloads — lock-in without savings

Quick Reference

AWS cost savings ladder (easiest → most effort):
  1. Delete unattached EBS volumes (Immediate, free money)
  2. Delete unused EIPs, snapshots, LBs (Immediate)
  3. Set S3 lifecycle rules (Low effort, high savings at scale)
  4. Right-size over-provisioned EC2/RDS (Medium — need data)
  5. Buy Compute Savings Plans for stable workloads (Low effort once you decide)
  6. Move batch workloads to Spot (Medium — needs architecture change)
  7. Move to ARM/Graviton2 (~20% cheaper, often drop-in) (Medium)
  8. Redesign for serverless/Fargate where appropriate (High effort, high savings)

GCP cost savings ladder:
  1. Delete idle VMs and unattached disks
  2. Set Cloud Storage lifecycle rules
  3. Enable CUD (Committed Use Discounts) for stable workloads
  4. Move to Spot VMs for batch/CI
  5. Use Cloud Run (scale-to-zero) instead of always-on VMs
  6. BigQuery: partition + cluster tables (query cost reduction)

Lambda cost formula:
  Cost = requests × $0.0000002 + GB-seconds × $0.0000166667
  Example: 1M requests, 512MB, avg 500ms:
  = 1,000,000 × $0.0000002 + (1,000,000 × 0.5 × 0.5) × $0.0000166667
  = $0.20 + $4.17 = $4.37/million invocations

Skill Information

Source: MoltbotDen
Category: DevOps & Cloud
Repository: View on GitHub