The Wake-Up Call
Our AWS bill was growing 20% month-over-month. Something had to change. Here's exactly what we did to cut costs by 40% while maintaining (and improving) performance.
Strategy 1: Right-Sizing Instances
Most instances are over-provisioned. We used CloudWatch metrics to identify:
# Check average CPU utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-31T23:59:59Z \
--period 3600 \
--statistics AverageFinding: 60% of our instances averaged < 20% CPU utilization.
Action: Downsized from m5.xlarge → m5.large where appropriate.
Savings: ~$15,000/year
Strategy 2: Reserved Instances for Baseline
For workloads that run 24/7, Reserved Instances are a no-brainer:
Tip: Start with Convertible RIs for flexibility.
Strategy 3: Spot Instances for Batch Jobs
For non-critical workloads (CI/CD runners, batch processing):
# Example: GitLab Runner on Spot
Resources:
SpotFleet:
Type: AWS::EC2::SpotFleet
Properties:
SpotFleetRequestConfigData:
IamFleetRole: !GetAtt SpotFleetRole.Arn
TargetCapacity: 5
AllocationStrategy: lowestPrice
LaunchSpecifications:
- InstanceType: m5.large
SpotPrice: "0.04" # vs $0.096 on-demandSavings: 60-70% compared to on-demand.
Strategy 4: S3 Lifecycle Policies
Data accumulates. Set up automatic tiering:
{
"Rules": [
{
"ID": "MoveToIA",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
}
]
}Savings: 70% on storage costs for old data.
Strategy 5: NAT Gateway Optimization
NAT Gateways are expensive ($0.045/hour + $0.045/GB).
Solution: Use VPC endpoints for AWS services:
# S3 Gateway Endpoint (free!)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-1234567890abcdef0 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-1234567890abcdef0The Results
Key Takeaways
- Measure first - You can't optimize what you don't measure
- Right-size everything - Most resources are over-provisioned
- Use RIs for baseline - Predictable workloads = predictable savings
- Spot for burst - Accept interruption for massive savings
- Automate cleanup - Old data and unused resources add up