Scaling AWS Smarter: A Startup's Cost-Optimization Journey

SkylineAI is an AI-powered media analytics startup that launched in 2023 and quickly grew a user base processing video and image data. Their architecture reflected this growth: an Application Load Balancer and Amazon CloudFront fed web traffic to containerized microservices on Amazon ECS/Fargate, while heavy compute (video transcoding and ML training) ran on EC2 instances.

By late 2024 the team faced a dilemma: rapid growth meant exponentially rising AWS bills. A preliminary audit found many idle or oversized resources – classic FinOps red flags. For example, other companies at scale have seen similar trends: DoorDash grew S3 data ~2.6× while using lifecycle policies to cut per-GB costs 30%, saving millions.

SkylineAI needed to apply these lessons: aggressively optimize costs without degrading performance or scalability.

Typical AWS microservices architecture:

SkylineAI's initial design included load balancer, ECS/Fargate services, and S3 storage. It scaled well but lacked cost controls.

Identifying the Cost Challenge

By January 2025 the SkylineAI AWS bill had tripled year-over-year. Key pain points emerged:

Compute (EC2): Dozens of On-Demand instances ran video processing jobs even during idle periods.
Containers (ECS/Fargate): Many services were configured with conservative (overprovisioned) vCPU/memory, and all tasks ran On-Demand.
Storage (S3): Every image, thumbnail, and log stayed in S3 Standard storage indefinitely, despite 90% of old data being rarely accessed.
Serverless (Lambda): API-facing functions had cold-start latency spikes; the team suspected they might need warmers or provisioned concurrency.

In short, usage scaled but purchase options did not. AWS experts note that Spot Instances and Savings Plans can cut costs dramatically – up to 90% off for fault-tolerant compute and 72% off for baseline usage.

📊 Real-World Examples:

A startup (Gett) saved ~$800K/year by moving to EC2 Spot, and another (AdRoll) saw fixed costs fall 75% and overall ops costs 83% by mixing On-Demand, Reserved, and Spot instances.

Cost Optimization Strategy

SkylineAI tackled each layer with specific techniques, always measuring performance impact:

EC2/Compute: Use Spot and Savings Plans for batch workloads.
Containers (ECS/Fargate): Add Fargate Spot, right-size tasks, and use Savings Plans.
Storage (S3): Apply lifecycle rules and intelligent tiering for older data.
Serverless (Lambda): Reduce cold starts via lightweight runtimes, smaller packages, SnapStart/provisioned concurrency where needed.

Each change was rolled out iteratively. Detailed effects are described below.

Optimizing EC2 with Spot and Savings Plans

SkylineAI's heavy batch jobs (video transcoding, ML training) were interruptible and non-time-sensitive. These were moved to EC2 Spot Instances. AWS data shows Spot can be up to 90% cheaper than On-Demand.

In practice, we switched ~70% of non-critical workloads to Spot (mixing instance types for availability) and used Compute Savings Plans for the rest. This strategy parallels AWS recommendations: start with On-Demand for baseline, add Reserved or Savings Plans, and use Spot for flexible bursts.

💰 The Impact:

EC2 compute costs dropped by ~60%. One team member noted, "We simply wouldn't have paid for those machines if we hadn't realized how safe Spot could be here."

We also scaled the cluster with AWS Auto Scaling, ensuring idle EC2 instances were terminated between jobs.

Right-Sizing Containers and Using Fargate Spot

For microservices running on ECS/Fargate, we improved utilization in two ways.

1. Right-Sizing Task Definitions

Many services had been provisioned with generous CPU/memory "just in case," which AWS warns leads to 30–70% waste. Using AWS Compute Optimizer and custom load tests, we resized tasks to match actual peaks.

Example:

A frontend service went from 2 vCPU / 4 GB RAM to 1 vCPU / 2 GB RAM without affecting performance.

In aggregate, right-sizing is estimated to cut container costs ~50%.

2. Deploying Fargate Spot

AWS launched Fargate Spot in 2019, and Spot Fargate tasks run on spare capacity with discounts of roughly 50–70% off standard Fargate pricing. We set our ECS capacity provider strategy to mix ~75% FARGATE_SPOT and 25% regular FARGATE tasks.

💰 The Impact:

This change alone slashed our Fargate spend by roughly 60–70%. We also purchased a Fargate Compute Savings Plan (1-year) to lock in further discounts (up to 50% off).

Efficient Storage: S3 Lifecycle and Tiering

SkylineAI's S3 growth was the steepest expense curve. We had dozens of TBs in Standard storage. To tame this, we enabled lifecycle policies and intelligent tiering.

Storage Strategy:

Objects older than 30 days were transitioned to S3 Standard-Infrequent Access (Standard-IA)
After 90 days moved to S3 Glacier Instant Retrieval
Used S3 Storage Class Analysis to identify truly cold data

This reflects best practices: move "write-once, read-rarely" data to archives. For example, Canva migrated inactive user designs to Glacier Instant Retrieval and saved $3M+ per year.

💰 The Impact:

By right-sizing our storage classes and compressing images before upload, SkylineAI reduced its S3 bill by roughly half.

Lambda Cold-Start Mitigation and Optimization

For latency-sensitive APIs, we examined Lambda cold-starts. AWS reports cold starts impact <1% of invocations in typical workloads, so we first verified cold starts weren't a major bottleneck.

Mitigation Strategies:

Using Node.js/Python runtimes (which are relatively fast)
Minimizing deployment package size
Experimenting with AWS SnapStart for Java/Python functions
Using Provisioned Concurrency for mission-critical Lambdas

AWS Lambda SnapStart:

When you publish a new function version, AWS initializes it and takes a microVM snapshot. On each invocation, Lambda restores the snapshot instead of re-running init, dramatically reducing cold-start latency. We used SnapStart for our heaviest Java-based functions to ensure sub-100ms cold starts with minimal additional cost.

In practice, the cold-start cost turned out to be only a small fraction of our Lambda spend ($0.0021–$0.0025 per 512 MB cold start, i.e. a few dollars per million invocations).

Results: Before vs. After

By mid-2025, after implementing the above changes, SkylineAI's AWS expenses had dropped dramatically without impacting scale or feature velocity.

Service	Before	After	Savings
EC2	$8,000	$3,200	60%
ECS/Fargate	$5,000	$1,750	65%
S3 Storage	$3,500	$1,750	50%
Lambda	$500	$450	10%
Total	$17,000	$7,150	~57%

🎯 Key Result:

System performance remained robust. Auto Scaling groups handled workload spikes; ECS services maintained the same throughput (now at lower vCPU footprint); S3 response times were unchanged for active data. Users saw no downtime or latency regressions.

Key Takeaways for Startups

Startups on AWS can achieve large cost savings without sacrificing scalability. SkylineAI's experience highlights these lessons:

1. Audit and Monitor Continuously

Use tools like AWS Cost Explorer and S3 Storage Lens to spot inefficiencies. Tag resources and review underutilized assets regularly.

2. Mix Pricing Models Intelligently

Pay for baseline steady workloads with Reserved/Savings Plans (up to ~72% off) and scale with On-Demand. Offload flexible jobs to Spot Instances or Fargate Spot (up to 90% off).

3. Right-Size Everything

Don't just deploy with default or maximum configs. Use AWS Compute Optimizer or load tests to resize EC2/ECS resources. In practice, right-sizing can cut cloud spend ~30–70%.

4. Optimize Storage Tiers

Implement S3 lifecycle rules to move old data to cheaper classes (Standard-IA, Glacier Instant Retrieval, or Deep Archive). Automated tiering can halve storage bills as usage grows.

5. Weigh Lambda Cold-Start Trade-offs

First, measure if cold starts really impact SLAs – AWS notes they affect <1% of calls for most apps. If mitigation is needed, consider AWS SnapStart or Provisioned Concurrency for the hottest functions.

6. Commit to a FinOps Mindset

Regularly review the bill, experiment with new AWS features (e.g. Graviton instances or new storage classes), and build "cost-awareness" into engineering culture. Even incremental changes add up.

Final Thoughts

By following these principles, other startups can similarly scale their AWS infrastructure efficiently, keeping costs low while handling growth. The key is combining the right AWS features (Spot, Savings Plans, lifecycle rules, etc.) with diligent measurement and tuning.

As the skylines in our revenue charts climb, our cloud spending charts now happily slope downward – a win-win for innovation and the bottom line.

📚 Referenced Case Studies:

DoorDash: Saved millions using S3 Storage Lens (30% cost reduction, 2.6× data growth)
Gett: Saved $800K/year with EC2 Spot
AdRoll: Reduced fixed costs by 75% and ops costs by 83% with mixed pricing
Arm: Cut costs 40% using EC2 Spot (65% workload on Spot)
Canva: Saved $3M+ annually with Glacier Instant Retrieval