AWS ECS Placement Strategy
When I took a look at our AWS ECS instances, I noticed that we weren’t using them in the most efficient, resilient and cost effective way
This is just one example where some EC2 instances had 0 or 1 task running on them, while others had 7 or 8 running on them.
Another example where we had unused instances running 0 tasks, some running only 1 task while others were running 6.
I also noticed that some services had all of their tasks running in only 1 or 2 availability zones (AZs) when 3 were available. We thought we had redundancy and resiliency. …..
Why does this matter?
This meant that if 1 instance went down for a planned or unplanned event we would lose 6 containers at once. Worse, some services had 3 tasks running on 1 instance in 1 availability zone.
According to Amazon, when a task that uses the EC2 launch type is launched, Amazon ECS must determine where to place the task based on the requirements specified in the task definition, such as CPU and memory. Similarly, when you scale down the task count, Amazon ECS must determine which tasks to terminate. You can apply task placement strategies and constraints to customize how Amazon ECS places and terminates tasks.
A task placement strategy is an algorithm for selecting instances for task placement or tasks for termination. For example, Amazon ECS can select instances at random, or it can select instances such that tasks are distributed evenly across a group of instances.
What happened?
An availability zone went unavailable and we lost a disproportionate number of tasks and since they were all unevenly placed in instances in this single AZ, we had business impact.
How to fix this?
Amazon ECS task placement strategy
Currently, container instances are evenly spread across 3 Availability Zones and our placement strategy is spread(instanceId) which means that a single ECS service can potentially have all 3 tasks running on different instances but in the same availability zone. This is because the placement strategy only applies to the service level and not the cluster level.
One approach that could improve the spread across our instances would be to add an additional placement strategy to also spread the tasks across availability zones too.
The strategy would therefore be:
spread(attribute:ecs.availability-zone), spread(instanceId)
This still does not guarantee a better spread because ECS only considers the placement strategy at a per-service level but this may be worth testing.
And testing we did.
Things are starting to look a lot better. This setting is per service and takes effect after the instance is stopped/started/rebooted. After the instances rebooted we saw much better results.
As you can see we started off with an unbalanced cluster and after rebooting each EC2 instance the cluster evenly distributed all of the tasks. Below is the Terraform configuration
“placement_strategy”: [{“field”: “attribute:ecs.availability-zone”,“type”: “spread”},{“type”:“spread”,“field”:“host”}]
AWS does not support balancing/re-balancing tasks across multiple services like other cloud providers. We could create a LAMBDA function that listens to scaling event triggers and then tweaks the scaling rules from there. Inspired by this blog post
https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/
As we made this change to several clusters we not only ensured that the application was more resilient and the tasks were now evenly distributed, we were also able to shutdown unused instances by scaling down which saved the teams and company a lot of money.
I hope this helps your team.