I faced a challenge with our EFS throughput which caused significant load increases on our EC2 instances (upwards of 30,000!!) as well as extreme delays reading and writing to the EFS volume.
The EFS volume (network storage) was setup in Bursting Throughput mode which means that Amazon scales the throughput based on the size of our file system. Our size at the time of the incident was 26.81GB which means we have a permitted throughput of 1 MiB/sec based on our size. However, we are generally pushing around 1.1 miB / sec which exceeded our throughput limit. Because of this, we are unable to accrue bursting credits as we are constantly pushing more throughput than is permitted. A file system can drive throughput at its baseline metered rate continuously. A file system accumulates burst credits whenever it is inactive or driving throughput below its baseline metered rate. Accumulated burst credits give the file system the ability to drive throughput above its baseline rate. So for example a 100GiB File system can burst to 300MiB for up to 72 minutes.
You can see how from 5/30–6/1 we had 100MiB/sec throughput available to us as well as the bursting credits we had.
This stopped on 6/1 and we had a brief period of time where we could burst early yesterday morning
WHY DID THIS HAPPEN?
We were continuously driving more throughput than was permitted so we would always be at a credit balance of 0 unless our throughput dropped below the permitted throughput of 1MiB/sec.
HOW DID WE SOLVE?
We made the decision to change from bursting throughput mode to provisioned throughput mode and saw immediate relief.
The mode change will help in the future because our application requires a high throughput with a low volume of data.
HOW DO WE TEST THIS IN THE FUTURE?
As part of our endurance testing, because our EFS volume is small, I would say about 30 minutes should give us plenty of time to exhaust our credit balance.
HOW OFTEN ARE THE CREDITS RESET?
Whenever we are driving throughput lower than our permitted throughput we begin to accrue burst credits
Each file system earns credits over time at a baseline rate that is determined by the size of the file system that is stored in the EFS Standard or One Zone storage class. A file system can drive throughput at its baseline metered rate continuously. A file system accumulates burst credits whenever it is inactive or driving throughput below its baseline metered rate. Accumulated burst credits give the file system the ability to drive throughput above its baseline rate.
WHAT IMPLICATIONS DO WE HAVE WITH THE THROUGHPUT CHANGE?
There are no technical challenges or concerns with the change made. We can monitor our throughput and we can scale back the minimum provisioned rate as needed. This change will affect our bill. We are billed for the throughput that we provision above what Amazon provides us. There is a calculator on this page. Please keep in mind this is list/retail pricing and does not reflect any discounts.
WHAT ELSE CAN WE DO?
Monitor throughput and burst credits
Use asynchronous write operations
Asynchronous writes are buffered on our EC2s before written to EFS. This has lower latency but the tradeoff is consistency and time take to complete the write.
We should use the settings recommended by AWS (https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html) Use NFS 4.1 as it provides better performance. We should also increase the read and write buffers for NFS clients to 1MB if this is not already in place
Use multiple EFS volumes
We can add a second, third ,etc.. EFS volume to split logs and other data
I hope this helps.