Redis & hidden AWS network throughput limits

Dale Frohman
4 min readApr 3, 2021

Throughout the morning we were receiving reports of users losing their sessions. This was tedious as they would have to logout and log back in every 30 minutes.

The sessions were stored in AWS Elasticache (Redis), after checking our dashboards we noticed that something was not right.

A team member reached out

“I’m trying to interpret what I’m seeing and want to get some input…”

  • Right about 3pm, there was a significant spike in CPU utilization for Redis 0002–001
  • This graph shows CPU, not Engine CPU. The Engine CPU didn’t even hit 70%, so it didn’t saturate)
  • Immediately following, 0002–002 had significantly lower CPU utilization. Thoughts on why?
  • New connections spiked, much higher than it should have for just re-adding a task.
  • There’s no indication that this hit any capacity threshold, but I do suspect this was related to the tasks falling over. Although I don’t have any ideas as to what was the symptom/cause.

Time to dive in!

We are often exceeding the network baseline throughput of 100M per node

For this node we are also seeing packet shaping being performed due to the high utilization as the outbound bandwidth is exceeded. This is affecting performance as out packets are being reduced/dropped. You can see 0002–001 constantly exceeding the outbound network allowance

After this traffic manipulation, we observe the high number of new connections

There is a TCP limit of 65,000 connections per node. Once this is met, additional packets are dropped. Naturally, this results in connection retries. You can see that these excess connections cause spikes in the CPU. The CPU has to work harder when excess connections are occurring.

We are seeing a reduction in Hash Based Commands. This is due to the percentage of traffic being dropped and with excessive retries and re-connects, the actual commands coming through is less.

0002–002 continuously breaches the network baseline throughput of 100M. This exceeds the baseline limit on a m5.xlarge. We see the same peak in incoming traffic, and about 30,000 new connections, but we did not reach the 65,000 limit.

Since we had a large influx of data to the Master, we see replication volumes of 160M as the read replica tries to keep in sync with the Master

The pattern continues with new connections exceeding the 65,000 limit and network bytes out exceeding the baseline throughput again

So we have two immediate challenges

  1. Network traffic is exceeding the network throughput limit. We may need to upgrade the cluster to use bigger ec2 instances
  2. We are hitting the limit of TCP (65,000 connections). There is nothing we can do on the Redis side to fix this. We may want to use connection pooling on the app side so that connections are re-used. We should also look at using all nodes in the cluster to distribute the load.

We found that each EC2 instance has a baseline limit. The 10G is the “burst” limit. The baseline would be 100 MBps which is the same as 750Mbps.

We are seeing this behavior mostly on shard 2.

We were able to identify that this is happening due to not refreshing the client side session token that is used to authenticate with the server. We are doing extensive analysis to identify the root cause of this issue as to why the refreshToken is not working in the client side. In the mean time we can increase the session valid time to have a token that could be used for a long time.

Current value is 30 mins. The short term fix was to increase this.

The root cause was all the records go into the same map/key.

The permanent fix is to split the data out to more shards. Splitting the keys solved this challenge.

Hope this helps someone else :)

Until next time.

--

--

Dale Frohman

Principal Site Reliability Engineer. Cyber Security Professional. Technologist. Leader.