One night in production, we had a challenge with our Hazelcast cluster.
We noticed that on one of the four nodes in the cluster had 100% CPU and over 320 threads. The process consuming the threads and cpu was java (hazelcast). The cluster became unhealthy. Nodes have come and gone hundreds of times in the past. Whether it was intentional or unintentional, it has always been uneventful and routine. This night it was not.
We started digging through the logs. From examining the logs, I see that the system CPU load was indeed at 100%, but only 50% was being used by the Hazelcast instance. Additionally, the peak thread count never exceeded 150. This is indicated from logs emitted by
com.hazelcast.internal.diagnostics.HealthMonitor. These logs are triggered when certain performance indicators exceed a threshold. Do you have any other Java processes running alongside Hazelcast on your servers?
Our monitoring tool (AppDynamics) shows the 300+ threads. It is weird that the lpg numbers and what we observed are doubled.
The only other java process running is an AppDynamics (monitoring) agent
What we are seeing is the application timing out to the hazelcast cluster. The cluster sits at waiting for migrations and backup synchronization across the healthy nodes once we take the bad node out of the cluster.
Before the node went down, I saw the errors in both (app segments), related to slow operations and cluster connectivity issues. At the same time, I saw the backup sync activities are happening too fast in the cluster which is observed through management center and hazelcast logs. I suggested we take the bad node down, hoping the cluster will re-initiate the executor service (which it should normally do) but it did not made any difference in the hazelcast cluster. Normally taking one node out of the cluster re-initiates the executor service and stabilize the cluster most of the times.
This time it did not. It made the problem worse.
Summary of hazelcast cluster event observed in splunk
WARNING com.hazelcast.internal.partition.impl.MigrationManager — partitionId=2052 is completely lost!
host sents heart beat host since and Timed out by 60000ms
host1 concludes host2 to be dead
host1— rejects Client Authentication request
host1 logs Authentication failed request
host1 — Assigned new owners for 2053 lost partition
host1 is merging tcp/ip to host2
host1 is merging tcp/ip to host2, instructed by master
host2 Started Ignoring heartbeat from Members host3
Events show possible split brain cluster state with host2 and host1 acting as master, originating because of heart beat failure
heartbeat timeout: 60000 ms, suspicion level: 1.00
is suspected to be dead for reason: explicit suspicion
Member having UUID is not part of the cluster. Client Authentication rejected.
Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 44124 ms, Heartbeat-Timeout: 60000 ms
Thousands of very scary entries.
I concluded that there was a split-brain between host1 and the other members. Since this behavior looks to precede the high cpu usage, I'd like to investigate this further to understand what's causing the excessive heap usage on host1 compared to the others.
I needed to dig deeper. We had a maintenance window where we took another member out of the cluster cleanly and added it back in.
We felt the same effects.We were seeing a lot of waiting migrations and backup synchronizations in progress.
We then took the new node out of the cluster and all migrations happened except 1.
We also noticed that 1 of our larger maps that normally has 5.9M entries stopped at 5.1M entries.
To summarize upon restarting this node, it joined the cluster and attempt to migrated the partition id 718 from host1 to host2, which has failed and retried for 75 times. During this time the application calls slowed to a crawl. The problem was fixed upon clean restart of all cluster nodes.
It is not looking very positive at this point. The only way to recover from this is to take the entire cluster down and back up?
We noticed these errors which prompted us to start tcp dumps and wireshark analysis of network traffic
WARNING com.hazelcast.internal.networking.nio.NioChannelOptions — The configured tcp send buffer size conflicts with the value actually being used by the socket and can lead to sub-optimal performance. Configured 1048576 bytes, actual 124928 bytes. On Linux look for kernel parameters ‘net.ipv4.tcp_wmem’ and ‘net.core.wmem_max’.This warning will only be shown once.
After further analysis we found the challenge. LRO was enabled in our VMWare environments.
To resolve the challenges with production we made the following changes to the production hazelcast nodes:
sudo ethtool -K eth2 lro off
sudo echo -e “ethtool -K eth2 lro off” >> /etc/rc.local
sudo cat /etc/rc.local
sudo ethtool -k eth2 | grep large-receive-offload
Server reboot is not required as the production & lower environments are all running on VMXNET3 network type.
This is vendor recommended and is a known networking issue regarding network performance issues.
Known Networking Issues Network performance issues, including timeouts, might occur with LRO (Large Receive Offload) enabled on Linux virtual machines and ESXi/ESX hosts. We have specifically had this reported in VMware environments, but it could potentially impact other environments as well. We strongly recommend disabling LRO when running in virtualized environments:
LRO somehow interferes with kernel ability/capacity to read data from the NIC (network interface). I noticed that Hazelcast uses TCP connections in a unique way. The TCP pipes are used in a unidirectional way. Dedicated to either reads or writes and a never a mixture of both. Most of the time the number of bytes usually match the number of bytes sent on one end with the number of bytes on the other end. When the issue is about to happen, the number of written and read are no longer aligned. They are grouped into different size packets. Then communication stops for that particular pipe. Apparently LRO does not play nicely with packet routing/bridging. This is not specific to Hazelcast, it’s just how LRO works. I believe bridging/routing is routinely used in virtualized environments and you should disable LRO.
It was a long few days, but high fives all around to the team for being relentless with the solution. 1 down 999,999 challenges to go! :)