This post was authored by Gabe Contreras, Enterprise Architect Nutanix
With many big data applications like Elasticsearch, one question that invariably comes up is “Why have RF2 at the Elasticsearch layer and still have to do RF2 at the Nutanix layer?”.
There are many reasons operationally why you would want Elasticsearch on Nutanix, especially with AHV, rather than on bare metal. Those operational reasons such as not having application silos, leveraging spare compute and storage, having a single pane for all your workloads and supporting a single platform all lead to operational efficiencies, especially at scale. These help you turn your big data deployments into a private cloud for all applications and lets you use spare capacity for workloads such as extra compute to run kafka consumers and producers.
There are also space saving features such as compression and EC-X on Nutanix. While Elasticsearch does have compression since it is contained per data node, Nutanix gets even more compression savings. For this post we will focus on the single technical question about RF levels when you have an application that has the capability of doing its own RF, with focus on Elasticsearch. These nuances play directly into resiliency and especially on how each platform handles rebuilds during failures.
First let’s go over two main deployment patterns I see with customers doing bare metal deployments with their production mission critical Elasticsearch workloads. These workloads are almost always on all flash. The first is relying only on Elasticsearch for resiliency. This means bare metal nodes in a RAID 0 configuration to get maximum performance from each individual node, but with Elasticsearch set to RF3. This means one original with two copies. The other is creating more node resiliency. This means RAID 10 on the node and RF2 on Elasticsearch. With RAID10 configuration you will not need to rebuild entire nodes of data for a single disk failure.
The other reason for multiple copies on Elasticsearch is for search parallelization. Unless you are doing a very heavy amount of searches, for many customers one primary and one copy is normally enough for fast searches. This points to the fact that the copy structure of Elasticsearch is more about resiliency. With Elasticsearch, you can apply what is called node or rack awareness for RF resiliency, which would correspond to how we recommend our customers build Elasticsearch on Nutanix for large deployments. You would use the id setting to split evenly the Elasticsearch nodes between two separate Nutanix clusters to build resiliency across racks or availability zones.
We will go over the main failure scenarios to show how each platform handles those failures and how Nutanix does it better.
Prolonged Node Failures
The first scenario is a complete server hardware failure where the node could be down for days until a part is replaced. Elasticsearch has the setting index.unassigned.node_left.delayed_timeout which the default for this setting is one minute. In a bare metal world many customers leave this at the default because if a node fails, they want the rebuild to start as quickly as possible. But some put this as a delay of 5 minutes or so to give a little extra time. Elasticsearch will wait for this threshold to be met and then start reassigning the shards for rebuild.
The rebuild is accomplished at the per shard level. For large deployments you may have designed for your shards to be a max of 50GB. So, if you have 100s of shards that need to be rebuilt each one is a one to one copy, one node has the primary and it will copy that data to the newly assigned secondary. So, depending on how busy your nodes are it can be slower due to this behavior. If you happen to have one node that is the primary for 3 shards that will need to be rebuilt, then it will be a 1 to 3 copy for this rebuild.
This can lead to hotspots in your environment, this can slow down indexing, and can slow down searches as more resources are taken by the rebuild process and there are less copies to be searched. If this occurs during your peak times, it can lead to poor user experience or possibly other failures. When we looked at customers production workloads, rebuild times with all flash bare metal clusters were 8 hours or more until the cluster was healthy. During peak times the rebuild could increase to more than 16 hours.
The rebuild process and rate is controlled by Elasticsearch. While the rebuild is happening in the background, the data durability is compromised. It is possible to tune how fast Elasticsearch can rebuild data but you have to also plan for your peak workload. So, if you plan for peak then during off hours with spare capacity to rebuild, the rebuild speed is artificially limited.