Specifying Data Nodes for Hadoop
The reason so many new data storage projects are meeting enterprise scaling requirements at less cost is primarily due to the use of commodity class nodes in a scale out environment. The nuances various projects and companies do to a scale-out environment account for the rest.
Practically, scale out means you are limited in the ability to specify a single node and you focus your scaling energies on “how many?” versus “what specification?”. Quite often, I find myself challenging the notion of never challenging the node specification as the cluster scales out. Most clusters can contain nodes of varying specification and, since they are built out over time in agile fashion, it would seem to me that there would clearly be multiple specifications. While this introduces potential performance variation, better-specified nodes, even though it may not be practical to deploy them throughout the cluster, yield better selective performance and incorporate new learnings into the Hadoop journey.
In other words, no environment should truly be thought of as entirely scale out.
That aside, whether to be able to check the “scale out” box or for its true advantages, many enterprise platforms that used to be scale-up only now can scale out and make multiple nodes work together. This is admittedly a vastly different scale out from a Hadoop or NoSQL cluster. We’re talking scaling out to 8 nodes for example, instead of possibly hundreds. In these clusters, you would definitely pay a lot of attention to the specification of the node.
In a Hadoop cluster, you need to get a balance between hard drive count, CPU core count and memory density. You want to look at different CPU technologies and CPU bus speeds, data and infrastructure node memory allocations, and the type of hard drive controllers. The correct balance and budget depends on anticipated usage, performance expectations and data activity/temperature. Be sure to get some outside, forward-looking perspective on these ‘anticipated’ aspects of your Hadoop workload. It’s more than the initial requirements in front of you.
I see too many clusters put together with the thought that Hadoop is so much better for the workload, the specification of the cluster itself is not important. Here is some guidance on specifying a data node in a cluster – once you get to that point by determining cluster size. Note I am not referring to administrative nodes like name nodes, trackers, standby name nodes or high availability nodes.
- Architect with at least one spindle per core, with each spindle as a separate file system. I/O bound workloads (many are) could use more spindles per core. The optimization of the nodes for I/O or CPU is an important one and requires a good visioning of the eventual workloads.
- Use large drives – a terabyte minimum. This assumes you have at least 3 terabytes to manage in the cluster. Since Hadoop workloads tend to be large sequential I/O scans, the lower seek times in larger drives is usually a good tradeoff.
- Use dual multi-core CPUs running 2.5 GHz.
- Have at least 64MB of memory per node. This will work even if you want to use HBase. This much memory combined with somewhere between 12-20 cores per node will allow for dozens of concurrent tasks.
I did not go into administration nodes, cooling, existing skills, reusable available nodes, network configuration or racking today, but those are interesting as well to specifying data nodes for your Hadoop cluster.
This post is brought to you by SAS.