Hadoop Implementation Questions
Many midsize companies are now sizing up Hadoop. Sometimes it is to take advantage of unstructured data and other times it is to take advantage of the storage price-performance. Regardless, there are some basic questions about Hadoop adoption that many are thinking about as they consider a move. I’ll provide some of those questions, and answers, in this post.
How are customers evolving with their Hadoop environments?
2-3 years ago only the innovators were using Hadoop. There was Google who developed the file system and Yahoo who created the Map-Reduce functionality, and others like them.
Hadoop harnesses an area of impending competitive advantage – the utilization of big data. Structured data utilization is closer to the point of diminishing returns in many organizations as compared to unstructured data, where Hadoop excels.
So now many are using it and adopters want to not only use Hadoop, but optimize the clusters. Optimization is important to get the most work out of the cluster.
What are the key concerns for customers when configuring Hadoop nodes?
When you start looking at different configurations (number of nodes, memory, disk, etc.), you want to get a balance between hard drive count, CPU core count and memory density. You want to look at different CPU technologies and CPU bus speeds, data and infrastructure node memory allocations, and the type of hard drive controllers.
How should Hadoop be tested in the environment?
Start with a proof of concept – a small 5 node cluster. Use this configuration to get a better understanding of what the data looks like, potential cleansing necessary and get a better understanding of how it all works – nodes, jobs. Plan to spend 2-3 months in POC. You can run VMs on local desktops for testing and development to get started.
Should Hadoop be virtualized?
Many adopters have virtualized entire environments which makes it hard to do anything in a dedicated way. You can clearly do virtualization with Hadoop. You can use a hypervisor. I’d be reluctant at this stage to get into a lot of sharing of the Hadoop nodes as it will take up quite a bit of resources or at least disk (relatively speaking), but VM-hosted Hadoop works well.
Hadoop can clearly exist in the cloud as well, which I’ve argued may be the best deployment method for Hadoop for many. Though a public cloud will yield some performance degradation over something more controlled, the use of more VMs can compensate. Public clouds will need to adhere to extra commitments for Hadoop, such as not moving VMs around.
This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. I’ve been compensated to contribute to this program, but the opinions expressed in this post are my own and don’t necessarily represent IBM’s positions, strategies or opinions.