The State of the Elephant
From its formation in the mid-2000s for the needs of the Silicon Valley data elite, Hadoop has come a long way. Those early adopters had data needs that far surpassed any budget for a DBMS out there. The scale was an order of magnitude away from the target workload for a DBMS as well. The timing of their scale was not certain either, given the variability of the data.
This early adopter profile is now commonplace in numerous industries which has led to wide adoption. Hadoop is quickly making its way from the largest Web-data companies through the Fortune 1000 and will see adoption in certain upper midmarket companies in the next five to 10 years.
One of the measures of its staying power can be found in the wide array of drivers. From business, technical and financial perspectives, there are key drivers.
For many businesses, Hadoop makes possible the movement from reactive business to proactive business. By capturing granular movements of parts and products in the supply chain or granular movements of a user on a webpage or all data points on a customer, companies are able to take their customer understanding deeper, anticipate next moves and intervene if necessary and as appropriate for company goals.
Those tasked with providing these functions to the business are finding that predictive model development and run times can drop dramatically with the speed of Hadoop so data warehouses and other DBMS do not have to punch above their weight. With data growing exponentially and overwhelming existing systems and new types of data that inform analytic processes, there are many technical drivers for Hadoop.
Without the financial drivers, there may be no Hadoop phenomenon. However, the financial incentives are strong. The cost of data systems as a percentage of overall IT spend is at unprecedented highs and continues to grow. It would be irresponsible to ignore Hadoop in light of this. In the basic architecture of Hadoop is found commodity nodes. The open source model provides functionality at no cost. Both of these factors help to keep costs down for this platform .
The first wave of Hadoop met these key drivers, but in a much more limited capacity than is present in Hadoop 2. With Hadoop 2 (and YARN), you can now have all data in a single cluster, which acts as a multi-tenant central interaction point for all data access and simultaneous processing. Each workload (batch, interactive and real-time) can be processed on the most efficient platform. This enables the ecosystem to move applications natively into Hadoop and leverage existing skills like SQL.
What immaturity Hadoop has left is largely in the area of nonfunctional requirements. Security and governance in particular require more work to be successful than the legacy environment. Fortunately, distributions are making strides to harden Hadoop and make it enterprise ready when it comes to security and governance. Consider Cloudera’s recent acquisition of Gazzang, with products for encrypting data stored in Hadoop environments and for managing who can access the keys, tokens and other protocols that grant access to the data.
According to blended estimates from IDC, Wikibon and Allied Market Research, Hadoop is about a $6 billion market in 2015, growing to a $50 billion market by 2020, which will represent 50% of the big data market. The Hadoop market is almost evenly split between software, hardware and services. The consensus is that this represents a net new market, not displacing the existing data storage market but rather adding to it.
Ultimately Hadoop will replace some, but not much, of what we now have in a DBMS. There is data that is not in a database, nor should be, that constitutes the big data of Hadoop. It will ultimately be more data by volume than what is in a data warehouse – maybe petabytes. It’s the data we’ve willfully chosen to ignore to date and the data that is forming the backbone of modern applications.
This post is brought to you by SAS.