Exploring Low Value Density Data with Hadoop
Traditional business intelligence, analytics, and reporting practices have always focused on learning what we know we don’t know.
BI projects began with requirements—often the business user stating his or her needs listed in terms of, “I’d like to see this, this, and this.” Everything was designed around obtaining answers to known questions, because these answers provided consistent added value.
What about learning that which is unknown? Analysts, statisticians, and data scientists have developed practices of data exploration, discovery and mining. Data discovery allows us to dive into untouched, stagnant pools (or lakes) of data to discover new information that we might miss with our conventional BI and analytic methods.
The premise was built on the notion of hidden gems buried in a mass of otherwise useless data. However, building a business case for data discovery was problematic. Often the costs of expertise and time needed to find these hypothetical hidden gems were barriers to many businesses putting forth the effort. Also, many leaders feared the effort would only lead to uncovering fool’s gold. Thus, many retreated to the consistent and proven value of traditional BI and analysis.
Compounding the issue, data discovery has been severely hampered by conventional information management database schema design. With RDBMS and data warehouses, database schemas have to be predesigned and data has to be transformed and conformed to fit an imposed structure. Traditional BI and analytics focused on data with a high value density that fit the requirements and design structure. Each piece of data is important, because it is there for a reason. Either it has a business process purpose, has analytic value, or both.
Data that doesn’t fit the specified structure is ignored (or worse still, lost) and blown out as exhaust from the BI engine. Exhaust data has neither a business process purpose nor immediate or perceived analytic value. It is assumed that exhaust data has little value. This has often been the justification for ignoring the exhaust data.
Exhaust data has a low value density, so in small amounts, there was not a good enough reason to commit effort and resources to manage or analyze this data. In the days when storage was limited and expensive, this was justifiable from an ROI standpoint. However, now that data has become so large in terms of quantity, the amount of exhaust in our data streams and sources is exponentially greater than ever before. Even if the exhaust data has a low value density, the mass of big data is mountainous enough that the weight of its collective value is beginning to equal (and possibly surpass) the high value density data on which traditional BI and analytics have focused.
For example, a convenience store chain implements a customer loyalty program that offers reward incentives for fuel and convenience retail items. A conventional BI approach would integrate the sales and discount data into a warehouse model aggregated by the dimensions inherent from the reporting requirements and point of sale system. Certainly an analyst could draw insights on the effectiveness of the program by tracking sales over time.
However, there is a wash of data being blown out the BI exhaust pipe. The customer loyalty participants are actually generating a “stream” of big data, including everything from retail product choices, product pairings, day of the week, time of day, geo-location distance from home address, fuel volume, and other data that is too big or “messy” for a typical BI reporting environment. All of this data could be explored to discover patterns and behaviors that the aggregated totals from a data warehouse would overlook.
Hadoop gives an alternative to burning our data exhaust. Given inexpensive storage and the massive scale to which Hadoop can grow, we can collect and extract the value from low value density data through data discovery. Hadoop grants us the flexibility of relaxed schemas to extract value when data are read versus traditional RDBMS where high value data must be shoved into a rigid schema at write-time.
This post was brought to you by IBM for Midsize Business and opinions are my own. To read more on this topic, visit IBM’s Midsize Insider. Dedicated to providing businesses with expertise, solutions and tools that are specific to small and midsized companies, the Midsize Business program provides businesses with the materials and knowledge they need to become engines of a smarter planet.