An Environment Ready for Hadoop
I recently spoke to a company that had adopted Hadoop. There are implications in the story for companies of all sizes, even SMB companies, in thinking about their future use of this technology. It may be surprising the pedestrian nature of the workload – one that most companies have – this company moved to Hadoop.
Functioning in a highly competitive environment, keeping customers is critical to their success. It costs 10 times as much to secure a new customer as it does to retain a customer. With 10% average annual industry churn, a company with the ability to move the needle on retention will see revenues soar.
Naturally, vast measures have been taken over the years to understand the cycle of attrition and safeguard the customer base. The environment for the Retention Processing (the set of operations undertaken to reduce customer attrition) has been at the leading edge of relational database technologies for years.
But as their analytic models expanded over the years, processing cycles became a constraint that was very evidently becoming more restrictive with each passing day. The environmental constraints were causing the researchers to scale back their analysis, schedule model execution for off-hours, run infrequently and work on subsets rather than full data. They were unable to innovate at the speed they needed to.
By focusing on a subset of data, the models had sampling biases and were overfitting to noise rather than future data.
They were unable to run their Retention Processing on the entire data set in the relational data warehouse. Processing customer data from one of the US’ 50 states alone took 9.5 hours. The environment was simply not performant enough to keep up with the data sizes encompassing information from all 50 states, or with the analytic complexities demanded by their data scientists and users. For a company whose core business is the ability to predict, this required serious attention and out-of-the-box thinking.
One thing that was being compromised was the automated cleansing and transformation of data. Since these are not optional, and the corporate warehouse was unable to do more of these necessary functions at the growing levels of the business, it was left to the data scientists to spend 80% of their time in these activities instead of doing actual analysis.
Predictive modeling was reduced to simply focusing on the 10% of customers determined to be the most likely to leave in the short term. The 10% was the average attrition for a rate increase, but it increasingly felt arbitrary to the data scientists. It should be variable according to the specific increase matched to the customer attributes and other data including service interactions and external/public data. The goal is to “understand everything we can about the customer”.
To make matters worse, the data under management was doubling every 6 months.
There was also a desire to create a profile of various forms of outreach depending on the circumstances and have prioritized, not just flat, lists of customers. Simply prioritizing the outreach in the legacy way using core attributes of the customer was not enough. Getting to progressive levels with the current environment was unthinkable.
Naturally, additional resources (hardware, software) were being added to the legacy environment. Actually there was “no end” to that activity.
Rather than continue to sacrifice productive Retention Processing cycles – and budget – they embarked on a path to investigate what Hadoop could offer. Their Retention Processing program is now three times as effective at retaining customers since it has moved to Hadoop, driving many millions of dollars annually, on less than a one million dollar investment, to the bottom line.
While many Hadoop success stories focused on online marketing and technology companies with at least tens of terabytes of data to manage, Hadoop offered this company in a traditional vertical a tremendous advantage. What about yours?
This post is brought to you by Cloudera.