Throw Out the Garbage In, Garbage Out
I wish I had a nickel for every time I have heard “Garbage In, Garbage Out” (GIGO for this post) or a variation in my career moving data from place to place.
Data warehouses, by definition, and analytical databases receive data from other systems. Unless you are dealing with a data source experienced in delivering its data to other systems, that process inevitably opens Pandora’s box. Since each system ideally, and usually, delivers its data to only one other system, if you are tapping it, you are likely doing so for the first time.
These times are usually underestimated. In addition to the technical hurdles, there are interpersonal issues to deal with that include justifying the need for yet another environment/database in the shop with this data. Usually variations of “Aren’t we providing all that is needed?” are heard.
Rather than expound on the answer to that question here, I’m going to focus on the data itself – once that question is answered.
Ultimately, to have a successful workload, the data must be at an acceptable standard of quality. This may be something different from system to system. Operational systems have a different sense of quality than an analytical system. Often operational systems are built to take in data record by record and do not see the lack of attention to complete, accurate data.
For an analytical database that is looking to support business decisions with high quality data and will look at trends in the data, this may not be acceptable. This is why it’s important to do source system data profiling early on in order to effectively gauge the work effort, and perhaps the feasibility, of bringing over the data
Usually the situation is not so dire. A few tweaks to the data can be done and the data is ready to go into its analytical journey. These tweaks may be:
- Desired by the source environment in which case the source environment should ensure the data is entered to this standard.
- Not desired by the source environment, in which case the data should be transformed en route to the analytic store
There is no option here for bringing in sub-standard data from the source environment. That’s why there’s a “T” in “ETL”! The source environment should not be “required” to acquiesce to the data standard of the analytic environment. Furthermore, it is very likely that significant operational system enhancement is not going to happen on your project’s timetable.
For example, in a retail situation each store can map their own keyboard. Most won’t, but some will – to take advantage of local trends. A store that sells mostly Combination #3 might change F1 from Combination #1 to Combination #3. It makes life easier at the store, but can create a headache at corporate if it is not prepared to make the adjustment to the data. The store is concerned with getting paid what it is due. Headquarters is concerned with what is selling. In this case, the source doesn’t change, the analytic store will change the data.
There’s also history data to deal with! Will a system have data at multiple standards in different time periods?
Ultimately, the success of your analytic project will depend on the data being up to a standard. There are tools at your disposal to make sure you can do this. If the source data, from your perspective, is garbage, it will be your project’s demise to cite, especially well into the project, the GIGO mantra as if you have no options.
These issues affect analytic projects at companies large and small.
Throw out the garbage in, garbage out.
This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. I’ve been compensated to contribute to this program, but the opinions expressed in this post are my own and don’t necessarily represent IBM’s positions, strategies or opinions.