The Practice of Curation in Modern Data Architecture
Last week, I had the privilege of attending the Teradata Influencers Summit #TD3PI. Teradata does a tremendous job bringing together their best with the best from the information management analyst community. One of my favorite sessions was “Analytical Architecture Foundation” led by Mark Madsen, leader of global analytics architecture at Teradata.
In this presentation, Mark presented a model for the current state of complexity in modern information architecture. I have reproduced the drawing he made on a virtual whiteboard—with a few nuances of my own.
One will recognize many of the familiar faces on this diagram from both data sources, data consumption, and the analytical artifacts we build to meet the needs of our business decision makers. We see well represented most of the information needs and uses of the modern organization.
However, it’s very easy to spot the flaw in this architecture pattern—silos. Ah, yes, silos are an offensive four-letter word in the data community as things get built vertically, virtually isolating it from the rest of the information ecosystem. We are all exhaustively aware of the pain they cause, so I need not discuss it further.
Second, Mark showed how we attempted to solve the above architectural challenge with more architecture. Organizations began digging a moat for data around or replacing their monolithic data warehouses with another four-letter data word—data lake. “Put all your data in one place where everyone and everything can access it.” The problem is without proper management of the inflows and growth of the lake, it spilled over and flooded fertile data soil and became the marshy data swamp. Worse still, the engine running the lake could never compete performance-wise with the modern analytical and data warehouse engines.
Don’t mistake me, the data lake has its applications and uses, but it neither is nor never was going to be the end all.
Mark offered an approach to architecture he called the 3 C’s—data collection, curation, and consumption. Collection (the data sources) and consumption (BI and analytics) are nothing new, but the curation component intrigued me.
Wikipedia defines data curation as “a broad term used to indicate processes and activities related to the organization and integration of data.” Some might say, “We do that already. We have a data governance program and a BI semantic layer.” While I applaud those efforts, I want to think of data curation from a broader perspective.
To me, data curation is more than activities, technologies, and architectures. For example, data governance is an organization of people who come together to make data decisions, define data, and solve data issues, but it’s not data curation. A BI semantic layer provides a layer of abstraction, but, again, it’s not curation. I could go with a long list of activities and technologies—taxonomies, data dictionaries, retention policies, Chief Data Officers, stewardship, master data management, and so forth—all of which fall to some degree under umbrella of data curation.
Instead, I think that data curation is a strategy and a practice as it relates to data, applications, people, architecture, and sources. We envision a data world where data from all sources is defined, organized, de-siloed, timely, consistent, reliable, available to all authorized users, applications, analytical tools, and any imaginable use case, and bringing answers to business questions across the organization. I propose we, as an information management community of practitioners, develop this practice and strategies to bring it to fruition.
Data curation enables us to leverage the architecture in-place—the fit-for-purpose tools and engines like MPP data warehouses, advanced analytical engines, and yes, even Excel.
Guess what? Data curation is nothing new. The practice to accomplish this has been under development for thousands of years—it’s called a library. Librarianship and information science has been too often forgotten as the data race has reached petabyte scale and technology can perform at astronomical scale, yet we are paying the price today with much data pain.
Like this post? Subscribe to our newsletter to stay alert to all things MCG.