Analytic Database Extensions
In my last post, I looked at trends in database management systems and the divide between operational and analytical systems. Now I want to look at the top three extensions which are adding analytic capabilities to a modern database. These are features that should be utilized in almost every analytic database, for every size company.
In a columnar database, the administrator has control over which columns are “grouped” together physically. There is no change to SQL or to logical design of these databases. Databases with these capabilities are still considered relational databases.
The pointer system within the data pages that help the data manager navigate the pages can be obliterated since most columns are fixed length and therefore will reside at fixed locations in the data page. Compression possibilities expand dramatically with columnar orientation due to the increased likelihood of repeating values. For example, the DBMS can store the “range” of rows that a value takes up instead of repeating the value for each row.
High column selectivity queries (small percentage of overall bytes in the record in the SELECT or WHERE clause) will perform better in a columnar oriented DBMS. Most analytic workloads would have overall benefit.
In-database analytics moves processing to the data, architecturally speeding up the performance of the analytics. This is a DBMS advantage when large volumes of data are present. By moving processing to the DBMS, these “embedded” analytics are made available through SQL and other DBMS interfaces.
With in-database analytics, data calculations are managed and executed in a central location, available to any user regardless of location, application or user interface. End-users can utilize the performance and robust functionality of in-database analytics without needing to be aware of the nature of the processing.
For in-database analytics, the DBMS needs to be extended with user-defined functions, preferably those made available by the open source, parallel data science language R.
In-memory processing, the current fastest commercial medium for data storage, is thousands of times faster than hard disk drives (HDD). It is also more expensive or it would be more pervasive. However, a lower price point has created an in-memory hardware renaissance.
With multi-core processors that truly maximize the value of hardware and with the price of memory dropping precipitously, we are able to use more memory than we have used for quite some time – but only for a very small slice of data. Next generation CPUs will have thousands of cores, expanding the addressable memory tremendously.
In-memory capabilities will be the corporate standard for the near future, especially for traditional databases, where disk I/O is the undisputed weak-link and bottleneck. In-memory systems do not have disk I/O. Access to databases in main memory is up to 10,000 times faster than access from storage drives. Near-future blade servers will have up to 500 gigabytes of RAM. Already systems are being sold with up to 50 terabytes of main memory. Compression techniques (i.e., with a columnar orientation) can make this effectively 10x to 20x that size.
This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. I’ve been compensated to contribute to this program, but the opinions expressed in this post are my own and don’t necessarily represent IBM’s positions, strategies or opinions.