Scale-Out Architectures: NoSQL and Hadoop
Scaling to a certain level (data, complexity, etc.) is less expensive with many smaller machines than with fewer larger machines. In the early days of computing, as systems began to tackle scaling challenges with larger more complex machines, the costs took off exponentially. A 10-terabyte system could cost not 10 times a 1-terabyte system, but nearly 100 times. This is scale up.
Many systems today scale out by multiplying that 1-terabyte system 10 times – at 10, not 100 times the cost. By dividing the programming across a single data set spread over many machines with processing power in each unit of disk, scale out systems are able to keep scaling affordably. This scale-out provides the majority of the value proposition of NoSQL and HDFS (the Hadoop file system) systems. The distinguishing features of the implementations provide the rest.
HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up. On these systems, you need to get a balance between hard drive count, CPU core count and memory density. You want to look at different CPU technologies and CPU bus speeds, data and infrastructure node memory allocations, and the type of hard drive controllers.
NoSQL systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes. Gone are the pointer system of the database management system (DBMS), the division of data into structured pages and the de facto immediate transaction integrity.
NoSQL systems make many operations possible, some from a cost perspective and others that were just not previously possible like gaming, shopping carts, online event logging, and content management systems. NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation.
HDFS runs on a large cluster of commodity-class nodes. Whenever a node is placed in an IP range as specified by a “NameNode,” one of the necessary Java virtual machines, it becomes game for data storage in the file system and will report a heartbeat henceforth to the NameNode. Sharding can be used to spread the data set to nodes across data centers, potentially all across the world, if required.
Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record values with schema applied on read based on the HCatalog or other means of tagging the values.
This means of data storage, and its MapReduce framework for getting the data out, make sense for analysis of sensor data, clickstream data, social data, server logs, smart grid data, electronic medical records, video, pictures, unstructured text, geolocation data, and any high-volume data that outscales the value of a DBMS.
Though there are many similarities in how the architectures are constructed, and even the use of MapReduce, NoSQL and Hadoop solutions target different workloads.
Hadoop provides high throughput but does not provide low latent data access. It rocks for large batch appends, but small batches may be slower than you think. If you process terabytes of data in your queries, Hadoop is best.
I think of Hadoop as ideal for the analytical, post-operational, data-warehouse-ish type of workload. Together with a relational data warehouse, they can form very effective data warehouse infrastructure.
NoSQL systems do provide low latency access and accommodate many concurrent users. If you need to insert data quickly, fetch data in real-time or have web-scale, unpredictable concurrent usage, NoSQL will be a better solution than Hadoop. These reasons put NoSQL it firmly in the operational/application camp.
It is not uncommon for the two to exchange data. NoSQL is frequently the collection point for big data. The data is passed to Hadoop for analytic processing where sometimes results are sent back to NoSQL, or at least to an operational relational data store that can be accessed together with NoSQL, for operations.
This post is brought to you by SAS.