Understanding Big Data Persistence
Relational Databases have been the dominant data persistence technology for many years. Feature-rich, mature ecosystem has been developed around them. Effectively relational databases are the default option for any enterprise applications as well as data warehouse projects. This statement holds true in present days even with proliferated NOSQL techniques. That being said, relational databases may become the bottleneck of an enterprise application with constantly growing data size and extreme performance requirement. Scalability is sacrificed for strong consistency, transaction support, normalized data model, ad hoc query, and table-join capabilities. NOSQL (with exceptions) as well as NewSQL with automated (or transparent) sharding, re-sharding, and failover, was invented for easy scale out, but falls short on some features as a price. CAP theorem dictates that it is practically impossible to have all the merits available in a distributed data storage system.
First glance at CAP theorem likely makes you feel it is simple. According to wiki, it is impossible to simultaneously provide all three guarantees of C-Consistency, A-Availability, and P-Partition Tolerance to a clustered system. While, it turned out that the definition of the three guarantees are intricate and inconsistent across different resources. Cloudera made a good effort in describing the CAP theorem. But the author believes that the best interpretation comes out of the book "NoSQL Distilled".
Consistency means all nodes in a cluster see the same data at the same time. Availability has a particular meaning in the context of CAP, every request received by a non-failing node in the system must result in a response. If the response is undetermined to a running node, it is not available. A failed, unresponsive node doesn’t infer a lack of CAP availability. Partition tolerance means that the cluster can survive communication breakages that separate the cluster into multiple partitions, which are unable to communicate with each other. At least one partition should remain functioning and accessible to the clients. Bear in mind this network partition is accidental, destructive, and undesired.
CAP theorem states that only two out of the three guarantees can be achieved at any moment. To get a better insight, the author quotes a small section from the book blended with interpretations from his own perspectives.
First consider a CA (consistency and availability) system. A single-server system is a CA system with consistency and availability but not partition tolerance. Apparently, a single machine can’t partition. If it’s up, it’s available. When it is down, a failed, unresponsive node doesn’t infer a lack of CAP availability. Relational Databases are in general a CA system.
It is theoretically possible to have a CA cluster. However, this would mean that in event of network partitions, all the nodes in the cluster would go down so that no client can talk to a node. Why? Servers under different partitions can't communicate with each other. Copies of the same data under different partitions would be out of synch when the servers keep serving read/write requests. That breaks consistency. If you choose to make all nodes ready-only, which preserves consistency, it impairs availability because the nodes reject all write requests. As a result, we have to bring all the nodes offline. Note a failed, unresponsive node doesn’t infer a lack of CAP availability. Since the entire cluster is out of service due to a network partition, it is to say that the cluster has no partition tolerance. This indeed defeats the purpose of having a cluster.
Evidently partition tolerance is a must for a clustered system. The choice is narrowed down to CP and AP, meaning that you trade between consistency and availability. An example should make it clear. Martin and Pramod are both trying to book the last hotel room on a system that uses peer-to-peer distribution with two nodes, London for Martin and Mumbai for Pramod, assuming a multiple data center topology. (The two nodes could reside in the same data center say in New York.) If we want to ensure consistency (CP), when Martin tries to book his room on the London node, that node must communicate with the Mumbai node before confirming the booking. Essentially, both nodes must agree on the serialization of their requests. But should the network link break between the two data centers (or just the two nodes in the New York data center), then neither system can book any hotel room, sacrificing availability. The status of the room may still be available as read only so long as the record remains in synch in both partitions – impaired availability. In a master-slave topology, the master node can still book the room, while the slave either is taken offline (weakens partition tolerance) or rejects the request (impairs availability) since its copy of data is no longer up-to-date until the network connection is resumed.
If we choose availability (AP), both partitions would allow their user to book the last hotel room in the peer-to-peer topology. That leads to data inconsistency, copies of the same record under different network partitions are out of synch. It is described as a write-write conflict to be resolved once the network connection is resumed. A common solution is optimistic locking, in which the first write commits and the second fails by comparing a timestamp value (or version count) on different copies of the same record. In reality, some hotels always keep a few rooms clear even when they are fully booked, in order to be able to swap a guest out of a room with problems or to accommodate a double-booking. Another example is Amazon shopping cart. In this case you are always allowed to write to your shopping cart, even if network failures mean you end up with multiple shopping carts under different partitions. The checkout process can merge the multiple shopping carts by putting the union of the items from the carts into a single cart and returning that. In the meantime, a master-slave topology favoring AP remains with a writable master node and a read-only slave node(s). The master node is always up-to-date, while the slave may return obsolete data, sacrificing consistency. Eventually, of course, the updates will propagate across the network, and data on different nodes will be consistent, generally referred as Eventually Consistency.
There is no definitive interpretation of CAP theorem so far. After all, it is still a little debatable. According to the description of this article, Apache Cassandra possesses a peer-to-peer topology, and is highly tunable from CP to AP through the configurable READ and WRITE consistency levels per each database operation. MongoDB has master-slave architecture, and both reads and writes go to the master by default, which suggests a CP system. When secondaries (slaves) are allowed to serve reads, it becomes AP as the secondaries may return out-of-date results. Apache HBase is practically a peer-to-peer architecture with peers the region servers, considering HBase Master does very little concrete work in the cluster. HBase is believed a CP system in that each data region is served by a single region server for both reads and writes, which guarantees consistency. On the other hand, it used to bear a weak form of Partition Tolerance due to Name Node the single point of failure (SPOF). A partition losing contact with the Name Node is out-of-service. HA (high availability) NameNodes introduced in Hadoop 2.0.0 mitigates the situation by running multiple Name Nodes in the same cluster. Availability falls short in that should a region server fail, clients will have data inaccessible for the time till the region comes up on some other server. This failover process is not instantaneous.
A general principle in database scaling is that replication helps on read performance as all the replicas can serve read requests, little help on write performance (avoiding write-write conflicts). On the other hand, sharding/partitioning improves both read and write scaling, since reads and writes are spread out to different nodes hosting different data shards.
As we’ve discussed, relational databases alone are not intended to run in a cluster. Client-side sharding approach for scaling is very involving and complex in terms of programming, configuration, and maintenance. Besides, we lose table joins, referential integrity (foreign keys), and transactions that cross shards. People were motivated to design brand new, innovative database systems. NOSQL databases are introduced to tackle the spots not being well-addressed by relational databases, for instance, extreme capacity and performance, unstructured or semi-structured data, continuously evolving schema, and complex relationship graphs.
There are 4 types of NOSQL databases, namely, column family-oriented databases, key-value databases, document databases, and graph databases, among which column-family databases are well-tuned for big data because of the automated and efficient sharding/re-sharding mechanisms.
Column Family-Oriented NOSQL
Apache HBase is a column family-oriented, schema-free database modeled after Google’s BigTable, the database behind the Google web search engine. HBase is indeed a Hadoop database running on top of HDFS (Hadoop Distributed File System). Schema-free means no database constraints, no foreign key relationships (referential integrity), rows are no longer lined up in a table. Thinking of a column family as a table in SQL databases, different rows under the same table can have a different number of columns, as well as different column names. Note schema-free doesn't infer disorganized data; rather the data model is elastic to accommodate ongoing changes. Disorganized data only leads to chaos. Multi-row transaction is not supported, while, an operation on any single row is atomic. There are two fashions in storing data in HBase, tall table versus wide table. For instance, when storing customer orders, with a tall table, you save each order as a new row. However, in a wide table you save each customer as a row, and orders are columns in that row. Obviously wide tables are highly de-normalized as opposite to tall tables. Data belonging to the same column family is partitioned into regions and saved to region servers in the HBase cluster. HBase supports indexed search by row key, row key range query with filters, full table scan with filters. Note filters do not improve query performance as they are applied after the fact. There is no table-join support in HBase, which implies join has to be done on the client side. In wide tables, HBase does not index column names in contradiction to Cassandra, which has column names sorted. To enable ad-hoc type queries, secondary indexes are to be created on columns at the client side either through MapReduce jobs (batch) or through HBase Observer Coprocessors (real-time). The most important thing in designing HBase tables is the row key. HBase works well if you prepare the data for the queries you need - a query driven table design approach, but not for general search. HBase has the best integration with Hadoop since both run on HDFS to take full advantage of data locality. For example, HBase enables large map-side joins for Hadoop.
Google web search engine is a use case of column-family databases. Google runs web crawlers to collect all the web pages on the internet and store their data in BigTable. A MapReduce process makes use of BigTable as the data source to produce the search indexes. When a user submits a search request, a web search application queries the search indexes and retrieves the matching document from the BigTable to display to the user. The open source counterpart, Apache Nutch project has a storage abstraction via Apache Gora for big data stores such as HBase, Cassandra, Apache Accumulo (a distributed key-value store), Apache Avro, as well as various SQL stores.
Apache Cassandra on the other hand is modeled after Amazon DynamoDB. With a decentralized, peer-to-peer topology, Cassandra is highly scalable and available. Internally, Cassandra is implemented as staged event driven architecture (SEDA, like in Spring Integration), each stage in a work is resilient to scale up or down independently. Tunable consistency level on individual queries gives developers choices between performance and consistency. A high consistency level makes Cassandra behave more like a relational database, while a low consistency level boosts read (or write) performance and throughput, but may leave you some data conflicts to solve. The latest release of Cassandra provides native support on secondary indexes. The primary index in Cassandra allows fast lookup of rows by their row key. Each node in a Cassandra cluster knows the range of the keys it manages, requested rows can be efficiently located by scanning the row indexes only on the relevant nodes. In HBase, this job is taken care by ZooKeeper. Cassandra automatically writes a row to the proper node depending on the key regardless to which node a client connects. In addition to the query types supported by HBase, Cassandra offers pageable slice queries, where given a wide row, users can retrieve a subset of the columns falling within a given column name range (with column names sorted in a column family). Apart from many discrepancies, Cassandra and HBase share the same design principle - carefully planning your queries before designing the database.
A few new projects are stemmed from Cassandra, for instance, Brisk (running Hadoop on Cassandra File System), Lucandra (using Cassandra as the underlying data storage for Lucene). With Cages - a distributed synchronization library in Java, multi-row transactions can be simulated in Cassandra.
Amazon SimpleDB, Apache Accumulo, Riak, Redis, and Kyoto Cabinet are typical key value stores with extremely high performance. The primary data access pattern is by key, while search by indexed metadata is often supported. They do not in general support join or ad-hoc queries.
Document databases are essentially key-value stores, of which the value is XML, JSON types of documents. In regular key value stores, the value is just a piece of binary data, which isn't queryable. While the rich, hierarchical documents in a document database are visible and queryable from the perspective of the database. MongoDB, CouchDB, and RavenDB are popular document databases.
MongoDB is fully capable in querying and manipulating JSON/BSON (Binary JSON) documents. Ad-hoc query and indexing features make you feel you are working with a regular SQL database, even without join or transaction support. Replica sets in a MongoDB cluster provide data redundancy, automated failover, and can distribute read loads across machines. MongoDB has been designed to scale horizontally with automated sharding and re-sharding, although re-sharding is not quite optimized. Unblocked, asynchronous write mode improves write performance if you could afford occasional data loss on less-valued data sets like server logs. MongoDB runs blazing fast when it works like an in-memory database. At minimum, you need to make sure that all indexes fit in RAM. Ideally, both indexes and a working data set fit in RAM. A working set is the subset of total data commonly queried and updated. When you can’t fit the indexes in the physical RAM of one machine, that is the moment you need to shard the database. With optimized configurations, MongoDB could outperform relational databases in orders of magnitude under a similar hardware spec.
A graph database like Noe4j and InfiniteGraph offers a normalized data model with nodes (vertices), edges (relationships), and properties that makes it better at handling data with complex relationships, for instance, user relationships in a social networking application. Query performance in a relational database degrades significantly with the increasing number of joins. Normalized relational schema trades query efficiency for data efficiency. Graph databases are built to serve joins (hops in a graph). Each hop is executed with a constant speed irrelevant to the total size of the graph. Query performance is only related to the number of hops despite the growing size of the database. The relationship-oriented data structure makes Neo4j difficult to shard and scale out. On the other hand, it enables full JTA transaction support. Lucene is the default indexing engine for Neo4j; in addition, Lucene empowers Neo4j on full-text search. Boosted by Spring Data, Neo4j resembles Hibernate in various aspects. A POJO data model with a few annotations abstracts the native Neo4j API. Full-text search resembling Hibernate Search takes Lucene query strings. The Open Session In View pattern is also available for on-demand, lazy data access. Apache Giraph (modeled on Google Pregel graph processing framework) along with Gremlin (a Groovy based graph traversal language) enables complex graph analysis on Neo4j-like graph databases.
Where to Use
Column family databases are built for big data OLTP (OnLine Transaction Processing) with extreme capacity, performance, and throughput. Key-value stores are for shopping carts, user profiles, session implementations, caching libraries, while not for ad-hoc queries, general OLTP. Document databases with ad-hoc query capability are for general OLTP, such as ecommerce applications. Graph databases are for social networking, collaborative filtering/recommendation engines. Except graph, most NOSQL databases don't support table joins nor transactions. That effectively pushes join and transaction executions to the client application. It gives extra work to the application developers. In general, NOSQL databases alone are not designed for OLAP (OnLine Analytical Processing) workloads. But this doesn't prevent NOSQL being the data source of a big data processing/analytics system, for example, the collaboration between Hadoop and HBase, in situ data sources in fast/real-time big data processing with tools such as Google Dremel, Cloudera Impala, and Apache Drill (discussed in the second article of this series).
Yahoo has the largest Hadoop cluster worldwide. Hence, they are the largest client and proponent of HBase. Even though Cassandra was incubated in Facebook, ironically customized MySQL clusters remain the primary OLTP with Facebook. Likewise, Twitter uses Gizzard (MySQL add-on) as the primary storage for tweets and user profiles. Netflix has perhaps the largest production cluster of Cassandra.
NewSQL is a class of relational database management systems that provides comparable scalability and performance of NOSQL systems for OLTP workloads while still supporting SQL syntax and maintaining the ACID guarantees. Supporting SQL and transaction is a huge benefit to application developers. While CAP theorem still applies to a NewSQL database cluster, consequently, there are restrictions on certain SQL operations (e.g. distributed joins across-shards), and some performance penalties. MySQL Cluster NDB, Gizzard from Twitter, ScaleBase, ScaleDB, NuoDB, Tokutek, VoltDB, OceanBase, Amazon Relational Database Service (RDS), and Microsoft SQL Azure are regarded NewSQL databases, among which many are MySQL add-ons.
Data persistence technologies are not limited to Relational, NOSQL, and NewSQL. There are Data Grids, File Systems, Content Management Systems (CMS), Version Control Systems, as well as Object databases. NOSQL and NewSQL are big data stores for structured or semi-structured data. For unstructured big data such as music, photos, and videos, CDN (content distribution network) offers a distributed, scalable, and highly available solution. Due to the massive size of each record, network latency can be very significant in delivering the content to a user. CDN evaluates a user's request primarily based on his geographic location, and then elects a node in a data center with the best proximity to serve that user.
When data and system grow big, there won't be a one-size-fits-all data-persistence solution. Developers need to carefully evaluate the requirement and select the most suited approach for different aspects of an enterprise application. Such a blended database solution is named as Polyglot Persistence. With an SOA (Service-Oriented Architecture) mindset, developers can wrap the persistence logic around each data store as a reusable data service to be integrated via the infrastructure of SOA.
Real Life Challenges
Relational databases have been the de facto standard persistence technology in JavaEE for many years. A rigid and well-defined schema carries enormous information on business requirement, such that a RAD tool (like Spring Roo, SpringFuse, even IDE plugins) can generate a decent application barebone covering presentation, service, and persistence layers through database introspection like a no brainer. JPA binds tables and relationships to a POJO data model, which is friendly to all programming purposes including web. Database constraints are automatically mapped to validators in the presentation layer. All these bring developers a tremendous relief on their intensive coding job. NOSQL is gradually picking up in that regards. For instance, MongoDB and Neo4j are supported database types in Spring Roo today. Nevertheless, the schema-less nature of the column-family databases make them difficult to support by RAD tools.
Programming NOSQL and NewSQL needs more skilled developers than relational. Managing them in a cluster is a big challenge to system administrators. In case any outstanding issue occurs, developers often have to take the ownership to get things straightened out. It could be more stressful and hectic when you are required to keep the application running during the process of troubleshooting, bug fixing, or data migration.
Although losing a lot of ground to NOSQL and NewSQL, relational databases should remain the default option for any data persistence and analysis requirement. When relational databases can't meet your extreme performance and scalability expectations, you start adding NOSQL and NewSQL to your checklist.
This article covers new data persistence technologies including NOSQL and NewSQL databases. CAP theorem is clarified with examples. Different data access patterns and good practices are discussed. The bottom line is that relational databases should remain the default option for any data persistence and analysis requirement. NOSQL and NewSQL come into the picture when relational databases fall short to meet performance and scalability expectations.
About the Author
Xinyu Liu, as a Sun Microsystems certified enterprise architect, Xinyu Liu has intensive software design and development experience with cutting-edge server-side technologies. He took his graduate degree from George Washington University and currently serves as the application architect for Virginia Workers' Compensation Commission. Dr. Liu has written for Java.net, JavaWorld.com, and IBM developerWorks on topics such as JSF, Spring Security, Hibernate Search, Spring Web Flow, the Servlet 3.0 specification, and Drools Rule Engine. He also has worked for Packt Publishing reviewing the books Spring Web Flow 2 Web Development, Grails 1.1 Web Application Development, and Application Development for IBM WebSphere Process Server 7 and Enterprise Service Bus 7.
- Clustering for High Availability (HA) with JBoss AS7: Online presentation from JBoss.
- Hibernate Shards: Simplifying Hibernate programming with sharded databases.
- Google MapReduce: Simplified data processing on large clusters.
- Google BigTable: A distributed storage system for managing structured data.
- Apache Hadoop: A reliable, scalable, distributed computing framework.
- Apache HBase: A distributed, scalable, big data store.
- Apache Cassandra: A distributed, scalable, big data store.
- MongoDB: A JSON document database.
- Neo4j: A graph database.
- Twitter Storm: A free and open source distributed realtime computation system.
- Apache S4: A general-purpose, distributed, and scalable data stream processing system.
- Google Dremel: Interactive analysis of web-scale datasets.
- Cloudera Impala: Real-time queries in Apache Hadoop.
- Apache Drill: Provide low latency ad-hoc queries to many different data sources.
- Big Data: Introduce the Lambda Architecture.
- Kundera: JPA 1.0 ORM library for the Cassandra/Hbase/MongoDB database.
- Apache Gora: An in-memory data model and persistence for big data.
- Spring Data: Modern data access for enterprise Java.
- Spring XD: A unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Spring Social: Connect your applications with SaaS providers such as Facebook and Twitter.
- Hadoop in Action: A book on Hadoop.
- Hadoop in Practice: A book on Hadoop ecosystem.
- HBase in Action: A book on HBase.
- MongoDB in Action: A book on MongoDB.
- NoSQL Distilled: A brief guide to the emerging world of polyglot persistence.