For several decades, enterprise IT applications were focused solely on handling structured data which have a predefined model and are organized in a well-defined form that makes it easy to store, query, process and manage them. Indeed, the vast majority of transactional enterprise applications such as CRM (Customer Relationship Management) and Enterprise Resource Planning (ERP) systems are very effective in handling large amounts of structured data through the use of popular Relational Database Management Systems (RDBMS) and the Structured Query Language (SQL).
In recent years, we are however witnessing the emergence of new applications that deal with large volume of unstructured data as well such as flat binary files containing text, video, and audio. This is, for example, the case with social media data and data used in multi-sensor and Internet-of-Things (IoT) applications. Unstructured data are not necessarily devoid of a structure as they may include an encoding structure and metadata associated with them. Nevertheless, they need a totally different approach to their storage and handling which led to the emergence of entirely new types of databases called NoSQL databases. The NoSQL databases are used to handle the vast amount of unstructured data that are nowadays part of the trending BigData applications.
NoSQL Database Types
NoSQL databases have no strict schema requirements and do not necessarily guarantee the Atomicity, Consistency, Isolation and Durability properties of RDBMS systems. Rather, NoSQL systems tend to trade consistency in favor of high availability while adhering to the “BASE properties” which are outlined below:
- Basically Available i.e. they guarantee availability.
- Soft-State i.e. their state may change over time.
- Eventual Consistency i.e. they will eventually become consistent.
There are different types of NoSQL databases that meet these properties which have been developed to serve various purposes and applications. The most prominent types include:
- Document Stores: These are NoSQL databases that store documents made up of tagged elements. They store documents in some standard format or encoding such as XML, JSON and PDF documents which are usually referred to as Binary Large Objects (BLOBs). Moreover, documents are usually indexed which enables document stores to outperform conventional file systems in document-oriented operations. MongoDB and CouchDB are two examples of popular document stores.
- Graph Databases: In these databases, data are represented as nodes (vertices) and edges i.e. in a graph form). Graph databases are suited for applications involving graph like queries e.g., finding the shortest path between two elements of a graph structure. Popular graph databases include Neo4j and VertexDB.
- Key-Value Stores: These databases are based on the hash table of keys where keys are mapped to more complex values like entire lists. Keys can be stored in the hash table and can be distributed easily. They typically support regular CRUD (Create, Read, Update, Delete) operations but offer no join or aggregate functions. Examples of key-value stores are the Amazon DynamoDB and Apache Cassandra systems.
- Column Stores:Column stores are characterized by the fact that each storage block contains data from only one column of a relational structure. As such, column stores are a hybrid of RDBMS and key-value stores. In practice, values are stored in groups of zero or more columns in column-order rather than in row-order. Values of queries are based on key matching. Examples of column stores include HBase and Vertica.
Making the Right Choice
Given the different types of NoSQL databases, system architects and developers are offered with various database options for handling unstructured data in their applications. Selecting the right option is primarily a matter of understanding their properties and use. In particular, it’s a good practice to use a document store when your data comprises of collections of similar entities which are however semi-structured and sparse rather than conventional tabular data. As a prominent example, a document store is a good choice for storing the data of a blog. This is because blog posts can be stored as indexed documents, which could be ordered and/or retrieved by properties like their author, subject and authoring date. Posts are typically unstructured or semi-structured yet they comprise the above-listed metadata.
It’s probably a good idea to use a key-value store or a column store when scalability is your main concern. In this context, scalability is reflected in the size of the data and the overall load to be put on the system. In most cases writing unstructured data very fast is what is required for various applications. This is the reason why a column store can be a good idea for a Twitter-based application e.g., an application that leverages vast amounts of Twitter data for branding or sentiment analysis. Twitter applications need to deal with many posts that come at high speed i.e. they feature very high rates of ingestion as hundreds of gigabytes of data are posted on Twitter every few minutes. Moreover, processing of data from Twitter applications requires queries based on the user or the date of the tweet which can be supported by the key-value store or a column store.
Graph databases are the right choice when data traversal is a primary concern. This is usually the case in applications where data are represented in graph format in order to represent in an intuitive way the relationships between the different entities. The most characteristic examples of applications that can really benefit from graph databases are social networking applications where analyzing social graphs is one of the most important tasks. In such applications, social network participants (e.g., persons) are represented as nodes and their relationships (e.g., friend or follower relationship) as an arc or edge. In this context, graph databases facilitate queries based on the names of the nodes while at the same time easing the creation of node groups. Moreover, they are an excellent choice when multiple nodes of the graph need to be traversed (e.g., in order to understand how two or more people can get to know each other through their social graph).
While there are many uses of NoSQL databases, it’s always important not to be deceived by the hype around NoSQL and BigData. Conventional RDBMS systems remain the primary choice for most applications, especially when you need to process structured data and produce reports. NoSQL is not the ideal tool for in reporting, which is a very important function in most enterprise applications. Hence, the use of NoSQL should be avoided in cases of uniform and structured data and also in cases of legacy systems that already make use of an RDBMS.
By and large, NoSQL is certainly a powerful tool in your arsenal for surviving in a data-driven society, especially in cases of BigData applications. We hope that our recommendations would help you choose the right database for your storage needs.