Fishing in the Big Data Lake

Fishing in the Big Data Lake
share on
by Sanjeev Kapoor 07 Sep 2017

A company’s business data is among its most critical assets, which is the reason why enterprises are heavily investing in data management infrastructures and services, such as databases, data warehouses, data mining services, business intelligence, reporting and more. In recent years, the importance of data assets has been rising due to the emergence of data-driven enterprises. Such enterprises process and use factual data in order to drive their operations and decisions. Likewise, there has also been an emergence of new data management infrastructures, which are associated with various Big Data services and provide enterprises with exceptional opportunities to derive business value out of arbitrarily large volumes of data. In most cases, such Big Data infrastructures and services co-exist with the legacy data management infrastructures such as data warehouses and data mining tools. The latter infrastructures still play a significant role in enterprise data management. Hence, modern businesses must be able to understand the value of both legacy and emerging data management infrastructures in order to make proper choices about their business data architectures and services.


Conventional Enterprise Data Management Infrastructures

Nowadays, most enterprises deploy some sort of conventional enterprise data management infrastructure, such as:

Big Data or something else.
Let's help you with your IT project.

  • Centralized databases: Centralized databases form the most basic and common data management infrastructure. They are usually deployed as scalable, high-performance, multi-processor cluster computing systems at a single location. Having all corporate data at a single location has some advantages (e.g., opportunities for tighter security and lower risks), but also some drawbacks (e.g., poor disaster recovery options).
  • Distributed databases: When data demands are decentralized (e.g., multiple enterprise locations or branches), a decentralized database is usually the less costly and more flexible option than a centralized database. Distributed databases can be decentralized in different ways such as partitioning and replication. In the case of partitioning the database is split into distinct segments, which are maintained in different locations or regions. On the other hand, the case of duplication involves maintaining complete copies of the database in multiple locations which requires frequent synchronization of the separate databases (e.g., off-hours).
  • Data Warehouse: A data warehouse can be considered as a database that stores both current and historic data. It consolidates data from management analysis and decision making, while providing a rich set of reporting and querying tools. Enterprises leverage warehouses for improved and easy accessibility to information from the entire enterprise, as a warehouse consolidates data from many databases across different departments and business units. Furthermore, a warehouse enables an enterprise to appropriately model and remodel the data, as means of deriving insights that solve enterprise problems and optimize business processes.
  • Data Mart: A data mart is a subset of a data warehouse, which contains a summarized or highly focused portion of data that addresses a specific user group. For example, one data mart can focus on financial & accounting operations, while another can focus on sales & marketing data.
  • Data Mining Services and Tools: Data mining provides the means for analyzing large pools of data, usually residing in a warehouse or data mart. It enables one to find hidden patterns, while at the same time facilitating the extraction of business rules. Typical information obtained through data mining includes associations of an event, sequences (i.e. linked events), classifications (i.e. assigning an item to a group among a list of predefined groups), clusters (i.e. assigning items to clusters based on some similarity measure and without having a list of predefined groups) and forecasting (i.e. predicting future values based on a list of past values).

The combination of data warehouses and data mining tools is, for many enterprises, the preferred way for identifying patterns that drive their decisions and operations. Despite the advent of Big Data, enterprises are still very keen on maintaining data warehouses, as these ensure that data is structured and processed in easy and reliable ways.


Riding the Wave of Big Data

Enterprises are increasingly adopting and implementing Big Data as it has the ability to process large and complex data sets, which are almost impossible to be processed by conventional data management infrastructures and their tools. Typical examples of such data includes web logs, sensor data, data from social media and social networking platforms, call detail records, genomics data, large scale e-commerce data, as well as various types of multimedia data such as collection of medical images, movie databases and more.

Big Data are in several cases unstructured or poorly structured, while featuring high velocity and veracity, which makes their collection, processing and management very challenging. Big Data tools such as distributed filesystems (e.g., Apache HDFS (Hadoop Distributed File System)) and streaming engine (e.g., Apache Spark and Apache Storm) provide the means for dealing with high volume and high velocity datasets. This is where the conventional tools fail. Enterprises are therefore deploying such tools in order to benefit from Big Data, without abandoning their conventional infrastructure.

The advent of Big Data has also given rise to the concept of a “Data Lake” which is a storage repository that can hold vast amounts of data in raw data format including structured, semi-structured and unstructured data. A Data Lake accommodates data of varying structures, which it is able to resolve at the application delivery level i.e. when the data structuring requirements are known. Some people consider Data Lakes as a re-creation of data warehouses in the Big Data era. While there is truth in this argument, it should be underlined that data lakes are significantly different from warehouses, in terms of the ways they structure and manage data. The main difference is that data warehouses deal with structured data only, while data lakes store raw data and transform it to some structure when it is time to use the data (e.g., as part of an application). Also data lakes are closely affiliated to Big Data technologies as in most cases they leverage tools and techniques from the Hadoop/Big Data ecosystem. Finally, data lakes provide some agility in terms of their processing, as data schemas can flexibly change, while data warehouses adhere to given schemas in order to benefit from well-structured data.


Establishing an Enterprise Big. Data Architecture

With so many different solutions and infrastructures for managing data, enterprises are challenged to create data management architectures that are responsive and scale in a cost-effective way. Modern enterprise data architectures should combine the merits of warehouses for processing structured data, with Big Data infrastructures’ capabilities for scalable processing of large volumes of data, including data volumes of low business value (e.g., raw social media streams) in a scalable way.

There are several architectural patterns that strive to combine data lakes with data warehouses. Some of these employ data warehouses as the primary means for data analytics, which are complemented with a data lake infrastructure for data of lower business value. In this case, the business value of the collected data is assessed, prior to deciding to move them from the data lake to the warehouse. However, there are also architecture patterns that work the other way around: They deploy a data lake as the central repository of information and selectively move data that are deemed to have the highest business value in a complementary warehouse. No matter the selected pattern, enterprises need to take into account cost/benefit considerations, as each of the present architecture comes with certain costs and requires considerable effort for its deployment. Hence, data lakes and data warehouses are likely to co-exist in a data management architecture: The former will deal with large volumes of raw data and their transformations, while the latter will deal with data mining over well-structured data.

Overall, firms are offered with remarkable data management opportunities to assist them in their effort to improve decision making and to increase the efficiency of their business processes. The selection of a proper data management infrastructure therefore becomes a key to a business’s competitiveness. Aside from the technology factor, the ever important organizational (e.g., organizational obstacles) and management (e.g., data administration and governance) factors should not be underestimated. Hence data management is an area where optimal decisions can make a real difference and yield improved business results.

Recent Posts

get in touch

We're here to help!

Terms of use
Privacy Policy
Cookie Policy
Site Map
2020 IT Exchange, Inc