by Sanjeev Kapoor 07 Sep 2017
Fishing in the Big Data Lake
share on

Fishing in the Big Data Lake

A company’s business data is among its most critical assets, which is the reason why enterprises are heavily investing in data management infrastructures and services, such as databases, data warehouses, data mining services, business intelligence, reporting and more. In recent years, the importance of data assets has been rising due to the emergence of data-driven enterprises. Such enterprises process and use factual data in order to drive their operations and decisions. Likewise, there has also been an emergence of new data management infrastructures, which are associated with various Big Data services and provide enterprises with exceptional opportunities to derive business value out of arbitrarily large volumes of data. In most cases, such Big Data infrastructures and services co-exist with the legacy data management infrastructures such as data warehouses and data mining tools. The latter infrastructures still play a significant role in enterprise data management. Hence, modern businesses must be able to understand the value of both legacy and emerging data management infrastructures in order to make proper choices about their business data architectures and services.

 

Conventional Enterprise Data Management Infrastructures

Nowadays, most enterprises deploy some sort of conventional enterprise data management infrastructure, such as:

The combination of data warehouses and data mining tools is, for many enterprises, the preferred way for identifying patterns that drive their decisions and operations. Despite the advent of Big Data, enterprises are still very keen on maintaining data warehouses, as these ensure that data is structured and processed in easy and reliable ways.

 

Riding the Wave of Big Data

Enterprises are increasingly adopting and implementing Big Data as it has the ability to process large and complex data sets, which are almost impossible to be processed by conventional data management infrastructures and their tools. Typical examples of such data includes web logs, sensor data, data from social media and social networking platforms, call detail records, genomics data, large scale e-commerce data, as well as various types of multimedia data such as collection of medical images, movie databases and more.

Big Data are in several cases unstructured or poorly structured, while featuring high velocity and veracity, which makes their collection, processing and management very challenging. Big Data tools such as distributed filesystems (e.g., Apache HDFS (Hadoop Distributed File System)) and streaming engine (e.g., Apache Spark and Apache Storm) provide the means for dealing with high volume and high velocity datasets. This is where the conventional tools fail. Enterprises are therefore deploying such tools in order to benefit from Big Data, without abandoning their conventional infrastructure.

The advent of Big Data has also given rise to the concept of a “Data Lake” which is a storage repository that can hold vast amounts of data in raw data format including structured, semi-structured and unstructured data. A Data Lake accommodates data of varying structures, which it is able to resolve at the application delivery level i.e. when the data structuring requirements are known. Some people consider Data Lakes as a re-creation of data warehouses in the Big Data era. While there is truth in this argument, it should be underlined that data lakes are significantly different from warehouses, in terms of the ways they structure and manage data. The main difference is that data warehouses deal with structured data only, while data lakes store raw data and transform it to some structure when it is time to use the data (e.g., as part of an application). Also data lakes are closely affiliated to Big Data technologies as in most cases they leverage tools and techniques from the Hadoop/Big Data ecosystem. Finally, data lakes provide some agility in terms of their processing, as data schemas can flexibly change, while data warehouses adhere to given schemas in order to benefit from well-structured data.

 

Establishing an Enterprise Big. Data Architecture

With so many different solutions and infrastructures for managing data, enterprises are challenged to create data management architectures that are responsive and scale in a cost-effective way. Modern enterprise data architectures should combine the merits of warehouses for processing structured data, with Big Data infrastructures’ capabilities for scalable processing of large volumes of data, including data volumes of low business value (e.g., raw social media streams) in a scalable way.

There are several architectural patterns that strive to combine data lakes with data warehouses. Some of these employ data warehouses as the primary means for data analytics, which are complemented with a data lake infrastructure for data of lower business value. In this case, the business value of the collected data is assessed, prior to deciding to move them from the data lake to the warehouse. However, there are also architecture patterns that work the other way around: They deploy a data lake as the central repository of information and selectively move data that are deemed to have the highest business value in a complementary warehouse. No matter the selected pattern, enterprises need to take into account cost/benefit considerations, as each of the present architecture comes with certain costs and requires considerable effort for its deployment. Hence, data lakes and data warehouses are likely to co-exist in a data management architecture: The former will deal with large volumes of raw data and their transformations, while the latter will deal with data mining over well-structured data.

Overall, firms are offered with remarkable data management opportunities to assist them in their effort to improve decision making and to increase the efficiency of their business processes. The selection of a proper data management infrastructure therefore becomes a key to a business’s competitiveness. Aside from the technology factor, the ever important organizational (e.g., organizational obstacles) and management (e.g., data administration and governance) factors should not be underestimated. Hence data management is an area where optimal decisions can make a real difference and yield improved business results.

Recent Posts

get in touch

We're here to help!

Terms of use
Privacy Policy
Site Map
2015 IT Exchange, Inc