To Turn Big Data from Buzzword to Reality, Build a Data Lake
Edo Interactive runs a card link offer platform which enables merchants to offer coupons to consumers based on their buying behavior.
The platform pulls around 50 million card transaction details from banks and credit card companies daily, runs them through a predictive model, and generates customized coupons which individual cardholders can redeem towards their next purchase.
For Edo, speed and accuracy of analytics is mission critical. In 2013 they faced a serious crisis when their existing data warehouse took 27 hours to process 24 hours worth of raw data.
To solve this problem Edo built a data lake using a Hadoop cluster which processes the daily build in 4 hours, and slashed the time needed to run models to minutes as opposed to days.
Data warehouses were the traditional repositories of enterprise data. Here’s how the process of data collection, storage and reporting looked like in a pre-Big Data environment.
The data warehouse stores curated, scrubbed, and largely structured data from sources like ERP systems. Therefore the possibilities of analysis are limited.
In sharp contrasts here’s how Big Data is organized using data lakes.
To sum up here are the differences between a data lake and a data warehouse:
Two of the biggest benefits of data lakes are speed and lower cost of hardware.
When Webtrends, a company which collects clickstream data from websites and mobiles which corporate marketers use to perform ad-hoc analysis implemented a data lake using a Hadoop cluster:
…Internet clickstream data (could) be streamed into the cluster and prepared for analysis in just 20 to 40 milliseconds…much faster than with the older system…hardware costs are (also) 25% to 50% lower on it.
Similarly, another company called Razorsight which offers cloud based analytics tools for telecommunication companies found that using a Hadoop cluster as a data lake cost $2,000 per TB, about one-tenth of their data warehousing solution.
But these benefits of flexibility, cost, and speed are only realized if the data lake architecture is well designed, and if there are rules in place governing how data is stored.
While a data lake will hold unstructured or raw data it in itself cannot be a data dump like a badly managed Sharepoint portal.
Data lakes need a fair amount of governance because of the nature of data stored in the former.
A data lake works on the ingest-analyze-understand philosophy.
Many flawed data lake implementations focus exclusively on the Ingest phase. When that happens, disaster strikes. In the case of one company:
“All of the context of the data — where it came from, why it was created, who created it — was lost…By the time the company fixed this issue and went back to their old platform, they had lost two-thirds of their customers and almost went out of business.”
That would scare anyone off data lakes.
If you want to keep your data lake from becoming a data sinkhole:
Adopting a schema for data lakes will also maintain data lineage which is extremely imprtant in regulated industries like finance and healthcare.
Data lakes offer unprecented flexibility for data driven companies to suss out insights that would have been otherwise impossible with traditional data warehouses.
Data lakes are particularly suitable when you need to store massive amounts of unstructured data like tweets along with structured data like server logs but don’t want them to be in silos.
However without a proper architecture a data lake becomes a data dump, and all you will have is a case of garbage-in garbage-out.
Enterprise Software Development: Core Competence or Commodity?
Virtual Reality: A Powerful Tool for the Present and Future of Business Enterprises
CIOs in 2021: New Mindsets, Cultures and Leadership Rules
DataOps: Maximizing Efficiency of your Data
Getting the most from your Multi-Cloud Environment
Six Ingredients of Data Management Intelligence
Top Strategic Priorities for CIOs in 2021
Positioning Your IT for Success in 2021
Customer Centric Processes: From CRM to Customer Data Platforms
Robotic Process Automation: A Driver for Cost-Efficient Enterprises Processes
We're here to help!
No obligation quotes in 48 hours. Teams setup within 2 weeks.
If you are a Service Provider looking to register, please fill out this Information Request and someone will get in touch.
Outsource with Confidence to high quality Service Providers.
If you are a Service Provider looking to register, please fill out
this Information Request and someone will get in
Enter your email id and we'll send a link to reset your password to the address
we have for your account.
The IT Exchange service provider network is exclusive and by-invite. There is
no cost to get on-board;
if you are competent in your areas of focus, then you are welcome. As a part of this exclusive