The Concept of Hadoop Data Lake and Its Implications

Author: 
Larry Xu

Collecting and capturing business data is being around for quite some time. The process of gathering and analyzing business data is essential for decision making in management due to the fact that our human mathematical intuition is often limited and flawed. Therefore, people need help in making decisions to manage the businesses or even our daily lives. Business Intelligence is such a system based on extracting valuable and insightful information out of the collected and stored data. The decision support function offered by the Business Intelligence system has become a commonplace in modern times.

Over the years, the collected data has been fed into Data Bases, Data Marts, and larger Data Warehouses. Now the latest buzz word is Data Lake. “The Data Lake is somewhat unique, in that we have new technologies, like Hadoop for example, that enable you to collect massive amounts of information and store that in a file system; and really store it in an unprecedented, high scale, singular fashion.” – Steve Lucas, President, SAP Platform Solutions, openSAP course “Driving Business Results with Big Data”, June 2015.

Data_Lake_Graph-579217-edited

Unlike a traditional hierarchical Data Warehouse which stores data in files or folders, Hadoop Data Lake is a large object-based data storage repository which holds data in its native format until it is needed. An analogy can be made between a Hadoop Data Lake and a real natural lake that is a water source for city residents. The water in the lake comes from rain, rivers, streams, creeks, etc. and is in native state without being treated. When people need to use the water, it then flows to the treatment plant to be treated for human use.

Just like water in a real lake, the data in Hadoop Data Lake comes from a variety of sources: non-relational data, such as log files, internet click stream records, sensor data, JSON objects, images and social media posts. It can also pull data from relational databases. The unique nature of a Hadoop Data Lake should have the following implications:

  • A Less Expensive Alternative for Storing Massive Amount of Data.

    A Hadoop Data Lake stores data on Hadoop Distributed File Systems (HDFS), using clusters of readily available, mass produced commodity hardware servers. Thus, the cost of storing data is comparatively low.

  • A More Suitable Platform for Big Data Management.

    A Hadoop Data Lake has the capability to store a diverse mixture of structured, unstructured, and semi-structured data. This capability makes it a more suitable platform for Big Data management and analytics applications than data warehouses based on RDBS. A BI system can leverage this platform to extract valuable and insightful information from the collected and stored data in Hadoop Data Lake.

  • A Compliment to Enterprise Data Warehouse.

    Even though Hadoop Data Lake has promising features, it is premature and difficult to entirely supplant the time-tested traditional data warehouses. Ideally, a hybrid platform of both EDW and Hadoop Data Lake will reap the benefits of combined strength of different data storages and processing engines. For example, a company can utilize SAP HANA as a platform for hot data real-time processing; use SAP IQ (Near Line Storage) for warm data storage and processing; and use Hadoop Data Lake for cold data archiving and batch processing.

  • Here is a possible use case for Hadoop Data Lake: Utilities Industry. As a power grid operator, the utility company can leverage both the massive amount of sensors data and historical data, such as maintenance schedule data, weather data, etc. stored in Hadoop Data Lake to predict and anticipate major equipment service schedules so that proactive replacement or repairs can be performed to avoid untimely and costly disruption of electric delivery caused by sudden and unexpected equipment breakdowns.