Book Image

Data Lake Development with Big Data

By : Pradeep Pasupuleti, Beulah Salome Purra
Book Image

Data Lake Development with Big Data

By: Pradeep Pasupuleti, Beulah Salome Purra

Overview of this book

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications. This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.
Table of Contents (13 chapters)

Key benefits of Data Lake

Having understood the need for the Data Lake and the business/technology context of its evolution, let us now summarize the important benefits in the following list:

  • Scale as much as you can: Theoretically, the HDFS-based storage of Hadoop gives you the flexibility to support arbitrarily large clusters while maintaining a constant price per performance curve even as it scales. This means, your data storage can be scaled horizontally to cater to any need at a judicious cost. To gain more space, all you have to do is plug in a new cluster and then Hadoop scales seamlessly. Hadoop brings you the incredible facility to run the code close to storage, allowing quicker processing of massive data sets. The usage of Hadoop for underlying storage makes the Data Lake more scalable at a better price point than Data Warehouses by an order of magnitude. This allows for the retention of huge amounts of data.

  • Plug in disparate data sources: Unlike a data warehouse that can only ingest structured data, a Hadoop-powered Data Lake has an inherent ability to ingest multi-structured and massive datasets from disparate sources. This means that the Data Lake can store literally any type of data such as multimedia, binary, XML, logs, sensor data, social chatter, and so on. This is one huge benefit that removes data silos and enables quick integration of datasets.

  • Acquire high-velocity data: In order to efficiently stream high-speed data in huge volumes, the Data Lake makes use of tools that can acquire and queue it. The Data Lake utilizes tools such as Kafka, Flume, Scribe, and Chukwa to acquire high-velocity data. This data could be the incessant social chatter in the form of Twitter feeds, WhatsApp messages, or it could be sensor data from the machine exhaust. This ability to acquire high-velocity data and integrate with large volumes of historical data gives Data Lake the edge over Data Warehousing systems, which could not do any of these as effectively.

  • Add a structure: To make sense of vast amounts of data stored in the Data Lake, we should create some structure around the data and pipe it into analysis applications. Applying a structure on unstructured data can be done while ingesting or after being stored in the Data Lake. A structure such as the metadata of a file, word counts, parts of speech tagging, and so on, can be created out of the unstructured data. The Data Lake gives you a unique platform where we have the ability to apply a structure on varied datasets in the same repository with a richer detail; hence, enabling you to process the combined data in advanced analytic scenarios.

  • Store in native format: In a Data Warehouse, the data is premodeled as cubes that are the best storage structures for predetermined analysis routines at the time of ingestion. The Data Lake eliminates the need for data to be premodeled; it provides iterative and immediate access to the raw data. This enhances the delivery of analytical insights and offers unmatched flexibility to ask business questions and seek deeper answers.

  • Don't worry about schema: Traditional data warehouses do not support the schemaless storage of data. The Data Lake leverages Hadoop's simplicity in storing data based on schemaless write and schema-based read modes. This is very helpful for data consumers to perform exploratory analysis and thus, develop new patterns from the data without worrying about its initial structure and ask far-reaching, complex questions to gain actionable intelligence.

  • Unleash your favorite SQL: Once the data is ingested, cleansed, and stored in a structured SQL storage of the Data Lake, you can reuse the existing PL-SQL scripts. The tools such as HAWQ and IMPALA give you the flexibility to run massively parallel SQL queries while simultaneously integrating with advanced algorithm libraries such as MADLib and applications such as SAS. Performing the SQL processing inside the Data Lake decreases the time to achieving results and also consumes far less resources than performing SQL processing outside of it.

  • Advanced algorithms: Unlike a data warehouse, the Data Lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms to recognize items of interest that will power real-time decision analytics.

  • Administrative resources: The Data Lake scores better than a data warehouse in reducing the administrative resources required for pulling, transforming, aggregating, and analyzing data.