Book Image

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra
Book Image

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.
Table of Contents (23 chapters)
Title Page
About the Authors
About the Reviewers
Customer Feedback
Part 1 - Overview
Part 2 - Technical Building blocks of Data Lake
Part 3 - Bringing It All Together

About the Reviewers

Wei Di is currently a staff member in a business analytics data mining team. As a data scientist, she is passionate about creating smart and scalable analytics and data mining solutions that can impact millions of individuals and empower successful business.

Her interests also cover wide areas, including artificial intelligence, machine learning, and computer vision. She was previously associated with the eBay human language technology team and eBay research labs, with focus on image understanding for large-scale application and joint learning from both visual and text information. Prior to that, she was with, working on large-scale data mining and machine learning models in the areas of record linkage, search relevance, and ranking. She received her PhD from Purdue University in 2011 with focus on data mining and image classification.





Vivek Mishra is an IT professional with more than 9 years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor to open source software such as Apache Cassandra and lead committer for Kundera(a JPA 2.0-compliant object-datastore mapping library for NoSQL Datastores such as Cassandra, HBase, MongoDB, and REDIS).

Vivek, in his previous experience, has enjoyed long-lasting partnerships with the most recognizable names in SCM, banking and finance industries, employing industry-standard, full-software life cycle methodologies such as Agile and SCRUM. He is currently employed with Impetus Infotech.

He has undertaken speaking engagements in cloud camp and Nasscom big data seminars and is an active blogger at


Rubén Oliva Ramos is a computer systems engineer with a master's degree in computer and electronic systems engineering, teleinformatics, and networking specialization from University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience in developing web applications to control and monitor devices connected with Arduino and Raspberry Pi using web frameworks and cloud services to build Internet of Things applications.

He is a mechatronics teacher at the University of Salle Bajio and teaches students of master's in design and engineering of mechatronics Systems. He also works at Centro de Bachillerato Tecnologico Industrial 225 in Leon, Guanajuato, Mexico, teaching the following: electronics, robotics and control, automation, and microcontrollers at Mechatronics Technician Career. He has worked on consultant and developer projects in areas such as monitoring systems and datalogger data using technologies such as Android, iOS, Windows Phone, Visual Studio .NET, HTML5, PHP, CSS, Ajax, JavaScript, Angular, ASP .NET databases (SQlite, mongoDB, and MySQL), and web servers (Node.js and IIS). Ruben has done hardware programming on Arduino, Raspberry Pi, Ethernet Shield, GPS and GSM/GPRS, ESP8266, and control and monitor systems for data acquisition and programming.

He has written the book titled Internet of Things Programming with JavaScript, Packt.

His current job involves monitoring, controlling, and acquisition of data with Arduino and Visual Basic .NET for Alfaomega Editor Group.

"I want to thank God for helping me reviewing this book, to my wife, Mayte, and my sons, Ruben and Dario, for their support, to my parents, my brother and sister whom I love and to all my beautiful family."