Book Image

Data Lake Development with Big Data

By : Pradeep Pasupuleti, Beulah Salome Purra
Book Image

Data Lake Development with Big Data

By: Pradeep Pasupuleti, Beulah Salome Purra

Overview of this book

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications. This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.
Table of Contents (13 chapters)

Before the Data Lake

In this section, let us quickly look at how the Data Lake has evolved from a historical perspective.

From the time data-intensive applications were used to solve business problems, we have seen many evolutionary steps in the way data has been stored, managed, analyzed, and visualized.

The earlier systems were designed to answer questions about the past; questions such as what were my total sales in the last year?, were answered by machines built around monolithic processors that ran COBOL, accessing data from tapes and disks. Since the dawn of faster processors and better storage, businesses were able to slice and dice data to find fine-grained answers from subsets of data; these questions resembled: what was the sales performance of x unit in y geography in z timeframe?

If we extract one common pattern, all the earlier systems were developed for business users, in order to help them make decisions for their businesses. The current breed of data systems empowers people like you and me to make decisions and improve the way we live. This is an ultimate paradigm shift brought by the advances in myriad technologies.

For many of us, the technologies that run in the background are transparent, while we consult applications that help us make decisions that alter our immediate future profoundly. We use applications to help us navigate to an address (mapping), decide on our holidays (weather and holiday planning sites), get a summary of product reviews (review sites), get similar products (recommendation engines), connect and grow professionally (professional social networks), and the list goes on.

All these applications use enabling technologies that understand natural languages, process humungous amounts of data, store and effortlessly process our personal data such as images and audio, and even extract intelligence from them by tagging our faces and finding relationships. Each of us, in a way, contributes to the flooding of these application servers with our personal data in the form of our preferences, likes, affiliations, networks, hobbies, friends, images, and videos.

If we can attribute one fundamental cause for today's explosion of data, it should be the proliferation of ubiquitous internet connectivity and the Smartphone; with it comes the exponential number of applications that transmit and store a variety of data.

Juxtaposing the growth of Smartphones and the internet with the rapid decline of storage costs and cloud computing, which also bring down the processing costs, we can immediately comprehend that the traditional data architectures do not scale to handle the volume and variety of data; thus cannot, answer questions that you and I want. They work well, extremely well for business users, but not directly for us.

In order to democratize the value hidden in data and thus empower common customers to use data for day-to-day decision making, organizations should first store and extract value from the different types of data being collected in such a huge quantities. For all this to happen, the following two key developments have had a revolutionary impact:

  • The development of distributed computing architectures that can scale linearly and perform computations at an unbelievable pace

  • The development of new-age algorithms that can analyze natural languages, comprehend the semantics of the spoken words and special types, run Neural Nets, perform deep learning, graph social network interactions, perform constraint-based stochastic optimization, and so on

Earlier systems were simply not architected to scale linearly and store/analyze these many types of data. They are good for the purpose they were initially built for. They excelled as a historical data store that can offload structured data from Online Transaction Processing (OLTP) systems, perform transformations, cleanse it, slice-dice and summarize it, and then feed it to Online Analytical Processing (OLAP) systems. Business Intelligence tools consume the exhaust of the OLAP systems and spew good-looking reports religiously at regular intervals so that the business users can make the decisions.

We can immediately grasp the glaring differences between the earlier systems and the new age systems by looking at these major aspects:

  • The storage and processing differs in the way it scales (distributed versus monolithic)

  • In earlier systems, data is managed in relational systems versus NoSQL, MPP, and CEP systems in the new age Big Data systems

  • Traditional systems cannot handle high-velocity data that is efficiently ingested and processed by Big Data applications

  • Structured data is predominantly used in earlier systems versus unstructured data being used in Big Data systems along with structured data

  • Traditional systems have limitations around the scale of data that they can handle; Big Data systems are scalable and can handle humongous amounts of data

  • Traditional analytic algorithms such as linear/logistic regressions versus cutting edge algorithms such as random forests-ensemble methods, stochastic optimizations, deep learning, and NLP being regularly used in Big Data systems

  • Reports and drilldowns are the mainstay, versus the advanced visualizations such as Tag cloud and Heat map, which are some of the choicest reporting advances in the Big Data era

Data Lake is one such architecture that has evolved to address the need of the organizations to adapt to the new business reality. Organizations today listen to the customer's voice more than ever; they are sensitive to customer feedback and negative remarks—it hurts their bottom line if they don't. Organizations understand their customers more intimately than ever before—they know your every move, literally, through behavioral profiling. Finally, organizations use all the data at their disposal to help customers leverage it for their personal benefit. In order to catch up with the changing business landscape, there is immense potential for building a Data Lake to store, process, and analyze huge amounts of structured and unstructured data.

The following figure elucidates the vital differences between traditional and Big Data systems:

Traditional versus Big Data systems