Book Image

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra
Book Image

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.
Table of Contents (23 chapters)
Title Page
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Part 1 - Overview
Part 2 - Technical Building blocks of Data Lake
Part 3 - Bringing It All Together

Where does this data live in an enterprise?


The data in an enterprise lives in different formats in the form of raw data, binaries (images and videos), and so on, and in different application's persistent storage internally within an organization or externally in a private or public cloud. Let's first classify these different types of data. One way to categorize where the data lives is as follows:

  • Intranet (within enterprise)
  • Internet (external to enterprise)

Another way in which data living in an enterprise can be categorized is in the form of different formats in which they exist, as follows:

  • Data stores or persistent stores (RDBMS or NoSQL)
  • Traditional data warehouses (making use of RDBMS, NoSQL etc.)
  • File stores

Now let's get into a bit more detail about these different data categories.

Intranet (within enterprise)

In simple terms, enterprise data that only exists and lives within its own private network falls in the category of intranet.

Various applications within an enterprise exist within the enterprise's own network, and access is denied to others apart from designated employees. Due to this reason, the data captured using these applications lives within an enterprise in a secure and private fashion.

The applications churning out this data can be data of the employees or various transactional data captured while using enterprises in day-to-day applications.

Technologies used to establish intranet for an enterprise include Local Area Network (LAN) and Wide Area Network (WAN). Also, there are multiple application platforms that can be used within an enterprise, enabling intranet culture within the enterprise and its employees. The data could be stored in a structured format in different stores, such as traditional RDBMS and NoSQL databases. In addition to these stores, there lies unstructured data in the form of different file types. Also, most enterprises have traditional data warehouses, where data is cleansed and made ready to be analyzed.

Internet (external to enterprise)

A decade or so ago, most enterprises had their own data centers, and almost all the data would reside in that. But with the evolution of cloud, enterprises are looking to put some data outside their own data center into cloud, with security and controls in place, so that the data is never accessed by unauthorized people. Going the cloud way also takes a good amount of operational costs away from the enterprise, and that is one of the biggest advantages. Let's get into the subcategories in this space in more detail.

Business applications hosted in cloud

With the various options provided by cloud providers, such as SaaS, PaaS, IaaS, and so on, there are ways in which business applications can be hosted in cloud, taking care of all the essential enterprise policies and governance. Because of this, many enterprises have chosen this as a way to host internally developed applications in these cloud providers. Employees use these applications from the cloud and go about doing their day-to-day operations very similar to how they would have for a business application hosted within an enterprise's own data center.

Third–party cloud solutions

With so many companies now providing their applications/services hosted in cloud, enterprises needing them could use these as is and not worry about maintaining and managing on-premises infrastructure. These products, just by being on the cloud, provide enterprises with huge incentives with regard to how they charge for these services.

Due to this benefit, enterprises favorably choose these cloud products, and due to its mere nature, the enterprises now save their data (very specific to their business)in the cloud on someone else infrastructure, with the cloud provider having full control on how these data live in there.

Google BigQuery is one such piece of software, which, as a service product, allows us to export the enterprise data to their cloud, running this software for various kinds of analysis work. The good thing about these products is that after the analysis, we can decide on whether to keep this data for future use or just discard it. Due to the elastic (ability to expand and contract at will, with regard to hardware in this case) nature of cloud, you can very well ask for a big machine if your analysis is complex, and after use, you can just discard or reduce these servers back to their old configuration.

Due to this nature, Google BigQuery calls itself anEnterprise Cloud Data Warehouse, and it does stay true to its promise. It gives speed and scale to enterprises along with the important security, reliability, and availability. It also gives integration with other similar software products again in cloud for various other needs.

Google BigQuery is just one example; there are other similar software available in cloud with varying degrees of features. Enterprises nowadays need to do many things quickly, and they don't want to spend time doing research on this and hosting these in their own infrastructure due to various overheads; these solutions give all they want without much trouble and at a very handy price tag.

The list of such solutions at this stage is ever growing, and I don't think that naming these is required. So we picked BigQuery as an example to explain this very nature.

Similar to software as a service available in the cloud, there are many business applications available in cloud as services. One such example is Salesforce. Basically, Salesforce is a Customer Relationship Management (CRM) solution, but it does have many packaged features in it. It's not a sales pitch, but I just want to give some very important features such business applications in cloud bring to the enterprise. Salesforce brings all the customer information together and allows enterprises to build a customer-centric business model from sales, business analysis, and customer service.

Being in cloud, it also brings many of the features that software as a service in cloud brings.

Because of the ever-increasing impact of cloud on enterprises, a good amount of enterprise data now lives on the Internet (in cloud), obviously taking care of privacy and other common features an enterprise data should comply with to safeguard enterprise’s business objectives.

Social data (structured and unstructured)

Social connection of an enterprise nowadays is quite crucial, and even though enterprise data doesn’t live in social sites, it does have rich information fed by the real customer on enterprise business and its services.

Comments and suggestions on these special sites can indeed be used to reinvent the way enterprises do business and interact with the customers.

Comments in these social sites can damage the reputation and brand of an enterprise if no due diligence in taken on these comments from customers. The enterprise takes these social sites really seriously nowadays, because of which even though it doesn't have enterprise data, it does have customer reviews and comments, which, in a way, show how customer perceive the brand.

Because of this nature, I would like to classify this data also as enterprise data fed in by non-enterprise users. Its very important to take care of the fourth V, namely veracity in big data while analyzing this data as there are people out there who want to use these just as channels to get some undue advantages while dealing with the enterprise in the process of the business.Another way of categorizing enterprise data is by the way the data is finally getting stored. Let's see this categorization in more detail in the following section.

Data stores or persistent stores (RDBMS or NoSQL)

This data, whether on premises (enterprise infrastructure) or in cloud, is stored as structured data in the so-called traditional RDBMS or new generation NoSQL persistent stores. This data comes into these stores through business applications, and most of the data is scattered in nature, and enterprises can easily find a sense of each and every data captured without much trouble. The main issue when data is stored in a traditional RDBMS kind of store is when the amount of data grows beyond an acceptable state. In that situation, the amount of analysis that we can make of the data takes a good amount of effort and time. Because of this, enterprises force themselves to segregate this data into production (data that can be queried and made use of by the business application) and non-production (data that is old and not in the production system, rather moved to a different storage).

Because of this segregation, analysis usually spans a few years and doesn't give enterprises a large span of how the business was dealing with certain business parameters. Say for example, if the production has five years of sales data, and 15 years of sales data is in the non-production storage, the users, when dealing with sales data analysis, just have a view of the last five years of data. There might be trends that are changing every five years, and this can only be known when we do an analysis of 20 years of sales data. Most of the time, because of RDBMS, storing and analyzing huge data is not possible. Even if this is possible, it is time consuming and doesn't give a great deal of flexibility, which an analyst looks for. This renders to the analyst a certain restricted analysis, which can be a big problem if the enterprise is looking into this data for business process tweaks.

The so-called new generation NoSQL (different databases in this space have different capabilities) gives more flexibility on analysis and the amount of data storage. It also gives the kind of performance and other aspects that analysts look for, but it still lacks certain aspects.

Even though the data is stored in an individual business application, it doesn't have a single view from various business application data, and that is what implementing a proper Data lake would bring into the enterprise.

Traditional data warehouse

As explained in the previous section, due to the amount of data captured in production business applications, almost all the time, the data in production is segregated from non-production. The non-production data usually lives in different forms/areas of the enterprise and flows into a different data store (usually RDBMS or NoSQL) called the data warehouse. Usually, the data is cleansed and cut out as required by the data analyst. Cutting out the data again puts a boundary on the type of analysis an analyst can do on the data. In most cases, there should be hidden gems of data that haven’t flown into the data warehouse, which would result in more analysis, using which the enterprises can tweak certain processes; however, since they are cleansed and cut out, this innovative analysis never happens. This aspect is also something that needs correction. The Data lake approach explained in this book allows the analyst to bring in any data captured in the production business application to do any analysis as the case may be.

The way these data warehouses are created today is by employing an ETL (Extract, Transform, Load) from the production database to the data warehouse database. ETL is entrusted with cleaning the data as needed by the analyst who works with these data warehouses for various analyses.

File stores

Business applications are ever changing, and new applications allow the end users to capture data in different formats apart from keying in data (using a keyboard), which are structured in nature.

Another way in which the end users now feed in data is in the form of documents in different formats. Some of the well-known formats are as follows:

  • Different document formats (PDF, DOC, XLS, and so on)
  • Binary formats
    • Image-based formats (JPG, PNG, and so on)
    • Audio formats (MP3, RAM, AC3)
    • Video formats (MP4, MPEG, MKV)

As you saw in the previous sections, dealing with structured data itself is in question, and now we are bringing in the analysis of unstructured data. But analysis of this data is also as important nowadays as structured ones. By implementing Data lake, we could bring in new technologies surrounding this lake, which will allow us to make some good value out of this unstructured data as well, using the latest and greatest technologies in this space.

Apart from various file formats and data living in it, we have many applications that allow end users to capture a huge amount of data in the form of sentences, which also need analysis. To deal with these comments from end users manually is a Herculean task, and in this modern age, we need to decipher the sentences/comments in an automatic fashion and get a view of their sentiment. Again, there are many such technologies available that can make sense of this data (free flowing text) and help enterprises deal with it in the right fashion.

For example, if we do have a suggestion capturing system in place for an enterprise and (let's say) we have close to 1000 suggestions that we get in a day, because of the nature of the business, it's very hard to get into the filtering of these suggestions. Here, we could use technologies aiding in the sentiment analysis of these comments, and according to the rating these analysis tools provide, perform an initial level of filtering and then hand it over to the human who can understand and make use of it.