Book Image

Practical Big Data Analytics

By : Nataraj Dasgupta
Book Image

Practical Big Data Analytics

By: Nataraj Dasgupta

Overview of this book

Big Data analytics relates to the strategies used by organizations to collect, organize, and analyze large amounts of data to uncover valuable business insights that cannot be analyzed through traditional systems. Crafting an enterprise-scale cost-efficient Big Data and machine learning solution to uncover insights and value from your organization’s data is a challenge. Today, with hundreds of new Big Data systems, machine learning packages, and BI tools, selecting the right combination of technologies is an even greater challenge. This book will help you do that. With the help of this guide, you will be able to bridge the gap between the theoretical world of technology and the practical reality of building corporate Big Data and data science platforms. You will get hands-on exposure to Hadoop and Spark, build machine learning dashboards using R and R Shiny, create web-based apps using NoSQL databases such as MongoDB, and even learn how to write R code for neural networks. By the end of the book, you will have a very clear and concrete understanding of what Big Data analytics means, how it drives revenues for organizations, and how you can develop your own Big Data analytics solution using the different tools and methods articulated in this book.
Table of Contents (16 chapters)
Title Page
Packt Upsell
Contributors
Preface

Preface

This book introduces the reader to a broad spectrum of topics related to big data as used in the enterprise. Big data is a vast area that encompasses elements of technology, statistics, visualization, business intelligence, and many other related disciplines. To get true value from data that oftentimes remains inaccessible, either due to volume or technical limitations, companies must leverage proper tools both at the software as well as the hardware level.

To that end, the book not only covers the theoretical and practical aspects of big data, but also supplements the information with high-level topics such as the use of big data in the enterprise, big data and data science initiatives and key considerations such as resources, hardware/software stack and other related topics. Such discussions would be useful for IT departments in organizations that are planning to implement or upgrade the organizational big data and/or data science platform.

The book focuses on three primary areas:

1. Data mining on large-scale datasets

Big data is ubiquitous today, just as the term data warehouse was omnipresent not too long ago. There are a myriad of solutions in the industry. In particular, Hadoop and products in the Hadoop ecosystem have become both popular and increasingly common in the enterprise. Further, more recent innovations such as Apache Spark have also found a permanent presence in the enterprise - Hadoop clients, realizing that they may not need the complexity of the Hadoop framework have shifted to Spark in large numbers. Finally, NoSQL solutions, such as MongoDB, Redis, Cassandra and commercial solutions such as Teradata, Vertica and kdb+ have provided have taken the place of more conventional database systems.

This book will cover these areas with a fair degree of depth. Hadoop and related products such as Hive, HBase, Pig Latin and others have been covered. We have also covered Spark and explained key concepts in Spark such as Actions and Transformations. NoSQL solutions such as MongoDB and KDB+ have also been covered to a fair extent and hands-on tutorials have also been provided.

2. Machine learning and predictive analytics

The second topic that has been covered is machine learning, also known by various other names, such as Predictive Analytics, Statistical Learning and others. Detailed explanations with corresponding machine learning code written using R and machine learning packages in R have been provided. Algorithms, such as random forest, support vector machines, neural networks, stochastic gradient boosting, decision trees have been discussed. Further, key concepts in machine learning such as bias and variance, regularization, feature section, data pre-processing have also been covered.

3. Data mining in the enterprise

In general, books that cover theoretical topics seldom discuss the more high-level aspects of big data - such as the key requirements for a successful big data initiative. The book includes survey results from IT executives and highlights the shared needs that are common across the industry. The book also includes a step-by-step guide on how to select the right use cases, whether it is for big data or for machine learning based on lessons learned from deploying production solutions in large IT departments.

We believe that with a strong foundational knowledge of these three areas, any practitioner can deliver successful big data and/or data science projects. That is the primary intention behind the overall structure and content of the book.

Who this book is for

The book is intended for a diverse range of audience. In particular, readers who are keen on understanding the concepts of big data, data science and/or machine learning at a holistic level, namely, how they are all inter-related will gain the most benefit from the book.

Technical audience: For technically minded readers, the book contains detailed explanations of the key industry tools for big data and machine learning. Hands-on exercises using Hadoop, developing machine learning use cases using the R programming language, building comprehensive production-grade dashboards with R Shiny have been covered. Other tutorials in Spark and NoSQL have also been included. Besides the practical aspects, the theoretical underpinnings of these key technologies have also been explained.

Business audience: The extensive theoretical and practical treatment of big data has been supplemented with high level topics around the nuances of deploying and implementing robust big data solutions in the workplace. IT management, CIO organizations, business analytics and other groups who are tasked with defining the corporate strategy around data will find such information very useful and directly applicable.

What this book covers

Chapter 1, A Gentle Primer on Big Data, covers the basic concepts of big data and machine learning and the tools used, and gives a general understanding of what big data analytics pertains to.

Chapter 2, Getting started with Big Data Mining, introduces concepts of big data mining in an enterprise and provides an introduction to the software and hardware architecture stack for enterprise big data.

Chapter 3, The Analytics Toolkit, discusses the various tools used for big data and machine Learning and provides step-by-step instructions on where users can download and install tools such as R, Python, and Hadoop. 

Chapter 4, Big Data with Hadoop, looks at the fundamental concepts of Hadoop and delves into the detailed technical aspects of the Hadoop ecosystem. Core components of Hadoop such as Hadoop Distributed File System (HDFS), Hadoop Yarn, Hadoop MapReduce and concepts in Hadoop 2 such as ResourceManager, NodeManger, Application Master have been explained in this chapter. A step-by-step tutorial on using Hive via the Cloudera Distribution of Hadoop (CDH) has also been included in the chapter.

Chapter 5, Big Data Analytics with NoSQL, looks at the various emerging and unique database solutions popularly known as NoSQL, which has upended the traditional model of relational databases. We will discuss the core concepts and technical aspects of NoSQL. The various types of NoSQL systems such as In-Memory, Columnar, Document-based, Key-Value, Graph and others have been covered in this section. A tutorial related to MongoDB and the MongoDB Compass interface as well as an extremely comprehensive tutorial on creating a production-grade R Shiny Dashboard with kdb+ have been included.

Chapter 6, Spark for Big Data Analytics, looks at how to use Spark for big data analytics. Both high-level concepts as well as technical topics have been covered. Key concepts such as SparkContext, Directed Acyclic Graphs, Actions & Transformations have been covered. There is also a complete tutorial on using Spark on Databricks, a platform via which users can leverage Spark

Chapter 7, A Gentle Introduction to Machine Learning Concepts, speaks about the fundamental concepts in machine learning. Further, core concepts such as supervised vs unsupervised learning, classification, regression, feature engineering, data preprocessing and cross-validation have been discussed. The chapter ends with a brief tutorial on using an R library for Neural Networks.

Chapter 8, Machine Learning Deep Dive, delves into some of the more involved aspects of machine learning. Algorithms, bias, variance, regularization, and various other concepts in Machine Learning have been discussed in depth. The chapter also includes explanations of algorithms such as random forest, support vector machines, decision trees. The chapter ends with a comprehensive tutorial on creating a web-based machine learning application.

Chapter 9, Enterprise Data Science, discusses the technical considerations for deploying enterprise-scale data science and big data solutions. We will also discuss the various ways enterprises across the world are implementing their big data strategies, including cloud-based solutions. A step-by-step tutorial on using AWS - Amazon Web Services has also been provided in the chapter.

Chapter 10, Closing Thoughts on Big Data, discusses corporate big data and Data Science strategies and concludes with some pointers on how to make big data related projects successful.

Appendix A, Further Reading on Big Data, contains links for a wider understanding of big data.

To get the most out of this book

  1. A general knowledge of Unix would be very helpful, although isn't mandatory
  2. Access to a computer with an internet connection will be needed in order to download the necessary tools and software used in the exercises 
  3. No prior knowledge of the subject area has been assumed as such
  4. Installation instructions for all the software and tools have been provided in Chapter 3, The Analytics Toolkit.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at www.packtpub.com.
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Big-Data-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/PracticalBigDataAnalytics_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The results are stored in HDFS under the /user/cloudera/output."

A block of code is set as follows:

   "_id" : ObjectId("597cdbb193acc5c362e7ae97"), 
   "firstName" : "Nina", 
   "age" : 53, 
   "frequentFlyer" : [ 
          "Delta", 
          "JetBlue", 
          "Delta" 

Any command-line input or output is written as follows:

$ cd Downloads/ # cd to the folder where you have downloaded the zip file

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "This sort of additional overhead can easily be alleviated by using virtual machines (VMs)"

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.