Practical Big Data Analytics

Practical Big Data Analytics

By : Nataraj Dasgupta

Buy this Book

Practical Big Data Analytics

By: Nataraj Dasgupta

Buy this Book

Overview of this book

Big Data analytics relates to the strategies used by organizations to collect, organize, and analyze large amounts of data to uncover valuable business insights that cannot be analyzed through traditional systems. Crafting an enterprise-scale cost-efficient Big Data and machine learning solution to uncover insights and value from your organization’s data is a challenge. Today, with hundreds of new Big Data systems, machine learning packages, and BI tools, selecting the right combination of technologies is an even greater challenge. This book will help you do that. With the help of this guide, you will be able to bridge the gap between the theoretical world of technology and the practical reality of building corporate Big Data and data science platforms. You will get hands-on exposure to Hadoop and Spark, build machine learning dashboards using R and R Shiny, create web-based apps using NoSQL databases such as MongoDB, and even learn how to write R code for neural networks. By the end of the book, you will have a very clear and concrete understanding of what Big Data analytics means, how it drives revenues for organizations, and how you can develop your own Big Data analytics solution using the different tools and methods articulated in this book.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Too Big or Not Too Big

What is big data?

Why we are talking about big data now if data has always existed

Types of Big Data

Sources of big data

When do you know you have a big data problem and where do you start your search for the big data solution?

Summary

Big Data Mining for the Masses

What is big data mining?

Technical elements of the big data platform

Summary

The Analytics Toolkit

Components of the Analytics Toolkit

System recommendations

Installing Hadoop

Installing Packt Data Science Box

Summary

Big Data With Hadoop

The fundamentals of Hadoop

The Hadoop ecosystem

Hands-on with CDH

Summary

Big Data Mining with NoSQL

Why NoSQL?

NoSQL databases

Analyzing Nobel Laureates data with MongoDB

Tracking physician payments with real-world data

The CMS Open Payments Portal

R Shiny platform for developers

Summary

Spark for Big Data Analytics

The advent of Spark

Spark practicals

Spark exercise - hands-on with Spark (Databricks)

Summary

An Introduction to Machine Learning Concepts

What is machine learning?

Factors that led to the success of machine learning

Machine learning, statistics, and AI

Categories of machine learning

Subdividing supervised machine learning

Common terminologies in machine learning

The core concepts in machine learning

Leveraging multicore processing in the model

Summary

Machine Learning Deep Dive

The bias, variance, and regularization properties

The gradient descent and VC Dimension theories

Popular machine learning algorithms

Tutorial - associative rules mining with CMS data

Summary

Enterprise Data Science

Enterprise data science overview

A roadmap to enterprise analytics success

Data science solutions in the enterprise

Enterprise data science – machine learning and AI

Enterprise infrastructure solutions

Tutorial – using RStudio in the cloud

Summary

Closing Thoughts on Big Data

Corporate big data and data science strategy

Ethical considerations

Silicon Valley and data science

The human factor

Summary

External Data Science Resources

Visualization libraries

Courses on R

Courses on machine learning

Machine learning and deep learning links

Web-based machine learning services

Movies

Machine learning books from Packt

Books for leisure reading

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Types of Big Data

Data can be broadly classified as being structured, unstructured, or semi-structured. Although these distinctions have always existed, the classification of data into these categories has become more prominent with the advent of big data.

Structured

Structured data, as the name implies, indicates datasets that have a defined organizational structure such as Microsoft Excel or CSV files. In pure database terms, the data should be representable using a schema. As an example, the following table representing the top five happiest countries in the world published by the United Nations in its 2017 World Happiness Index ranking would be an atypical representation of structured data.

We can clearly define the data types of the columns--Rank, Score, GDP per capita, Social support, Healthy life expectancy, Trust, Generosity, and Dystopia are numerical columns, whereas Country is represented using letters, or more specifically, strings.

Refer to the following table for a little more clarity:

Rank	Country	Score	GDP per capita	Social support	Healthy life expectancy	Generosity	Trust	Dystopia
1	Norway	7.537	1.616	1.534	0.797	0.362	0.316	2.277
2	Denmark	7.522	1.482	1.551	0.793	0.355	0.401	2.314
3	Iceland	7.504	1.481	1.611	0.834	0.476	0.154	2.323
4	Switzerland	7.494	1.565	1.517	0.858	0.291	0.367	2.277
5	Finland	7.469	1.444	1.54	0.809	0.245	0.383	2.43

World Happiness Report, 2017 [Source: https://en.wikipedia.org/wiki/World_Happiness_Report#cite_note-4]

Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra, and Hive in the open source domain are examples of technologies that provide the ability to manage and query structured data.

Unstructured

Unstructured data consists of any dataset that does not have a predefined organizational schema as in the table in the prior section. Spoken words, music, videos, and even books, including this one, would be considered unstructured. This by no means implies that the content doesn’t have organization. Indeed, a book has a table of contents, chapters, subchapters, and an index--in that sense, it follows a definite organization.

However, it would be futile to represent every word and sentence as being part of a strict set of rules. A sentence can consist of words, numbers, punctuation marks, and so on and does not have a predefined data type as spreadsheets do. To be structured, the book would need to have an exact set of characteristics in every sentence, which would be both unreasonable and impractical.

Note

Data from social media, such as posts on Twitter, messages from friends on Facebook, and photos on Instagram, are all examples of unstructured data.

Unstructured data can be stored in various formats. They can be Blobs or, in the case of textual data, freeform text held in a data storage medium. For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally used to query, index, and other operations.

Semi-structured

Semi-structured data refers to data that has both the elements of an organizational schema as well as aspects that are arbitrary. A personal phone diary (increasingly rare these days!) with columns for name, address, phone number, and notes could be considered a semi-structured dataset. The user might not be aware of the addresses of all individuals and hence some of the entries may have just a phone number and vice versa.

Similarly, the column for notes may contain additional descriptive information (such as a facsimile number, name of a relative associated with the individual, and so on). It is an arbitrary field that allows the user to add complementary information. The columns for name, address, and phone number can thus be considered structured in the sense that they can be presented in a tabular format, whereas the notes section is unstructured in the sense that it may contain an arbitrary set of descriptive information that cannot be represented in the other columns in the diary.

In computing, semi-structured data is usually represented by formats, such as JSON, that can encapsulate both structured as well as schemaless or arbitrary associations, generally using key-value pairs. A more common example could be email messages, which have both a structured part, such as name of the sender, time when the message was received, and so on, that is common to all email messages and an unstructured portion represented by the body or content of the email.

Platforms such as Mongo and CouchDB are generally used to store and query semi-structured datasets.

Practical Big Data Analytics

By : Nataraj Dasgupta

Practical Big Data Analytics

By: Nataraj Dasgupta

Overview of this book

Related Content you might be interested in

Current Title:

Practical Big Data Analytics

Hands-On Big Data Modeling

Apache Hadoop 3 Quick Start Guide

Hands-On Data Science with R

Types of Big Data

Structured

Unstructured

Note

Semi-structured