Practical Data Analysis - Second Edition

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar

Buy this Book

Practical Data Analysis - Second Edition

By: Hector Cuesta, Dr. Sampath Kumar

Buy this Book

Overview of this book

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service. This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.

Practical Data Analysis - Second Edition

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started

Computer science

Artificial intelligence

Data, information, and knowledge

The data analysis process

Quantitative versus qualitative data analysis

Importance of data visualization

What about big data?

Quantified self

Tools and toys for this book

Summary

Preprocessing Data

Data sources

Data scrubbing

Data formats

Data reduction methods

Getting started with OpenRefine

Summary

Getting to Grips with Visualization

What is visualization?

Working with web-based visualization

Exploring scientific visualization

Visualization in art

The visualization life cycle

Visualizing different types of data

Getting started with D3.js

Interaction and animation

Data from social networks

An overview of visual analytics

Summary

Text Classification

Learning and classification

Bayesian classification

E-mail subject line tester

The data

The algorithm

Classifier accuracy

Summary

Similarity-Based Image Retrieval

Image similarity search

Dynamic time warping

Processing the image dataset

Implementing DTW

Analyzing the results

Summary

Simulation of Stock Prices

Financial time series

Random Walk simulation

Monte Carlo methods

Generating random numbers

Implementation in D3js

Quantitative analyst

Summary

Predicting Gold Prices

Working with time series data

Smoothing time series

Lineal regression

The data - historical gold prices

Nonlinear regressions

Summary

Working with Support Vector Machines

Understanding the multivariate dataset

Dimensionality reduction

Getting started with SVM

Summary

Modeling Infectious Diseases with Cellular Automata

Introduction to epidemiology

The epidemic models

Modeling with Cellular Automaton

Simulation of the SIRS model in CA with D3.js

Summary

Working with Social Graphs

Structure of a graph

Social networks analysis

Acquiring the Facebook graph

Working with graphs using Gephi

Statistical analysis

Degree distribution

Transforming GDF to JSON

Graph visualization with D3.js

Summary

Working with Twitter Data

The anatomy of Twitter data

Using OAuth to access Twitter API

Getting started with Twython

Summary

Data Processing and Aggregation with MongoDB

Getting started with MongoDB

Data preparation

Group

Aggregation framework

Summary

Working with MapReduce

An overview of MapReduce

Programming model

Using MapReduce with MongoDB

Filtering the input collection

Grouping and aggregation

Counting the most common words in tweets

Summary

Online Data Analysis with Jupyter and Wakari

Getting started with Wakari

Getting started with IPython notebook

Introduction to image processing with PIL

Getting started with pandas

Sharing your Notebook

Summary

Understanding Data Processing using Apache Spark

Platform for data processing

An introduction to the distributed file system

An introduction to Apache Spark

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What about big data?

Big data is a term used when the data exceeds the processing capacity of a typical database. The integration of computer technology into science and daily life has enabled the collection of massive volumes of data, such as climate data, website transaction logs, customer data, and credit card records. However, such big datasets cannot be practically managed on a single commodity computer because their sizes are too large to fit in memory, or it takes more time to process the data. To avoid this obstacle, one may have to resort to parallel and distributed architectures, with multicore and cloud computing platforms providing access to hundreds or thousands of processors. For the storing and manipulation of big data, parallel and distributed architectures show new capabilities.

Now, big data is a truth: the variety, volume, and velocity of data coming from the Web, sensors, devices, audio, video, networks, log files, social media, and transactional applications reach exceptional levels. Now big data has also hit the business, government, and science sectors. This phenomenal growth means that not only must we understand big data in order to interpret the information that truly counts, but also the possibilities of big data analytics.

There are three main features of big data:

Volume: Large amounts of data
Variety: Different types of structured, unstructured, and multistructured data
Velocity: Needs to be analyzed quickly

As is shown in the following image, we can see the interaction between these three Vs:

We need big data analytics when data grows fast and needs to uncover hidden patterns, unknown correlations, and other useful information that can be used to make better decisions. With big data analytics, data scientists and others can analyze huge volumes of data that conventional analytics and business intelligence solutions cannot in order to transform business decisions for the future. Big data analytics is a workflow that distils terabytes of low-value data.

Big data is an opportunity for any company to take advantage of data aggregation, data exhaustion, and metadata. This makes big data a useful business analytics tool, but there is a common misunderstanding of what big data actually is.

The most common architecture for big data processing is through Map-Reduce, which is a programming model for processing large datasets in parallel using a distributed cluster.

Apache Hadoop is the most popular implementation of MapReduce, and it is used to solve large-scale distributed data storage, analysis, and retrieval tasks. However, MapReduce is just one of three classes of technologies that store and manage big data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. In this book we will implement MapReduce functions and NoSQL storage through MongoDB in Chapter 12, Data Processing and Aggregation with MongoDB, and Chapter 13, Working with MapReduce.

MongoDB provides us with document-oriented storage, high availability, and map/reduce flexible aggregation for data processing.

A paper published by IEEE in 2009 The Unreasonable Effectiveness of Data says the following:

"But invariably, simple models and a lot of data trump over more elaborate models based on less data."

This is a fundamental idea in big data (you can find the full paper at http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf). The trouble with real-world data is that the probability of finding false correlations is high and gets higher as the datasets grows. That's why, in this book, we will focus on meaningful data instead of big data.

One of the main challenges for big data is how to store, protect, back up, organize, and catalog the data in a petabyte scale. Another of the main challenges of big data is the concept of data ubiquity. With the proliferation of smart devices with several sensors and cameras, the amount of data available for each person increases every minute. Big data must be able to process all those data in real time:

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar

Practical Data Analysis - Second Edition

By: Hector Cuesta, Dr. Sampath Kumar

Overview of this book

Related Content you might be interested in

Current Title:

Practical Data Analysis - Second Edition

What about big data?