Practical Data Analysis - Second Edition

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar

Buy this Book

Practical Data Analysis - Second Edition

By: Hector Cuesta, Dr. Sampath Kumar

Buy this Book

Overview of this book

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service. This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.

Practical Data Analysis - Second Edition

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started

Computer science

Artificial intelligence

Data, information, and knowledge

The data analysis process

Quantitative versus qualitative data analysis

Importance of data visualization

What about big data?

Quantified self

Tools and toys for this book

Summary

Preprocessing Data

Data sources

Data scrubbing

Data formats

Data reduction methods

Getting started with OpenRefine

Summary

Getting to Grips with Visualization

What is visualization?

Working with web-based visualization

Exploring scientific visualization

Visualization in art

The visualization life cycle

Visualizing different types of data

Getting started with D3.js

Interaction and animation

Data from social networks

An overview of visual analytics

Summary

Text Classification

Learning and classification

Bayesian classification

E-mail subject line tester

The data

The algorithm

Classifier accuracy

Summary

Similarity-Based Image Retrieval

Image similarity search

Dynamic time warping

Processing the image dataset

Implementing DTW

Analyzing the results

Summary

Simulation of Stock Prices

Financial time series

Random Walk simulation

Monte Carlo methods

Generating random numbers

Implementation in D3js

Quantitative analyst

Summary

Predicting Gold Prices

Working with time series data

Smoothing time series

Lineal regression

The data - historical gold prices

Nonlinear regressions

Summary

Working with Support Vector Machines

Understanding the multivariate dataset

Dimensionality reduction

Getting started with SVM

Summary

Modeling Infectious Diseases with Cellular Automata

Introduction to epidemiology

The epidemic models

Modeling with Cellular Automaton

Simulation of the SIRS model in CA with D3.js

Summary

Working with Social Graphs

Structure of a graph

Social networks analysis

Acquiring the Facebook graph

Working with graphs using Gephi

Statistical analysis

Degree distribution

Transforming GDF to JSON

Graph visualization with D3.js

Summary

Working with Twitter Data

The anatomy of Twitter data

Using OAuth to access Twitter API

Getting started with Twython

Summary

Data Processing and Aggregation with MongoDB

Getting started with MongoDB

Data preparation

Group

Aggregation framework

Summary

Working with MapReduce

An overview of MapReduce

Programming model

Using MapReduce with MongoDB

Filtering the input collection

Grouping and aggregation

Counting the most common words in tweets

Summary

Online Data Analysis with Jupyter and Wakari

Getting started with Wakari

Getting started with IPython notebook

Introduction to image processing with PIL

Getting started with pandas

Sharing your Notebook

Summary

Understanding Data Processing using Apache Spark

Platform for data processing

An introduction to the distributed file system

An introduction to Apache Spark

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Quantified self

Quantified self is self-knowledge through self-tracking with technology. In this aspect, one can collect daily activities data on his own in terms of inputs, states, and performance. For example, input means food consumption or quality of surrounding air, states means mood or blood pressure, and performance means mental or physical condition. To collect these data, we can use wearable sensors and life logging. Quantified self-process allows individuals to quantify biometrics that they never knew existed, as well as make data collection cheaper and more convenient. One can track their insulin and cortisol levels and sequence DNA. Using quantified self data, one can be cautious about one's overall health, diet, and level of physical activity.

These days, wearing self-tracking gadgets is rapidly increasing. If we pooled the quantified self-data of a specific group of people, we can apply predictive algorithms on this data to diagnose patients in that location. That means quantified self data is very useful in certain medication contexts.

In the following screenshot, we can see some electronics gadgets that gather quantitative data:

Sensors and cameras

Interaction with the outside world is highly important in data analysis. Using sensors like Radio-Frequency Identification (RFID) or a smartphone to scan a QR code (Quick Response) code are easy ways of interacting directly with the customer, making recommendations, and analyzing consumer trends.

On the other hand, people are using their smartphones all the time, using their cameras as a tool. In Chapter 5, Similarity-Based Image Retrieval, we will use these digital images to perform a search by image. This can be used, for example, in face recognition or for finding recommendations of a restaurant just by taking a picture of the front door.

This interaction with the real world can give you a competitive advantage and a real-time data source directly from the customer.

Social network analysis

Nowadays, the Internet brings people together in many ways (that is, using social media); for example, Facebook, Twitter, LinkedIn, and so on. Using these social networks, users are working, playing, socializing online, and demonstrating new forms of collaboration and more. Social networks play a crucial role in reshaping business models and opening up numerous possibilities of studying human interaction and collective behavior.

In fact, if we intended to understand how to identify key individuals in social systems, we can generate models using analytical techniques on social network data and extract the information mentioned previously. This process is called Social Network Analysis (SNA).

Formally, the SNA performs the analysis of social relationships in terms of network theory, with nodes representing individuals and ties representing relationships between the individuals. Social networks create groups of related individuals (friendships) based on different aspects of their interaction. We can find out important information such as hobbies (for product recommendation) or who has the most influential opinion in a group (centrality). We will present in Chapter 10, Working with Social Graphs, a project, Who is your closest friend?, and we will show a solution for Twitter clustering.

Social networks are strongly connected, and these connections are often asymmetric. This makes SNA computationally expensive, and so it needs to be addressed with high-performance solutions that are less statistical and more algorithmic. The visualization of a social network can help us gain a good insight into how people are connected. The exploration of a graph is done through displaying nodes and ties in various colors, sizes, and distributions. D3.js has animation capabilities that enable us to visualize a social graph with interactive animations. These help us to simulate behaviors like information diffusion or the distance between nodes.

Facebook processes more than 500 TB of data daily (images, text, video, likes, and relationships), and this amount of data needs non-conventional treatment like NoSQL databases and MapReduce frameworks. In this book, we will work with MongoDB, a document-based NoSQL database, which also has great functions for aggregations and MapReduce processing.

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar

Practical Data Analysis - Second Edition

By: Hector Cuesta, Dr. Sampath Kumar

Overview of this book

Related Content you might be interested in

Current Title:

Practical Data Analysis - Second Edition

Quantified self

Sensors and cameras

Social network analysis