Practical Data Wrangling

Practical Data Wrangling

By : Allan Visochek

Buy this Book

Practical Data Wrangling

By: Allan Visochek

Buy this Book

Overview of this book

Around 80% of time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis and reporting. Python and R are considered a popular choice of tool for data analysis, and have packages that can be best used to manipulate different kinds of data, as per your requirements. This book will show you the different data wrangling techniques, and how you can leverage the power of Python and R packages to implement them. You’ll start by understanding the data wrangling process and get a solid foundation to work with different types of data. You’ll work with different data structures and acquire and parse data from various locations. You’ll also see how to reshape the layout of data and manipulate, summarize, and join data sets. Finally, we conclude with a quick primer on accessing and processing data from databases, conducting data exploration, and storing and retrieving data quickly using databases. The book includes practical examples on each of these points using simple and real-world data sets to give you an easier understanding. By the end of the book, you’ll have a thorough understanding of all the data wrangling concepts and how to implement them in the best possible way.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Programming with Data

Understanding data wrangling

The tools for data wrangling

Summary

Introduction to Programming in Python

External resources

Logistical overview

Running programs in python

Data types, variables, and the Python shell

Compound statements

Making annotations within programs

A programmer's resources

Summary

Reading, Exploring, and Modifying Data - Part I

External resources

Logistical overview

Introducing a basic data wrangling work flow

Introducing the JSON file format

Opening and closing a file in Python using file I/O

Reading the contents of a file

Exploring the contents of a data file

Modifying a dataset

Outputting the modified data to a new file

Specifying input and output file names in the Terminal

Summary

Reading, Exploring, and Modifying Data - Part II

Logistical overview

Understanding the CSV format

Introducing the CSV module

Using the CSV module to read CSV data

Using the CSV module to write CSV data

Using the pandas module to read and process data

Handling non-standard CSV encoding and dialect

Understanding XML

Using the XML module to parse XML data

Summary

Manipulating Text Data - An Introduction to Regular Expressions

Logistical overview

Understanding the need for pattern recognition

Introducting regular expressions

Looking for patterns

Quantifying the existence of patterns

Extracting patterns

Summary

Cleaning Numerical Data - An Introduction to R and RStudio

Logistical overview

Introducing R and RStudio

Familiarizing yourself with RStudio

Conducting basic outlier detection and removal

Handling NA values

Variable names and contents

Summary

Simplifying Data Manipulation with dplyr

Logistical overview

Introducing dplyr

Getting started with dplyr

Chaining operations together

Filtering the rows of a dataframe

Summarizing data by category

Rewriting code using dplyr

Summary

Getting Data from the Web

Logistical overview

Introducing APIs

Using Python to retrieve data from APIs

Using URL parameters to filter the results

Summary

Working with Large Datasets

Logistical overview

Understanding computer memory

Understanding databases

Introducing MongoDB

Interfacing with MongoDB from Python

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 1. Programming with Data

It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.

For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.

Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.

Note

A short side note on terminology: Data science as an all encompassing term can be a bit elusive. As it is such a new field, the definition of a data scientist can change depending on who you ask. To be more general, the term data programmer will be used in this book to refer to anyone who will find data wrangling useful in their work.

Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:

There may be extra steps involved in getting the data
The information needed may be spread across multiple sources
Datasets may be too large to work with in their original format
There may be far more fields or information in a particular dataset than needed
Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on
Datasets may be structured or formatted in a way that is not compatible with a particular application

Due to this, it is often the responsibility of the data programmer to perform the following functions:

Discover and gather the data that is needed (getting data)
Merge data from different sources if necessary (merging data)
Fix flaws in the data entries (cleaning data)
Extract the necessary data and put it in the proper structure (shaping data)
Store it in the proper format for further use (storing data)

This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:

Getting data
Cleaning data
Merging and shaping data
Storing data

Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling.

Practical Data Wrangling

By : Allan Visochek

Practical Data Wrangling

By: Allan Visochek

Overview of this book

Related Content you might be interested in

Current Title:

Practical Data Wrangling

Chapter 1. Programming with Data

Note