Practical Data Wrangling

Practical Data Wrangling

By : Allan Visochek

Buy this Book

Practical Data Wrangling

By: Allan Visochek

Buy this Book

Overview of this book

Around 80% of time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis and reporting. Python and R are considered a popular choice of tool for data analysis, and have packages that can be best used to manipulate different kinds of data, as per your requirements. This book will show you the different data wrangling techniques, and how you can leverage the power of Python and R packages to implement them. You’ll start by understanding the data wrangling process and get a solid foundation to work with different types of data. You’ll work with different data structures and acquire and parse data from various locations. You’ll also see how to reshape the layout of data and manipulate, summarize, and join data sets. Finally, we conclude with a quick primer on accessing and processing data from databases, conducting data exploration, and storing and retrieving data quickly using databases. The book includes practical examples on each of these points using simple and real-world data sets to give you an easier understanding. By the end of the book, you’ll have a thorough understanding of all the data wrangling concepts and how to implement them in the best possible way.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Programming with Data

Understanding data wrangling

The tools for data wrangling

Summary

Introduction to Programming in Python

External resources

Logistical overview

Running programs in python

Data types, variables, and the Python shell

Compound statements

Making annotations within programs

A programmer's resources

Summary

Reading, Exploring, and Modifying Data - Part I

External resources

Logistical overview

Introducing a basic data wrangling work flow

Introducing the JSON file format

Opening and closing a file in Python using file I/O

Reading the contents of a file

Exploring the contents of a data file

Modifying a dataset

Outputting the modified data to a new file

Specifying input and output file names in the Terminal

Summary

Reading, Exploring, and Modifying Data - Part II

Logistical overview

Understanding the CSV format

Introducing the CSV module

Using the CSV module to read CSV data

Using the CSV module to write CSV data

Using the pandas module to read and process data

Handling non-standard CSV encoding and dialect

Understanding XML

Using the XML module to parse XML data

Summary

Manipulating Text Data - An Introduction to Regular Expressions

Logistical overview

Understanding the need for pattern recognition

Introducting regular expressions

Looking for patterns

Quantifying the existence of patterns

Extracting patterns

Summary

Cleaning Numerical Data - An Introduction to R and RStudio

Logistical overview

Introducing R and RStudio

Familiarizing yourself with RStudio

Conducting basic outlier detection and removal

Handling NA values

Variable names and contents

Summary

Simplifying Data Manipulation with dplyr

Logistical overview

Introducing dplyr

Getting started with dplyr

Chaining operations together

Filtering the rows of a dataframe

Summarizing data by category

Rewriting code using dplyr

Summary

Getting Data from the Web

Logistical overview

Introducing APIs

Using Python to retrieve data from APIs

Using URL parameters to filter the results

Summary

Working with Large Datasets

Logistical overview

Understanding computer memory

Understanding databases

Introducing MongoDB

Interfacing with MongoDB from Python

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Understanding data wrangling

Data wrangling, broadly speaking, is the process of gathering data in its raw form and molding it into a form that is suitable for its end use. Preparing data for its end use can branch out into a number of different tasks based on the exact use case. This can make it rather hard to pin down exactly what data wrangling entails, and formulate how to go about it. Nevertheless, there are a number of common steps in the data wrangling process, as outlined in the following subsections. The approach that I will take in this book is to introduce a number of tools and practices that are often involved in data wrangling. Each of the chapters will consist of one or more exercises and/or projects that will demonstrate the application of a particular tool or approach.

Getting and reading data

The first step is to retrieve a dataset and open it with a program capable of manipulating the data. The simplest way of retrieving a dataset is to find a data file. Python and R can be used to open, read, modify, and save data stored in static files. In Chapter 3, Reading, Exploring, and Modifying Data - Part I, I will introduce the JSON data format and show how to use Python to read, write and modify JSON data. In Chapter 4, Reading, Exploring, and Modifying Data - Part II, I will walk through how to use Python to work with data files in the CSV and XML data formats. In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will introduce R and Rstudio, and show how to use R to read and manipulate data.

Larger data sources are often made available through web interfaces called application programming interfaces (APIs). APIs allow you to retrieve specific bits of data from a larger collection of data. Web APIs can be great resources for data that is otherwise hard to get. In Chapter 8, Getting Data from the Web, I discuss APIs in detail and walk through the use of Python to extract data from APIs.

Another possible source of data is a database. I won't go into detail on the use of databases in this book, though in Chapter 9, Working with Large Datasets, I will show how to interact with a particular database using Python.

Note

Databases are collections of data that are organized to optimize the quick retrieval of data. They can be particularly useful when we need to work incrementally on very large datasets, and of course may be a source of data.

Cleaning data

When working with data, you can generally expect to find human errors, missing entries, and numerical outliers. These types of errors usually need to be corrected, handled, or removed to prepare a dataset for analysis.

In Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, I will demonstrate how to use regular expressions, a tool to identify, extract, and modify patterns in text data. Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, includes a project to use regular expressions to extract street names.

In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will demonstrate how to use RStudio to conduct two common tasks for cleaning numerical data: outlier detection and NA handling.

Shaping and structuring data

Preparing data for its end use often requires both structuring and organizing the data in the correct manner.

To illustrate this, suppose you have a hierarchical dataset of city populations, as shown in Figure 01:

Figure 01: Hierarchical structure of the population of cities

If the goal is to create a histogram of city populations, the previous data format would be hard to work with. Not only is the information of the city populations nested within the data structure, but it is nested to varying degrees of depth. For the purposes of creating a histogram, it is better to represent the data as a list of numbers, as shown in Figure 02:

Figure 02: List of populations for histogram visualization

Making structural changes like this for large datasets requires you to build programs that can extract the data from one format and put it into another format. Shaping data is an important part of data wrangling because it ensures that the data is compatible with its intended use. In Chapter 4, Reading, Exploring, and Modifying Data - Part II, I will walk through exercises to convert between data formats.

Changing the form of data does not necessarily need to involve changing its structure. Changing the form of a dataset can involve filtering the data entries, reducing the data by category, changing the order of the rows, and changing the way columns are set up.

All of the previously mentioned tasks are features of the dplyr package for R. In Chapter 7, Simplifying Data Manipulation with dplyr, I will show how to use dplyr to easily and intuitively manipulate data.

Storing data

The last step after manipulating a dataset is to store the data for future use. The easiest way to do this is to store the data in a static file. I show how to output the data to a static file in Python in Chapters 3, Reading, Exploring, and Modifying Data - Part I and Chapter 4, Reading, Analyzing, Modifying, and Writing Data - Part II. I show how to do this in R in Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio.

When working with large datasets, it can be helpful to have a system that allows you to store and quickly retrieve large amounts of data when needed.

In addition to being a potential source of data, databases can be very useful in the process of data wrangling as a means of storing data locally. In Chapter 9, Working with Large Datasets, I will briefly demonstrate the use of databases to store data.

Practical Data Wrangling

By : Allan Visochek

Practical Data Wrangling

By: Allan Visochek

Overview of this book

Related Content you might be interested in

Current Title:

Practical Data Wrangling

Understanding data wrangling

Getting and reading data

Note

Cleaning data

Shaping and structuring data

Storing data