Book Image

Practical Data Wrangling

By : Allan Visochek
Book Image

Practical Data Wrangling

By: Allan Visochek

Overview of this book

Around 80% of time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis and reporting. Python and R are considered a popular choice of tool for data analysis, and have packages that can be best used to manipulate different kinds of data, as per your requirements. This book will show you the different data wrangling techniques, and how you can leverage the power of Python and R packages to implement them. You’ll start by understanding the data wrangling process and get a solid foundation to work with different types of data. You’ll work with different data structures and acquire and parse data from various locations. You’ll also see how to reshape the layout of data and manipulate, summarize, and join data sets. Finally, we conclude with a quick primer on accessing and processing data from databases, conducting data exploration, and storing and retrieving data quickly using databases. The book includes practical examples on each of these points using simple and real-world data sets to give you an easier understanding. By the end of the book, you’ll have a thorough understanding of all the data wrangling concepts and how to implement them in the best possible way.
Table of Contents (16 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Chapter 1. Programming with Data

It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.

For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.

Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.

Note

A short side note on terminologyData science as an all encompassing term can be a bit elusive. As it is such a new field, the definition of a data scientist can change depending on who you ask. To be more general, the term data programmer will be used in this book to refer to anyone who will find data wrangling useful in their work.

Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:

  • There may be extra steps involved in getting the data
  • The information needed may be spread across multiple sources
  • Datasets may be too large to work with in their original format
  • There may be far more fields or information in a particular dataset than needed
  • Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on
  • Datasets may be structured or formatted in a way that is not compatible with a particular application

Due to this, it is often the responsibility of the data programmer to perform the following functions:

  • Discover and gather the data that is needed (getting data)
  • Merge data from different sources if necessary (merging data)
  • Fix flaws in the data entries (cleaning data)
  • Extract the necessary data and put it in the proper structure (shaping data)
  • Store it in the proper format for further use (storing data)

This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:

  • Getting data
  • Cleaning data
  • Merging and shaping data
  • Storing data

Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling.