Book Image

Practical Data Wrangling

By : Allan Visochek
Book Image

Practical Data Wrangling

By: Allan Visochek

Overview of this book

Around 80% of time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis and reporting. Python and R are considered a popular choice of tool for data analysis, and have packages that can be best used to manipulate different kinds of data, as per your requirements. This book will show you the different data wrangling techniques, and how you can leverage the power of Python and R packages to implement them. You’ll start by understanding the data wrangling process and get a solid foundation to work with different types of data. You’ll work with different data structures and acquire and parse data from various locations. You’ll also see how to reshape the layout of data and manipulate, summarize, and join data sets. Finally, we conclude with a quick primer on accessing and processing data from databases, conducting data exploration, and storing and retrieving data quickly using databases. The book includes practical examples on each of these points using simple and real-world data sets to give you an easier understanding. By the end of the book, you’ll have a thorough understanding of all the data wrangling concepts and how to implement them in the best possible way.
Table of Contents (16 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

The tools for data wrangling


The most popular languages used for data wrangling are Python and R. I will use the remaining part of this chapter to introduce Python and R, and briefly discuss the differences between them.

Python

Python is a generalized programming language used for everything from web development (Django and Flask) to game development, and for scientific and numerical computation. See Python.org/about/apps/.

Python is really useful for data wrangling and scientific computing in general because it emphasizes simplicity, readability, and modularity.

To see this, take a look at a Python implementation of the hello world program, which prints the words Hello World!:

Print("Hello World!")

To do the same thing in Java, another popular programming language, we need something a bit more verbose:

System.out.println("Hello World!");

While this may not seem like a huge difference, extra research and consultation of documentation can add up, adding time to the data wrangling process.

Python also has built-in data structures that are relatively flexible in the way that they handle data.

Note

Data structures are abstractions that help organize the data in a program for easy manipulation. We will explore the various data structures in Python and R in Chapter 2, Introduction to Programming in Python.

This contributes to Python's relative ease of use, particularly when working with data on a low level.

Finally, because of Python's modularity and popularity within the scientific community, there are a number of packages built around Python that can be quite useful to us in data wrangling.

Note

Packages/modules/libraries are extensions of a language, or prewritten code in that language--typically built by individual users and the open source community--that add on functionality that is not built into the language. They can be imported in a program to include new tools. We will be leveraging packages throughout the book, both in R and Python, to extract, read, clean, shape, and store data.

R

R is both a programming language and an environment built specifically for statistical computing. This definition has been taken from the R website, r-project.org/about.html:

The term 'environment' is intended to characterize [R] as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

In other words, one of the major differences between R and Python is that some of the most common functionalities for working with data--data handling and storage, visualization, statistical computation, and so on--come built in. A good example of this is linear modeling, a basic statistical method for modelling numerical data.

In R, linear modeling is a built-in functionality that is made very intuitive and straightforward, as we will see in Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions. There are a number of ways to do linear modeling in Python, but they all require using external libraries and often doing extra work to get the data in the right format.

R also has a built-in data structure called a dataframe that can make manipulation of tabular data more intuitive. 

The big takeaway here is that there are benefits and trade-offs to both languages. In general, being able to use the right tool for the job can save an immense amount of time spent on data wrangling. It is therefore quite useful as a data programmer to have a good working knowledge of each language and know when to use one or the other.