Book Image

Data Wrangling with R

By : Gustavo R Santos
Book Image

Data Wrangling with R

By: Gustavo R Santos

Overview of this book

In this information era, where large volumes of data are being generated every day, companies want to get a better grip on it to perform more efficiently than before. This is where skillful data analysts and data scientists come into play, wrangling and exploring data to generate valuable business insights. In order to do that, you’ll need plenty of tools that enable you to extract the most useful knowledge from data. Data Wrangling with R will help you to gain a deep understanding of ways to wrangle and prepare datasets for exploration, analysis, and modeling. This data book enables you to get your data ready for more optimized analyses, develop your first data model, and perform effective data visualization. The book begins by teaching you how to load and explore datasets. Then, you’ll get to grips with the modern concepts and tools of data wrangling. As data wrangling and visualization are intrinsically connected, you’ll go over best practices to plot data and extract insights from it. The chapters are designed in a way to help you learn all about modeling, as you will go through the construction of a data science project from end to end, and become familiar with the built-in RStudio, including an application built with Shiny dashboards. By the end of this book, you’ll have learned how to create your first data model and build an application with Shiny in R.
Table of Contents (21 chapters)
1
Part 1: Load and Explore Data
5
Part 2: Data Wrangling
12
Part 3: Data Visualization
16
Part 4: Modeling

What is data wrangling?

Data wrangling is the process of modifying, cleaning, organizing, and transforming data from one given state to another, with the objective of making it more appropriate for use in analytics and data science.

This concept is also referred to as data munging, and both words are related to the act of changing, manipulating, transforming, and incrementing your dataset.

I bet you’ve already performed data wrangling. It is a common task for all of us. Since our primary school years, we have been taught how to create a table and make counts to organize people’s opinions in a dataset. If you are familiar with MS Excel or similar tools, remember all the times you have sorted, filtered, or added columns to a table, not to mention all of those lookups that you may have performed. All of that is part of the data-wrangling process. Every task performed to somehow improve the data and make it more suitable for analysis can be considered data wrangling.

As a data scientist, you will constantly be provided with different kinds of data, with the mission of transforming the dataset into insights that will, consequentially, form the basis for business decisions. Unlike a few years ago, when the majority of data was presented in a structured form such as text or tables, nowadays, data can come in many other forms, including unstructured formats such as video, audio, or even a combination of those. Thus, it becomes clear that most of the time, data will not be presented ready to work and will require some effort to get it in a ready state, sometimes more than others.

Figure 1.1 – Data before and after wrangling

Figure 1.1 – Data before and after wrangling

Figure 1.1 is a visual representation of data wrangling. We see on the left-hand side three kinds of data points combined, and after sorting and tabulating, the data is clearer to be analyzed.

A wrangled dataset is easier to understand and to work with, creating the path to better analysis and modeling, as we shall see in the next section when we will learn why data wrangling is important to a data science project.