Book Image

The Data Wrangling Workshop - Second Edition

By : Brian Lipp, Shubhadeep Roychowdhury, Dr. Tirthajyoti Sarkar
Book Image

The Data Wrangling Workshop - Second Edition

By: Brian Lipp, Shubhadeep Roychowdhury, Dr. Tirthajyoti Sarkar

Overview of this book

While a huge amount of data is readily available to us, it is not useful in its raw form. For data to be meaningful, it must be curated and refined. If you’re a beginner, then The Data Wrangling Workshop will help to break down the process for you. You’ll start with the basics and build your knowledge, progressing from the core aspects behind data wrangling, to using the most popular tools and techniques. This book starts by showing you how to work with data structures using Python. Through examples and activities, you’ll understand why you should stay away from traditional methods of data cleaning used in other languages and take advantage of the specialized pre-built routines in Python. Later, you’ll learn how to use the same Python backend to extract and transform data from an array of sources, including the internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, the book teaches you how to handle missing or incorrect data, and reformat it based on the requirements from your downstream analytics tool. By the end of this book, you will have developed a solid understanding of how to perform data wrangling with Python, and learned several techniques and best practices to extract, clean, transform, and format your data efficiently, from a diverse array of sources.
Table of Contents (11 chapters)
Preface

About the Book

While a huge amount of data is readily available to us, it is not useful in its raw form. For data to be meaningful, it must be curated and refined.

If you're a beginner, then The Data Wrangling Workshop, Second Edition will help to break down the process for you. You'll start with the basics and build your knowledge, progressing from the core aspects behind data wrangling, to using the most popular tools and techniques.

This book starts by showing you how to work with data structures using Python. Through examples and activities, you'll understand why you should stay away from traditional methods of data cleaning used in other languages and take advantage of the specialized pre-built routines in Python. Later, you'll learn how to use the same Python backend to extract and transform data from an array of sources, including the internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, the book teaches you how to handle missing or incorrect data, and reformat it based on the requirements from your downstream analytics tool.

By the end of this book, you will have developed a solid understanding of how to perform data wrangling with Python, and learned several techniques and best practices to extract, clean, transform, and format your data efficiently, from a diverse array of sources.

Audience

The Data Wrangling Workshop, Second Edition is designed for developers, data analysts, and business analysts who are looking to pursue a career as a full-fledged data scientist or analytics expert. Although this book is for beginners who want to start data wrangling, prior working knowledge of the Python programming language is necessary to easily grasp the concepts covered here. It will also help to have rudimentary knowledge of relational databases and SQL.

About the Chapters

Chapter 1, Introduction to Data Wrangling with Python, describes the importance of data wrangling in data science and introduces the basic building blocks that are used in data wrangling.

Chapter 2, Advanced Operations on Built-in Data Structures, discusses advanced built-in data structures that can be used for complex data wrangling problems faced by data scientists. The chapter will also talk about working with files using standard Python libraries.

Chapter 3, Introduction to NumPy, Pandas, and Matplotlib, will introduce you to the fundamentals of the NumPy, Pandas, and Matplotlib libraries. These are fundamental libraries for when you are performing data wrangling. This chapter will teach you how to calculate descriptive statistics of a one-dimensional/multi-dimensional DataFrame.

Chapter 4, A Deep Dive into Data Wrangling with Python, will introduce working with pandas DataFrames, including coverage of advanced concepts such as subsetting, filtering, grouping, and much more.

Chapter 5, Getting Comfortable with Different Kinds of Data Sources, introduces you to the several diverse data sources you might encounter as a data wrangler. This chapter will provide you with the knowledge to read CSV, Excel, and JSON files into pandas DataFrames.

Chapter 6, Learning the Hidden Secrets of Data Wrangling, discusses data problems that arise in business use cases and how to resolve them. This chapter will give you the knowledge needed to be able to clean and handle real-life messy data.

Chapter 7, Advanced Web Scraping and Data Gathering, introduces you to the concepts of advanced web scraping and data gathering. It will enable you to use Python libraries such as requests and BeautifulSoup to read various web pages and gather data from them.

Chapter 8, RDBMS and SQL, will introduce you to the basics of using RDBMSes to query databases using Python and convert data from SQL and store it into a pandas DataFrame. A large part of the world's data is stored in RDBMSes, so it is necessary to master this topic if you want to become a successful data-wrangling expert.

Chapter 9, Applications in Business Use Cases and Conclusion of the Course, will enable you to utilize the skills you have learned through the course of the previous chapters. By the end of this chapter, you will be able to easily handle data wrangling tasks for business use cases.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "This will return the value associated with it – ["list_element1", 34]".

A block of code is set as follows:

list_1 = []
    for x in range(0, 10):
    list_1.append(x)
list_1

Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Click New and choose Python 3."

Code Presentation

Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

For example:

history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \
                    validation_split=0.2, shuffle=False)

Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])

Multi-line comments are enclosed by triple quotes, as shown below:

"""
Define a seed for the random number generator to ensure the 
result will be reproducible
"""
seed = 1
np.random.seed(seed)
random.set_seed(seed)

Setting up Your Environment

Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.

Installing Python

Installing Python on Windows

To install Python on Windows, do the following:

  1. Find your desired version of Python on the official installation page at https://www.anaconda.com/distribution/#windows.
  2. Ensure that you select Python 3.7 on the download page.
  3. Ensure that you install the correct architecture for your computer system, that is, either 32-bit or 64-bit. You can find out this information in the System Properties window of your OS.
  4. After you have downloaded the installer, simply double-click the file and follow the user-friendly prompts onscreen.

Installing Python on Linux

To install Python on Linux, you have a couple of options. Here is one option:

  1. Open the command line and verify that Python 3 is not already installed by running python3 --version.
  2. To install Python 3, run this:
    sudo apt-get update
    sudo apt-get install python3.7
  3. If you encounter problems, there are numerous sources online that can help you troubleshoot the issue.

Alternatively, install Anaconda Linux by downloading the installer from https://www.anaconda.com/distribution/#linux and following the instructions.

Installing Python on MacOS

Similar to the case with Linux, you have a couple of methods for installing Python on a Mac. To install Python on macOS X, do the following:

  1. Open the Terminal for Mac by pressing CMD + Spacebar, type terminal in the open search box, and hit Enter.
  2. Install Xcode through the command line by running xcode-select –install.
  3. The easiest way to install Python 3 is using Homebrew, which is installed through the command line by running ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)".
  4. Add Homebrew to your $PATH environment variable. Open your profile in the command line by running sudo nano ~/.profile and inserting export PATH="/usr/local/opt/python/libexec/bin:$PATH" at the bottom.
  5. The final step is to install Python. In the command line, run brew install python.
  6. You can also install Python via the Anaconda installer available from https://www.anaconda.com/distribution/#macos.

Installing Libraries

pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/30UUshh.

The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.

Project Jupyter

Project Jupyter is open source, free software that gives you the ability to run code written in Python and some other languages interactively from a special notebook, similar to a browser interface. It was born in 2014 from the IPython project and has since become the default choice for the entire data science workforce.

Once you are running the Jupyter server, click New and choose Python 3. A new browser tab will open with a new and empty notebook. Rename the Jupyter file:

Figure 0.1: Jupyter server interface

Figure 0.1: Jupyter server interface

The main building blocks of Jupyter notebooks are cells. There are two types of cells: In (short for input) and Out (short for output). You can write code, normal text, and Markdown in In cells, press Shift + Enter (or Shift + Return), and the code written in that particular In cell will be executed. The result will be shown in an Out cell, and you will land in a new In cell, ready for the next block of code. Once you get used to this interface, you will slowly discover the power and flexibility it offers.

When you start a new cell, by default, it is assumed that you will write code in it. However, if you want to write text, then you have to change the type. You can do that using the following sequence of keys: Escape -> m -> Enter:

Figure 0.2: Jupyter notebook

Figure 0.2: Jupyter notebook

When you are done with writing the text, execute it using Shift + Enter. Unlike the code cells, the result of the compiled Markdown will be shown in the same place as the In cell.

To have a cheat sheet of all the handy key shortcuts in Jupyter Notebook, you can bookmark this Gist: https://gist.github.com/kidpixo/f4318f8c8143adee5b40. With this basic introduction and the image ready to be used, we are ready to embark on the exciting and enlightening journey.

Accessing the Code Files

You can find the complete code files of this book at https://packt.live/2YenXcb. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/2YKlrJQ.

We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.

If you have any issues or questions about installation, please email us at [email protected].