Book Image

Data Wrangling with Python

By : Dr. Tirthajyoti Sarkar, Shubhadeep Roychowdhury
Book Image

Data Wrangling with Python

By: Dr. Tirthajyoti Sarkar, Shubhadeep Roychowdhury

Overview of this book

For data to be useful and meaningful, it must be curated and refined. Data Wrangling with Python teaches you the core ideas behind these processes and equips you with knowledge of the most popular tools and techniques in the domain. The book starts with the absolute basics of Python, focusing mainly on data structures. It then delves into the fundamental tools of data wrangling like NumPy and Pandas libraries. You'll explore useful insights into why you should stay away from traditional ways of data cleaning, as done in other languages, and take advantage of the specialized pre-built routines in Python. This combination of Python tips and tricks will also demonstrate how to use the same Python backend and extract/transform data from an array of sources including the Internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, you'll cover how to handle missing or wrong data, and reformat it based on the requirements from the downstream analytics tool. The book will further help you grasp concepts through real-world examples and datasets. By the end of this book, you will be confident in using a diverse array of sources to extract, clean, transform, and format your data efficiently.
Table of Contents (12 chapters)
Data Wrangling with Python
Preface
Appendix

Preface

Note

About

This section briefly introduces the author(s), the coverage of this book, the technical skills you'll need to get started, and the hardware and software requirements required to complete all of the included activities and exercises.

About the Book

For data to be useful and meaningful, it must be curated and refined. Data Wrangling with Python teaches you all the core ideas behind these processes and equips you with knowledge about the most popular tools and techniques in the domain.

The book starts with the absolute basics of Python, focusing mainly on data structures, and then quickly jumps into the NumPy and pandas libraries as the fundamental tools for data wrangling. We emphasize why you should stay away from the traditional way of data cleaning, as done in other languages, and take advantage of the specialized pre-built routines in Python. Thereafter, you will learn how, using the same Python backend, you can extract and transform data from a diverse array of sources, such as the internet, large database vaults, or Excel financial tables. Then, you will also learn how to handle missing or incorrect data, and reformat it based on the requirements from the downstream analytics tool. You will learn about these concepts through real-world examples and datasets.

By the end of this book, you will be confident enough to handle a myriad of sources to extract, clean, transform, and format your data efficiently.

About the Authors

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques to design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois, and certifications in artificial intelligence and machine learning from Stanford and MIT.

Shubhadeep Roychowdhury works as a senior software engineer at a Paris-based cybersecurity start-up, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics. He holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford.

Learning Objectives

  • Use and manipulate complex and simple data structures

  • Harness the full potential of DataFrames and numpy.array at run time

  • Perform web scraping with BeautifulSoup4 and html5lib

  • Execute advanced string search and manipulation with RegEX

  • Handle outliers and perform data imputation with Pandas

  • Use descriptive statistics and plotting techniques

  • Practice data wrangling and modeling using data generation techniques

Approach

Data Wrangling with Python takes a practical approach to equip beginners with the most essential data analysis tools in the shortest possible time. It contains multiple activities that use real-life business scenarios for you to practice and apply your new skills in a highly relevant context.

Audience

Data Wrangling with Python is designed for developers, data analysts, and business analysts who are keen to pursue a career as a full-fledged data scientist or analytics expert. Although, this book is for beginners, prior working knowledge of Python is necessary to easily grasp the concepts covered here. It will also help to have rudimentary knowledge of relational database and SQL.

Minimum Hardware Requirements

For the optimal student experience, we recommend the following hardware configuration:

  • Processor: Intel Core i5 or equivalent

  • Memory: 8 GB RAM

  • Storage: 35 GB available space

Software Requirements

You'll also need the following software installed in advance:

  • OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit or Windows 10 64-bit, Ubuntu Linux, or the latest version of macOS

  • version of OS X

  • Processor: Intel Core i5 or equivalent

  • Memory: 4 GB RAM (8 GB Preferred)

  • Storage: 35 GB available space

Conventions

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: " This will return the value associated with it- ["list_element1", 34]"

A block of code is set as follows:

list_1 = []
    for x in range(0, 10):
    list_1.append(x)
list_1

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Click on New and choose Python 3."

Installation and Setup

Each great journey begins with a humble step. Our upcoming adventure in the land of data wrangling is no exception. Before we can do awesome things with data, we need to be prepared with the most productive environment. In this short section, we shall see how to do that.

The only prerequisite regarding the environment for this book is to have Docker installed. If you have never heard of Docker or you have only a very faint idea what it is, then fear not. All you need to know about Docker for the purpose of this book is this: Docker is a lightweight containerization engine that runs on all three major platforms (Linux, Windows, and macOS). The main idea behind Docker is give you safe, easy, and lightweight virtualization on top of your native OS.

Install Docker

  1. To install Docker on a Mac or Windows machine, create an account on Docker and download the latest version. It's easy to install and set up.

  2. Once you have set up Docker, open a shell (or Terminal if you are a Mac user) and type the following command to verify that the installation has been successful:

    docker version

    If the output shows you the server and client version of Docker, then you are all set up.

Pull the image

  1. Pull the image and you will have all the necessary packages (including Python 3.6.6) installed and ready for you to start working. Type the following command in a shell:

    docker pull rcshubhadeep/packt-data-wrangling-base
  2. If you want to know the full list of all the packages and their versions included in this image, you can check out the requirements.txt file in the setup folder of the source code repository of this book. Once the image is there, you are ready to roll. Downloading it may take time, depending on your connection speed.

Run the environment

  1. Run the image using the following command:

    docker run -p 8888:8888 -v 'pwd':/notebooks -it rcshubhadeep/packt-data-wrangling-base

    This will give you a ready-to-use environment.

  2. Open a browser tab in Chrome or Firefox and go to http://localhost:8888. You will be prompted to enter a token. The token is dw_4_all.

  3. Before you run the image, create a new folder and navigate there from the shell using the cd command.

    Once you create a notebook and save it as ipynb file. You can use Ctrl +C to stop running the image.

Introduction to Jupyter notebook

Project Jupyter is open source, free software that gives you the ability to run code, written in Python and some other languages, interactively from a special notebook, similar to a browser interface. It was born in 2014 from the IPython project and has since become the default choice for the entire data science workforce.

  1. Once you are running the Jupyter server, click on New and choose Python 3. A new browser tab will open with a new and empty notebook. Rename the Jupyter file:

    Figure 0.1: Jupyter server interface

    The main building blocks of Jupyter notebooks are cells. There are two types of cells: In (short for input) and Out (short for output). You can write code, normal text, and Markdown in In cells, press Shift + Enter (or Shift + Return), and the code written in that particular In cell will be executed. The result will be shown in an Out cell, and you will land in a new In cell, ready for the next block of code. Once you get used to this interface, you will slowly discover the power and flexibility it offers.

  2. One final thing you should know about Jupyter cells is that when you start a new cell, by default, it is assumed that you will write code in it. However, if you want to write text, then you have to change the type. You can do that using the following sequence of keys: Escape->m->Enter:

    Figure 0.2: Jupyter notebook

  3. And when you are done with writing the text, execute it using Shift + Enter. Unlike the code cells, the result of the compiled Markdown will be shown in the same place as the "In" cell.

Note

To have a "Cheat sheet" of all the handy key shortcuts in Jupyter, you can bookmark this Gist: https://gist.github.com/kidpixo/f4318f8c8143adee5b40. With this basic introduction and the image ready to be used, we are ready to embark on the exciting and enlightening journey that awaits us!

Installing the Code Bundle

Copy the code bundle for the class to the C:/Code folder.

Additional Resources

The code bundle for this book is also hosted on GitHub at: https://github.com/TrainingByPackt/Data-Wrangling-with-Python.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!