Book Image

Hands-On Data Preprocessing in Python

By : Roy Jafari
5 (2)
Book Image

Hands-On Data Preprocessing in Python

5 (2)
By: Roy Jafari

Overview of this book

Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who’s developed college-level courses on data preprocessing and related subjects. With this book, you’ll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data. You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you’ll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data. By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.
Table of Contents (24 chapters)
1
Part 1:Technical Needs
6
Part 2: Analytic Goals
11
Part 3: The Preprocessing
18
Part 4: Case Studies

Overview of the Jupyter Notebook

The Jupyter Notebook is becoming increasingly popular as a successful User Interface (UI) for Python programing. As a UI, the Jupyter Notebook provides an interactive environment where you can run your Python code, see immediate outputs, and take notes.

Fernando Pérezthe and Brian Granger, the architects of the Jupyter Notebook, outlines the following reasons in terms of what they were looking for in an innovative programming UI:

  • Space for individual exploratory work
  • Space for collaboration
  • Space for learning and education

If you have used the Jupyter Notebook already, you can attest that it delivers all these promises, and if you have not yet used it, I have good news for you: we will be using Jupyter Notebook for the entirety of this book. Some of the code that I will be sharing will be in the form of screenshots from the Jupyter Notebook UI.

The UI design of the Jupyter Notebook is very simple. You can think of it as one column of material. These materials could be under code chunks or Markdown chunks. The solution development and the actual coding happens under the code chunks, whereas notes for yourself or other developers are presented under Markdown chunks. The following screenshot shows both an example of a Markdown chunk and a code chunk. You can see that the code chunk has been executed and the requested print has taken place and the output is shown immediately after the code chunk:

Figure 1.1 – Code for printing Hello World in a Jupyter notebook

Figure 1.1 – Code for printing Hello World in a Jupyter notebook

To create a new chunk, you can click on the + sign on the top ribbon of the UI. The newly added chunk will be a code chunk by default. You can switch the code chunk to a Markdown chunk by using the drop-down list on the top ribbon. Moreover, you can move the chunks up or down by using the correct arrows on the ribbon. You can see these three buttons in the following screenshot:

Figure 1.2 – Jupyter Notebook control ribbon

Figure 1.2 – Jupyter Notebook control ribbon

You can see the following in the preceding screenshot:

  • The ribbon shown in the screenshot also allows you to Cut, Copy, and Paste the chunks.
  • The Run button on the ribbon is to execute the code of a chunk.
  • The Stop button is to stop running code. You normally use this button if your code has been running for a while with no output.
  • The Restart button wipes the slate clean; it removes all of the variables you have defined so you can start over.
  • Finally, the Restart & Run button restarts the kernel and runs all of the chunks of code in the Jupyter Notebook files.

There is more to the Jupyter Notebook, such as useful short keys to speed up development and specific Markdown syntax to format the text under Markdown chunks. However, the introduction here is just enough for you to start meaningfully analyzing data using Python through the Jupyter Notebook UI.