Book Image

Hands-On Data Preprocessing in Python

By : Roy Jafari
5 (2)
Book Image

Hands-On Data Preprocessing in Python

5 (2)
By: Roy Jafari

Overview of this book

Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who’s developed college-level courses on data preprocessing and related subjects. With this book, you’ll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data. You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you’ll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data. By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.
Table of Contents (24 chapters)
1
Part 1:Technical Needs
6
Part 2: Analytic Goals
11
Part 3: The Preprocessing
18
Part 4: Case Studies

Overview of the basic functions of NumPy

In short, as the name suggests, NumPy is a Python module brimming with useful functions for dealing with numbers. The Num in the first part of the name NumPy stands for numbers, and Py stands for Python. There you have it. If you have numbers and you are in Python, you know what you need to import. That is correct; you need to import NumPy, simple as that. See the following screenshot:

Figure 1.3 – Code for importing the NumPy module

Figure 1.3 – Code for importing the NumPy module

As you can see, we have given the alias np to the module after importing it. You can actually assign any alias that you wish and your code would function; however, I suggest sticking with np. I have two compelling reasons for doing so:

  • First, everyone else uses this alias, so if you share your code with others, they know what you are doing throughout your project.
  • Second, a lot of the time, you end up using code written by others in your projects, so consistency will make your job easier. You will see that most of the famous modules also have a famous alias, for example, pd for Pandas, and plt for matplotlib.pyplot.

    Good practice advice

    NumPy can handle all types of mathematical and statistical calculations for a collection of numbers, such as mean, median, standard deviation (std), and variance (var). If you have something else in mind and are not sure whether NumPy has it, I suggest googling it before trying to write your own. If it involves numbers, chances are NumPy has it.

The following screenshot shows the mean, for example, applied to a collection of numbers:

Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function

Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function

As shown in Figure 1.4, there are two ways to do this. The first one, portrayed in the top chunk, uses np.mean(). This function is one of the properties of the NumPy module and can be accessed directly. The great aspect of using this approach is that you do not need to change your data type most of the time before NumPy honors your request. You can input lists, Pandas series, or DataFrames. You can see on the top chunk that np.mean() easily calculated the mean of lst_nums, which is of the list type. The second way, as shown in the bottom chunk, is to first use np.array() to transform the list into a NumPy array and then use the .mean() function, which is a property of any NumPy array. Before continuing to progress with this chapter, take a moment and use the Python type() function to see the different types of lst_numbs and ary_nums, as shown in the following screenshot:

Figure 1.5 – The application of the type() function

Figure 1.5 – The application of the type() function

Next we will learn about four NumPy functions: np.arange(), np.zeros(), np.ones(), and np.linspace().

The np.arange() function

This function, as shown in the following screenshot, produces a sequence of numbers with equal increments. You can see in the figure that by changing the two inputs, you can get the function to output many different sequences of numbers that are required for your analytic purposes:

Figure 1.6 – Examples of using the np.arange() function

Figure 1.6 – Examples of using the np.arange() function

Pay attention to the three chunks of code in the preceding figure to see the default behavior of np.arange() when only one or two inputs are passed.

  • When only one input is passed, as in the first chunk of code, the default of np.arange() is that you want a sequence of numbers from zero to the input number with increments of one.
  • When two inputs are passed, as in the second chunk of code, the default of the function is that you want a sequence of numbers from the first input to the second input with increments of one.

The np.zeros() and np.ones() functions

np.ones() creates a NumPy array filled with ones, and np.zeros() does the same thing with zeros. Unlike np.arange(), which takes the input to calculate what needs to be included in the output array, np.zeros() and np.ones() take the input to structure the output array. For instance, the top chunk of the following screenshot specifies the request for an array with four rows and five columns filled with zeros. As you can see in the bottom chunk, if you only pass in one number, the output array will have only one dimension:

Figure 1.7 – Examples of np.zeros() and np.ones()

Figure 1.7 – Examples of np.zeros() and np.ones()

These two functions are excellent resources for creating a placeholder to keep the results of calculations in a loop. For instance, review the following example and observe how this function facilitated the coding.

Example – Using a placeholder to accommodate analytics

Given the grade data of 10 students, create a code using NumPy that calculates and reports their grade average.

The data of the 10 students and the solution to this example are provided in the following screenshots. Please review and try this code before progressing:

Figure 1.8 – Grade data for the example

Figure 1.8 – Grade data for the example

Now that you've had a chance to engage with this example, allow me to highlight a few matters about the provided solution presented in Figure 1.9:

  • Notice how np.zeros() facilitated the solution by streamlining it significantly. After the code is done, all of the average grades are calculated and saved already. Compare the printed values before and after the for loop.
  • The enumerate() function in the for loop might sound strange to you. What that does is help the code to have both an index (i) and the item (name) from the collection (Names).
  • The .format() function is an invaluable property of any string variable. If there are any symbols such as {} in the string, this function will replace them with what has been input sequentially.
  • # better-looking report is a comment in the second chunk of the code. Comments are not compiled and their only purpose is to communicate something with whoever reads the source code.
Figure 1.9 – Solution to the preceding example

Figure 1.9 – Solution to the preceding example

The np.linspace() function

This function returns evenly spaced numbers over a specified interval. The function takes three inputs. The first two inputs specify the interval, and the third shows the number of elements that the output will have. For example, refer to the following screenshot:

Figure 1.10 – Solution to the preceding example

Figure 1.10 – Solution to the preceding example

In the first code block, 19 numbers are evenly spaced between 0 and 1, altogether creating an array with 21 numbers. The second gives another example. After trying out the two examples in the screenshot, try np.linspace(0,1,20) and after investigating the results, think about why I chose 21 over 20 in my example.

np.linspace() is a very handy function for situations where you need to try out different values to find the one that best fits your needs. The following example showcases a simple situation like that.

Example – np.linspace() to create solution candidates

We are interested in finding the value(s) that holds the following mathematical statement: .

Imagine that we don't know that the statement can be simplified easily to ascertain that either 2 or 3 will hold the statement:

So we would like to use NumPy to try out any whole numbers between -1000 and 1000 and find the answer.

The following screenshot shows Python code that provides a solution to this problem:

Figure 1.11 – Solution to the preceding example

Figure 1.11 – Solution to the preceding example

Please review and try this code before moving on.

Now that you've had a chance to engage with this example, allow me to highlight a couple of things:

  • Notice how smart use of np.linspace() leads to an array with all of the numbers that we were interested in trying out.
  • Uncomment #print(Candidates) and review all of the numbers that were tried out to establish the desired answers.

This concludes our review of the NumPy module. Next, we will review another very useful Python module, Pandas.