Book Image

Mastering Numerical Computing with NumPy

By : Umit Mert Cakmak, Tiago Antao, Mert Cuhadaroglu
Book Image

Mastering Numerical Computing with NumPy

By: Umit Mert Cakmak, Tiago Antao, Mert Cuhadaroglu

Overview of this book

NumPy is one of the most important scientific computing libraries available for Python. Mastering Numerical Computing with NumPy teaches you how to achieve expert level competency to perform complex operations, with in-depth coverage of advanced concepts. Beginning with NumPy's arrays and functions, you will familiarize yourself with linear algebra concepts to perform vector and matrix math operations. You will thoroughly understand and practice data processing, exploratory data analysis (EDA), and predictive modeling. You will then move on to working on practical examples which will teach you how to use NumPy statistics in order to explore US housing data and develop a predictive model using simple and multiple linear regression techniques. Once you have got to grips with the basics, you will explore unsupervised learning and clustering algorithms, followed by understanding how to write better NumPy code while keeping advanced considerations in mind. The book also demonstrates the use of different high-performance numerical computing libraries and their relationship with NumPy. You will study how to benchmark the performance of different configurations and choose the best for your system. By the end of this book, you will have become an expert in handling and performing complex data manipulations.
Table of Contents (11 chapters)

NumPy array operations

This section will guide you through the creation and manipulation of numerical data with NumPy. Let's start by creating a NumPy array from the list:

In [17]: my_list = [2, 14, 6, 8]
my_array = np.asarray(my_list)
type(my_array)
Out[17]: numpy.ndarray

Let's do some addition, subtraction, multiplication, and division with scalar values:

In [18]: my_array + 2
Out[18]: array([ 4, 16, 8, 10])
In [19]: my_array - 1
Out[19]: array([ 1, 13, 5, 7])
In [20]: my_array * 2
Out[20]: array([ 4, 28, 12, 16, 8])
In [21]: my_array / 2
Out[21]: array([ 1. , 7. , 3. , 4. ])

It's much harder to do the same operations in a list because the list does not support vectorized operations and you need to iterate its elements. There are many ways to create NumPy arrays, and now you will use one of these methods to create an array which is full of zeros. Later, you will perform some arithmetic operations to see how NumPy behaves in element-wise operations between two arrays:

In [22]: second_array = np.zeros(4) + 3
second_array
Out[22]: array([ 3., 3., 3., 3.])
In [23]: my_array - second_array
Out[23]: array([ -1., 11., 3., 5.])
In [24]: second_array / my_array
Out[24]: array([ 1.5 , 0.21428571, 0.5 , 0.375 ])

As we did in the previous code, you can create an array which is full of ones with np.ones or an identity array with np.identity and do the same algebraic operations that you did previously:

In [25]: second_array = np.ones(4) + 3
second_array
Out[25]: array([ 4., 4., 4., 4.])
In [26]: my_array - second_array
Out[26]: array([ -2., 10., 2., 4.])
In [27]: second_array / my_array
Out[27]: array([ 2. , 0.28571429, 0.66666667, 0.5 ])

It works as expected with the np.ones method, but when you use the identity matrix, the calculation returns a (4,4) array as follows:

In [28]: second_array = np.identity(4)
second_array
Out[28]: array([[ 1., 0., 0., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 0., 1.]])
In [29]: second_array = np.identity(4) + 3
second_array
Out[29]: array([[ 4., 3., 3., 3.],
[ 3., 4., 3., 3.],
[ 3., 3., 4., 3.],
[ 3., 3., 3., 4.]])
In [30]: my_array - second_array
Out[30]: array([[ -2., 11., 3., 5.],
[ -1., 10., 3., 5.],
[ -1., 11., 2., 5.],
[ -1., 11., 3., 4.]])

What this does is subtract the first element of my_array from all of the elements of the first column of second_array and the second_element of the second column, and so on. The same rule is applied to division as well. Please keep in mind that you can successfully do array operations even if they are not exactly the same shape. Later in this chapter, you will learn about broadcasting errors when computation cannot be done between two arrays due to differences in their shapes:

In [31]: second_array / my_array
Out[31]: array([[ 2. , 0.21428571, 0.5 , 0.375 ],
[ 1.5 , 0.28571429, 0.5 , 0.375 ],
[ 1.5 , 0.21428571, 0.66666667, 0.375 ],
[ 1.5 , 0.21428571, 0.5 , 0.5 ]])

One of the most useful methods in creating NumPy arrays is arange. This returns an array for a given interval between your start and end values. The first argument is the start value of your array, the second is the end value (where it stops creating values), and the third one is the interval. Optionally, you can define your dtype as the fourth argument. The default interval values are 1:

In [32]: x = np.arange(3,7,0.5)
x
Out[32]: array([ 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5])

There is another way to create an array with fixed intervals between the start and stop point when you cannot decide what the interval should be, but you should know how many splits your array should have:

In [33]: x = np.linspace(1.2, 40.5, num=20)
x
Out[33]: array([ 1.2 , 3.26842105, 5.33684211, 7.40526316, 9.47368421,
11.54210526, 13.61052632, 15.67894737, 17.74736842, 19.81578947,
21.88421053, 23.95263158, 26.02105263, 28.08947368, 30.15789474,
32.22631579, 34.29473684, 36.36315789, 38.43157895, 40.5 ])

There are two different methods which are similar in usage but return different sequences of numbers because their base scale is different. This means that the distribution of the numbers will be different as well. The first one is geomspace, which returns numbers on a logarithmic scale with a geometric progression:

In [34]: np.geomspace(1, 625, num=5)
Out[34]: array([ 1., 5., 25., 125., 625.])

The other important method is logspace, where you can return the values for your start and stop points, which are evenly scaled in:

In [35]: np.logspace(3, 4, num=5)
Out[35]: array([ 1000. , 1778.27941004, 3162.27766017, 5623.4132519 ,
10000. ])

What are these arguments? If the starting point is 3 and the ending point is 4, then these functions return the numbers which are much higher than the initial range. Your starting point is actually set as default to 10**Start Argument and the ending is set as 10**End Argument. So technically, in this example, the starting point is 10**3 and the ending point is 10**4. You can avoid such situations and keep your start and end points the same as when you put them as arguments in the method. The trick is to use base 10 logarithms of the given arguments:

In [36]: np.logspace(np.log10(3) , np.log10(4) , num=5)
Out[36]: array([ 3. , 3.2237098 , 3.46410162, 3.72241944, 4. ])

By now, you should be familiar with different ways of creating arrays with different distributions. You have also learned how to do some basic operations with them. Let's continue with other useful functions that you will definitely use in your day to day work. Most of the time, you will have to work with multiple arrays and you will need to compare them very quickly. NumPy has a great solution for this problem; you can compare the arrays as you would compare two integers:

In [37]: x = np.array([1,2,3,4])
y = np.array([1,3,4,4])
x == y
Out[37]: array([ True, False, False, True], dtype=bool)

The comparison is done element-wise and it returns a Boolean vector, whether elements are matching in two different arrays or not. This method works well in small size arrays and also gives you more details. You can see from the output array where the values are represented as False, that these indexed values are not matching in these two arrays. If you have a large array, you may also choose to get a single answer to your question, whether the elements are matching in two different arrays or not:

In [38]: x = np.array([1,2,3,4])
y = np.array([1,3,4,4])
np.array_equal(x,y)
Out[38]: False

Here, you have a single Boolean output. You only know that arrays are not equal, but you don't know which elements exactly are not equal. The comparison is not only limited to checking whether two arrays are equal or not. You can also do element-wise higher- lower comparison between two arrays:

In [39]: x = np.array([1,2,3,4])
y = np.array([1,3,4,4])
x < y
Out[39]: array([False, True, True, False], dtype=bool)

When you need to do logical comparison (AND, OR, XOR), you can use them in your array as follows:

In [40]: x = np.array([0, 1, 0, 0], dtype=bool)
y = np.array([1, 1, 0, 1], dtype=bool)
np.logical_or(x,y)
Out[40]: array([ True, True, False, True], dtype=bool)
In [41]: np.logical_and(x,y)
Out[41]: array([False, True, False, False], dtype=bool)
In [42]: x = np.array([12,16,57,11])
np.logical_or(x < 13, x > 50)
Out[42]: array([ True, False, True, True], dtype=bool)

So far, algebraic operations such as addition and multiplication have been covered. How can we use these operations with transcendental functions such as the exponential function, logarithms, or trigonometric functions?

In [43]: x = np.array([1, 2, 3,4 ])
np.exp(x)
Out[43]: array([ 2.71828183, 7.3890561 , 20.08553692, 54.59815003])
In [44]: np.log(x)
Out[44]: array([ 0. , 0.69314718, 1.09861229, 1.38629436])
In [45]: np.sin(x)
Out[45]: array([ 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])

What about the transpose of a matrix? First, you will use the reshape function with arange to set the desired shape of the matrix:

In [46]: x = np.arange(9)
x
Out[46]: array([0, 1, 2, 3, 4, 5, 6, 7, 8])
In [47]: x = np.arange(9).reshape((3, 3))
x
Out[47]: array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [48]: x.T
Out[48]: array([[0, 3, 6],
[1, 4, 7],
[2, 5, 8]])

You transpose the 3*3 array, so the shape doesn't change because both dimensions are 3. Let's see what happens when you don't have a square array:

In [49]: x = np.arange(6).reshape(2,3)
x
Out[49]: array([[0, 1, 2],
[3, 4, 5]])
In [50]: x.T
Out[50]: array([[0, 3],
[1, 4],
[2, 5]])

The transpose works as expected and the dimensions are switched as well. You can also get summary statistics from arrays such as mean, median, and standard deviation. Let's start with methods that NumPy offers for calculating basic statistics:

Method
Description
np.sum
Returns the sum of all array values or along the specified axis
np.amin
Returns the minimum value of all arrays or along the specified axis
np.amax
Returns the maximum value of all arrays or along the specified axis
np.percentile
Returns the given qth percentile of all arrays or along the specified axis
np.nanmin
The same as np.amin, but ignores NaN values in an array
np.nanmax
The same as np.amax, but ignores NaN values in an array
np.nanpercentile
The same as np.percentile, but ignores NaN values in an array

The following code block gives an example of the preceding statistical methods of NumPy. These methods are very useful as you can operate the methods in a whole array or axis-wise according to your needs. You should note that you can find more fully-featured and better implementations of these methods in SciPy as it uses NumPy multidimensional arrays as a data structure:

In [51]: x = np.arange(9).reshape((3,3))
x
Out[51]: array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [52]: np.sum(x)
Out[52]: 36
In [53]: np.amin(x)
Out[53]: 0
In [54]: np.amax(x)
Out[54]: 8
In [55]: np.amin(x, axis=0)
Out[55]: array([0, 1, 2])
In [56]: np.amin(x, axis=1)
Out[56]: array([0, 3, 6])
In [57]: np.percentile(x, 80)
Out[57]: 6.4000000000000004

The axis argument determines the dimension that this function will operate on. In this example, axis=0 represents the first axis which is the x axis, and axis = 1 represents the second axis which is y. When we use a regular amin(x), we return a single value because it calculates the minimum value in all arrays, but when we specify the axis, it starts evaluating the function axis-wise and returns an array which shows the results for each row or column. Imagine you have a large array; you find the max value by using amax, but what will happen if you need to pass the index of this value to another function? In such cases, argmin and argmax come to the rescue, as shown in the following snippet:

In [58]: x = np.array([1,-21,3,-3])
np.argmax(x)
Out[58]: 2
In [59]: np.argmin(x)
Out[59]: 1

Let's continue with more statistical functions:

Method

Description

np.mean

Returns the mean of all array values or along the specific axis

np.median

Returns the median of all array values or along the specific axis

np.std

Returns the standard deviation of all array values or along the specific axis

np.nanmean

The same as np.mean, but ignores NaN values in an array

np.nanmedian

The same as np.nanmedian, but ignores NaN values in an array

np.nonstd

The same as np.nanstd, but ignores NaN values in an array

The following code gives more examples of the preceding statistical methods of NumPy. These methods are heavily used in data discovery phases, where you analyze your data features and distribution:

In [60]: x = np.array([[2, 3, 5], [20, 12, 4]])
x
Out[60]: array([[ 2, 3, 5],
[20, 12, 4]])
In [61]: np.mean(x)
Out[61]: 7.666666666666667
In [62]: np.mean(x, axis=0)
Out[62]: array([ 11. , 7.5, 4.5])
In [63]: np.mean(x, axis=1)
Out[63]: array([ 3.33333333, 12. ])
In [64]: np.median(x)
Out[64]: 4.5
In [65]: np.std(x)
Out[65]: 6.3944420310836261