Book Image

Mastering Numerical Computing with NumPy

By : Umit Mert Cakmak, Tiago Antao, Mert Cuhadaroglu
Book Image

Mastering Numerical Computing with NumPy

By: Umit Mert Cakmak, Tiago Antao, Mert Cuhadaroglu

Overview of this book

NumPy is one of the most important scientific computing libraries available for Python. Mastering Numerical Computing with NumPy teaches you how to achieve expert level competency to perform complex operations, with in-depth coverage of advanced concepts. Beginning with NumPy's arrays and functions, you will familiarize yourself with linear algebra concepts to perform vector and matrix math operations. You will thoroughly understand and practice data processing, exploratory data analysis (EDA), and predictive modeling. You will then move on to working on practical examples which will teach you how to use NumPy statistics in order to explore US housing data and develop a predictive model using simple and multiple linear regression techniques. Once you have got to grips with the basics, you will explore unsupervised learning and clustering algorithms, followed by understanding how to write better NumPy code while keeping advanced considerations in mind. The book also demonstrates the use of different high-performance numerical computing libraries and their relationship with NumPy. You will study how to benchmark the performance of different configurations and choose the best for your system. By the end of this book, you will have become an expert in handling and performing complex data manipulations.
Table of Contents (11 chapters)

Working with multidimensional arrays

This section will give you a brief understanding of multidimensional arrays by going through different matrix operations.

In order to do matrix multiplication in NumPy, you have to use dot() instead of *. Let's see some examples:

In [66]: c = np.ones((4, 4))
c*c
Out[66]: array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
In [67]: c.dot(c)
Out[67]: array([[ 4., 4., 4., 4.],
[ 4., 4., 4., 4.],
[ 4., 4., 4., 4.],
[ 4., 4., 4., 4.]])

The most important topic in working with multidimensional arrays is stacking, in other words how to merge two arrays. hstack is used for stacking arrays horizontally (column-wise) and vstack is used for stacking arrays vertically (row-wise). You can also split the columns with the hsplit and vsplit methods in the same way that you stacked them:

In [68]: y = np.arange(15).reshape(3,5)
x = np.arange(10).reshape(2,5)
new_array = np.vstack((y,x))
new_array
Out[68]: array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9]])
In [69]: y = np.arange(15).reshape(5,3)
x = np.arange(10).reshape(5,2)
new_array = np.hstack((y,x))
new_array
Out[69]: array([[ 0, 1, 2, 0, 1],
[ 3, 4, 5, 2, 3],
[ 6, 7, 8, 4, 5],
[ 9, 10, 11, 6, 7],
[12, 13, 14, 8, 9]])

These methods are very useful in machine learning applications, especially when creating datasets. After you stack your arrays, you can check their descriptive statistics by using Scipy.stats. Imagine a case where you have 100 records, and each record has 10 features, which means you have a 2D matrix which has 100 rows and 10 columns. The following example shows how you can easily get some descriptive statistics for each feature:

In [70]: from scipy import stats
x= np.random.rand(100,10)
n, min_max, mean, var, skew, kurt = stats.describe(x)
new_array = np.vstack((mean,var,skew,kurt,min_max[0],min_max[1]))
new_array.T
Out[70]: array([[ 5.46011575e-01, 8.30007104e-02, -9.72899085e-02,
-1.17492785e+00, 4.07031246e-04, 9.85652100e-01],
[ 4.79292653e-01, 8.13883169e-02, 1.00411352e-01,
-1.15988275e+00, 1.27241020e-02, 9.85985488e-01],
[ 4.81319367e-01, 8.34107619e-02, 5.55926602e-02,
-1.20006450e+00, 7.49534810e-03, 9.86671083e-01],
[ 5.26977277e-01, 9.33829059e-02, -1.12640661e-01,
-1.19955646e+00, 5.74237697e-03, 9.94980830e-01],
[ 5.42622228e-01, 8.92615897e-02, -1.79102183e-01,
-1.13744108e+00, 2.27821933e-03, 9.93861532e-01],
[ 4.84397369e-01, 9.18274523e-02, 2.33663872e-01,
-1.36827574e+00, 1.18986562e-02, 9.96563489e-01],
[ 4.41436165e-01, 9.54357485e-02, 3.48194314e-01,
-1.15588500e+00, 1.77608372e-03, 9.93865324e-01],
[ 5.34834409e-01, 7.61735119e-02, -2.10467450e-01,
-1.01442389e+00, 2.44706226e-02, 9.97784091e-01],
[ 4.90262346e-01, 9.28757119e-02, 1.02682367e-01,
-1.28987137e+00, 2.97705706e-03, 9.98205307e-01],
[ 4.42767478e-01, 7.32159267e-02, 1.74375646e-01,
-9.58660574e-01, 5.52410464e-04, 9.95383732e-01]])

NumPy has a great module named numpy.ma, which is used for masking array elements. It's very useful when you want to mask (ignore) some elements while doing your calculations. When NumPy masks, it will be treated as an invalid and does not take into account computation:

In [71]: import numpy.ma as ma
x = np.arange(6)
print(x.mean())
masked_array = ma.masked_array(x, mask=[1,0,0,0,0,0])
masked_array.mean()
2.5
Out[71]: 3.0

In the preceding code, you have an array x = [0,1,2,3,4,5]. What you do is mask the first element of the array and then calculate the mean. When an element is masked as 1(True), the associated index value in the array will be masked. This method is also very useful while replacing the NAN values:

In [72]: x = np.arange(25, dtype = float).reshape(5,5)
x[x<5] = np.nan
x
Out[72]: array([[ nan, nan, nan, nan, nan],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
In [73]: np.where(np.isnan(x), ma.array(x, mask=np.isnan(x)).mean(axis=0), x)
Out[73]: array([[ 12.5, 13.5, 14.5, 15.5, 16.5],
[ 5. , 6. , 7. , 8. , 9. ],
[ 10. , 11. , 12. , 13. , 14. ],
[ 15. , 16. , 17. , 18. , 19. ],
[ 20. , 21. , 22. , 23. , 24. ]])

In preceding code, we changed the value of the first five elements to nan by putting a condition with index. x[x<5] refers to the elements which indexed for 0, 1, 2, 3, and 4. Then we overwrite these values with the mean of each column(excluding nan values). There are many other useful methods in array operations in order help your code be more concise:

Method
Description
np.concatenate
Join to the matrix in a sequence with a given matrix
np.repeat
Repeat the element of an array along a specific axis
np.delete
Return a new array with the deleted subarrays
np.insert
Insert values before the specified axis
np.unique
Find unique values in an array
np.tile
Create an array by repeating a given input for a given number of repetitions