Book Image

Mastering Numerical Computing with NumPy

By : Umit Mert Cakmak, Tiago Antao, Mert Cuhadaroglu
Book Image

Mastering Numerical Computing with NumPy

By: Umit Mert Cakmak, Tiago Antao, Mert Cuhadaroglu

Overview of this book

NumPy is one of the most important scientific computing libraries available for Python. Mastering Numerical Computing with NumPy teaches you how to achieve expert level competency to perform complex operations, with in-depth coverage of advanced concepts. Beginning with NumPy's arrays and functions, you will familiarize yourself with linear algebra concepts to perform vector and matrix math operations. You will thoroughly understand and practice data processing, exploratory data analysis (EDA), and predictive modeling. You will then move on to working on practical examples which will teach you how to use NumPy statistics in order to explore US housing data and develop a predictive model using simple and multiple linear regression techniques. Once you have got to grips with the basics, you will explore unsupervised learning and clustering algorithms, followed by understanding how to write better NumPy code while keeping advanced considerations in mind. The book also demonstrates the use of different high-performance numerical computing libraries and their relationship with NumPy. You will study how to benchmark the performance of different configurations and choose the best for your system. By the end of this book, you will have become an expert in handling and performing complex data manipulations.
Table of Contents (11 chapters)

Basics of NumPy array objects

As mentioned in the preceding section, what makes NumPy special is the usage of multidimensional arrays called ndarrays. All ndarray items are homogeneous and use the same size in memory. Let's start by importing NumPy and analyzing the structure of a NumPy array object by creating the array. You can easily import this library by typing the following statement into your console. You can use any naming convention instead of np, but in this book, np will be used as it's the standard convention. Let's create a simple array and explain what the attributes hold by Python behind the scenes as metadata of the created array, so-called attributes:

In [2]: import numpy as np
x = np.array([[1,2,3],[4,5,6]])
x
Out[2]: array([[1, 2, 3],[4, 5, 6]])
In [3]: print("We just create a ", type(x))
Out[3]: We just create a <class 'numpy.ndarray'>
In [4]: print("Our template has shape as" ,x.shape)
Out[4]: Our template has shape as (2, 3)
In [5]: print("Total size is",x.size)
Out[5]: Total size is 6
In [6]: print("The dimension of our array is " ,x.ndim)
Out[6]: The dimension of our array is 2
In [7]: print("Data type of elements are",x.dtype)
Out[7]: Data type of elements are int32
In [8]: print("It consumes",x.nbytes,"bytes")
Out[8]: It consumes 24 bytes

As you can see, the type of our object is a NumPy array. x.shape returns a tuple which gives you the dimension of the array as an output such as (n,m). You can get the total number of elements in an array by using x.size. In our example, we have six elements in total. Knowing attributes such as shape and dimension is very important. The more you know, the more you will be comfortable with computations. If you don't know your array's size and dimensions, it wouldn't be wise to start doing computations with it. In NumPy, you can use x.ndim to check what the dimension of your array is. There are other attributes such as dtype and nbytes, which are very useful while you are checking memory consumption and verifying what kind of data type you should use in the array. In our example, each element has a data type of int32 that consumes 24 bytes in total. You can also force some of these attributes while creating your array such as dtype. Previously, the data type was an integer. Let's switch it to float, complex, or uint (unsigned integer). In order to see what the data type change does, let's analyze what byte consumption is, which is shown as follows:

In [9]: x = np.array([[1,2,3],[4,5,6]], dtype = np.float)
print(x)
Out[9]: print(x.nbytes)
[[ 1. 2. 3.]
[ 4. 5. 6.]]
48
In [10]: x = np.array([[1,2,3],[4,5,6]], dtype = np.complex)
print(x)
print(x.nbytes)
Out[10]: [[ 1.+0.j 2.+0.j 3.+0.j]
[ 4.+0.j 5.+0.j 6.+0.j]]
96
In [11]: x = np.array([[1,2,3],[4,-5,6]], dtype = np.uint32)
print(x)
print(x.nbytes)
Out[11]: [[ 1 2 3]
[ 4 4294967291 6]]
24

As you can see, each type consumes a different number of bytes. Imagine you have a matrix as follows and that you are using int64 or int32 as the data type:

In [12]: x = np.array([[1,2,3],[4,5,6]], dtype = np.int64)
print("int64 consumes",x.nbytes, "bytes")
x = np.array([[1,2,3],[4,5,6]], dtype = np.int32)
print("int32 consumes",x.nbytes, "bytes")
Out[12]: int64 consumes 48 bytes
int32 consumes 24 bytes

The memory need is doubled if you use int64. Ask this question to yourself; which data type would suffice? Until your numbers are higher than 2,147,483,648 and lower than -2,147,483,647, using int32 is enough. Imagine you have a huge array with a size over 100 MB. In such cases, this conversion plays a crucial role in performance.

As you may have noticed in the previous example, when you were changing the data types, you were creating an array each time. Technically, you cannot change the dtype after you create the array. However, what you can do is either create it again or copy the existing one with a new dtype and with the astype attribute. Let's create a copy of the array with the new dtype. Here is an example of how you can also change your dtype with the astype attribute:

In [13]: x_copy = np.array(x, dtype = np.float)
x_copy
Out[13]: array([[ 1., 2., 3.],
[ 4., 5., 6.]])
In [14]: x_copy_int = x_copy.astype(np.int)
x_copy_int
Out[14]: array([[1, 2, 3],
[4, 5, 6]])

Please keep in mind that when you use the astype attribute, it doesn't change the dtype of the x_copy, even though you applied it to x_copy. It keeps the x_copy, but creates a x_copy_int:

In [15]: x_copy
Out[15]: array([[ 1., 2., 3.],
[ 4., 5., 6.]])

Let's imagine a case where you are working in a research group that tries to identify and calculate the risks of an individual patient who has cancer. You have 100,000 records (rows), where each row represents a single patient, and each patient has 100 features (results of some of the tests). As a result, you have (100000, 100) arrays:

In [16]: Data_Cancer= np.random.rand(100000,100)
print(type(Data_Cancer))
print(Data_Cancer.dtype)
print(Data_Cancer.nbytes)
Data_Cancer_New = np.array(Data_Cancer, dtype = np.float32)
print(Data_Cancer_New.nbytes)
Out[16]: <class 'numpy.ndarray'>
float64
80000000
40000000

As you can see from the preceding code, their size decreases from 80 MB to 40 MB just by changing the dtype. What we get in return is less precision after decimal points. Instead of being precise to 16 decimals points, you will have only 7 decimals. In some machine learning algorithms, precision can be negligible. In such cases, you should feel free to adjust your dtype so that it minimizes your memory usage.