Book Image

Hands-On Data Analysis with NumPy and Pandas

By : Curtis Miller
5 (1)
Book Image

Hands-On Data Analysis with NumPy and Pandas

5 (1)
By: Curtis Miller

Overview of this book

Python, a multi-paradigm programming language, has become the language of choice for data scientists for visualization, data analysis, and machine learning. Hands-On Data Analysis with NumPy and Pandas starts by guiding you in setting up the right environment for data analysis with Python, along with helping you install the correct Python distribution. In addition to this, you will work with the Jupyter notebook and set up a database. Once you have covered Jupyter, you will dig deep into Python’s NumPy package, a powerful extension with advanced mathematical functions. You will then move on to creating NumPy arrays and employing different array methods and functions. You will explore Python’s pandas extension which will help you get to grips with data mining and learn to subset your data. Last but not the least you will grasp how to manage your datasets by sorting and ranking them. By the end of this book, you will have learned to index and group your data for sophisticated data analysis and manipulation.
Table of Contents (12 chapters)

Exploring alternatives to Jupyter


Now we will consider alternatives to Jupyter Notebooks. We will look at:

  • Jupyter QT Console
  • Spyder
  • Rodeo
  • Python interpreter
  • ptpython

The first alternative we will consider is the Jupyter QT Console; this is a Python interpreter with added functionality, aimed specifically for data analysis.

The following screenshot shows the Jupyter QT Console:

It is very similar to the Jupyter Notebook. In fact, it is effectively the Console version of the Jupyter Notebook. Notice here that we have some interesting syntax. We have In [1], and then let's suppose you were to type in a command, for example:

print ("Hello, world!")

We see some output and then we see In [2].

Now let's try something else:

1 + 1

Right after In [2], we see Out[2]. What does this mean? This is a way to track historical commands and their outputs in a session. To access, say, the command for In [42], we type _i42. So, in this case, if we want to see the input for command 2, we type in i2. Notice that it gives us a string, 1 + 1. In fact, we can run this string.

If we type in eval and then _i2, notice that it gives us the same output as the original command, In [2], did. Now, how about Out[2]? How can we access the actual output? In this case, all we would do is just _ and then the number of the output, say 2. This should give us 2. So this gives you a more convenient way to access historical commands and their outputs.

Another advantage of Jupyter Notebooks is that you can see images. For example, let's get Matplotlib running. First we're going to import Matplotlib with the following command:

import matplotlib.pyplot as plt

After we've imported Matplotlib, recall that we need to run a certain magic, the Matplotlib magic:

%matplotlib inline

We need to give it the inline parameter, and now we can create a Matplotlib figure. Notice that the image shows up right below the command. When we type in _8, it shows that a Matplotlib object was created, but it does not actually show the plot itself. As you can see, we can use the Jupyter console in a more advanced way than the typical Python console. For example, let's work with a dataset called Iris; import it using the following line:

from sklearn.datasets import load_iris

This is a very common dataset used in data analysis. It's often used as a way to evaluate training models. We will also use k-means clustering on this:

from sklearn.cluster import KMeans

The load_Iris function isn't actually the Iris dataset; it is a function that we can use to get the Iris dataset. The following command will actually give us access to that dataset:

iris  = load_iris()

Now we will train a k-means clustering scheme on this dataset:

iris_clusters = KMeans(n_clusters = 3, init =  "random").fit(iris.data)

We can see the documentation right away when we're typing in a function. For example, I know what the end clusters parameter means; it is actually the original doc string from the function. Here, I want the number of clusters to be 3, because I know that there are actually three real clusters in this dataset. Now that a clustering scheme has been trained, we can plot it using the following code:

plt.scatter(iris.data[:, 0], iris.data[:, 1], c = iris_clusters.labels_)

Spyder

Spyder is an IDE unlike the Jupyter Notebook or the Jupyter QT Console. It integrates NumPy, SciPy, Matplotlib, and IPython. It is extensible with plugins, and it is included with Anaconda.

The following screenshot shows Spyder, an actual IDE intended for data analysis and scientific computing:

Spyder Python 3.6

On the right, you can go to File explorer to search for new files to load. Here, we want to open up iris_kmeans.py. This is a file that contains all the commands that we used before in the Jupyter QT Console. Notice on the right that the editor has a console; that is in fact the IPython console, which you saw as the Jupyter QT Console. We can run this entire file by clicking on the Run tab. It will run in the console, shown as follows:

The following screenshot will be the output:

Notice that at the end we see the result of the clustering that we saw before. We can type in commands interactively as well; for example, we can make our computer say Hello, world!.

In the editor, let's type in a new variable, let's say n = 5. Now let's run this file in the editor. Notice that n is a variable that the editor is aware of. Now let's make a change, say n = 6. Unless we were to actually run this file again, the console will be unaware of the change. So if I were to type n in the console again, nothing changes, and it's still 5. You would need to run this line in order to actually see a change.

We also have a variable explorer where we can see the values of variables and change them. For example, I can change the value of n from 6 to 10, shown as follows:

The following screenshot shows the output:

Then, when I go to the console and ask what n is, it will say 10:

n10

That concludes our discussion of Spyder.

Rodeo

Rodeo is a Python IDE developed by Yhat, and is intended for data analysis applications exclusively. It is intended to emulate the RStudio IDE, which is popular among R users, and it can be downloaded from Rodeo's website. The only advantage of the base Python interpreter is that every Python installation includes it, shown as follows:

ptpython

What may be a lesser known console-based Python REPL is ptpython, designed by Jonathan Slenders. It exists only in the console and is an independent project by him. You can find it on GitHub. It has lightweight features, yet it also includes syntax highlighting, autocompletion, and even IPython. It can be installed with the following command:

pip install ptpython

That concludes our discussion on alternatives to the Jupyter Notebooks.