Book Image

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur
Book Image

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!
Table of Contents (9 chapters)

Jupyter Notebooks


One aspect of the data science development environment that distinguishes itself from other Python projects is the use of IPython Jupyter notebooks (https://jupyter.org). Jupyter notebooks provide a means of creating and sharing interactive documents with live, executable code snippets, and plots, as well as the rendering of mathematical equations through the Latex (https://www.latex-project.org) typesetting system. This section of the chapter will introduce you to Jupyter notebooks and some of their key features to ensure your development environment is correctly set up.

Throughout this book, we will make frequent reference to the documentation for each of the introduced tools/packages. The ability to effectively read and understand the documentation for each tool is extremely important. Many of the packages we will use contain so many features and implementation details that it is very difficult to memorize them all. The following documentation may come in handy for the upcoming section on Jupyter notebooks:

  • The Anaconda documentation can be found at https://docs.anaconda.com.

  • The Anaconda user guide can be found at https://docs.anaconda.com/anaconda/user-guide.

  • The Jupyter Notebook documentation can be found at https://jupyter-notebook.readthedocs.io/en/stable/.

Exercise 1: Launching a Jupyter Notebook

In this exercise, we will launch our Jupyter notebook. Ensure you have correctly installed Anaconda with Python 3.7, as per the Preface:

  1. There are two ways of launching a Jupyter notebook through Anaconda. The first method is to open Jupyter using the Anaconda Navigator application available in the Anaconda folder of the Windows Start menu. Click on the Launch button and your default internet browser will then launch at the default address, http://localhost:8888, and will start in a default folder path.

  2. The second method is to launch Jupyter via the Anaconda prompt. To launch the Anaconda prompt, simply click on the Anaconda Prompt menu item, also in the Windows Start menu, and you should see a pop-up window similar to the following screenshot:

    Figure 1.4: Anaconda prompt

  3. Once in the Anaconda prompt, change to the desired directory using the cd (change directory) command. For example, to change into the Desktop directory for the Packt user, do the following:

    C:\Users\Packt> cd C:\Users\Packt\Desktop
  4. Once in the desired directory, launch a Jupyter notebook using the following command:

    C:\Users\Packt> jupyter notebook

    The notebook will launch with the working directory from the one you specified earlier. This then allows you to navigate and save your notebooks in the directory of your choice as opposed to the default, which can vary between systems, but is typically your home or My Computer directory. Irrespective of the method of launching Jupyter, a window similar to the following will open in your default browser. If there are existing files in the directory, you should also see them here:

    Figure 1.5: Jupyter notebook launch window

Exercise 2: Hello World

The Hello World exercise is a rite of passage, so you certainly cannot be denied that experience! So, let's print Hello World in a Jupyter notebook in this exercise:

  1. Start by creating a new Jupyter notebook by clicking on the New button and selecting Python 3. Jupyter allows you to run different versions of Python and other languages, such as R and Julia, all in the same interface. We can also create new folders or text files here too. But for now, we will start with a Python 3 notebook:

    Figure 1.6: Creating a new notebook

    This will launch a new Jupyter notebook in a new browser window. We will first spend some time looking over the various tools that are available in the notebook:

    Figure 1.7: The new notebook

    There are three main sections in each Jupyter notebook, as shown in the following screenshot: the title bar (1), the toolbar (2), and the body of the document (3). Let's look at each of these components in order:

    Figure 1.8: Components of the notebook

  2. The title bar simply displays the name of the current Jupyter notebook and allows the notebook to be renamed. Click on the Untitled text and a popup will appear allowing you to rename the notebook. Enter Hello World and click Rename:

    Figure 1.9: Renaming the notebook

  3. For the most part, the toolbar contains all the normal functionality that you would expect. You can open, save, and make copies of—or create new—Jupyter notebooks in the File menu. You can search replace, copy, and cut content in the Edit menu and adjust the view of the document in the View menu. As we discuss the body of the document, we will also describe some of the other functionalities in more detail, such as the ones included in the Insert, Cell, and Kernel menus. One aspect of the toolbar that requires further examination is the far right-hand side, the outline of the circle on the right of Python 3.

    Hover your mouse over the circle and you will see the Kernel Idle popup. This circle is an indicator to signify whether the Python kernel is currently processing; when processing, this circle indicator will be filled in. If you ever suspect that something is running or is not running, you can easily refer to this icon for more information. When the Python kernel is not running, you will see this:

    Figure 1.10: Kernel idle

    When the Python kernel is running, you will see this:

    Figure 1.11: Kernel busy

  4. This brings us to the body of the document, where the actual content of the notebook will be entered. Jupyter notebooks differ from standard Python scripts or modules, in that they are divided into separate executable cells. While Python scripts or modules will run the entirety of the script when executed, Jupyter notebooks can run all of the cells sequentially, or can also run them separately and in a different order if manually executed.

    Double-click on the first cell and enter the following:

    >>> print('Hello World!')
  5. Click on Run (or use the Ctrl + Enter keyboard shortcut):

    Figure 1.12: Running a cell

Congratulations! You just completed Hello World in a Jupyter notebook.

Exercise 3: Order of Execution in a Jupyter Notebook

In the previous exercise, notice how the print statement is executed under the cell. Now let's take it a little further. As mentioned earlier, Jupyter notebooks are composed of a number of separately executable cells; it is best to think of them as just blocks of code you have entered into the Python interpreter, and the code is not executed until you press the Ctrl + Enter keys. While the code is run at a different time, all of the variables and objects remain in the session within the Python kernel. Let's investigate this a little further:

  1. Launch a new Jupyter notebook and then, in three separate cells, enter the code shown in the following screenshot:

    Figure 1.13: Entering code into multiple cells

  2. Click Restart & Run All.

    Notice that there are three executable cells, and the order of execution is shown in rectangular brackets; for example, In [1], In [2], and In [3]. Also note how the hello_world variable is declared (and thus executed) in the second cell and remains in memory, and thus is printed in the third cell. As we mentioned before, you can also run the cells out of order.

  3. Click on the second cell, containing the declaration of hello_world, change the value to add a few more exclamation points, and run the cell again:

    Figure 1.14: Changing the content of the second cell

    Notice that the second cell is now the most recently executed cell (In [4]), and that the print statement after it has not been updated. To update the print statement, you would then need to execute the cell below it. Warning: be careful about your order of execution. If you are not careful, you can easily override values or declare variables in cells below their first use, as in notebooks, you no longer need to run the entire script at once. As such, it is good practice to regularly click Kernel | Restart & Run All. This will clear all variables from memory and run all cells from top to bottom in order. There is also the option to run all cells below or above a particular cell in the Cell menu:

    Figure 1.15: Restarting the kernel

    Note

    Write and structure your notebook cells as if you were to run them all in order, top to bottom. Use manual cell execution only for debugging/early investigation.

  4. You can also move cells around using either the up/down arrows on the left of Run or through the Edit toolbar. Move the cell that prints the hello_world variable to above its declaration:

    Figure 1.16: Moving cells

  5. Click on Restart & Run All cells:

    Figure 1.17: Variable not defined error

    Notice the error reporting that the variable is not defined. This is because it is being used before its declaration. Also, notice that the cell after the error has not been executed as shown by the empty In [ ].

Exercise 4: Advantages of Jupyter Notebooks

There are a number of additional features of Jupyter notebooks that make them very useful. In this exercise, we will examine some of these features:

  1. Jupyter notebooks can execute commands directly within the Anaconda prompt by including an exclamation point prefix (!). Enter the code shown in the following screenshot and run the cell:

    Figure 1.18: Running Anaconda commands

  2. One of the best features of Jupyter notebooks is the ability to create live reports that contain executable code. Not only does this save time in preventing separate creation of reports and code, but it can also assist in communicating the exact nature of the analysis being completed. Through the use of Markdown and HTML, we can embed headings, sections, images, or even JavaScript for dynamic content.

    To use Markdown in our notebook, we first need to change the cell type. First, click on the cell you want to change to Markdown, then click on the Code drop-down menu and select Markdown:

    Figure 1.19: Running Anaconda commands

    Notice that In [ ] has disappeared and the color of the box lining the cell is no longer blue.

  3. You can now enter valid Markdown syntax and HTML by double-clicking in the cell and then clicking Run to render the markdown. Enter the syntax shown in the following screenshot and run the cell to see the output:

    Figure 1.20: Markdown syntax

    The output will be as follows:

    Figure 1.21: Markdown output

    Note

    For a quick reference on Markdown, refer to the Markdown Syntax.ipynb Jupyter notebook in the code files for this chapter.

Python Packages and Modules

While the standard features that are included in Python are certainly feature-rich, the true power of Python lies in the additional libraries (also known as packages in Python), which, thanks to open source licensing, can be easily downloaded and installed through a few simple commands. In an Anaconda installation, it is even easier as many of the most common packages come pre-built within Anaconda. You can get a complete list of the pre-installed packages in the Anaconda environment by running the following command in a notebook cell:

!conda list

In this book, we will be using the following additional Python packages:

  • NumPy (pronounced Num Pie and available at https://www.numpy.org/): NumPy (short for numerical Python) is one of the core components of scientific computing in Python. NumPy provides the foundational data types from which a number of other data structures derive, including linear algebra, vectors and matrices, and key random number functionality.

  • SciPy (pronounced Sigh Pie and available at https://www.scipy.org): SciPy, along with NumPy, is a core scientific computing package. SciPy provides a number of statistical tools, signal processing tools, and other functionality, such as Fourier transforms.

  • pandas (available at https://pandas.pydata.org/): pandas is a high-performance library for loading, cleaning, analyzing, and manipulating data structures.

  • Matplotlib (available at https://matplotlib.org/): Matplotlib is the foundational Python library for creating graphs and plots of datasets and is also the base package from which other Python plotting libraries derive. The Matplotlib API has been designed in alignment with the Matlab plotting library to facilitate an easy transition to Python.

  • Seaborn (available at https://seaborn.pydata.org/): Seaborn is a plotting library built on top of Matplotlib, providing attractive color and line styles as well as a number of common plotting templates.

  • Scikit-learn (available at https://scikit-learn.org/stable/): Scikit-learn is a Python machine learning library that provides a number of data mining, modeling, and analysis techniques in a simple API. Scikit-learn includes a number of machine learning algorithms out of the box, including classification, regression, and clustering techniques.

These packages form the foundation of a versatile machine learning development environment with each package contributing a key set of functionalities. As discussed, by using Anaconda, you will already have all of the required packages installed and ready for use. If you require a package that is not included in the Anaconda installation, it can be installed by simply entering and executing the following in a Jupyter notebook cell:

!conda install <package name>

As an example, if we wanted to install Seaborn, we'd run this:

!conda install seaborn

To use one of these packages in a notebook, all we need to do is import it:

import matplotlib