Book Image

Hands-On Data Science and Python Machine Learning

By : Frank Kane
Book Image

Hands-On Data Science and Python Machine Learning

By: Frank Kane

Overview of this book

Join Frank Kane, who worked on Amazon and IMDb’s machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them. Based on Frank’s successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis.
Table of Contents (11 chapters)

Installing Enthought Canopy

Let's dive right in and get what you need installed to actually develop Python code with data science on your desktop. I'm going to walk you through installing a package called Enthought Canopy which has both the development environment and all the Python packages you need pre-installed. It makes life really easy, but if you already know Python you might have an existing Python environment already on your PC, and if you want to keep using it, maybe you can.

The most important thing is that your Python environment has Python 3.5 or newer, that it supports Jupyter Notebooks (because that's what we're going to use in this course), and that you have the key packages you need for this book installed on your environment. I'll explain exactly how to achieve a full installation in a few simple steps - it's going to be very easy.

Let's first overview those key packages, most of which Canopy will be installing for us automatically for us. Canopy will install Python 3.5 for us, and some further packages we need including: scikit_learn, xlrd, and statsmodels. We'll need to manually use the pip command, to install a package called pydot2plus. And that will be it - it's very easy with Canopy!

Once the following installation steps are complete, we'll have everything we need to actually get up and running, and so we'll open up a little sample file and do some data science for real. Now let's get you set up with everything you need to get started as quickly as possible:

  1. The first thing you will need is a development environment, called an IDE, for Python code. What we're going to use for this book is Enthought Canopy. It's a scientific computing environment, and it's going to work well with this book:
  1. To get Canopy installed, just go to www.enthought.com and click on DOWNLOADS: Canopy:
  1. Enthought Canopy is free, for the Canopy Express edition - which is what you want for this book. You must then select your operating system and architecture. For me, that's Windows 64-bit, but you'll want to click on corresponding Download button for your operating system and with the Python 3.5 option:
  1. We don't have to give them any personal information at this step. There's a pretty standard Windows installer, so just let that download:
  1. After that's downloaded we go ahead and open up the Canopy installer, and run it! You might want to read the license before you agree to it, that's up to you, and then just wait for the installation to complete.
  2. Once you hit the Finish button at the end of the install process, allow it to launch Canopy automatically. You'll see that Canopy then sets up the Python environment by itself, which is great, but this will take a minute or two.
  3. Once the installer is done setting up your Python environment, you should get a screen that looks like the one below. It says welcome to Canopy and a bunch of big friendly buttons:
  1. The beautiful thing is that pretty much everything you need for this book comes pre-installed with Enthought Canopy, that's why I recommend using it!
  2. There is just one last thing we need to set up, so go ahead and click the Editor button there on the Canopy Welcome screen. You'll then see the Editor screen come up, and if you click down in the window at the bottom, I want you to just type in:
!pip install pydotplus 
  1. Here's how that's going to look on your screen as you type the above line in at the bottom of the Canopy Editor window; don't forget to press the Return button of course:
  1. One you hit the Return button, this will install that one extra module that we need for later on in the book, when we get to talking about decision trees, and rendering decision trees.
  2. Once it has finished installing pydotplus, it should come back and say it's successfully installed and, voila, you have everything you need now to get started! The installation is done, at this point - but let's just take a few more steps to confirm our installation is running nicely.

Giving the installation a test run

  1. Let's now give your installation a test run. The first thing to do is actually to entirely close the Canopy window! This is because we're not actually going to be editing and using our code within this Canopy editor. Instead we're going to be using something called an IPython Notebook, which is also now known as the Jupyter Notebook.
  2. Let me show you how that works. If you now open a window in your operating system to view the accompanying book files that you downloaded, as described in the Preface of this book. It should look something like this, with the set of .ipynb code files you downloaded for this book:

Now go down to the Outliers file in the list, that's the Outliers.ipynb file, double-click it, and what should happen is it's going to start up Canopy first and then it's going to kick off your web browser! This is because IPython/Jupyter Notebooks actually live within your web browser. There can be a small pause at first, and it can be a little bit confusing first time, but you'll soon get used to the idea.

You should soon see Canopy come up and for me my default web browser Chrome comes up. You should see the following Jupyter Notebook page, since we double-clicked on the Outliers.ipynb file:

If you see this screen, it means that everything's working great in your installation and you're all set for the journey across rest of this book!

If you occasionally get problems opening your IPNYB files

Just occasionally, I've noticed that things can go a little bit wrong when you double-click on a .ipynb file. Don't panic! Just sometimes, Canopy can get a little bit flaky, and you might see a screen that is looking for some password or token, or you might occasionally see a screen that says it can't connect at all.

Don't panic if either of those things happen to you, they are just random quirks, sometimes things just don't start up in the right order or they don't start up in time on your PC and it's okay.

All you have to do is go back and try to open that file a second time. Sometimes it takes two or three tries to actually get it loaded up properly, but if you do it a couple of times it should pop up eventually, and a Jupyter Notebook screen like the one we saw previously about Dealing with Outliers, is what you should see.