Book Image

Learning Data Mining with Python

Book Image

Learning Data Mining with Python

Overview of this book

Table of Contents (20 chapters)
Learning Data Mining with Python
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Using Python and the IPython Notebook


In this section, we will cover installing Python and the environment that we will use for most of the book, the IPython Notebook. Furthermore, we will install the numpy module, which we will use for the first set of examples.

Installing Python

The Python language is a fantastic, versatile, and an easy to use language.

For this book, we will be using Python 3.4, which is available for your system from the Python Organization's website: https://www.python.org/downloads/.

There will be two major versions to choose from, Python 3.4 and Python 2.7. Remember to download and install Python 3.4, which is the version tested throughout this book.

In this book, we will be assuming that you have some knowledge of programming and Python itself. You do not need to be an expert with Python to complete this book, although a good level of knowledge will help.

If you do not have any experience with programming, I recommend that you pick up the Learning Python book from.

The Python organization also maintains a list of two online tutorials for those new to Python:

  • For nonprogrammers who want to learn programming through the Python language: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers

  • For programmers who already know how to program, but need to learn Python specifically: https://wiki.python.org/moin/BeginnersGuide/Programmers

    Note

    Windows users will need to set an environment variable in order to use Python from the command line. First, find where Python 3 is installed; the default location is C:\Python34. Next, enter this command into the command line (cmd program): set the enviornment to PYTHONPATH=%PYTHONPATH%;C:\Python34. Remember to change the C:\Python34 if Python is installed into a different directory.

Once you have Python running on your system, you should be able to open a command prompt and run the following code:

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on Linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print("Hello, world!")
Hello, world!
>>> exit()

Note that we will be using the dollar sign ($) to denote that a command is to be typed into the terminal (also called a shell or cmd on Windows). You do not need to type this character (or the space that follows it). Just type in the rest of the line and press Enter.

After you have the above "Hello, world!" example running, exit the program and move on to installing a more advanced environment to run Python code, the IPython Notebook.

Note

Python 3.4 will include a program called pip, which is a package manager that helps to install new libraries on your system. You can verify that pip is working on your system by running the $ pip3 freeze command, which tells you which packages you have installed on your system.

Installing IPython

IPython is a platform for Python development that contains a number of tools and environments for running Python and has more features than the standard interpreter. It contains the powerful IPython Notebook, which allows you to write programs in a web browser. It also formats your code, shows output, and allows you to annotate your scripts. It is a great tool for exploring datasets and we will be using it as our main environment for the code in this book.

To install IPython on your computer, you can type the following into a command line prompt (not into Python):

$ pip install ipython[all]

You will need administrator privileges to install this system-wide. If you do not want to (or can't) make system-wide changes, you can install it for just the current user by running this command:

$ pip install --user ipython[all]

This will install the IPython package into a user-specific location—you will be able to use it, but nobody else on your computer can. If you are having difficulty with the installation, check the official documentation for more detailed installation instructions:http://ipython.org/install.html.

With the IPython Notebook installed, you can launch it with the following:

$ ipython3 notebook

This will do two things. First, it will create an IPython Notebook instance that will run in the command prompt you just used. Second, it will launch your web browser and connect to this instance, allowing you to create a new notebook. It will look something similar to the following screenshot (where home/bob will be replaced by your current working directory):

To stop the IPython Notebook from running, open the command prompt that has the instance running (the one you used earlier to run the IPython command). Then, press Ctrl + C and you will be prompted Shutdown this notebook server (y/[n])?. Type y and press Enter and the IPython Notebook will shutdown.

Installing scikit-learn

The scikit-learn package is a machine learning library, written in Python. It contains numerous algorithms, datasets, utilities, and frameworks for performing machine learning. Built upon the scientific python stack, scikit-learn users such as the numpy and scipy libraries are often optimized for speed. This makes scikit-learn fast and scalable in many instances and also useful for all skill ranges from beginners to advanced research users. We will cover more details of scikit-learn in Chapter 2, Classifying with scikit-learn Estimators.

To install scikit-learn, you can use the pip utility that comes with Python 3, which will also install the numpy and scipy libraries if you do not already have them. Open a terminal with administrator/root privileges and enter the following command:

$ pip3 install -U scikit-learn

Note

Windows users may need to install the numpy and scipy libraries before installing scikit-learn. Installation instructions are available at www.scipy.org/install.html for those users.

Users of major Linux distributions such as Ubuntu or Red Hat may wish to install the official package from their package manager. Not all distributions have the latest versions of scikit-learn, so check the version before installing it. The minimum version needed for this book is 0.14.

Those wishing to install the latest version by compiling the source, or view more detailed installation instructions, can go to http://scikit-learn.org/stable/install.html to view the official documentation on installing scikit-learn.