Book Image

Large Scale Machine Learning with Python

By : Bastiaan Sjardin, Alberto Boschetti
Book Image

Large Scale Machine Learning with Python

By: Bastiaan Sjardin, Alberto Boschetti

Overview of this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.
Table of Contents (17 chapters)
Large Scale Machine Learning with Python
About the Authors
About the Reviewer

Python packages

The packages that we are going to introduce in the present paragraph will be frequently used in the book. If you are not using a scientific distribution, we offer you a walkthrough on what versions you should decide on and how to install them quickly and successfully.


NumPy, which is Travis Oliphant's creation, is at the core of every analytical solution in the Python language. It provides the user with multidimensional arrays along with a large set of functions to operate multiple mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Arrays are useful not just to store data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems.

  • Website:

  • Version at the time of writing: 1.11.1

  • Suggested install command:

    $ pip install numpy


As a convention that is largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np:

import numpy as np


An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more.

  • Website:

  • Version at the time of writing: 0.17.1

  • Suggested install command:

    $ pip install scipy


Pandas deals with everything that NumPy and SciPy cannot do. In particular, thanks to its specific object data structures, DataFrames, and Series, it allows the handling of complex tables of data of different types (something that NumPy's arrays cannot) and time series. Thanks to Wes McKinney's creation, you will be able to easily and smoothly load data from a variety of sources, and then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize it at your will.


Conventionally, pandas is imported as pd:

import pandas as pd


Started as part of SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations in Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Expect us to talk at length about this package throughout the book.

Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at Inria (French Institute for Research in Computer Science and Automation).

Scikit-learn offers modules for data processing (sklearn.preprocessing and sklearn.feature_extraction), model selection and validation (sklearn.cross_validation, sklearn.grid_search, and sklearn.metrics), and a complete set of methods (sklearn.linear_model) in which the target value, being a number or probability, is expected to be a linear combination of the input variables.


Note that the imported module is named sklearn.

The matplotlib package

Originally developed by John Hunter, matplotlib is the library containing all the building blocks to create quality plots from arrays and visualize them interactively.

You can find all the MATLAB-like plotting frameworks inside the PyLab module.

  • Website:

  • Version at the time of writing: 1.5.1

  • Suggested install command:

    $ pip install matplotlib

You can simply import just what you need for your visualization purposes:

import matplotlib as mpl
from matplotlib import pyplot as plt


Gensim, programmed by Radim Řehůřek, is an open source package suitable to analyze large textual collections by the usage of parallel distributable online algorithms. Among advanced functionalities, it implements Latent Semantic Analysis (LSA), topic modeling by Latent Dirichlet Allocation (LDA), and Google's word2vec, a powerful algorithm to transform texts into vector features to be used in supervised and unsupervised machine learning.


H2O is an open source framework for big data analysis created by the start-up (previously named as 0xdata). It is usable by R, Python, Scala, and Java programming languages. H2O easily allows using a standalone machine (leveraging multiprocessing) or Hadoop cluster (for example, a cluster in an AWS environment), thus helping you scale up and out.

In order to install the package, you first have to download and install Java on your system, (You need to have Java Development Kit (JDK) 1.8 installed as H2O is Java-based.) then you can refer to the online instructions provided at

We can overview all the installation steps together in the following lines.

You can install both H2O and its Python API, as we have been using in our book, by the following instructions:

$ pip install -U requests
$ pip install -U tabulate
$ pip install -U future
$ pip install -U six

These steps will install the required packages, and then we can install the framework, taking care to remove any previous installation:

$ pip uninstall h2o
$ pip install h2o

In order to have installed the same version as we have in our book, you can change the last pip install command with the following:

$ pip install

If you run into problems, please visit the H2O Google groups page, where you can get help with your problems:!forum/h2ostream


XGBoost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). It is available for Python, R, Java, Scala, Julia, and C++ and it can work on a single machine (leveraging multithreading), both in Hadoop and Spark clusters.

Detailed instructions to install XGBoost on your system can be found at

The installation of XGBoost on both Linux and Mac OS is quite straightforward, whereas it is a little bit trickier for Windows users. For this reason, we provide specific installations steps to have XGBoost working on Windows:

  1. First of all, download and install Git for Windows (

  2. Then you need a Minimalist GNU for Windows (MinGW) compiler present on your system. You can download it from according to the characteristics of your system.

  3. From the command line, execute the following:

    $ git clone --recursive
    $ cd xgboost
    $ git submodule init
    $ git submodule update
  4. Then, from the command line, copy the configuration for 64-bit systems to be the default one:

    $ copy make\

    Alternatively, you can copy the plain 32-bit version:

    $ copy make\
  5. After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling procedure:

    $ make -j4
  6. Finally, if the compiler completed its work without errors, you can install the package in your Python by executing the following commands:

    $ cd python-package
    $ python install


Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. Basically, it provides you with all the building blocks that you need to create deep neural networks.

The installation of Theano should be straightforward as it is now a package on PyPI:

$ pip install Theano

If you want the most updated version of the package, you can get them with GitHub cloning:

$ git clone git://

Then you can proceed with the direct Python installation:

$ cd Theano
$ python install

To test your installation, you can run the following from the shell/CMD and verify the reports:

$ pip install nose
$ pip install nose-parameterized
$ nosetests theano

If you are working on a Windows OS and the previous instructions don't work, you can try these steps:

  1. Install TDM-GCC x64 (

  2. Open the Anaconda command prompt and execute the following:

    $ conda update conda
    $ conda update –all
    $ conda install mingw libpython
    $ pip install git+git://


Theano needs libpython, which isn't compatible yet with version 3.5, so if your Windows installation is not working, that could be the likely cause.

In addition, Theano's website provides some information to Windows users that could support you when everything else fails:

An important requirement for Theano to scale out on GPUs is to install NVIDIA CUDA drivers and SDK for code generation and execution on GPU. If you do not know too much about the CUDA Toolkit, you can actually start from this web page in order to understand more about the technology being used:

Therefore, if your computer owns an NVIDIA GPU, you can find all the necessary instructions in order to install CUDA using this tutorial page from NVIDIA itself:


Just like Theano, TensorFlow is another open source software library for numerical computation using data flow graphs instead of just arrays. Nodes in such a graph represent mathematical operations, whereas the graph edges represent the multidimensional data arrays (the so-called tensors) moved between the nodes. Originally, Google researchers, being part of the Google Brain Team, developed TensorFlow and recently they made it open source for the public.

For the installation of TensorFlow on your computer, follow the instructions found at the following link:

Windows support is not present at the moment but it is in the current roadmap:

For Windows users, a good compromise could be to run the package on a Linux-based virtual machine or Docker machine. (The preceding OS set-up page offers directions to do so.)

The sknn library

The sknn library (for extensions, scikit-neuralnetwork) is a wrapper for Pylearn2, helping you to implement deep neural networks without requiring you to become an expert on Theano. As a bonus, the library is compatible with the Scikit-learn API.

Optionally, if you want to take advantage of the most advanced features such as convolution, pooling, or upscaling, you have to complete the installation as follows:

$ pip install -r

After installation, you also have to execute the following:

$ git clone
$ cd scikit-neuralnetwork
$ python develop

As seen for XGBoost, this will make the sknn package available in your Python installation.


The theanets package is a deep learning and neural network toolkit written in Python and uses Theano to accelerate computations. Just as with sknn, it tries to make it easier to interface with Theano functionalities in order to create deep learning models.

You can also download the current version from GitHub and install the package directly in Python:

$ git clone
$ cd theanets
$ python develop


Keras is a minimalist, highly modular neural networks library written in Python and capable of running on top of either TensorFlow or Theano.

  • Website:

  • Version at the time of writing: 1.0.5

  • Suggested installation from PyPI:

    $ pip install keras

You can also install the latest available version (advisable as the package is in continuous development) using the following command:

$ pip install git+git://

Other useful packages to install on your system

Concluding this long tour of the many packages that you will see in action among the pages of this book, we close with three simple, yet quite useful, packages, that need little presentation but need to be installed on your system: memory profiler, climate, and NeuroLab.

Memory profiler is a package monitoring memory usage by a process. It also helps dissecting memory consumption by a specific Python script, line by line. It can be installed as follows:

$ pip install -U memory_profiler

Climate just consists of some basic command-line utilities for Python. It can be promptly installed as follows:

$ pip install climate

Finally, NeuroLab is a very basic neural network package loosely based on the Neural Network Toolbox (NNT) in MATLAB. It is based on NumPy and SciPy, not Theano; consequently, do not expect astonishing performances but know that it is a good learning toolbox. It can be easily installed as follows:

$ pip install neurolab