Book Image

Large Scale Machine Learning with Python

By : Bastiaan Sjardin, Alberto Boschetti
Book Image

Large Scale Machine Learning with Python

By: Bastiaan Sjardin, Alberto Boschetti

Overview of this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.
Table of Contents (17 chapters)
Large Scale Machine Learning with Python
About the Authors
About the Reviewer



"The nice thing about having a brain is that one can learn, that ignorance can be supplanted by knowledge, and that small bits of knowledge can gradually pile up into substantial heaps."

 --Douglas Hofstadter

Machine learning is often referred to as the part of artificial intelligence that actually works. Its aim is to find a function based on an existing set of data (training set) in order to predict outcomes of a previously unseen dataset (test set) with the highest possible correctness. This occurs either in the form of labels and classes (classification problems) or in the form of a continuous value (regression problems). Tangible examples of machine learning in real-life applications range from predicting future stock prices to classifying the gender of an author from a set of documents. Throughout this book, the most important machine learning concepts, together with methods suitable for larger datasets, will be made clear to the reader, thanks to practical examples in Python. We will look at supervised learning (classification & regression), as well as unsupervised learning (such as Principal Component Analysis (PCA), clustering, and topic modeling) that have been found to be applicable to larger datasets.

Large IT corporations such as Google, Facebook, and Uber have generated a lot of buzz by claiming that they successfully applied such machine learning methods at a large scale. With the onset and availability of big data, the demand for scalable machine learning solutions has grown exponentially and many other companies and individuals have started aspiring to ripe the fruits of hidden correlations in big datasets. Unfortunately, most learning algorithms don't scale well, straining CPUs and memory either on a desktop computer or on a larger computing cluster. During these times, even if big data has passed the peak of hype, scalable machine learning solutions are not plentiful.

Frankly, we still need to work around a lot of bottlenecks even with datasets we would hardly categorize as big data (think of datasets up to 2GB or even smaller). The mission of this book is to provide methods (and sometimes unconventional ones) to apply the most powerful open source machine learning methods at a larger scale, without the need for expensive enterprise solutions or large computing clusters. Throughout this book, we will use Python and some other readily available solutions that integrate well in scalable machine learning pipelines. Reading the book is a journey that will redefine what you knew about machine learning, setting you on the starting blocks of real big data analysis.

What this book covers

Chapter 1, First Steps to Scalability, sets the problem of scalable machine learning under the right perspective and familiarizes you with the tools that we will be using in this book.

Chapter 2, Scalable Learning in Scikit-learn, discusses strategies for stochastic gradient descent (SGD) where we mitigate memory consumption; it is based on the theme of out-of-core learning. We will also deal with data preparation techniques that can deal with a variety of data, such as the hashing trick.

Chapter 3, Fast-Learning SVMs, covers streaming algorithms that are capable of discovering non-linearity in the form of support vector machines. We will present alternatives to Scikit-learn, such as LIBLINEAR and Vowpal Wabbit, which, although operating as external shell commands, can be easily wrapped and directed by Python scripts.

Chapter 4, Neural Networks and Deep Learning, provides useful tactics for applying deep neural networks within the Theano framework together with large-scale applications with H2O. Even though it is a hot topic, it can be quite a challenge to apply it successfully, let alone provide scalable solutions. We will also resort to unsupervised pre-training with autoencoders with the theanets package.

Chapter 5, Deep Learning with TensorFlow, covers interesting deep learning techniques together with an online method for neural networks. Although TensorFlow is only in its infancy, the framework provides elegant machine learning solutions. We will also utilize Keras Convolutional Neural Networks capabilities within the TensorFlow environment.

Chapter 6, Classification and Regression Trees at Scale, explains scalable solutions for random forest, gradient boosting, and XGboost. CART, an acronym for classification and regression trees, is a machine learning method usually applied in the framework of ensemble methods. We will also provide examples of a large-scale application using H2O.

Chapter 7, Unsupervised Learning at Scale, dives into unsupervised learning, as we will cover PCA, cluster analysis, and topic modeling using the right approach for scaling them up.

Chapter 8, Distributed Environments – Hadoop and Spark, teaches us how to set up Spark within a virtual machine environment, shifting from a single machine to a computational network paradigm. As Python can easily glue and power up our efforts on a cluster of machines, it becomes a piece of cake to leverage the power of a Hadoop cluster.

Chapter 9, Practical Machine Learning with Spark, gets into action with Spark, teaching all the essentials for starting immediately to manipulate data and build predictive models on large datasets.

Appendix, Introduction to GPUs and Theano, will cover the basics of Theano and GPU-computation. It will help you install and prepare your environment for using Theano on the GPU, if your system allows it.

What you need for this book

The execution of the code examples provided in this book requires an installation of Python 2.7 or higher versions on macOS, Linux, or Microsoft Windows.

The examples throughout the book will make frequent use of Python's essential libraries, such as SciPy, NumPy, Scikit-learn, and StatsModels, and to a minor extent, matplotlib and pandas, for scientific and statistical computing. We will also make use of an out-of-core cloud computing application called H2O.

This book is highly dependent on Jupyter and its Notebooks powered by the Python kernel. We will use its most recent version, 4.1, for this book.

The first chapter will provide you with all the step-by-step instructions and some useful tips to set up your Python environment, these core libraries, and all the necessary tools.

Who this book is for

This book is suitable for aspiring and actual data science practitioners, developers, and everyone who intends to work with large and complex datasets. We strive to make this book as accessible as possible to a wider audience. Yet, considering that the topics in this book are quite advanced, it is recommended, but not strictly compulsory, that readers are familiar with basic machine learning concept such as classification and regression, error minimizing functions, and cross validation.

We also assume some experience with Python, Jupyter Notebooks, and command-line execution together with a reasonable level of mathematical knowledge to grasp the concepts behind the various large solutions we propose. The text is written in a style that programmers of other languages (R, Java, and MATLAB) can follow. Ideally, it is highly suitable for (but not limited to) a data scientist familiar with machine learning and interested in leveraging Python, in respect to other languages such as R or MATLAB, because of its computational, memory, and I/O capabilities.


In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "When inspecting the linear model, first check the coef_ attribute."

A block of code is set as follows:

from sklearn import datasets
iris = datasets.load_iris()

Since we will be using Jupyter Notebooks along most of the examples, expect to have always an input (marked as In:) and often an output (marked Out:) from the cell containing the block of code. On your computer you have just to input the code after the In: and check if results correspond to the Out: content:

In:, y)
Out: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

When a command should be given in the terminal command line, you'll find the command with the prefix $>, otherwise, if it's for the Python REPL it will be preceded by >>>:

>>> import sys
>>> print sys.version_info

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "As a rule, you just have to type the code after In: in your cells and run it."


Warnings or important notes appear in a box like this.


Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.

  2. Hover the mouse pointer on the SUPPORT tab at the top.

  3. Click on Code Downloads & Errata.

  4. Enter the name of the book in the Search box.

  5. Select the book for which you're looking to download the code files.

  6. Choose from the drop-down menu where you purchased this book from.

  7. Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged into your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac

  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at also have other code bundles from our rich catalog of books and videos available at Check them out!


On Github, you will also find Vowpal Wabbit executables for Windows.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from


Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to and enter the name of the book in the search field. The required information will appear under the Errata section.


Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.


If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.