Book Image

Large Scale Machine Learning with Python

By : Bastiaan Sjardin, Alberto Boschetti
Book Image

Large Scale Machine Learning with Python

By: Bastiaan Sjardin, Alberto Boschetti

Overview of this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.
Table of Contents (17 chapters)
Large Scale Machine Learning with Python
About the Authors
About the Reviewer

Chapter 9. Practical Machine Learning with Spark

In the previous chapter, we saw the main functionalities of data processing with Spark. In this chapter, we will focus on data science with Spark on a real data problem. During the chapter, you will learn the following topics:

  • How to share variables across a cluster's nodes

  • How to create DataFrames from structured (CSV) and semi-structured (JSON) files, save them on disk, and load them

  • How to use SQL-like syntax to select, filter, join, group, and aggregate datasets, thus making the preprocessing extremely easy

  • How to handle missing data in the dataset

  • Which algorithms are available out of the box in Spark for feature engineering and how to use them in a real case scenario

  • Which learners are available and how to measure their performance in a distributed environment

  • How to run cross-validation for hyperparameter optimization in a cluster