This chapter provides an introduction to Apache Spark from a Machine Learning (ML) and data analytics perspective, and also discusses machine learning in relation to Spark computing. Here, we first present an overview of Apache Spark, as well as Spark's advantages for data analytics, in comparison to MapReduce and other computing platforms. Then we discuss five main issues, as below:
Machine learning algorithms and libraries
Spark RDD and dataframes
Machine learning frameworks
Spark pipelines
Spark notebooks
All of the above are the most important topics that any data scientist or machine learning professional is expected to master, in order to fully take advantage of Apache Spark computing. Specifically, this chapter will cover all of the following six topics.
Spark overview and Spark advantages
ML algorithms and ML libraries for Spark
Spark RDD and dataframes
ML Frameworks, RM4Es and Spark computing
ML workflows and Spark pipelines
Spark notebooks introduction