Book Image

Apache Spark 2.x Cookbook

By : Rishi Yadav
Book Image

Apache Spark 2.x Cookbook

By: Rishi Yadav

Overview of this book

While Apache Spark 1.x gained a lot of traction and adoption in the early years, Spark 2.x delivers notable improvements in the areas of API, schema awareness, Performance, Structured Streaming, and simplifying building blocks to build better, faster, smarter, and more accessible big data applications. This book uncovers all these features in the form of structured recipes to analyze and mature large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will learn to set up development environments. Further on, you will be introduced to working with RDDs, DataFrames and Datasets to operate on schema aware data, and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will also work through recipes on machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark. Last but not least, the final few chapters delve deeper into the concepts of graph processing using GraphX, securing your implementations, cluster optimization, and troubleshooting.
Table of Contents (19 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Dimensionality reduction with singular value decomposition


Often, the original dimensions do not represent data in the best way possible. As we saw in PCA, you can, sometimes, project data to fewer dimensions and still retain most of the useful information.

Sometimes, the best approach is to align dimensions along the features that exhibit the most number of variations. This approach helps eliminate dimensions that are not representative of the data.

Let's look at the following figure again, which shows the best-fitting line on two dimensions:

The projection line shows the best approximation of the original data with one dimension. If we take the points where the gray line is intersecting with the black line and isolating it, we will have a reduced representation of the original data with as much variation retained as possible, as shown in the following figure:

 

Let's draw a line perpendicular to the first projection line, as shown in the following figure:

This line captures as much variation...