Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Dimensionality reduction with principal component analysis


Dimensionality reduction is the process of reducing the number of dimensions or features. A lot of real data contains a very high number of features. It is not uncommon to have thousands of features. Now, we need to drill down to features that matter.

Dimensionality reduction serves several purposes such as:

  • Data compression

  • Visualization

When the number of dimensions is reduced, it reduces the disk footprint and memory footprint. Last but not least; it helps algorithms to run much faster. It also helps reduce highly correlated dimensions to one.

Humans can only visualize three dimensions, but data can have a much higher number of dimensions. Visualization can help find hidden patterns in the data. Dimensionality reduction helps visualization by compacting multiple features into one.

The most popular algorithm for dimensionality reduction is principal component analysis (PCA).

Let's look at the following dataset:

Let's say the goal is to...