Book Image

Apache Spark 2.x Cookbook

By : Rishi Yadav
Book Image

Apache Spark 2.x Cookbook

By: Rishi Yadav

Overview of this book

While Apache Spark 1.x gained a lot of traction and adoption in the early years, Spark 2.x delivers notable improvements in the areas of API, schema awareness, Performance, Structured Streaming, and simplifying building blocks to build better, faster, smarter, and more accessible big data applications. This book uncovers all these features in the form of structured recipes to analyze and mature large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will learn to set up development environments. Further on, you will be introduced to working with RDDs, DataFrames and Datasets to operate on schema aware data, and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will also work through recipes on machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark. Last but not least, the final few chapters delve deeper into the concepts of graph processing using GraphX, securing your implementations, cluster optimization, and troubleshooting.
Table of Contents (19 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Using PageRank


PageRank measures the importance of each vertex in a graph. PageRank was started by Google's founders, who used the theory that the most important pages on the Internet are the pages with the most links leading to them. PageRank also looks at the importance of a page leading to the target page. So, if a given web page has incoming links from higher rank pages, it will be ranked higher.

Getting ready

We are going to use Wikipedia's page link data to calculate the page rank. Wikipedia publishes its data in the form of a database dump. We are going to use link data from, which has the data in two files:

  • links-simple-sorted.txt
  • titles-sorted.txt

Note

I have put both of them on Amazon S3 at s3a://com.infoobjects.wiki/links and s3a://com.infoobjects.wiki/nodes. Since the data size is larger, it is recommended that you run it on either Databricks Cloud or EMR.

How to do it...

  1. Import the graphx related classes:
scala> import org.apache.spark.graphx._
  1. Load the edges from Amazon S3:
scala&gt...