Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Using PageRank


PageRank measures the importance of each vertex in a graph. PageRank was started by Google's founders, who used the theory that the most important pages on the Internet are the pages with the most links leading to them. PageRank also looks at the importance of a page leading to the target page. So, if a given web page has incoming links from higher rank pages, it will be ranked higher.

Getting ready

We are going to use Wikipedia page link data to calculate page rank. Wikipedia publishes its data in the form of a database dump. We are going to use link data from http://haselgrove.id.au/wikipedia.htm, which has the data in two files:

  • links-simple-sorted.txt

  • titles-sorted.txt

I have put both of them on Amazon S3 at s3n://com.infoobjects.wiki/links and s3n://com.infoobjects.wiki/nodes. Since the data size is larger, it is recommended that you run it on either Amazon EC2 or your local cluster. Sandbox may be very slow.

You can load the files to hdfs using the following commands:

$ hdfs...