PageRank measures the importance of each vertex in a graph. PageRank was started by Google's founders, who used the theory that the most important pages on the Internet are the pages with the most links leading to them. PageRank also looks at the importance of a page leading to the target page. So, if a given web page has incoming links from higher rank pages, it will be ranked higher.
We are going to use Wikipedia page link data to calculate page rank. Wikipedia publishes its data in the form of a database dump. We are going to use link data from http://haselgrove.id.au/wikipedia.htm, which has the data in two files:
links-simple-sorted.txt
titles-sorted.txt
I have put both of them on Amazon S3 at s3n://com.infoobjects.wiki/links
and s3n://com.infoobjects.wiki/nodes
. Since the data size is larger, it is recommended that you run it on either Amazon EC2 or your local cluster. Sandbox may be very slow.
You can load the files to hdfs
using the following commands:
$ hdfs...