Given an initial page rank (weight) and transition matrix, it iteratively updates the page rank based on the previous page rank and transition matrix.
PageRank(n) = (1-beta) * PageRank(n-1) * TransitionMatrix + beta * PageRank(n-1)
where beta is a teleporting factor to avoid dead ends (all page ranks become 0) or spider traps (page rank dominated by one page). It is implemented on Hadoop by two MapReduce jobs - unitMultiplication and unitSum.
- Create directory for the transition matrix on HDFS
hdfs dfs -mkdir /transition
- Put the transition matrix (transition.txt) into the "transition" directory
hdfs dfs -put ./transition/transition.txt /transition
- Create directory for the initial page rank on HDFS
hdfs dfs -mkdir /pagerank0
- Put the initial page rank (pr.txt) into the "pagerank0" directory
hdfs dfs -put ./pr0/pr.txt /pagerank0
- Compile
hadoop com.sun.tools.javac.Main *.java
- Pack classes to jar
jar cf pr.jar *.class
- Run
hadoop jar pr.jar Driver /transition /pagerank /output 40 0.2
//args0: dir of transition.txt
//args1: dir of pagerank*.txt
//args2: output dir of the first MapReduce job
//args3: number of iterations
//args4: beta
Source of test data: https://www.limfinity.com/ir/