GitHub - zonagit/MapReduceAndHadoop: Map Reduce Matrix-Matrix multiplication and PageRank in Hadoop

zonagit / MapReduceAndHadoop Public

Notifications You must be signed in to change notification settings
Fork 1
Star 3

Map Reduce Matrix-Matrix multiplication and PageRank in Hadoop

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.settings		.settings
lib		lib
src		src
webgraphdata		webgraphdata
.classpath		.classpath
.gitignore		.gitignore
.project		.project
readme		readme

Repository files navigation

There are 2 packages in this project:

matrixmultiply
pagerank

a) In MatrixMultiply there are 3 classes with a main method 

To run the different mains from eclipse create a run/debug
configuration for each then enter arguments for command line.

or use the jar (to create a jar file from this project do from within 
Eclipse

Right Click the project->Export Java Jar file->Enter location
and name of jar)

To run one of the mains in the jar in a machine with hadoop 
installed type

./bin/hadoop jar jarname.jar matrixmultiply.ConversionMain true input_filename output_filename false path

or 
 
./bin/hadoop jar jarname.jar matrixmultiply.OnePassMatrixMultiplyMain 2 3 2 true optional_input_path optional_output_path 

or

./bin/hadoop jar jarname.jar matrixmultiply.TwoPassMatrixMultiplyMain 2 3 2 true optional_input_path optional_output_path optional_temp_path

(temp_path is the location where the output of step1 will be written to)

a.1) ConversionMain.java

This converts a file from binary to Hadoop SequenceFile or the other way
The command line arguments are

tobinary input_filename output_filename debug path

so to convert a given input_filename from binary to SequenceFile do

false input_filename output_filename false path_where_to_put_output_filename

to convert a given input_filename from SequenceFile to binary do

true input_filename output_filename false path_where_to_put_output_filename

if the path (last argument) is omitted the input and output files will be
read/written to a folder specified in the constant INPUT_PATH_FOLDER defined
at the top of ConversionMain.java
  
Note that the binary format consists of a first line with numrows numcols
then each line after that is one entry of the matrix in the order 
0 0
0 1
...
0 numcols-1
1 0
1 1
...
1 numcols-1
...
 
a.2) The arguments for OnePassMatrixMultiplyMain.java
are the dimensions
of the 2 matrices M (IxK) and N (KxJ), the number of groups and whether 
debugging should be enabled or not (if debug is on it will print out 
information along the different stages of the map reduce 
pipeline, including at the end if the map reduce matrix multiplication matches
regular matrix multiplication)
 
The number of groups should be a multiple of the size of both
matrices. So for 6x4x2 matrices use 2 groups, for
100x 100x1000 matrices use 5,10,20,25,50,100 (100 means that each row in 
the first matrix is in its own group and there are 100 groups of 10 columns each
in the second matrix).
  
 For 50x50x50 matrices (square matrices) note that choosing a grouping of
 1 means all rows/columns are in one group.

So the arguments for 2 groups would be

6 6 6 2 false optional_input_path optional_output_path

If not using grouping use 0 or -1 for the group argument
So enter say for the product of a 2x3 times by 3x2 matrix with 
no grouping

2 3 2 -1 false optional_input_path optional_output_path

a.2) No grouping was implemented for the 2 pass algorithm so the
arguments for TwoPassMatrixMultiplyMain.java are just the dimensions
of the 2 matrices

2 3 2 false optional_input_path optional_temp_path optional_output_path

for the product of a matrix 2x3 by a matrix 3x2

In both the one pass and the 2 pass algorithms the matrices are dense
i.e. none of their entries is zero. Also in both cases those matrices are
written to 2 different files called M and N in the input folder which are 
read by the mapper/first mapper.

b) In the package pagerank there are two mains IdsToLabelsMain.java and
 PageRankMain.java.
 
b.1) IdsToLabelsMain.java accepts as arguments

basenamelocation nodeid1 nodeid2 ... nodeidn

and then reads basename.fcl and outputs the labels for the requested nodeids

Here it is how I ran it from the command line (the fastutil, dsiutils and jsap jars
should be at the same location as this jar MapReduceAndHadoop.jar and
webbase-2001.fcl should be at the same location as well; all these jars are under the lib
folder of the git commit)

/mnt/scratch/u0082100/pagerank$ java -classpath "fastutil-6.5.15.jar:dsiutils-2.2.2.jar:jsap-2.1.jar:MapReduceAndHadoop.jar" pagerank.IdsToLabelsMain webbase-2001 93799279 38758143 69428479 112916492 111078997 4224907 30031331 20898292 112074159 30033865

b.2) PageRankMain.java

Run it with 3 arguments: the folder and basename containing the 
.properties and .graph files for that basename; the output folder and
the sort folder.

The output folder contains subfolders one for each iteration in pagerank.
The sort folder reads the last such subfolder, sorts by page rank
and outputs the result (in increasing page rank so the last line of the
file corresponds to the largest page rank node) to the sort folder.

For example to run page rank on the enron dataset from the jar I ran
(from /mnt/scratch/u0082100/)

/usr/local/hadoop-2.5.0/bin/hadoop jar MapReduceAndHadoop.jar  pagerank.PageRankMain pagerank/enron /user/u0082100/output /user/u0082100/sort

To find the top 30 page ranks I then ran

/usr/local/hadoop-2.5.0/bin/hdfs dfs -cat /user/u0082100/sort/part* |  tail  -n 30

b.3) Trust rank: Run PageRankMain.java with 4 arguments. The first 3 
arguments are similar to b.2) above: the location of the .properties,.graph files,
the output and sort dirs (use different output and sort dirs for
trust rank). The 4th argument is the path to a file 
containing the top N page ranks (as produced by b.2 above
namely the tail command). These will
form the trusted set.

I have one of these files under the folder webgraphdata, called enronTop300 
To build it I ran

/usr/local/hadoop-2.5.0/bin/hdfs dfs -cat /user/u0082100/sort/part* |  tail  -n 300

b.4) Spam mass: Run SpamMassMain.java with 3 arguments

1: the location of the last iteration of pagerank 
2: the location of the last iteration of trustpagerank
3: the location of the output of spam mass calculation
Some examples of command line cmds  I ran in the apt cluster

One pass
-----------
/usr/local/hadoop-2.5.0/bin/hadoop jar /home/u0082100/matrixmultiply.jar matrixmultiply.OnePassMatrixMultiplyMain 100 1000 1000 10 false /user/u0082100/input/ /user/u0082100/output/

Two Pass
---------
/usr/local/hadoop-2.5.0/bin/hadoop jar /home/u0082100/matrixmultiply.jar matrixmultiply.TwoPassMatrixMultiplyMain 100 100 1000 false /user/u0082100/input/ /user/u0082100/output/ /user/u0082100/temp

Kill a job
-----------
/usr/local/hadoop-2.5.0/bin/hadoop job -kill job_1412285424336_1065


Page Rank with taxation (running from /mnt/scratch/u0082100/)
-----------------------
/usr/local/hadoop-2.5.0/bin/hadoop jar MapReduceAndHadoop.jar  pagerank.PageRankMain pagerank/enron /user/u0082100/output /user/u0082100/sort

/usr/local/hadoop-2.5.0/bin/hadoop jar MapReduceAndHadoop.jar  pagerank.PageRankMain pagerank/dblp-2011-hc /user/u0082100/output /user/u0082100/sort

/usr/local/hadoop-2.5.0/bin/hadoop jar MapReduceAndHadoop.jar  pagerank.PageRankMain pagerank/hollywood-2011-hc /user/u0082100/output /user/u0082100/sort

/usr/local/hadoop-2.5.0/bin/hadoop jar MapReduceAndHadoop.jar  pagerank.PageRankMain pagerank/webbase-2001-hc /user/u0082100/output /user/u0082100/sort

Retrieve the top 30 pageranks
------------------------------
/usr/local/hadoop-2.5.0/bin/hdfs dfs -cat /user/u0082100/sort/part* |  tail  -n 30

Finally there is a jar file called MapReduceAndHadoop.jar that can be run
as described above.