Spark-DS-example

Illustration of how to use Apache Spark to analyse raw data, for MLDS meetup GBG 2015

The example is an attempt to do unsupervised graph clustering of Västtrafik stops using timetable pdfs converted to txt files (stored in timetables.tar.gz). The model is a power iteration graph clustering algorithm.

Running

Download apache spark binary suitable for your OS, and R if not installed either.

Mac OS X example:

brew install apache-spark

brew tap homebrew/science

brew install gcc

brew install Caskroom/cask/xquartz

brew install r

Install the required python packaged:

pip install -r requirements.txt

3a) Run the following code to execute the spark python script from a terminal:

spark-submit --driver-memory=6G py/readPdfs.py

The code automatically extracts the timetable files so no need to do it manually.

This might take a while, since it parses more than 5k files, parses them and trains a model.

3b)

Download spark binary from spark.apache.org (http://apache.mirrors.spacedump.net/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz) and unpack


./sbin/start-master.sh
./sbin/start-slave.sh spark://yourhostname:7077

spark-submit --driver-memory=6G py/readPdfs.py spark://yourhostname:7077

This way you can access webui from yourhostname:8080 and watch progress of the app.

Install R libraries

R -f R/install_libs.R

Visualise the result and compare PIC with Louvain graph clustering done in R.

R -f R/plotgraph.R

Plots saved in plots/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-DS-example

Running

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
R		R
plots		plots
py		py
README.md		README.md
requirements.txt		requirements.txt
timetables.tar.gz		timetables.tar.gz

jotsif/Spark-DS-example

Folders and files

Latest commit

History

Repository files navigation

Spark-DS-example

Running

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages