Skip to content

Illustration of how to use Apache Spark to analyse raw data, for MLSD meetup GBG 2015

Notifications You must be signed in to change notification settings

jotsif/Spark-DS-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark-DS-example

Illustration of how to use Apache Spark to analyse raw data, for MLDS meetup GBG 2015

The example is an attempt to do unsupervised graph clustering of Västtrafik stops using timetable pdfs converted to txt files (stored in timetables.tar.gz). The model is a power iteration graph clustering algorithm.

alt tag

Running

  1. Download apache spark binary suitable for your OS, and R if not installed either.

Mac OS X example:

brew install apache-spark

brew tap homebrew/science

brew install gcc

brew install Caskroom/cask/xquartz

brew install r

  1. Install the required python packaged:

pip install -r requirements.txt

3a) Run the following code to execute the spark python script from a terminal:

spark-submit --driver-memory=6G py/readPdfs.py

The code automatically extracts the timetable files so no need to do it manually.

This might take a while, since it parses more than 5k files, parses them and trains a model.

3b)

Download spark binary from spark.apache.org (http://apache.mirrors.spacedump.net/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz) and unpack


./sbin/start-master.sh
./sbin/start-slave.sh spark://yourhostname:7077

spark-submit --driver-memory=6G py/readPdfs.py spark://yourhostname:7077

This way you can access webui from yourhostname:8080 and watch progress of the app.

  1. Install R libraries

R -f R/install_libs.R

  1. Visualise the result and compare PIC with Louvain graph clustering done in R.

R -f R/plotgraph.R

Plots saved in plots/ directory.

About

Illustration of how to use Apache Spark to analyse raw data, for MLSD meetup GBG 2015

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published