GitHub - TheClimateCorporation/lemur: Lemur is a tool to launch hadoop jobs locally or on EMR, based on a configuration file, referred to as a jobdef. The jobdef file describes your EMR cluster, local environment, pre- and post-actions and zero or more "steps".

Overview

Lemur is a tool to launch hadoop jobs locally or on EMR, based on a configuration file, referred to as a jobdef. The jobdef file describes your EMR cluster, local environment, pre- and post-actions (aka hooks) and zero or more "steps". A step is Amazon's name for a task or job submitted to the cluster. Lemur reads your jobdef, at the end of your jobdef, you execute (fire! ...) to make things happen. Also keep in mind that the jobdef is an interpreted clj file, so you can insert arbitrary Clojure code to be executed anywhere in the file (but see HOOKS below for a better way).

Features

Launch EMR cluster and submit step(s); or run against local hadoop (usually hadoop standalone for dev and testing)
Basic configuration options include: -- Bootstrap actions -- Hadoop config -- Uploads (files to transfer to S3, or local) -- Cluster details (num instances, master instance type, etc) -- Output paths to use for data, logs, main jar, etc. -- Support for spot market instances
Profile support provides packages of options and functionality that can be enabled or disabled (e.g. you can have a :test profile or a :live profile)
Validation for your command line options and environment before launching EMR and your job
Override configured options via command line
Hooks for actions that should be triggered before or after job launch (e.g. one hook in use at Climate Corporation does a diff on the results of a local run, as a full integration test. Another hook, posts a detailed message to IRC-- hipchat-- when a new job is started)
Optionally wait for an EMR job to complete
A dry-run feature, so you can check the final cluster configuration, arguments that will be sent to your hadoop main, etc.
All the details from dry-run (cluster/step config, etc) are persisted with each job run
All settings can be literal values, interpolated strings (e.g. set the S3 bucket as "com.your-co.${env}.hadoop"), or functions for ultimate flexibility
Import common options, functionality and behavior to avoid duplication (i.e. DRY principle)
Pass-through command-line options, allows you to specify extra args on the command line that are meaningful to your hadoop main function, but are unknown to lemur or your jobdef
Submit a step to an already running jobflow

A Note About the Ruby elastic-mapreduce CLI tool

Lemur does not try to replace elastic-mapreduce. While there is some overlap, lemur is focused on launching. It provides no replacement for many common activities that you will find in elastic-mapreduce. For example, "elastic-mapreduce --list". I recommend that you install elastic-mapreduce along-side lemur (or rely on the AWS Console for those activities).

Installation

Download the latest tar-gzip (.tgz) from http://download.climate.com/lemur/releases/lemur-1.4.6.tgz
Expand into some install location
set LEMUR_HOME to the top of the install path
cd $LEMUR_HOME
lein jar # assuming you have leiningen installed and on classpath
set LEMUR_EXTRA_CLASSPATH to any classpath entries (colon separated) that you want lemur to include when it runs your jobdef. The classpath that includes you base files, or other functions or libraries for use by your jobdefs for example.

AWS Credentials

Lemur uses DefaultAWSCredentialsProviderChain to gather AWS credentials to access various AWS services.

Compatibility

v0.9.7 Clojure 1.2

v1.0.1+ Clojure 1.3

v1.4.0+ Clojure 1.5

I've used lemur on Mac OS X and Linux. It MAY work on Windows (if you use cygwin). If you try it on Windows, I would be interested in hearing about your experience (patches welcome).

Usage

The general command line format is:

bin/lemur <command> <jobdef-file> [options] [remaining]

bin/lemur help                    - display this help text
bin/lemur run ./jobdef.clj        - Run a job on EMR
bin/lemur dry-run ./jobdef.clj    - Dry-run, i.e. just print out what would be done
bin/lemur start ./jobdef.clj      - Start an EMR cluster, but don't run the steps (jobs)
bin/lemur local ./jobdef.clj      - Run the job using local hadoop (e.g. standalone mode)
bin/lemur submit ./jobdef.clj --jobflow j-123456789  - Submit steps to an existing jobflow (running cluster)

Examples

lemur run clj/wb-clj/scripts/launch/hrap-jobdef.clj --dataset ahps --num-days 10
lemur start clj/wb-clj/src/weatherbill/lemur/sample-jobdef.clj

Help

Checkout the wiki on github: https://github.com/TheClimateCorporation/lemur/wiki
Look at examples/sample-jobdef.clj for details on all options that you can use in your jobdef
Open Issues via GitHub
You can ask questions at the Google Group

Feedback and feature requests are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
bin		bin
examples		examples
resources		resources
src		src
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
LICENSE		LICENSE
README.md		README.md
build.xml		build.xml
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

A Note About the Ruby elastic-mapreduce CLI tool

Installation

AWS Credentials

Compatibility

Usage

Examples

Help

About

Releases

Packages

Contributors 12

Languages

License

TheClimateCorporation/lemur

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

A Note About the Ruby elastic-mapreduce CLI tool

Installation

AWS Credentials

Compatibility

Usage

Examples

Help

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Languages

Packages