PredictionIO classification engine for Heroku
A machine learning classifier deployable to Heroku with the PredictionIO buildpack.
Spark's Random Forests algorithm is used to predict a label using decision trees. See A Visual Introduction to Machine Learning to learn why decision trees are so effective.
Based on the attribute-based classifier template modified to use an alternative algorithm. Originally this engine implemented Spark's Naive Bayes algorithm. We soon switched to Random Forests to improve predictions by correlating attributes, a well-known weakness of Naive Bayes. The Bayes algorithm is still available in the engine source.
This engine demonstrates prediction of the best fitting service plan for a mobile phone user based on their voice, data, and text usage. The model is trained with a small, example data set.
The service plans labelled in the included training data are:
0
Low Usage: no services significantly utilized1
More Voice: expanded talk time to 1000 minutes2
More Data: expanded transfer quota to 1000 megabytes3
More Texts: expanded SMS to 1000 messages4
Voice + Data: expanded talk time & transfer quota5
Data + Text: expanded transfer quota & SMS6
Voice + Text: expanded talk time & SMS7
More Everything: all services used evenly
βοΈ Throughout this document, code terms that start with $
represent a value (shell variable) that should be replaced with a customized value, e.g $EVENTSERVER_NAME
, $ENGINE_NAME
, $POSTGRES_ADDON_ID
β¦
Please follow steps in order.
Once deployed, how to work with the engine.
- Heroku account
- Heroku CLI, command-line tools
- git
git clone \
https://github.com/heroku/predictionio-engine-classification.git \
pio-engine-classi
cd pio-engine-classi
heroku create $ENGINE_NAME
heroku buildpacks:set https://github.com/heroku/predictionio-buildpack.git
heroku addons:create heroku-postgresql:hobby-dev
heroku config:set \
PIO_EVENTSERVER_APP_NAME=classi \
PIO_EVENTSERVER_ACCESS_KEY=$RANDOM-$RANDOM-$RANDOM-$RANDOM
Initial training data is automatically imported from data/initial-events.json
.
π When you're ready to begin working with your own data, see data import methods in CUSTOM docs.
# Wait to deploy until the database is ready
heroku pg:wait
git push heroku master
# Follow the logs to see web process start-up
#
heroku logs -t
Once deployed, scale up the processes. These are paid, professional dyno types:
heroku ps:scale \
web=1:Standard-2X \
release=0:Performance-L \
train=0:Performance-L
When the release (pio train
) fails due to memory constraints or other transient error, you may use the Heroku CLI releases:retry plugin to rerun the release without pushing a new deployment:
# First time, install it.
heroku plugins:install heroku-releases-retry
# Re-run the release & watch the logs
heroku releases:retry
heroku logs -t
Once deployment completes, the engine is ready to predict the best fitting service plan for a mobile phone user based on their voice, data, and text usage.
Submit queries containing these three user attributes to get predictions using Spark's Random Forests algorithm:
# Fits low usage, `0`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":12,\"data_usage\":0,\"text_usage\":4}"
# Fits more voice, `1`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":480,\"data_usage\":0,\"text_usage\":121}"
# Fits more data, `2`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":25,\"data_usage\":1000,\"text_usage\":80}"
#Fits more texts, `3`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":5,\"data_usage\":80,\"text_usage\":1000}"
#Extreme voice & data, `4`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":450,\"data_usage\":1104,\"text_usage\":43}"
#Extreme data & text, `5`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":24,\"data_usage\":770,\"text_usage\":482}"
#Extreme voice & text, `6`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":450,\"data_usage\":80,\"text_usage\":332}"
#Everything equal / balanced usage, `7`
curl -X "POST" "https://$ENGINE_NAME.herokuapp.com/queries.json" \
-H "Content-Type: application/json; charset=utf-8" \
-d "{\"voice_usage\":450,\"data_usage\":432,\"text_usage\":390}"
For a production model, more aspects of a user account and their correlations might be taken into consideration, including: account type (individual, business, or family), frequency of roaming, international usage, device type (smart phone or feature phone), age of device, etc.
If you hit any snags with the engine serving queries, check the logs:
heroku logs -t --app $ENGINE_NAME
If errors are occuring, sometimes a restart will help:
heroku restart --app $ENGINE_NAME
If you want to customize an engine, then you'll need to get it running locally on your computer.
β‘οΈ Setup local development
bin/pio app new classi
PIO_EVENTSERVER_APP_NAME=classi data/import-events -f data/initial-events.json
bin/pio build
bin/pio train
bin/pio deploy