ORTModule Examples

This example uses ORTModule to fine-tune several popular HuggingFace models.

1 Setup

  1. Clone this repo and initialize git submodule
git clone
cd onnxruntime-training-examples
git submodule update --init --recursive
git submodule foreach git pull origin main
  1. Make sure python 3.8+ is installed

We recommend using conda to manage python environment. If you do not have conda installed, you can follow the instruction to install conda here. Once conda is installed, create a new python environment with

conda create --name myenv python=3.8
  1. Install azureml-core

Activate conda environment just created.

conda activate myenv

Install azureml dependency for script submission.

pip install azureml-core

2 Run on AzureML

2.1 Prerequisites

  1. AzureML subscription is required to run this example. Either a config.json file (How to get config.json file from Azure Portal) or subscription_id, resource_group, workspace_name information needs to be passed in through parameter.
  2. The subscription should have a gpu cluster. This example was tested with GPU cluster of SKU Standard_ND40rs_v2. See this document for creating gpu cluster.

2.2 Run this recipe

Download config.json file in 2.1 to huggingface/script directory. Or append below run script with AzureML workspace information such as --workspace_name <your_workspace_name> --resource_group <resource_group> --subscription_id <your_subscription_id>.

Here's an example to run run bert-large with ORTModule. builds a docker image based on dockerfile and submits run script to AzureML according to model and run configuration. Default docker image uses cuda 11.1.

cd huggingface/script
python --gpu_cluster_name <gpu_cluster_name> --hf_model bert-large --run_config ort

To run different models with different configuration, check below tables.

This table summarizes if model changes are required.

Model Performance Compariso Model Change
bart-large See BART No model change required
bert-large See BERT No model change required
deberta-v2-xxlarge See DeBERTa See this commit
distilbert-base See DistilBERT No model change required
gpt2 See GPT2 No model change required
roberta-large See RoBERTa See this commit
t5-large See T5 See this PR

Here're the different configs and description that the recipe script take through --run_config parameter.

Config Description
pt-fp16 PyTorch mixed precision
ort ORTModule mixed precision
ds_s1 PyTorch + Deepspeed stage 1
ds_s1_ort ORTModule + Deepspeed stage 1

Other parameters. Please also see parameters script/

Name Description
--model_batchsize Model batchsize per GPU
--max_steps Max step that a model will run
--process_count Total number of GPUs (not GPUs per node). Adjust this if target cluster is not 8 gpus
--node_count Node count
--skip_docker_build Skip docker build (use last built docker saved in AzureML environment)
--use_cu102 Use Cuda 10.2 dockerfile
--local_run Run the model locally, azureml related parameters will be ignored


  • Benchmark methodology: We report samples/sec on ND40rs_v2 VMs (V100 32G x 8), Cuda 11, with stable release onnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl. Cuda 10.2 option is also available through --use_cu102 flag. Please check dependency details in Dockerfile. We look at the metrics stable_train_samples_per_second in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller --max_steps value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust --max_steps to a smaller value. Lastly, we do not recommend running this recipe on [NC] series VMs which uses old architecture (K80).
  • Cost and VM availability: The finetuning job runs for ~1hr for default 8000 steps on ND40rs_v2 VMs, which costs $22.03/hr per run. Additional costs are Azure container registry costs for docker image storage, as well as Azure Storage cost for run history storage. Please note, ND40rs_v2 is not publicly available by default. To get it, after the subscription is created, user need to create a support ticket here, then ND series will be available.
  • On first run, this script takes ~20 mins to submit the finetuning job due to building a new docker image from Dockerfile. The step to build docker image hf_ort_env.register(ws).build(ws).wait_for_completion() can be skipped by passing --skip_docker_build if not running for the first time.

3 Run on local

3.1 Prerequisites

  1. A machine that you can access with GPU. This recipe was tested on 8 x 32G V100 GPUs machine.
  2. Know how many GPUs are there. This needs to be passed to parameter --process_count

3.2 Run this recipe

Build docker image.

cd huggingface/docker
sudo docker build -t hf-recipe-local-docker -f Dockerfile .

Run built docker image

  • Replace <onnxruntime-training-examples_path> to your local full path to onnxruntime-training-examples
    • Usually it's located at ~/onnxruntime-training-examples/
  • -v /dev/shm:/dev/shm mounts /dev/shm to inside docker /dev/shm. Similarly -v <onnxruntime-training-examples_path>:/onnxruntime-training-examples mounts <onnxruntime-training-examples_path> to inside docker /onnxruntime-training-examples/
sudo docker run -it -v /dev/shm:/dev/shm -v <onnxruntime-training-examples_path>:/onnxruntime-training-examples --gpus all hf-recipe-local-docker

Run script

  • Reminder to use the number of GPUs available locally to parameter --process_count
  • Depending on the memory available to local GPU, you might need to overwrite default batch size by passing in --model_batchsize
  • --local_run runs the script locally
cd /onnxruntime-training-examples/huggingface/script/
python --hf_model {hf_model} --run_config {run_config} --process_count <process_count> --local_run


Problem with Azure Authentication

If there's an Azure authentication issue, install Azure CLI here and run az login --use-device-code

In case of RuntimeError: CUDA out of memory error

The issue is most likely caused by hitting a HW limitation on the target, this can be mitigated by using the following switches

--model_batchsize - Change to smaller batchsize

--process_count - Change the number of GPUs to activate

For example

python --hf_model bart-large --run_config pt-fp16 --process_count 1 --local_run --model_batchsize 1 --max_steps 20


RoBERTa & DeBERTa currently decommissioned from the script because of unresolved issues.

RoBERTa currently requires ORT >= 1.12.0 according to this issue (#11268) which was resolved in ORT 1.12.0. However, running with ORT 1.12.0 with the PTCA Docker container and on the specified machine for benchmarking causes this issue (#12312).

DeBERTa has the following unresolved issues when using Optimum's ORTTrainer: #15 and #305