From b6bc7afafbdc3d11114c3577d565dbac780dc743 Mon Sep 17 00:00:00 2001 From: stiebels <15859432+stiebels@users.noreply.github.com> Date: Tue, 6 Feb 2024 22:51:35 +0100 Subject: [PATCH] Improve CIFAR10 example (#13) --- .../cifar10_pytorch/CIFAR10-PyTorch.ipynb | 118 ++++++++++++++++-- computer_vision/cifar10_pytorch/README.md | 8 +- computer_vision/cifar10_pytorch/adaptive.yaml | 1 + computer_vision/cifar10_pytorch/const.yaml | 3 +- .../cifar10_pytorch/distributed.yaml | 1 + ...xample.yaml => distributed_inference.yaml} | 6 +- 6 files changed, 119 insertions(+), 18 deletions(-) rename computer_vision/cifar10_pytorch/{distributed_inference_example.yaml => distributed_inference.yaml} (81%) diff --git a/computer_vision/cifar10_pytorch/CIFAR10-PyTorch.ipynb b/computer_vision/cifar10_pytorch/CIFAR10-PyTorch.ipynb index 8e0c1b1..08775a8 100644 --- a/computer_vision/cifar10_pytorch/CIFAR10-PyTorch.ipynb +++ b/computer_vision/cifar10_pytorch/CIFAR10-PyTorch.ipynb @@ -15,7 +15,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Test importing Determined. In Determined is properly installed, you should see no output.\n", + "# Test importing Determined. If Determined is properly installed, you should see no output.\n", "import determined as det" ] }, @@ -80,11 +80,17 @@ "metadata": {}, "source": [ "### const.yaml\n", - "For our first Determined experiment, we'll run this model training job with fixed hyperparameters. Note the following sections:\n", - "- `description`: A short description of the experiment\n", - "- `data`: A section for user to provide custom key value pairs. Here we specify where the data resides. \n", - "- `hyperparameters`: area for user to define hyperparameters that will be injected into the trial class at runtime. There are constant values for this configuration\n", - "- `searcher`: hyperparameter search algorithm for the experiment" + "For our first Determined experiment, we'll run this model training job with fixed hyperparameters. Note the following sections (keywords are clickable and bring you to the [official API docs](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html)):\n", + "\n", + "- [`name`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#name): A short human-readable name for the experiment.\n", + "- [`description`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#description): A short description of the experiment (ideally <255 chars).\n", + "- [`hyperparameters`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#hyperparameters): area for user to define hyperparameters that will be injected into the trial class at runtime. There are constant values for this configuration\n", + "- [`records_per_epoch`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#records-per-epoch): The number of records in the training data set. Mandatory since we're also setting `min_validation_period`.\n", + "- [`searcher`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#searcher): hyperparameter search algorithm for the experiment.\n", + "- [`entrypoint`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#experiment-config-entrypoint): A model definition trial class specification or Python launcher script, which is the model processing entrypoint.\n", + "- [`min_validation_period`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#min-validation-period): Specifies the minimum frequency at which validation should be run for each trial.\n", + "\n", + "Not all of these settings are always mandatory. See the references API documentation for details." ] }, { @@ -163,17 +169,111 @@ "When the experiment finishes, note that your best performing model achieves a lower validation error than our first experiment that ran with constant hyperparameter values. From the Determined experiment detail page, you can drill in to a particular trial and view the hyperparameter values used. You can also access the saved checkpoint of your best-performing model and load it for real-time or batch inference as described in the PyTorch documentation [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-a-general-checkpoint-for-inference-and-or-resuming-training)." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Distributed training on multiple GPUs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "See also the introduction to implementing distributed training, which you can find [here](https://docs.determined.ai/latest/model-dev-guide/dtrain/dtrain-implement.html#multi-gpu-training-implement)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### distributed.yaml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you have a multi-GPU cluster set up that's running Determined AI, you can distribute your training on multiple GPUs by changing a few settings in your experiment configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cat -n distributed.yaml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note the slight difference to `const.yaml`:\n", + "- We added `slots_per_trial` and set it to the number of GPUs we're training on (here: 16).\n", + "- Since we're training on 16 GPUs and we want a per-GPU batch size of 32, we're setting `global_batch_size` to (32*16=)512." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!det -m {determined_master} experiment create distributed.yaml ." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Distributed Batch Inference" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When using PyTorch, you can use the distributed training workflow with PyTorchTrial to accelerate inference workloads. This workflow is not yet officially supported, therefore, users must specify certain training-specific artifacts that are not used for inference. This is covered below. Also, you can find further documentation [here](https://docs.determined.ai/latest/model-dev-guide/dtrain/dtrain-implement.html#distributed-inference)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### distributed_inference.yaml" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cat -n distributed_inference.yaml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, launch the batch inference the same way as you would launch a training job." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "!det -m {determined_master} experiment create distributed_inference.yaml ." + ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -187,7 +287,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.5" + "version": "3.9.18" } }, "nbformat": 4, diff --git a/computer_vision/cifar10_pytorch/README.md b/computer_vision/cifar10_pytorch/README.md index 68c464b..e692d7e 100644 --- a/computer_vision/cifar10_pytorch/README.md +++ b/computer_vision/cifar10_pytorch/README.md @@ -9,8 +9,9 @@ example](https://github.com/keras-team/keras/blob/keras-2/examples/cifar10_cnn.p ### Configuration Files * **const.yaml**: Train the model with constant hyperparameter values. -* **distributed.yaml**: Same as `const.yaml`, but trains the model with multiple GPUs (distributed training). * **adaptive.yaml**: Perform a hyperparameter search using Determined's state-of-the-art adaptive hyperparameter tuning algorithm. +* **distributed.yaml**: Same as `const.yaml`, but trains the model with multiple GPUs (distributed training). +* **distributed_inference.yaml**: Use the distributed training workflow with PyTorchTrial to accelerate batch inference workloads. ## Data The CIFAR-10 dataset is downloaded from https://www.cs.toronto.edu/~kriz/cifar.html. @@ -19,10 +20,9 @@ The CIFAR-10 dataset is downloaded from https://www.cs.toronto.edu/~kriz/cifar.h If you have not yet installed Determined, installation instructions can be found under `docs/install-admin.html` or at https://docs.determined.ai/latest/index.html -Run the following command: `det -m experiment create -f +Run the following command: `det -m experiment create -f const.yaml .`. The other configurations can be run by specifying the appropriate configuration file in place of `const.yaml`. ## Results -Training the model with the hyperparameter settings in `const.yaml` should yield -a validation accuracy of ~74%. +Training the model with the hyperparameter settings in `const.yaml` should yield a validation accuracy of ~74%. diff --git a/computer_vision/cifar10_pytorch/adaptive.yaml b/computer_vision/cifar10_pytorch/adaptive.yaml index f8deaf5..12c5dcd 100644 --- a/computer_vision/cifar10_pytorch/adaptive.yaml +++ b/computer_vision/cifar10_pytorch/adaptive.yaml @@ -1,4 +1,5 @@ name: cifar10_pytorch_adaptive_search +description: An example experiment of hyperparameter tuning using Determined AI with CIFAR10 and PyTorch. hyperparameters: learning_rate: type: log diff --git a/computer_vision/cifar10_pytorch/const.yaml b/computer_vision/cifar10_pytorch/const.yaml index 243ff25..ee784c7 100644 --- a/computer_vision/cifar10_pytorch/const.yaml +++ b/computer_vision/cifar10_pytorch/const.yaml @@ -1,4 +1,5 @@ name: cifar10_pytorch_const +description: An example experiment using Determined AI with CIFAR10 and PyTorch. hyperparameters: learning_rate: 1.0e-4 learning_rate_decay: 1.0e-6 @@ -14,4 +15,4 @@ searcher: epochs: 32 entrypoint: model_def:CIFARTrial min_validation_period: - epochs: 1 + epochs: 1 \ No newline at end of file diff --git a/computer_vision/cifar10_pytorch/distributed.yaml b/computer_vision/cifar10_pytorch/distributed.yaml index 3a931d3..52edf68 100644 --- a/computer_vision/cifar10_pytorch/distributed.yaml +++ b/computer_vision/cifar10_pytorch/distributed.yaml @@ -1,4 +1,5 @@ name: cifar10_pytorch_distributed +description: An example experiment using Determined AI with CIFAR10, PyTorch and distributed multi-GPU training. hyperparameters: learning_rate: 1.0e-4 learning_rate_decay: 1.0e-6 diff --git a/computer_vision/cifar10_pytorch/distributed_inference_example.yaml b/computer_vision/cifar10_pytorch/distributed_inference.yaml similarity index 81% rename from computer_vision/cifar10_pytorch/distributed_inference_example.yaml rename to computer_vision/cifar10_pytorch/distributed_inference.yaml index e94ad6d..2c1be9f 100644 --- a/computer_vision/cifar10_pytorch/distributed_inference_example.yaml +++ b/computer_vision/cifar10_pytorch/distributed_inference.yaml @@ -1,11 +1,10 @@ -name: distributed_inference_example +name: cifar10_pytorch_distributed_inference +description: An example using Determined AI with CIFAR10, PyTorch and distributed batch inference. entrypoint: >- python3 -m determined.launch.torch_distributed python3 inference_example.py - resources: slots_per_trial: 2 - searcher: name: grid metric: x @@ -31,7 +30,6 @@ hyperparameters: - 12 - 13 - 14 - max_restarts: 0 bind_mounts: - host_path: /tmp