Determined implementation of CycleGAN
- determined_model_def.py is the user model definition file for Determined-managed training. This file is ported from cyclegan.py to implement the Determined Pytorch API.
- const.yaml is the user configuration for Determined-managed training.
- startup-hook.sh is the startup bash script that is used for downloading and extracting the training data.
- cyclegan.py is the original training script in the original repository.
Prerequisites: A Determined cluster must be installed in order to run this example. Please follow the directions here in order to install.
To run the example, simply submit the experiment to the cluster by running the following command from this directory:
det -m <master host:port> experiment create 1-gpu.yaml .
- Change resources .
resources:
slots_per_trial: 64
- Change global batch sizes
hyperparameters:
global_batch_size: 64
Who Manages Training | GPU Type | GPU Number Per Node | Node Number | Global Batch Size | Aggregation Frequency | Throughput |
---|---|---|---|---|---|---|
User | V100 | 1 | 1 | 1 | 1 | 4.948 records / sec |
Determined | V100 | 1 | 1 | 1 | 1 | 4.878 records / sec |
Determined | V100 | 8 | 8 | 64 | 1 | 71.44-164.45 records / sec* |
Determined | V100 | 8 | 8 | 64 | 2 | 157.18 records / sec |
Determined | V100 | 8 | 8 | 64 | 4 | 157.18 records / sec |
Determined | V100 | 8 | 8 | 128 | 1 | 232.85 records / sec |
Determined | V100 | 8 | 8 | 256 | 1 | 299.38 records / sec |
- The throughput is unstable due to inter-node communication when the global batch size is 64 and the aggregation frequency is 1. Use a larger batch size or a larger aggregation frequency to increase the scaling efficiency of the throughput. See Distributed Training Performance Optimization for details.