Add option to specify a subset of CUDA devices for the job to run on #736

miqueljubert · 2023-07-05T11:33:46Z

Summary:
Add a new parameters, cuda_visible_devices_subset, which contains a list of GPU indices. If set, auto_set_cuda_visible_devices will only use indices from this list when distributing the indices.

This allows masking out some GPUs, useful in scenarios like hosts shared between multiple users, where the first GPUs will be often in use by default processes, and hosts which have different types of GPUs available, and it is desired to only use a subset of those.

Differential Revision: D47208267

facebook-github-bot · 2023-07-05T11:34:46Z

This pull request was exported from Phabricator. Differential Revision: D47208267

codecov · 2023-07-26T05:37:49Z

Codecov Report

Merging #736 (ae66354) into main (966c96f) will decrease coverage by 0.04%.
Report is 4 commits behind head on main.
The diff coverage is 84.21%.

@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
- Coverage   92.80%   92.77%   -0.04%     
==========================================
  Files          96       96              
  Lines        6071     6087      +16     
==========================================
+ Hits         5634     5647      +13     
- Misses        437      440       +3

Flag	Coverage Δ
unittests	`92.77% <84.21%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
torchx/schedulers/local_scheduler.py	`93.77% <84.21%> (-0.45%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

d4l3k

this seems pretty reasonable to me though @kiukchung has better context on this logic

…ytorch#736) Summary: Pull Request resolved: pytorch#736 Add a new parameters, cuda_visible_devices_subset, which contains a list of GPU indices. If set, auto_set_cuda_visible_devices will only use indices from this list when distributing the indices. This allows masking out some GPUs, useful in scenarios like hosts shared between multiple users, where the first GPUs will be often in use by default processes, and hosts which have different types of GPUs available, and it is desired to only use a subset of those. Differential Revision: D47208267 fbshipit-source-id: 9a15d0e1202b4332d0a38ab06465eed04c9bf282

facebook-github-bot · 2023-08-09T09:27:30Z

This pull request was exported from Phabricator. Differential Revision: D47208267

kurman · 2023-08-17T22:20:34Z

torchx/schedulers/local_scheduler.py

@@ -168,6 +168,7 @@ class LocalOpts(TypedDict, total=False):
    log_dir: str
    prepend_cwd: Optional[bool]
    auto_set_cuda_visible_devices: Optional[bool]
+    auto_set_cuda_visible_devices_ids: List[str]


Should the list be an optional? I think we need an explicit optional semantics instead of empty list.

TorchX right now does not understand optional scheduler arguments.

Cache hits: 100%. Commands: 85 (cached: 85, remote: 0, local: 0) Traceback (most recent call last): File "<string>", line 51, in <module> File "<string>", line 37, in __run File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 120, in <module> main() File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 116, in main run_main(get_sub_cmds(), argv) File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 112, in run_main args.func(args) File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/cmd_run.py", line 248, in run self._run(runner, args) File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/cmd_run.py", line 184, in _run cfg = scheduler_opts.cfg_from_str(args.scheduler_args) File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/specs/api.py", line 866, in cfg_from_str cfg[key] = _cast_to_type(val, runopt_.opt_type) File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/specs/api.py", line 860, in _cast_to_type return opt_type(value) File "/usr/local/fbcode/platform010/lib/python3.10/typing.py", line 957, in __call__ result = self.__origin__(*args, **kwargs) File "/usr/local/fbcode/platform010/lib/python3.10/typing.py", line 387, in __call__ raise TypeError(f"Cannot instantiate {self!r}") TypeError: Cannot instantiate typing.Union

I will look at whether it can be cleanly added to the type instantiation.

kurman · 2023-08-17T22:21:15Z

torchx/schedulers/local_scheduler.py

@@ -780,6 +789,11 @@ def _cuda_device_count(self) -> int:
        except subprocess.CalledProcessError as e:
            log.exception(f"Got exception while listing GPUs: {e.stderr}")
            return 0
+        except FileNotFoundError as e:


Add a test?

kiukchung · 2023-08-17T22:55:19Z

I think the logic here can be simplified quite a bit if we do the following:

If auto_set_cuda_visible_devices_ids is passed, then trust the user-input and simply use that list of cuda_devices to auto-set CUDA_VISIBLE_DEVICES based on the number of ddp procs to run (e.g. the product of -j #x#). No need to validate it against nvidia-smi.
Make this option type List[int] instead of List[str]. This way you get "is_number" validation for free.

I'm all for input validation, but this is one of the cases where the code is getting hard to read/maintain.

kurman · 2023-08-18T16:30:19Z

No need to validate it against nvidia-smi.

+1 on not relying on binary on the PATH. In addition to that jobs typically run on homogenous hardware.

miqueljubert · 2023-08-28T12:45:56Z

I will push another diff with some changes to the torchx/specs/api.py. Otherwise neither Optional[List[str]] or Optional[List[int]] are supported CfgNode types, as per api.py

But adding those types will likely require non-trivial refactoring of RunOpts, since there are a lot of baked in assumptions about optional not being supported, and only lists of strings being supported. Is that a feature that will haver wider value to torchX? It is not clear to me the additional complexity and code paths will be worth it if they are only for this feature.

miqueljubert · 2023-08-28T13:09:57Z

I think the logic here can be simplified quite a bit if we do the following:

If auto_set_cuda_visible_devices_ids is passed, then trust the user-input and simply use that list of cuda_devices to auto-set CUDA_VISIBLE_DEVICES based on the number of ddp procs to run (e.g. the product of -j #x#). No need to validate it against nvidia-smi.

Make this option type List[int] instead of List[str]. This way you get "is_number" validation for free.

I'm all for input validation, but this is one of the cases where the code is getting hard to read/maintain.

sounds good. On 2., as I mention above this will require adding support for additional types to RunOpts, as only List[str] is supported at the moment.

kiukchung · 2023-10-31T19:32:38Z

is this PR still relevant?

kiukchung · 2023-11-07T19:30:33Z

closing as there is no activity. Feel free to reopen!

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Jul 5, 2023

d4l3k requested a review from kiukchung July 26, 2023 07:05

d4l3k reviewed Jul 26, 2023

View reviewed changes

miqueljubert force-pushed the export-D47208267 branch from f6eb2fd to ae66354 Compare August 9, 2023 09:27

kurman reviewed Aug 17, 2023

View reviewed changes

kiukchung closed this Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to specify a subset of CUDA devices for the job to run on #736

Add option to specify a subset of CUDA devices for the job to run on #736

miqueljubert commented Jul 5, 2023

facebook-github-bot commented Jul 5, 2023

codecov bot commented Jul 26, 2023 •

edited

Loading

d4l3k left a comment

facebook-github-bot commented Aug 9, 2023

kurman Aug 17, 2023

miqueljubert Aug 28, 2023

kurman Aug 17, 2023

kiukchung commented Aug 17, 2023

kurman commented Aug 18, 2023

miqueljubert commented Aug 28, 2023 •

edited

Loading

miqueljubert commented Aug 28, 2023

kiukchung commented Oct 31, 2023

kiukchung commented Nov 7, 2023

Add option to specify a subset of CUDA devices for the job to run on #736

Add option to specify a subset of CUDA devices for the job to run on #736

Conversation

miqueljubert commented Jul 5, 2023

facebook-github-bot commented Jul 5, 2023

codecov bot commented Jul 26, 2023 • edited Loading

Codecov Report

d4l3k left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Aug 9, 2023

kurman Aug 17, 2023

Choose a reason for hiding this comment

miqueljubert Aug 28, 2023

Choose a reason for hiding this comment

kurman Aug 17, 2023

Choose a reason for hiding this comment

kiukchung commented Aug 17, 2023

kurman commented Aug 18, 2023

miqueljubert commented Aug 28, 2023 • edited Loading

miqueljubert commented Aug 28, 2023

kiukchung commented Oct 31, 2023

kiukchung commented Nov 7, 2023

codecov bot commented Jul 26, 2023 •

edited

Loading

miqueljubert commented Aug 28, 2023 •

edited

Loading