-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to specify a subset of CUDA devices for the job to run on #736
Conversation
This pull request was exported from Phabricator. Differential Revision: D47208267 |
Codecov Report
@@ Coverage Diff @@
## main #736 +/- ##
==========================================
- Coverage 92.80% 92.77% -0.04%
==========================================
Files 96 96
Lines 6071 6087 +16
==========================================
+ Hits 5634 5647 +13
- Misses 437 440 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems pretty reasonable to me though @kiukchung has better context on this logic
…ytorch#736) Summary: Pull Request resolved: pytorch#736 Add a new parameters, cuda_visible_devices_subset, which contains a list of GPU indices. If set, auto_set_cuda_visible_devices will only use indices from this list when distributing the indices. This allows masking out some GPUs, useful in scenarios like hosts shared between multiple users, where the first GPUs will be often in use by default processes, and hosts which have different types of GPUs available, and it is desired to only use a subset of those. Differential Revision: D47208267 fbshipit-source-id: 9a15d0e1202b4332d0a38ab06465eed04c9bf282
f6eb2fd
to
ae66354
Compare
This pull request was exported from Phabricator. Differential Revision: D47208267 |
@@ -168,6 +168,7 @@ class LocalOpts(TypedDict, total=False): | |||
log_dir: str | |||
prepend_cwd: Optional[bool] | |||
auto_set_cuda_visible_devices: Optional[bool] | |||
auto_set_cuda_visible_devices_ids: List[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the list be an optional? I think we need an explicit optional semantics instead of empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TorchX right now does not understand optional scheduler arguments.
Cache hits: 100%. Commands: 85 (cached: 85, remote: 0, local: 0)
Traceback (most recent call last):
File "<string>", line 51, in <module>
File "<string>", line 37, in __run
File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 120, in <module>
main()
File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 116, in main
run_main(get_sub_cmds(), argv)
File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 112, in run_main
args.func(args)
File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/cmd_run.py", line 248, in run
self._run(runner, args)
File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/cmd_run.py", line 184, in _run
cfg = scheduler_opts.cfg_from_str(args.scheduler_args)
File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/specs/api.py", line 866, in cfg_from_str
cfg[key] = _cast_to_type(val, runopt_.opt_type)
File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/specs/api.py", line 860, in _cast_to_type
return opt_type(value)
File "/usr/local/fbcode/platform010/lib/python3.10/typing.py", line 957, in __call__
result = self.__origin__(*args, **kwargs)
File "/usr/local/fbcode/platform010/lib/python3.10/typing.py", line 387, in __call__
raise TypeError(f"Cannot instantiate {self!r}")
TypeError: Cannot instantiate typing.Union
I will look at whether it can be cleanly added to the type instantiation.
@@ -780,6 +789,11 @@ def _cuda_device_count(self) -> int: | |||
except subprocess.CalledProcessError as e: | |||
log.exception(f"Got exception while listing GPUs: {e.stderr}") | |||
return 0 | |||
except FileNotFoundError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test?
I think the logic here can be simplified quite a bit if we do the following:
I'm all for input validation, but this is one of the cases where the code is getting hard to read/maintain. |
+1 on not relying on binary on the PATH. In addition to that jobs typically run on homogenous hardware. |
I will push another diff with some changes to the torchx/specs/api.py. Otherwise neither Optional[List[str]] or Optional[List[int]] are supported CfgNode types, as per api.py But adding those types will likely require non-trivial refactoring of RunOpts, since there are a lot of baked in assumptions about optional not being supported, and only lists of strings being supported. Is that a feature that will haver wider value to torchX? It is not clear to me the additional complexity and code paths will be worth it if they are only for this feature. |
|
is this PR still relevant? |
closing as there is no activity. Feel free to reopen! |
Summary:
Add a new parameters, cuda_visible_devices_subset, which contains a list of GPU indices. If set, auto_set_cuda_visible_devices will only use indices from this list when distributing the indices.
This allows masking out some GPUs, useful in scenarios like hosts shared between multiple users, where the first GPUs will be often in use by default processes, and hosts which have different types of GPUs available, and it is desired to only use a subset of those.
Differential Revision: D47208267