Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to specify a subset of CUDA devices for the job to run on #736

Closed
wants to merge 1 commit into from

Conversation

miqueljubert
Copy link

Summary:
Add a new parameters, cuda_visible_devices_subset, which contains a list of GPU indices. If set, auto_set_cuda_visible_devices will only use indices from this list when distributing the indices.

This allows masking out some GPUs, useful in scenarios like hosts shared between multiple users, where the first GPUs will be often in use by default processes, and hosts which have different types of GPUs available, and it is desired to only use a subset of those.

Differential Revision: D47208267

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Jul 5, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47208267

@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Merging #736 (ae66354) into main (966c96f) will decrease coverage by 0.04%.
Report is 4 commits behind head on main.
The diff coverage is 84.21%.

@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
- Coverage   92.80%   92.77%   -0.04%     
==========================================
  Files          96       96              
  Lines        6071     6087      +16     
==========================================
+ Hits         5634     5647      +13     
- Misses        437      440       +3     
Flag Coverage Δ
unittests 92.77% <84.21%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
torchx/schedulers/local_scheduler.py 93.77% <84.21%> (-0.45%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@d4l3k d4l3k requested a review from kiukchung July 26, 2023 07:05
Copy link
Contributor

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty reasonable to me though @kiukchung has better context on this logic

…ytorch#736)

Summary:
Pull Request resolved: pytorch#736

Add a new parameters, cuda_visible_devices_subset, which contains a list of GPU indices. If set, auto_set_cuda_visible_devices will only use indices from this list when distributing the indices.

This allows masking out some GPUs, useful in scenarios like hosts shared between multiple users, where the first GPUs will be often in use by default processes, and hosts which have different types of GPUs available, and it is desired to only use a subset of those.

Differential Revision: D47208267

fbshipit-source-id: 9a15d0e1202b4332d0a38ab06465eed04c9bf282
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47208267

@@ -168,6 +168,7 @@ class LocalOpts(TypedDict, total=False):
log_dir: str
prepend_cwd: Optional[bool]
auto_set_cuda_visible_devices: Optional[bool]
auto_set_cuda_visible_devices_ids: List[str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the list be an optional? I think we need an explicit optional semantics instead of empty list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TorchX right now does not understand optional scheduler arguments.

Cache hits: 100%. Commands: 85 (cached: 85, remote: 0, local: 0)
Traceback (most recent call last):
  File "<string>", line 51, in <module>
  File "<string>", line 37, in __run
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 120, in <module>
    main()
  File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 116, in main
    run_main(get_sub_cmds(), argv)
  File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/main.py", line 112, in run_main
    args.func(args)
  File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/cmd_run.py", line 248, in run
    self._run(runner, args)
  File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/cli/cmd_run.py", line 184, in _run
    cfg = scheduler_opts.cfg_from_str(args.scheduler_args)
  File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/specs/api.py", line 866, in cfg_from_str
    cfg[key] = _cast_to_type(val, runopt_.opt_type)
  File "/data/users/jmiquel/fbsource/buck-out/v2/gen/fbcode/da4de3c780a17bfa/torchx/cli/__torchx__/torchx#link-tree/torchx/specs/api.py", line 860, in _cast_to_type
    return opt_type(value)
  File "/usr/local/fbcode/platform010/lib/python3.10/typing.py", line 957, in __call__
    result = self.__origin__(*args, **kwargs)
  File "/usr/local/fbcode/platform010/lib/python3.10/typing.py", line 387, in __call__
    raise TypeError(f"Cannot instantiate {self!r}")
TypeError: Cannot instantiate typing.Union

I will look at whether it can be cleanly added to the type instantiation.

@@ -780,6 +789,11 @@ def _cuda_device_count(self) -> int:
except subprocess.CalledProcessError as e:
log.exception(f"Got exception while listing GPUs: {e.stderr}")
return 0
except FileNotFoundError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test?

@kiukchung
Copy link
Collaborator

I think the logic here can be simplified quite a bit if we do the following:

  1. If auto_set_cuda_visible_devices_ids is passed, then trust the user-input and simply use that list of cuda_devices to auto-set CUDA_VISIBLE_DEVICES based on the number of ddp procs to run (e.g. the product of -j #x#). No need to validate it against nvidia-smi.

  2. Make this option type List[int] instead of List[str]. This way you get "is_number" validation for free.

I'm all for input validation, but this is one of the cases where the code is getting hard to read/maintain.

@kurman
Copy link
Contributor

kurman commented Aug 18, 2023

  1. No need to validate it against nvidia-smi.

+1 on not relying on binary on the PATH. In addition to that jobs typically run on homogenous hardware.

@miqueljubert
Copy link
Author

miqueljubert commented Aug 28, 2023

I will push another diff with some changes to the torchx/specs/api.py. Otherwise neither Optional[List[str]] or Optional[List[int]] are supported CfgNode types, as per api.py

But adding those types will likely require non-trivial refactoring of RunOpts, since there are a lot of baked in assumptions about optional not being supported, and only lists of strings being supported. Is that a feature that will haver wider value to torchX? It is not clear to me the additional complexity and code paths will be worth it if they are only for this feature.

@miqueljubert
Copy link
Author

I think the logic here can be simplified quite a bit if we do the following:

  1. If auto_set_cuda_visible_devices_ids is passed, then trust the user-input and simply use that list of cuda_devices to auto-set CUDA_VISIBLE_DEVICES based on the number of ddp procs to run (e.g. the product of -j #x#). No need to validate it against nvidia-smi.
  2. Make this option type List[int] instead of List[str]. This way you get "is_number" validation for free.

I'm all for input validation, but this is one of the cases where the code is getting hard to read/maintain.

  1. sounds good. On 2., as I mention above this will require adding support for additional types to RunOpts, as only List[str] is supported at the moment.

@kiukchung
Copy link
Collaborator

is this PR still relevant?

@kiukchung
Copy link
Collaborator

closing as there is no activity. Feel free to reopen!

@kiukchung kiukchung closed this Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants