-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update v6e-256 KubeRay Sample #2466
Update v6e-256 KubeRay Sample #2466
Conversation
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
ray-operator/config/samples/ray-cluster.tpu-v6e-256-multihost.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
import ray | ||
import jax | ||
import time | ||
|
||
from jax.experimental import multihost_utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this import as it's no longer used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave that in as well as multihost_utils.sync_global_devices("sync")
, I wasn't able to schedule a v6e-256 but I tested just now with a multi-host v6e-16 slice and adding that line ensures the JAX code runs once on each TPU host with Ray. I added it back in 372a081. Output of my manual test:
-------------------------------------------------------
Job 'raysubmit_EKeMpf1wY3pYYTzf' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_EKeMpf1wY3pYYTzf
Query the status of the job:
ray job status raysubmit_EKeMpf1wY3pYYTzf
Request the job to be stopped:
ray job stop raysubmit_EKeMpf1wY3pYYTzf
Tailing logs until the job exits (disable with --no-wait):
2024-11-06 22:52:41,758 INFO job_manager.py:528 -- Runtime env is setting up.
2024-11-06 22:52:54,414 INFO worker.py:1461 -- Using address 10.48.3.43:6379 set in the environment variable RAY_ADDRESS
2024-11-06 22:52:54,414 INFO worker.py:1601 -- Connecting to existing Ray cluster at address: 10.48.3.43:6379...
2024-11-06 22:52:54,420 INFO worker.py:1777 -- Connected to Ray cluster. View the dashboard at 10.48.3.43:8265
Number of TPU Workers: 4
(tpu_cores pid=503, ip=10.48.8.7) TPU Worker: 1
['TPU cores:16', 'TPU cores:16', 'TPU cores:16', 'TPU cores:16']
(tpu_cores pid=487, ip=10.48.1.7) TPU Worker: 3 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
------------------------------------------
Job 'raysubmit_EKeMpf1wY3pYYTzf' succeeded
------------------------------------------
Signed-off-by: ryanaoleary <[email protected]>
Why are these changes needed?
This PR adds recommended fields to the v6e-256 RayCluster and RayJob sample manifests. For the larger slice size, adding
privileged: true
resolves aUNKNOWN: TPU initialization failed: open(/dev/vfio/vfio): No such file or directory: No such file or directory; Couldn't open vfio container /dev/vfio/vfio
error while addingresources: '"{\"TPU\": 4}"'
to the rayStartParams resolves a race condition that sometimes occurs in RayServices and RayJobs where Python script execution begins before TPU device detection by the Raylets, causingray.available_resources()["TPU"]
to return 0.This PR was manually tested as follows:
Related issue number
Checks