Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update v6e-256 KubeRay Sample #2466

Merged
merged 14 commits into from
Nov 7, 2024

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR adds recommended fields to the v6e-256 RayCluster and RayJob sample manifests. For the larger slice size, adding privileged: true resolves a UNKNOWN: TPU initialization failed: open(/dev/vfio/vfio): No such file or directory: No such file or directory; Couldn't open vfio container /dev/vfio/vfio error while adding resources: '"{\"TPU\": 4}"' to the rayStartParams resolves a race condition that sometimes occurs in RayServices and RayJobs where Python script execution begins before TPU device detection by the Raylets, causing ray.available_resources()["TPU"] to return 0.

This PR was manually tested as follows:

  1. Create the RayJob CR
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-job.tpu-v6e-256-multihost.yaml
  1. View the Job output:
kubectl logs -l=job-name=v6e-256-job

2024-10-23 02:37:52,871 INFO cli.py:39 -- Job submission server address: http://v6e-256-job-raycluster-xj4wj-head-svc.default.svc.cluster.local:8265
2024-10-23 02:37:53,716 SUCC cli.py:63 -- ----------------------------------------------
2024-10-23 02:37:53,716 SUCC cli.py:64 -- Job 'v6e-256-job-4mhms' submitted successfully
2024-10-23 02:37:53,716 SUCC cli.py:65 -- ----------------------------------------------
2024-10-23 02:37:53,716 INFO cli.py:289 -- Next steps
2024-10-23 02:37:53,716 INFO cli.py:290 -- Query the logs of the job:
2024-10-23 02:37:53,716 INFO cli.py:292 -- ray job logs v6e-256-job-4mhms
2024-10-23 02:37:53,716 INFO cli.py:294 -- Query the status of the job:
2024-10-23 02:37:53,716 INFO cli.py:296 -- ray job status v6e-256-job-4mhms
2024-10-23 02:37:53,716 INFO cli.py:298 -- Request the job to be stopped:
2024-10-23 02:37:53,716 INFO cli.py:300 -- ray job stop v6e-256-job-4mhms
2024-10-23 02:37:53,742 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
2024-10-23 02:37:53,447 INFO job_manager.py:528 -- Runtime env is setting up.
2024-10-23 02:38:14,855 INFO worker.py:1461 -- Using address 10.96.6.73:6379 set in the environment variable RAY_ADDRESS
2024-10-23 02:38:14,856 INFO worker.py:1601 -- Connecting to existing Ray cluster at address: 10.96.6.73:6379...
2024-10-23 02:38:14,870 INFO worker.py:1777 -- Connected to Ray cluster. View the dashboard at 10.96.6.73:8265 
['TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256']
2024-10-23 02:38:44,974 SUCC cli.py:63 -- ---------------------------------
2024-10-23 02:38:44,974 SUCC cli.py:64 -- Job 'v6e-256-job-4mhms' succeeded
2024-10-23 02:38:44,974 SUCC cli.py:65 -- ---------------------------------

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
import ray
import jax
import time

from jax.experimental import multihost_utils
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this import as it's no longer used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave that in as well as multihost_utils.sync_global_devices("sync"), I wasn't able to schedule a v6e-256 but I tested just now with a multi-host v6e-16 slice and adding that line ensures the JAX code runs once on each TPU host with Ray. I added it back in 372a081. Output of my manual test:

-------------------------------------------------------
Job 'raysubmit_EKeMpf1wY3pYYTzf' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_EKeMpf1wY3pYYTzf
  Query the status of the job:
    ray job status raysubmit_EKeMpf1wY3pYYTzf
  Request the job to be stopped:
    ray job stop raysubmit_EKeMpf1wY3pYYTzf

Tailing logs until the job exits (disable with --no-wait):
2024-11-06 22:52:41,758 INFO job_manager.py:528 -- Runtime env is setting up.
2024-11-06 22:52:54,414 INFO worker.py:1461 -- Using address 10.48.3.43:6379 set in the environment variable RAY_ADDRESS
2024-11-06 22:52:54,414 INFO worker.py:1601 -- Connecting to existing Ray cluster at address: 10.48.3.43:6379...
2024-11-06 22:52:54,420 INFO worker.py:1777 -- Connected to Ray cluster. View the dashboard at 10.48.3.43:8265 
Number of TPU Workers: 4
(tpu_cores pid=503, ip=10.48.8.7) TPU Worker: 1
['TPU cores:16', 'TPU cores:16', 'TPU cores:16', 'TPU cores:16']
(tpu_cores pid=487, ip=10.48.1.7) TPU Worker: 3 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)

------------------------------------------
Job 'raysubmit_EKeMpf1wY3pYYTzf' succeeded
------------------------------------------

@andrewsykim andrewsykim merged commit b9f0209 into ray-project:master Nov 7, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants