[BUG] Cannot submit a Ray job to an existing cluster #5877

gpgn · 2024-10-21T07:18:21Z

Describe the bug

Asked to create an issue from this thread. The workaround is simple, but documenting the issue in any case.

I’m setting up the integration with Ray, and it seems to work nicely when creating a fresh RayCluster using: @task(task_config=RayJobConfig(worker_node_config=[WorkerNodeConfig(…)]))).

I can see the cluster starting, the job getting scheduled and distributed, and completing successfully.

I’m having trouble with using an existing RayCluster (in the same cluster) though. From the docs here I read that I should be able to use: @task(task_config=RayJobConfig(address="<RAY_CLUSTER_ADDRESS>")).

However when trying that it seems worker_node_config is a required argument. I tried using an empty list instead:

@task(
    container_image=...,
    task_config=RayJobConfig(
        worker_node_config=[],  # No need to create a Ray cluster but argument is required, maybe just setting to empty list helps?
        address="http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/",  # Tried different ports here as well, like 10001
        runtime_env=...
    ),
)

But then it still tries to start a new RayCluster instead of using the existing one found at address:

❯ k get rayclusters.ray.io -A
NAMESPACE             NAME                                         DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
<flyte-project-<flyte-domain>   ahvfr924w8k2vgvf97wp-n0-0-raycluster-crb9z                                         100m   500Mi    0      ready    2m25s
kuberay               kuberay-cluster                              1                 1                   2      3G       0      ready    3h37m
...

The address works fine if I just run:

k run kuberay-test --rm --tty -i --restart='Never' --image ... --command -- ray job submit --address http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/ -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

It looks like the worker_node_config argument has been required since the initial commit, and we can't seem to find code that submits a job without creating a new cluster. Not sure how the docs example has ever worked?

This seems to work as a simple workaround:

@task(
    container_image=<RAY_IMAGE>,
)
def ray_task_job_submit(n: int) -> typing.List[int]:
    ray.init(address="ray://kuberay-cluster-head-svc.kuberay.svc.cluster.local:10001")
    futures = [f.remote(i) for i in range(n)]
    return ray.get(futures)

Expected behavior

I'm not sure if this is something Flyte wants to support (from the docs it looks like it, but then there is no code to do it)? Either the docs could be updated to remove the option to submit to an existing cluster from the docs, or have the example there work accordingly and not start a new RayCluster and instead submit it to the one existing at address by omitting worker_node_config when doing:

@task(
    container_image=...,
    task_config=RayJobConfig(
        address="http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/",
        runtime_env=...
    ),
)

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

Sovietaced · 2024-10-27T04:03:28Z

I think the easiest thing to do is to rework the documentation to be correct for the current behavior. It looks like updating the module to submit work to an existing cluster would be non-trivial.

pingsutw · 2024-10-28T22:27:42Z

@gpgn qq: do you want to use Kuberay to submit a job or use Flyte to create pod that connect to your ray cluster?

gpgn added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Oct 21, 2024

Sovietaced self-assigned this Oct 24, 2024

Sovietaced added documentation Improvements or additions to documentation and removed untriaged This issues has not yet been looked at by the Maintainers labels Oct 24, 2024

This was referenced Oct 27, 2024

Cleanup ray docs to reflect current functionality flyteorg/flytesnacks#1765

Open

Remove misleading address field from Ray Task flyteorg/flytekit#2870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot submit a Ray job to an existing cluster #5877

[BUG] Cannot submit a Ray job to an existing cluster #5877

gpgn commented Oct 21, 2024 •

edited

Loading

Sovietaced commented Oct 27, 2024

pingsutw commented Oct 28, 2024

[BUG] Cannot submit a Ray job to an existing cluster #5877

[BUG] Cannot submit a Ray job to an existing cluster #5877

Comments

gpgn commented Oct 21, 2024 • edited Loading

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Sovietaced commented Oct 27, 2024

pingsutw commented Oct 28, 2024

gpgn commented Oct 21, 2024 •

edited

Loading