You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Asked to create an issue from this thread. The workaround is simple, but documenting the issue in any case.
I’m setting up the integration with Ray, and it seems to work nicely when creating a fresh RayCluster using: @task(task_config=RayJobConfig(worker_node_config=[WorkerNodeConfig(…)]))).
I can see the cluster starting, the job getting scheduled and distributed, and completing successfully.
I’m having trouble with using an existing RayCluster (in the same cluster) though. From the docs here I read that I should be able to use: @task(task_config=RayJobConfig(address="<RAY_CLUSTER_ADDRESS>")).
However when trying that it seems worker_node_config is a required argument. I tried using an empty list instead:
@task(container_image=...,task_config=RayJobConfig(worker_node_config=[], # No need to create a Ray cluster but argument is required, maybe just setting to empty list helps?address="http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/", # Tried different ports here as well, like 10001runtime_env=... ),)
But then it still tries to start a new RayCluster instead of using the existing one found at address:
❯ k get rayclusters.ray.io -A
NAMESPACE NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
<flyte-project-<flyte-domain> ahvfr924w8k2vgvf97wp-n0-0-raycluster-crb9z 100m 500Mi 0 ready 2m25s
kuberay kuberay-cluster 1 1 2 3G 0 ready 3h37m
...
The address works fine if I just run:
k run kuberay-test --rm --tty -i --restart='Never' --image ... --command -- ray job submit --address http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/ -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
It looks like the worker_node_config argument has been required since the initial commit, and we can't seem to find code that submits a job without creating a new cluster. Not sure how the docs example has ever worked?
I'm not sure if this is something Flyte wants to support (from the docs it looks like it, but then there is no code to do it)? Either the docs could be updated to remove the option to submit to an existing cluster from the docs, or have the example there work accordingly and not start a new RayCluster and instead submit it to the one existing at address by omitting worker_node_config when doing:
I think the easiest thing to do is to rework the documentation to be correct for the current behavior. It looks like updating the module to submit work to an existing cluster would be non-trivial.
Describe the bug
Asked to create an issue from this thread. The workaround is simple, but documenting the issue in any case.
I’m setting up the integration with Ray, and it seems to work nicely when creating a fresh RayCluster using:
@task(task_config=RayJobConfig(worker_node_config=[WorkerNodeConfig(…)])))
.I can see the cluster starting, the job getting scheduled and distributed, and completing successfully.
I’m having trouble with using an existing RayCluster (in the same cluster) though. From the docs here I read that I should be able to use:
@task(task_config=RayJobConfig(address="<RAY_CLUSTER_ADDRESS>"))
.However when trying that it seems
worker_node_config
is a required argument. I tried using an empty list instead:But then it still tries to start a new RayCluster instead of using the existing one found at address:
The
address
works fine if I just run:It looks like the
worker_node_config
argument has been required since the initial commit, and we can't seem to find code that submits a job without creating a new cluster. Not sure how the docs example has ever worked?This seems to work as a simple workaround:
Expected behavior
I'm not sure if this is something Flyte wants to support (from the docs it looks like it, but then there is no code to do it)? Either the docs could be updated to remove the option to submit to an existing cluster from the docs, or have the example there work accordingly and not start a new RayCluster and instead submit it to the one existing at
address
by omittingworker_node_config
when doing:Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: