Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mounting a writable existing persistent disk? #251

Open
slagelwa opened this issue Nov 15, 2022 · 2 comments
Open

Mounting a writable existing persistent disk? #251

slagelwa opened this issue Nov 15, 2022 · 2 comments

Comments

@slagelwa
Copy link

I've had some good successes with mounting existing read-only persistent disks to the VM running a dsub job, and its very cool that one can do this. However I was wondering about attached writable disks. According to the Life Science API documentation:

If all Mount references to this disk have the readOnly flag set to true, the disk will be attached in read-only mode and can be shared with other instances. Otherwise, the disk will be available for writing but cannot be shared.

I'm not exactly sure what they mean by Mount references. Do they mean that the disk is attached to zero or more VMs in read only mode? As that would seem to be what is implied by the description. (I'm not sure how outside of the VM that GCP would explicitly know how the disk is actually mounted). I've done some testing with a persistent disk that's unattached to any VMs, and one that was already attached in read only mode to a VM and in either case when I launch a dsub job the persistent disk is always attached in read only mode regardless.

@wnojopra
Copy link
Contributor

By Mount reference I'm assuming they mean this object in the request: https://cloud.google.com/life-sciences/docs/reference/rest/v2beta/projects.locations.pipelines/run#mount . There is a read-only flag here.

For persistent and existing disks, dsub always mounts this read-only (e.g. https://github.com/DataBiosphere/dsub/blob/main/dsub/providers/google_v2_base.py#L721). We do briefly mention that resource data can be mounted read-only: https://github.com/DataBiosphere/dsub#mounting-resource-data

The idea is that these disks contain resource data to be used as inputs for your workflows, so it made sense for dsub to mount as read-only. Would you mind describing your use case for mounting these disks as writable?

@slagelwa
Copy link
Author

slagelwa commented Nov 17, 2022

Yep that's what I was referring to in regards to Mount reference. Later in that same document it goes on to talk about existing disks and if the readOnly flag is true or not. After rereading the description in the context of the previous section, maybe it was restating just in a different way that you can only attach a writable compute disk to a single VM (usually that is). I was just inferring something different out of the documentation.

I would heartily agree that it almost always makes sense for dsub to mount as read-only for resource/reference data. I was chaining together an analysis that could benefit from having a shared persistent disk for performance where one step collectively operated on a set of large files and then executed a second step that used the combined result from the first step and each individual file. The second step of course is very easily run in parallel using dsub. Since I have to copy all of the files out of GCS for the first step, using a persistent disk here would save having to copy the files twice out of storage. Not a particularly lengthy operation I know (though it is a sizable amount of data), but its nice if you don't have to do it multiple times. And there are more downstream steps on the whole and individual results from the first two steps. It is also useful to be able to have all the data on the one persistent disk should something invariably go wrong during the development/processing.

But the real reason was a desire to make setting up the resource/reference disk part of the whole process to make it easier to maintain, e.g create the disk and populate it with a dsub job. Run the combined step in the next dsub job. Then do the multiple runs with a dsub tasks job. And sure I could resort to using a workflow orchestrator (nextflow, snakemake, etc) for doing this but it is a pretty straightforward couple of steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants