Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how: use DVC when data is stored in an external drive #563

Closed
dashohoxha opened this issue Aug 15, 2019 · 16 comments
Closed

how: use DVC when data is stored in an external drive #563

dashohoxha opened this issue Aug 15, 2019 · 16 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide status: stale You've been groomed!

Comments

@dashohoxha
Copy link
Contributor

dashohoxha commented Aug 15, 2019

E: Check whether #520 was done first... See also #899


This doc should explain the best solution (or a couple of possible solutions) for this situation.

Example: the data is located in a partition of size 16TB on an external drive, while the DVC project is on /home of a partition of size 320GB.

Context: https://discordapp.com/channels/485586884165107732/485596304961962003/611244643685892153

@shcheklein shcheklein changed the title use-case: How to use DVC when data is stored in an external drive how to use DVC when data is stored in an external drive Aug 15, 2019
@shcheklein shcheklein added type: enhancement Something is not clear, small updates, improvement suggestions p1-important Active priorities to deal within next sprints labels Aug 15, 2019
@dashohoxha

This comment has been minimized.

@shcheklein
Copy link
Member

These are related/solve similar problems:

#455 (fixes #103 )
https://dvc.org/doc/use-cases/multiple-data-scientists-on-a-single-machine

Keep in mind:

#497

@shcheklein

This comment has been minimized.

@dashohoxha
Copy link
Contributor Author

The solution described by @efiop (tracking a data file that is external (outside the dvc project)) seems to be a different solution. Having a remote DVC cache (same as multiple-users-on-a-single-machine) is another solution. The NFS case seems to have a similar solution to multiple-users-on-a-single-machine.

@shcheklein
Copy link
Member

@dashohoxha gotcha. This is a different one indeed. This sections - https://dvc.org/doc/user-guide/external-outputs and this one https://dvc.org/doc/user-guide/external-dependencies should be reorganized/taken into account.

Also, keep in mind. My take on this that there should be a very strong reason to complicate your workflow with external deps/outs/cache in case of multiple drives. As I mentioned on Discord, I think in most cases the ideal scenario is to use external cache and symlinks (similar to NFS, shared cache scenarios).

@dashohoxha
Copy link
Contributor Author

This sections - https://dvc.org/doc/user-guide/external-outputs and this one https://dvc.org/doc/user-guide/external-dependencies should be reorganized/taken into account.

They seem accurate to me (unless there is some missing information that I don't know).
The problem is that it is difficult for the user to read all the details and intricacies on user guides and manual pages, and find the best solution for his case. Showing him what the best solution would be in a particular case (or a similar case) should be helpful.

@shcheklein
Copy link
Member

@dashohoxha your PR looks good, there are some improvements can be done which I'll review and let you know, but first I would like to understand the "use case" itself better, what are possible solution for that "use case", how should we improve those sections in User Guide, how all this stuff corresponds wish the shared machine case (when there is a single cache setup on a separate partition). Without this holistic plan, we are potentially duplicating information, we are not properly communicating the use case, and we are not properly structuring User Guide.

To give just some concerns:

  1. Huge data on external local drive title. It's a very confusing title for the use case. Starting from the "external local drive" (is external or local after all?) to the way it's formulated (huge data is not a problem, probably, versioning it or managing it is a problem). Huge is a very vague term as well. Some people use a single huge drive for everything.

Some better titles from the top of my head: Managing data storage on a separate drive, Versioning data and processing data outside your repo, etc ...

  1. No matter how good we can come with the name there should be some integration with other parts of the docs (user guide, versioning examples). For examples, in most cases we assume that is part of your workspace. Why don't we clarify somehow that if your data is substantially large there are ways to manage it "externally".

  2. Back to the use case. It's basically about trying to version files that are located on the second large drive (it can be second large HDD, it can be some shared NAS, etc - the point is it's a second large volume with tons of data and tons of space on it). Using external outs/deps is not the only way to deal with this. It's also not ideal. Should we include in this use case different ways of doing this - like "local external cache" + links? They overlap substantially to my mind.

  3. User Guide part of it. If use case (especially title) should be written in a way that will immediately match with user's request (rule of thumb - what words would I use to describe this situation in case I would need to ask a question on chat?), then User Guide is more like a well structured manual. For example, "Managing External Data" is a good section that should actually combine external deps, external outs, some intro and overview of the use cases with links and instructions on how specific cases could be solved.

So, let's please, discuss and understand some strategy behind this.

@jorgeorpinel would love to hear your opinion on this.

@jorgeorpinel
Copy link
Contributor

Without this holistic plan, we are potentially duplicating information, we are not properly communicating the use case, and we are not properly structuring User Guide.

Yes!

It's funny because I've been noticing significant confusion around external X topics so I opened #566 recently. I also feel like we may need to regroup and figure out the connections between all the external data stuff before deciding which docs to change.

That said it's good to have more use cases and I'll review the PR but if we don't figure out the big picture, this doc may only add to the confusion of some users, like Dashamir mentioned in #563 (comment).

@jorgeorpinel

This comment has been minimized.

@dashohoxha

This comment has been minimized.

@shcheklein

This comment has been minimized.

@shcheklein shcheklein added the A: docs Area: user documentation (gatsby-theme-iterative) label Nov 13, 2019
@jorgeorpinel jorgeorpinel changed the title how to use DVC when data is stored in an external drive how to: use DVC when data is stored in an external drive Jan 20, 2020
@jorgeorpinel
Copy link
Contributor

I think that at this point it's unclear whether a how-to is needed and most of the content will be covered by #520? Can we close this @shcheklein ? Thanks

@jorgeorpinel jorgeorpinel removed the p1-important Active priorities to deal within next sprints label May 17, 2021
@shcheklein
Copy link
Member

Might be. It's not clear to me how will #520 evolve. This is one is quite precise and I would close when we clearly see that it is addressed (by #520 or whatever else). And this one is important indeed. Might be more important than clarifying --external, for example

@jorgeorpinel jorgeorpinel changed the title how to: use DVC when data is stored in an external drive how: use DVC when data is stored in an external drive Feb 22, 2022
@jorgeorpinel jorgeorpinel added the C: guide Content of /doc/user-guide label Feb 22, 2022
@jorgeorpinel jorgeorpinel added p1-important Active priorities to deal within next sprints and removed type: enhancement Something is not clear, small updates, improvement suggestions labels Jul 28, 2022
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 28, 2022

My take on this in general is that you have 4 routes when working with data from external drives:

  1. Download it (get, import(-url)) -- not useful when the local drive is smaller than the data
  2. Manage it in-place with an external cache
    Potentially in a shared cache
    See also guide: consolidate external data mgmt guides #520
  3. Transfer it directly to remote storage to use later in an env with a larger drive or appropriate cache setup.
  4. Ad hoc methods like virtually mounting external folders inside the the DVC project dir, or other "tricks".

Other than # 4 which we probably don't need to document, we have info. about all of this in docs. We may just need to consolidate it somewhere in the future Data Management guides. I added bullet there and with that and #520 I think we should close this as redundant.

@jorgeorpinel jorgeorpinel added status: stale You've been groomed! and removed p1-important Active priorities to deal within next sprints labels Jul 28, 2022
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 29, 2022

Back to the use case. It's basically about trying to version files that are located on the second large drive (it can be second large HDD, it can be some shared NAS, etc

Should we repurpose this ticket to focus specifically on managing external data on NAS? @shcheklein

More details in https://discuss.dvc.org/t/setup-dvc-to-work-with-shared-data-on-nas-server/180 (top forum question)

@dberenbaum
Copy link
Collaborator

We made updates to the guide about external data as part of the 3.0 release, so closing since I don't see additional actions we can take right now. Feel free to reopen if I missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide status: stale You've been groomed!
Projects
None yet
4 participants