Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

MLEM models as DVC outputs #205

Closed
igordertigor opened this issue Oct 31, 2022 · 10 comments
Closed

MLEM models as DVC outputs #205

igordertigor opened this issue Oct 31, 2022 · 10 comments
Assignees
Labels
🐛 type: bug Something isn't working.

Comments

@igordertigor
Copy link
Contributor

I'm using a DVC pipeline that specifies MLEM models as outputs. Mlem is configured to store models using dvc as described here. Unfortunately, DVC complains:

ERROR: Path '/home/ingo/tmp/dvc-mlem-issue/models/example.mlem' is ignored by
.dvcignore:1:/**/?*.mlem

Which is correct: I don't wont to track .mlem files in dvc, but I do want to track the binaries.

Here is a testscript that attempts to isolate the issue as much as possible:

git init
dvc init
mlem init

git add .dvc/ .mlem.yaml
git commit -m 'Basic setup'

mlem config set core.storage.type dvc
echo '/**/?*.mlem' > .dvcignore

git add .dvcignore .mlem.yaml
git commit -m 'MLEM works with DVC'

echo 'import mlem' > example.py
echo '' >> example.py
echo 'def example(x: float) -> float:' >> example.py
echo '    return x*x' >> example.py
echo '' >> example.py
echo 'mlem.api.save(example, "models/example")' >> example.py

git add example.py
git commit -m 'Add some code'

# # Works ok
# python example.py
# 
# dvc add models/example
# git add models/example.mlem models/example.dvc models/.gitignore
# git commit -m 'Add metadata for model'

dvc run \
  --name="Create example model" \
  --deps=example.py \
  --outs=models/example \
  --outs-no-cache=models/example.mlem \
  python example.py

I'm posting this here, as it is hopefully an issue with the documentation and not with mlem or dvc.

@aguschin aguschin added the 🐛 type: bug Something isn't working. label Oct 31, 2022
@aguschin
Copy link
Contributor

Hi @igordertigor! Thanks for reporting this!
Yep, you need to remove those lines from .dvcignore. Then this will work.

In the User Guide we suggest to remove .dvcignore at all (since we just created it there), but in more general case you should just delete these lines.

Could you please check and confirm it works?

@aguschin aguschin self-assigned this Oct 31, 2022
@igordertigor
Copy link
Contributor Author

Seems to work. Thank you.

@jorgeorpinel
Copy link
Contributor

@aguschin do docs need an update or is this something to transfer to the core repo and/or close? Thanks

@aguschin
Copy link
Contributor

aguschin commented Nov 1, 2022

@igordertigor, I tried to make it more clear. Could you please check if it's better now? https://mlem-ai-improve-user-gu-az7lgx.herokuapp.com/doc/user-guide/dvc#example

@igordertigor
Copy link
Contributor Author

I feel that this is still easy to misread, because the critical information is now contained in an example at the end. Better to have it a bit further up where you start the example and actually ignore the files in the first place. I was going to make a pull request, but I feel that there are two ways to communicate it here:

  1. Add a note that mentions that this should be temporary if you plan to write mlem artifacts from dvc pipelines. So when you say

Also, let’s add .mlem files to .dvcignore so that metafiles are ignored by DVC.

 $ echo "/**/?*.mlem" > .dvcignore
 $ git add .dvcignore

I would add a sentence afterwards along the lines: "Note: We do this here in order to simplify the migration from git-tracked files to dvc-tracked files. If you are planning to write mlem-artifacts from dvc pipelines, you would afterwards have to remove the ignore pattern."
2. Alternatively, you could in this case, explicitly add the models/rf.* files (and remove them from the git-cache). In that case, you could drop the full .dvcignore instruction to avoid confusion, but then you would use
$ git rm -r --cached models/rf
(Not sure about the data folder there) and you would further down do
$ git add models/rf.* models/.gitignore

Do you have a preference? I'm happy to adjust the docs accordingly.

@aguschin
Copy link
Contributor

aguschin commented Nov 1, 2022

Thanks for suggesting your help @igordertigor! IMO echo "/**/?*.mlem" > .dvcignore is important: this is a way to not track .mlem files with DVC by accident, so I think it's better to keep that and the 1st way is better.

Let's also split this page in 3 chunks:

  1. Set up (common for both cases)
  2. Versioning binaries manually
  3. Using DVC Pipeline

The 1st part will end just before

 $ echo "/**/?*.mlem" > .dvcignore
 $ git add .dvcignore

In the end of 1st part we'll say: now there are 2 options: Version binaries manually or using DVC PL, and give links to both. WDYT?

@jorgeorpinel
Copy link
Contributor

TBH I'm also confused as to why the general guide tells you to ignore .mlem files but then the example kind of does the opposite. So what is it an example of? These seem like 2 different situations.

@igordertigor
Copy link
Contributor Author

@aguschin , I think it's a good idea to separate points 2 and 3 of what you write. I created a pull request #211 to address these changes.

@aguschin
Copy link
Contributor

aguschin commented Nov 3, 2022

@igordertigor thanks for the contribution! I went ahead and updated #211, and merge it. Unfortunately, it doesn't get deployed to preview env. But now it should appear shorty at https://mlem.ai/doc/user-guide/dvc

If you like, you can check the page and bring any more feedback if you have some :)

Otherwise feel free to close this issue)

@igordertigor
Copy link
Contributor Author

Thanks. I'll close this one now. If I come across something else, I'll let you know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🐛 type: bug Something isn't working.
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants