Standardises GTDB execution and allow pre-uncompressed GTDB input #477

jfy133 · 2023-07-13T07:38:34Z

To close #424

PR checklist

This comment contains a description of changes (with reason).
If you've fixed a bug or added code that should be tested, add tests!
If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
Make sure your code lints (nf-core lint).
Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
Usage Documentation in docs/usage.md is updated.
Output Documentation in docs/output.md is updated.
CHANGELOG.md is updated.
README.md is updated (including new tool citations and authors/contributors).

github-actions · 2023-07-13T07:41:35Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit e6a2c71

+| ✅ 156 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-mag_logo_light.png
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-mag_logo_light.png
files_exist - File found: docs/images/nf-core-mag_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/nfcore_external_java_deps.jar
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: .github/workflows/awstest.yml
files_exist - File found: .github/workflows/awsfulltest.yml
files_exist - File found: lib/WorkflowMag.groovy
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-mag_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 2.4.0dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - assets/nf-core-mag_logo_light.png matches the template
files_unchanged - docs/images/nf-core-mag_logo_light.png matches the template
files_unchanged - docs/images/nf-core-mag_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/nfcore_external_java_deps.jar matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
files_unchanged - pyproject.toml matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml does not use -profile test
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (227 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: awstest.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: awsfulltest.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.9
Run at 2023-08-07 16:10:53

jfy133 · 2023-07-13T09:06:23Z

Need to test, currently GTDB is not being executed at all;

jfy133 · 2023-07-29T04:08:47Z

First pass tests:

Standard test profile (i.e., skip GTDBTK ensures no download of database)

nextflow run ../main.nf -profile singularity,test --outdir ./results
Standard test profile (still skipping) but with GTDBDTK path still results in no download/execution

nextflow run ../main.nf -profile singularity,test --outdir ./results --gtdb_db ~/cache/databases/gtdbtk_r202_data.tar.gz
Run GTDBK with pre-supplied archive tar (should DB prep it, and run gtdbk)

nextflow run ../main.nf -profile singularity,test --outdir ./results --gtdb_db /home/james/cache/databases/gtdbtk_r202_data.tar.gz --skip_gtdbk false

Working: but test dataset gets no completeness (all bins 'discarded' during ch_bins_metric)
Trying new data (subset Maixner 2021).
Working: but broken database
Run GTDBK with already decompressed tar archive with input as directory (no DB prep, and but still gtdbk)

time nextflow run ../main.nf -profile singularity,test --input "*_{R1,R2}.fastq.gz" --outdir ./results --gtdb_db /home/james/cache/databases/database --skip_gtdbtk false -dump-channels -resume --input samplesheet.2612.csv

Working: but broken database
Run GTDBK with already decompressed tar archive with input as directory (no DB prep, and but still gtdbk)

time nextflow run ../main.nf -profile singularity,test --input "*_{R1,R2}.fastq.gz" --outdir ./results --gtdb_db /home/james/cache/databases/database --skip_gtdbtk false -dump-channels -resume --input samplesheet.2612.csv

Working: but broken database
Run GTDBK but with no supplied database (i.e., should auto download)

time nextflow run ../main.nf -profile singularity,test --input "*_{R1,R2}.fastq.gz" --outdir ./results --skip_gtdbtk false -dump-channels -resume --input samplesheet.2612.csv

Working: but broken database
Run command but skipbinqc (should not autodownload

Remember: remove print and dumps!

jfy133 · 2023-07-29T18:53:48Z

TODO: same for BUSCO

jfy133 · 2023-07-30T06:07:29Z

Change my mind, BUSCO is a little more tricky, will do that in a follow up PR (if I do more than just accepting directory input)

nextflow.config

prototaxites

Changes look fine to me, but don't have time (or a downloaded copy of the GTDB database) to test with myself!

Working: but broken database

Can I check what this means?

jfy133 · 2023-07-31T09:34:58Z

It was saying something about couldn't find `tigram database' or something like, regardless if I user the auto-downloaded or manual download gtdb release 202 🤷 unless the error message was a bit funny and it meant it couldn't find features within the data maybe

But regardless the module definitly executed and got halfway through :)

prototaxites · 2023-07-31T09:40:19Z

Poking around - the r202 database release is listed with a maximum GTDB version compatibility of 1.7.0 - the version in mag dev at the moment is 2.1.1.

Worth retrying with the R214 or R207v2 release, which are listed as compatible? A little bit of googling suggests Tigram is some kind of HMM database - maybe this has since been added to the GTDB database. Might be we have to bump the default database version as well.

https://ecogenomics.github.io/GTDBTk/installing/index.html

jfy133 · 2023-07-31T09:41:37Z

Huh interesting... then will check that later, I guess somehow the module got updated at some point but not te URL?

jfy133 · 2023-07-31T09:41:54Z

Thansk for the investgiation :D (will likely finish next week though as teaching all this week)

CarsonJM · 2023-07-31T15:18:03Z

@jfy133 @prototaxites I think it would be great to get the r214 database set as default, since it is a pretty significant increase over r207. I'm happy to work on adding updating that this week if that would be helpful!

jfy133 · 2023-07-31T15:38:50Z

That would be great @CarsonJM !

I honestly think it will just take updating the URL in the nextflow.config and maybe some docs. You're welcome to try it on my branch of you want and push the current!

If you could also test on your own (small) data for the the database cases: auto download, supply as a tar.gz, and also an unpacked tar (i.e. Directory) that would be also really helpful.

The database is too large for the GitHub CI nodes so I fear it's not tested sufficiently :(, thusb the more manual tests the better.

We should also maybe consider a release checklist to also run a range of e.g full AWS run/local HPC run with all the large databases activated...

CarsonJM · 2023-07-31T15:54:50Z

Thanks for the guidance @jfy133 I will work on that this week and keep you all posted!

CarsonJM · 2023-08-07T13:57:44Z

Sorry for falling behind on this. I made the code changes and started some tests last Thursday, but sorely underestimated the amount of time I would need to request for downloading/unpacking this database. Re-running the tests now!

CarsonJM · 2023-08-07T17:11:08Z

Finished running all three tests (auto-download, .tar.gz, and directory) all worked great and it looks like all CI tests are going to pass as well. One thought on this would be to add label "process_high_memory" to GTDBTK_DB_PREPARATION so that by default it requests a lot of memory. Would that make sense?

jfy133 · 2023-08-07T17:58:12Z

Thanks for adding that @CarsonJM ! Why do you think that process needs lots of memory? Isn't it just running untar?

CarsonJM · 2023-08-07T18:25:47Z

Good point! @jfy133 I initially ran this without modifying the resource request, and it failed after > 4hrs (our default queues max runtime). When I requested 16 threads and 100GB mem, it completed in 40 min. After your comment I looked at the trace and mem requirement is definitely low! Would it be running faster because more cores were available?

Trace below:
task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar

5 8b/0ce534 62494 NFCORE_MAG:MAG:GTDBTK:GTDBTK_DB_PREPARATION (gtdbtk_r214_data.tar.gz) COMPLETED 0 2023-08-07 07:33:11.880 40m 25s 40m 23s 37.7% 6.7 MB 11.1 MB 158.6 GB 161.9 GB

jfy133 · 2023-08-08T06:58:27Z

AFAIK it's also single core... so I have no idea. Maybe the first time the node you were sent had lots of RAM intensive jobs going on?

I didn't have that issue myself personally (40m every time) when running on my laptop, so I'm more inclined to leave it as is for now?

Otherwise, if you're happy with the PR @CarsonJM please give the ✔️ and then we can merge, and dare I say it, make the release?

jfy133 added 2 commits July 13, 2023 09:37

First pass

19b2ab4

Merge branch 'dev' into improve-database-handling

73a172e

Fix linting

c83e818

jfy133 marked this pull request as draft July 13, 2023 08:24

Manual sync with dev

da7398a

jfy133 added 3 commits July 13, 2023 11:13

Add debugging to try and work out why GTDBTK not being executed

22039b4

Merge branch 'dev' into improve-database-handling

a34d827

Harshil align on new skip_gtdbtk flag

5e0ad34

jfy133 added 2 commits July 29, 2023 06:10

Merge branch 'dev' into improve-database-handling

4f3a554

Make GTDBTK explicitly optional, and more flexible with input databases

1b829d6

jfy133 marked this pull request as ready for review July 30, 2023 06:07

jfy133 commented Jul 30, 2023

View reviewed changes

nextflow.config Outdated Show resolved Hide resolved

Update nextflow.config

f7ac3c8

jfy133 requested review from prototaxites, d4straub, CarsonJM and skrakau July 30, 2023 06:09

prototaxites reviewed Jul 31, 2023

View reviewed changes

Updated gtdb_db version to r214

e6a2c71

prototaxites approved these changes Aug 7, 2023

View reviewed changes

CarsonJM approved these changes Aug 8, 2023

View reviewed changes

jfy133 merged commit b004f03 into dev Aug 10, 2023
15 checks passed

jfy133 deleted the improve-database-handling branch August 10, 2023 13:45

jfy133 mentioned this pull request Aug 31, 2023

Allow decompressed databases to be used for gtdb, kraken2, centrifuge, and busco #498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardises GTDB execution and allow pre-uncompressed GTDB input #477

Standardises GTDB execution and allow pre-uncompressed GTDB input #477

jfy133 commented Jul 13, 2023

github-actions bot commented Jul 13, 2023 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

jfy133 commented Jul 13, 2023

jfy133 commented Jul 29, 2023 •

edited

Loading

jfy133 commented Jul 29, 2023

jfy133 commented Jul 30, 2023 •

edited

Loading

prototaxites left a comment •

edited

Loading

jfy133 commented Jul 31, 2023 •

edited

Loading

prototaxites commented Jul 31, 2023

jfy133 commented Jul 31, 2023

jfy133 commented Jul 31, 2023

CarsonJM commented Jul 31, 2023

jfy133 commented Jul 31, 2023

CarsonJM commented Jul 31, 2023

CarsonJM commented Aug 7, 2023

CarsonJM commented Aug 7, 2023

jfy133 commented Aug 7, 2023

CarsonJM commented Aug 7, 2023 •

edited

Loading

jfy133 commented Aug 8, 2023

Standardises GTDB execution and allow pre-uncompressed GTDB input #477

Standardises GTDB execution and allow pre-uncompressed GTDB input #477

Conversation

jfy133 commented Jul 13, 2023

PR checklist

github-actions bot commented Jul 13, 2023 • edited Loading

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

jfy133 commented Jul 13, 2023

jfy133 commented Jul 29, 2023 • edited Loading

jfy133 commented Jul 29, 2023

jfy133 commented Jul 30, 2023 • edited Loading

prototaxites left a comment • edited Loading

Choose a reason for hiding this comment

jfy133 commented Jul 31, 2023 • edited Loading

prototaxites commented Jul 31, 2023

jfy133 commented Jul 31, 2023

jfy133 commented Jul 31, 2023

CarsonJM commented Jul 31, 2023

jfy133 commented Jul 31, 2023

CarsonJM commented Jul 31, 2023

CarsonJM commented Aug 7, 2023

CarsonJM commented Aug 7, 2023

jfy133 commented Aug 7, 2023

CarsonJM commented Aug 7, 2023 • edited Loading

jfy133 commented Aug 8, 2023

github-actions bot commented Jul 13, 2023 •

edited

Loading

`nf-core lint` overall result: Passed ✅ ⚠️

jfy133 commented Jul 29, 2023 •

edited

Loading

jfy133 commented Jul 30, 2023 •

edited

Loading

prototaxites left a comment •

edited

Loading

jfy133 commented Jul 31, 2023 •

edited

Loading

CarsonJM commented Aug 7, 2023 •

edited

Loading