Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardises GTDB execution and allow pre-uncompressed GTDB input #477

Merged
merged 11 commits into from
Aug 10, 2023
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#422](https://github.com/nf-core/mag/pull/422) - Adds support for normalization of read depth with BBNorm (added by @erikrikarddaniel and @fabianegli)
- [#439](https://github.com/nf-core/mag/pull/439) - Adds ability to enter the pipeline at the binning stage by providing a CSV of pre-computed assemblies (by @prototaxites)
- [#459](https://github.com/nf-core/mag/pull/459) - Adds ability to skip damage correction step in the ancient DNA workflow and just run pyDamage (by @jfy133)
- [#364](https://github.com/nf-core/mag/pull/364) - Added geNomad nf-core modules for identifying viruses in assemblies (by @PhilPalmer and @CarsonJM)
- [#364](https://github.com/nf-core/mag/pull/364) - Adds geNomad nf-core modules for identifying viruses in assemblies (by @PhilPalmer and @CarsonJM)
- [#481](https://github.com/nf-core/mag/pull/481) - Adds MetaEuk for annotation of eukaryotic MAGs, and MMSeqs2 to enable downloading databases for MetaEuk (by @prototaxites)
- [#437](https://github.com/nf-core/mag/pull/429) - `--gtdb_db` also now supports directory input of an pre-uncompressed GTDB archive directory (reported by @alneberg, fix by @jfy133)

### `Changed`

Expand All @@ -22,6 +23,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#442](https://github.com/nf-core/mag/pull/442) - Remove warning when BUSCO finds no genes in bins, as this can be expected in some datasets (reported by @Lumimar, fix by @jfy133).
- [#444](https://github.com/nf-core/mag/pull/444) - Moved BUSCO bash code to script (by @jfy133)
- [#428](https://github.com/nf-core/mag/pull/429) - Update to nf-core 2.9 `TEMPLATE` (by @jfy133)
- [#437](https://github.com/nf-core/mag/pull/429) - `--gtdb` parameter is split into `--skip_gtdbtk` and `--gtdb_db` to allow finer control over GTDB database retrieval (fix by @jfy133)

### `Fixed`

Expand Down
2 changes: 1 addition & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,6 @@ params {
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_clean = true
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
2 changes: 1 addition & 1 deletion conf/test_adapterremoval.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
clip_tool = 'adapterremoval'
skip_concoct = true
bin_domain_classification = true
Expand Down
2 changes: 1 addition & 1 deletion conf/test_ancient_dna.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
ancient_dna = true
binning_map_mode = 'own'
skip_spades = false
Expand Down
2 changes: 1 addition & 1 deletion conf/test_bbnorm.config
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ params {
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_clean = true
gtdb = false
skip_gtdbtk = true
bbnorm = true
coassemble_group = true
}
2 changes: 1 addition & 1 deletion conf/test_binrefinement.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
refine_bins_dastool = true
refine_bins_dastool_threshold = 0
postbinning_input = 'both'
Expand Down
2 changes: 1 addition & 1 deletion conf/test_busco_auto.config
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ params {
skip_spades = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
gtdb = false
skip_gtdbtk = true
skip_prokka = true
skip_prodigal = true
skip_quast = true
Expand Down
2 changes: 1 addition & 1 deletion conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ params {
centrifuge_db = "s3://ngi-igenomes/test-data/mag/p_compressed+h+v.tar.gz"
kraken2_db = "s3://ngi-igenomes/test-data/mag/minikraken_8GB_202003.tgz"
cat_db = "s3://ngi-igenomes/test-data/mag/CAT_prepare_20210107.tar.gz"
gtdb = "s3://ngi-igenomes/test-data/mag/gtdbtk_r202_data.tar.gz"
gtdb_db = "s3://ngi-igenomes/test-data/mag/gtdbtk_r202_data.tar.gz"

// reproducibility options for assembly
spades_fix_cpus = 10
Expand Down
2 changes: 1 addition & 1 deletion conf/test_host_rm.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
2 changes: 1 addition & 1 deletion conf/test_hybrid.config
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,6 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
1 change: 1 addition & 0 deletions conf/test_hybrid_host_rm.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,5 @@ params {
max_unbinned_contigs = 2
skip_binqc = true
skip_concoct = true
skip_gtdbtk = true
}
2 changes: 1 addition & 1 deletion conf/test_nothing.config
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,6 @@ params {
skip_concoct = true
skip_prokka = true
skip_binqc = true
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
2 changes: 1 addition & 1 deletion conf/test_virus_identification.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ params {
// For computational efficiency
reads_minlength = 150
coassemble_group = true
gtdb = false
skip_gtdbtk = true
skip_binning = true
skip_prokka = true
skip_spades = true
Expand Down
4 changes: 2 additions & 2 deletions lib/WorkflowMag.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -119,8 +119,8 @@ class WorkflowMag {
Nextflow.error('Both --busco_auto_lineage_prok and --busco_reference are specified! Invalid combination, please specify either --busco_auto_lineage_prok or --busco_reference.')
}

if (params.skip_binqc && params.gtdb) {
log.warn '--skip_binqc and --gtdb are specified! GTDB-tk will be omitted because GTDB-tk bin classification requires bin filtering based on BUSCO or CheckM QC results to avoid GTDB-tk errors.'
if (params.skip_binqc && !params.skip_gtdbtk) {
log.warn '--skip_binqc is specified, but --skip_gtdbtk is explictly set to run! GTDB-tk will be omitted because GTDB-tk bin classification requires bin filtering based on BUSCO or CheckM QC results to avoid GTDB-tk errors.'
}

// Check if CAT parameters are valid
Expand Down
2 changes: 1 addition & 1 deletion modules/local/gtdbtk_db_preparation.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ process GTDBTK_DB_PREPARATION {
path(database)

output:
tuple val("${database.toString().replace(".tar.gz", "")}"), path("database/*")
tuple val("${database.toString().replace(".tar.gz", "")}"), path("database/*"), emit: db

script:
"""
Expand Down
3 changes: 2 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,8 @@ params {
cat_db_generate = false
cat_official_taxonomy = false
save_cat_db = false
gtdb = "https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz"
skip_gtdbtk = false
gtdb_db = "https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz"
gtdbtk_min_completeness = 50.0
gtdbtk_max_contamination = 10.0
gtdbtk_min_perc_aa = 10
Expand Down
12 changes: 8 additions & 4 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -511,11 +511,15 @@
"type": "boolean",
"description": "Only return official taxonomic ranks (Kingdom, Phylum, etc.) when running CAT."
},
"gtdb": {
"skip_gtdbtk": {
"type": "boolean",
"description": "Skip the running of GTDB, as well as the automatic download of the database",
"default": "false"
},
"gtdb_db": {
"type": "string",
"default": "https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz",
"description": "GTDB database for taxonomic classification of bins with GTDB-tk.",
"help_text": "For information which GTDB reference databases are compatible with the used GTDB-tk version see https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data."
"description": "Specify the location of a GTDBTK database. Can be either an uncompressed directory or a `.tar.gz` archive. If not specified will be downloaded for you when GTDBTK or binning QC is not skipped.",
"default": "https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz"
},
"gtdbtk_min_completeness": {
"type": "number",
Expand Down
4 changes: 2 additions & 2 deletions subworkflows/local/binning.nf
Original file line number Diff line number Diff line change
Expand Up @@ -130,9 +130,9 @@ workflow BINNING {
ch_versions = ch_versions.mix(GUNZIP_UNBINS.out.versions.first())

emit:
bins = ch_binning_results_gunzipped.dump(tag: "ch_binning_results_gunzipped")
bins = ch_binning_results_gunzipped
bins_gz = ch_binning_results_gzipped_final
unbinned = ch_splitfasta_results_gunzipped.dump(tag: "ch_splitfasta_results_gunzipped")
unbinned = ch_splitfasta_results_gunzipped
unbinned_gz = SPLIT_FASTA.out.unbinned
metabat2depths = METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth
versions = ch_versions
Expand Down
18 changes: 16 additions & 2 deletions subworkflows/local/gtdbtk.nf
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,24 @@ workflow GTDBTK {
return [it[0], it[1]]
}

GTDBTK_DB_PREPARATION ( gtdb )
if ( gtdb.extension == 'gz' ) {
// Expects to be tar.gz!
ch_db_for_gtdbtk = GTDBTK_DB_PREPARATION ( gtdb ).db
} else if ( gtdb.isDirectory() ) {
// Make up meta id to match expected channel cardinality for GTDBTK
ch_db_for_gtdbtk = Channel
.of(gtdb)
.map{
[ it.toString().split('/').last(), it ]
}
.collect()
} else {
error("Unsupported object given to --gtdb, database must be supplied as either a directory or a .tar.gz file!")
}

GTDBTK_CLASSIFYWF (
ch_filtered_bins.passed.groupTuple(),
GTDBTK_DB_PREPARATION.out
ch_db_for_gtdbtk
)

GTDBTK_SUMMARY (
Expand Down
49 changes: 27 additions & 22 deletions workflows/mag.nf
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ log.info logo + paramsSummaryLog(workflow) + citation
WorkflowMag.initialise(params, log, hybrid)

// Check input path parameters to see if they exist
def checkPathParamList = [ params.input, params.multiqc_config, params.phix_reference, params.host_fasta, params.centrifuge_db, params.kraken2_db, params.cat_db, params.gtdb, params.lambda_reference, params.busco_reference ]
def checkPathParamList = [ params.input, params.multiqc_config, params.phix_reference, params.host_fasta, params.centrifuge_db, params.kraken2_db, params.cat_db, params.gtdb_db, params.lambda_reference, params.busco_reference ]
for (param in checkPathParamList) { if (param) { file(param, checkIfExists: true) } }

/*
Expand Down Expand Up @@ -205,13 +205,12 @@ if (params.genomad_db){
ch_genomad_db = Channel.empty()
}

gtdb = params.skip_binqc ? false : params.gtdb
gtdb = ( params.skip_binqc || params.skip_gtdbtk ) ? false : params.gtdb_db

if (gtdb) {
ch_gtdb = Channel
.value(file( "${gtdb}" ))
gtdb = file( "${gtdb}", checkIfExists: true)
} else {
ch_gtdb = Channel.empty()
gtdb = []
}

if(params.metaeuk_db && !params.skip_metaeuk) {
Expand Down Expand Up @@ -720,12 +719,12 @@ workflow MAG {


} else {
ch_binning_results_bins = BINNING.out.bins.dump(tag: 'BINNING.out.bins')
ch_binning_results_bins = BINNING.out.bins
.map { meta, bins ->
def meta_new = meta + [domain: 'unclassified']
[meta_new, bins]
}
ch_binning_results_unbins = BINNING.out.unbinned.dump(tag: 'BINNING.out.unbins')
ch_binning_results_unbins = BINNING.out.unbinned
.map { meta, bins ->
def meta_new = meta + [domain: 'unclassified']
[meta_new, bins]
Expand Down Expand Up @@ -877,25 +876,31 @@ workflow MAG {
/*
* GTDB-tk: taxonomic classifications using GTDB reference
*/
ch_gtdbtk_summary = Channel.empty()
if ( gtdb ){

ch_gtdb_bins = ch_input_for_postbinning_bins_unbins
.filter { meta, bins ->
meta.domain != "eukarya"
}
if ( !params.skip_gtdbtk ) {

GTDBTK (
ch_gtdb_bins,
ch_busco_summary,
ch_checkm_summary,
ch_gtdb
)
ch_versions = ch_versions.mix(GTDBTK.out.versions.first())
ch_gtdbtk_summary = GTDBTK.out.summary
ch_gtdbtk_summary = Channel.empty()
if ( gtdb ){

ch_gtdb_bins = ch_input_for_postbinning_bins_unbins
.filter { meta, bins ->
meta.domain != "eukarya"
}

GTDBTK (
ch_gtdb_bins,
ch_busco_summary,
ch_checkm_summary,
gtdb
)
ch_versions = ch_versions.mix(GTDBTK.out.versions.first())
ch_gtdbtk_summary = GTDBTK.out.summary
}
} else {
ch_gtdbtk_summary = Channel.empty()
}

if ( ( !params.skip_binqc ) || !params.skip_quast || gtdb){
if ( ( !params.skip_binqc ) || !params.skip_quast || !params.skip_gtdbtk){
BIN_SUMMARY (
ch_input_for_binsummary,
ch_busco_summary.ifEmpty([]),
Expand Down
Loading