diff --git a/CITATIONS.md b/CITATIONS.md index ddff8d5..99290f7 100644 --- a/CITATIONS.md +++ b/CITATIONS.md @@ -10,6 +10,10 @@ ## Pipeline tools +- [bbmap](https://sourceforge.net/projects/bbmap/) + + > Bushnell B. (2022) BBMap, URL: http://sourceforge.net/projects/bbmap/ + - [blastn](https://blast.ncbi.nlm.nih.gov/Blast.cgi) > Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990). doi:10.1016/s0022-2836(05)80360-2. diff --git a/README.md b/README.md index 5e24ef3..68c6e3b 100644 --- a/README.md +++ b/README.md @@ -19,16 +19,16 @@ ## Introduction -**nf-core/detaxizer** is a bioinformatics pipeline that checks for the presence of a specific taxon in (meta)genomic fastq files and offers the option to filter out this taxon or taxonomic subtree. The process begins with preprocessing (adapter trimming, quality cutting and optional length and quality filtering) using fastp and quality assessment via FastQC, followed by taxon classification with kraken2, and employs blastn for validation of the reads associated with the identified taxa. Users must provide a samplesheet to indicate the fastq files and, if utilizing the validation step, a fasta file for creating the blastn database to verify the targeted taxon. +**nf-core/detaxizer** is a bioinformatics pipeline that checks for the presence of a specific taxon in (meta)genomic fastq files and offers the option to filter out this taxon or taxonomic subtree. The process begins with quality assessment via FastQC and optional preprocessing (adapter trimming, quality cutting and optional length and quality filtering) using fastp, followed by taxon classification with kraken2 and/or bbduk, and optionally employs blastn for validation of the reads associated with the identified taxa. Users must provide a samplesheet to indicate the fastq files and, if utilizing bbduk in the classification and/or the validation step, fasta files for usage of bbduk and creating the blastn database to verify the targeted taxon. ![detaxizer metro workflow](docs/images/Detaxizer_metro_workflow.png) 1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) -2. Pre-processing ([`fastp`](https://github.com/OpenGene/fastp)) -3. Classification of reads ([`Kraken2`](https://ccb.jhu.edu/software/kraken2/)) +2. Optional pre-processing ([`fastp`](https://github.com/OpenGene/fastp)) +3. Classification of reads ([`Kraken2`](https://ccb.jhu.edu/software/kraken2/), and/or [`bbduk`](https://sourceforge.net/projects/bbmap/)) 4. Optional validation of searched taxon/taxa ([`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi)) -5. Optional filtering of the searched taxon/taxa from the reads (either from the raw files or the preprocessed reads, using either the output from kraken2 or blastn) -6. Summary of the processes (how many reads were initially present after preprocessing, how many were classified as the `tax2filter` plus potential taxonomic subtree and optionally how many were validated) +5. Optional filtering of the searched taxon/taxa from the reads (either from the raw files or the preprocessed reads, using either the output from the classification (kraken2 and/or bbduk) or blastn) +6. Summary of the processes (how many were classified and optionally how many were validated) 7. Present QC for raw reads ([`MultiQC`](http://multiqc.info/)) ## Usage @@ -45,6 +45,9 @@ CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,A Each row represents a fastq file (single-end) or a pair of fastq files (paired end). A third fastq file can be provided if long reads are present in your project. For more detailed information about the samplesheet, see the [usage documentation](docs/usage.md). +> [!NOTE] +> Be aware that the `tax2filter` (default _Homo sapiens_) has to be in the provided kraken2 database (if kraken2 is used) and that the reference for bbduk (provided by the `fasta_bbduk` parameter) should contain the taxa to filter/assess if it is wanted to assess/remove the same taxa as in `tax2filter`. This overlap in the databases is not checked by the pipeline. To filter out/assess taxa with bbduk only, the `tax2filter` parameter is not needed but a fasta file with references of these taxa has to be provided. + Now, you can run the pipeline using: ```bash diff --git a/conf/modules.config b/conf/modules.config index b690eb9..88cd561 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -135,7 +135,7 @@ process { withName: MERGE_IDS { publishDir = [ - path: { "${params.outdir}/ids" }, + path: { "${params.outdir}/classification/ids" }, mode: params.publish_dir_mode, pattern: '*ids.txt', enabled: params.save_intermediates diff --git a/docs/images/Detaxizer_metro_workflow.png b/docs/images/Detaxizer_metro_workflow.png index 4d40d5c..afb864b 100644 Binary files a/docs/images/Detaxizer_metro_workflow.png and b/docs/images/Detaxizer_metro_workflow.png differ diff --git a/docs/images/Detaxizer_metro_workflow.svg b/docs/images/Detaxizer_metro_workflow.svg index e871717..1c9f895 100644 --- a/docs/images/Detaxizer_metro_workflow.svg +++ b/docs/images/Detaxizer_metro_workflow.svg @@ -1,4 +1,4 @@ -
Fastq
Fastq
fastq.gz
FastQC
fastp
Blastn
Summarizer
Filtered results
Fastq
Fastq
fastq.gz
kraken2
Filter
kraken2 DB
Blastn DB
tsv
Summary
short reads
- single-end
- paired-end
long reads
QC, preprocessing and classification
v1.0.0
Exclusive OR
Optional validation
Optional filtering
Databases for classification and validation
Summary
\ No newline at end of file +
Fastq
Fastq
fastq.gz
FastQC
fastp
blastn
Summarizer
filtered reads
Fastq
Fastq
fastq.gz
kraken2
Filter
Standard route
v1.1.0
Exclusive OR
Optional filtering
Optional routes
bbduk
tsv
summary
short reads
- single-end
- paired-end
long reads
Fastq
Fastq
fastq.gz
removed
reads
kraken2
txt
txt
kraken2
reports
removed
reads
kraken2
reports filtered
reads
Classification
\ No newline at end of file diff --git a/docs/output.md b/docs/output.md index 9292836..98d0b99 100644 --- a/docs/output.md +++ b/docs/output.md @@ -11,15 +11,17 @@ The directories listed below will be created in the results directory after the The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: - [FastQC](#fastqc) - Raw read QC - Output not in the results directory by default -- [fastp](#fastp) - Preprocessing of raw reads -- [kraken2](#kraken2) - Classification of the preprocessed reads and extracting the searched taxa from the results -- [blastn](#blastn) - Validation of the reads classified as the searched taxa and extracting ids of validated reads -- [filter](#filter) - (Optional) filtering of the raw or preprocessed reads using either the read ids from kraken2 output or blastn output +- [fastp](#fastp) - (Optional) preprocessing of raw reads +- [kraken2](#kraken2) - Classification of the (preprocessed) reads and extracting the searched taxa from the results +- [bbduk](#bbduk) - Classification of the (preprocessed) reads +- [classification](#classification) - Preparation of the read IDs for filtering and/or validation +- [blastn](#blastn) - (Optional) validation of the reads classified as the searched taxa and extracting ids of validated reads +- [filter](#filter) - (Optional) filtering of the raw or preprocessed reads using either the read ids from kraken2 and/or bbduk output or blastn output - [summary](#summary) - The summary of the classification and the optional validation - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution -Only the filtering results, the summary, MultiQC and pipeline information are shown by default in the results folder. +Only the filtering results, the summary, MultiQC and pipeline information are shown by default in the results folder. Also, if the output from the filter are classified using kraken2, a kraken2 folder, containing a `filtered/` and a `removed/`folder, will be shown. ### FastQC @@ -51,10 +53,16 @@ kraken2 classifies the reads. The important files are `*.classifiedreads.txt`, `
Output files -- `kraken2/`: Contains the output from the classification step. +- `kraken2/`: Contains the output from the kraken2 classification steps. + - `filtered/`: Contains the classification of the filtered reads (post-filtering). + - `.classifiedreads.txt`: The whole kraken2 output for filtered reads. + - `.kraken2.report.txt`: Statistics on how many reads where assigned to which taxon/taxonomic group in the filtered reads. - `isolated/`: Contains the isolated lines and ids for the taxon/taxa mentioned in the `tax2filter` parameter. - `.classified.txt`: The whole kraken2 output for the taxon/taxa mentioned in the `tax2filter` parameter. - `.ids.txt`: The ids from the whole kraken2 output assigned to the taxon/taxa mentioned in the `tax2filter` parameter. + - `removed/`: Contains the classification of the removed reads (post-filtering). + - `.classifiedreads.txt`: The whole kraken2 output for removed reads. + - `.kraken2.report.txt`: Statistics on how many reads where assigned to which taxon/taxonomic group in the removed reads. - `summary/`: Summary of the kraken2 process. - `.kraken2_summary.tsv`: Contains two three columns, column 1 is the sample name, column 2 the amount of lines in the untouched kraken2 output and column 3 the amount of lines in the isolated output. - `taxonomy/`: Contains the list of taxa to filter/to assess for. @@ -64,6 +72,33 @@ kraken2 classifies the reads. The important files are `*.classifiedreads.txt`, `
+### bbduk + +bbduk classifies the reads. The important files are `*.bbduk.log` and `ids/*.bbduk.txt`. `` can be replaced by `_longReads`, `_R1` or left as `` depending on the cases mentioned in [fastp](#fastp). + +
+Output files + +- `bbduk/`: Contains the output from the bbduk classification step. + - `ids/`: Contains the files with the IDs classified by bbduk. + - `.bbduk.txt`: Contains the classified IDs per sample. + - `.bbduk.log`: Contains statistics on the bbduk run. + +
+ +### classification + +Either the merged IDs from [bbduk](#bbduk) and [kraken2](#kraken2) or the ones produced by one of the tools are shown in this folder. Also, the summary files of the classification step are shown. + +
+Output files + +- `classification/`: Contains the results and the summaries of the classification step. + - `ids/`: Contains either the merged ID files of the classification step or the ones from one classification tool. + - `.ids.txt`: Contains the classified IDs. + - `summary/`: Contains the summary files of either the classification step or the ones from one classification tool. - `.classification_summary.tsv`: Contains the count of reads classified. +
+ ### blastn blastn can validate the reads classified by kraken2 as the taxon/taxa to be assessed/to be filtered. To reduce computational burden only the highest scoring hit per input sequence is returned. If in any case one would need more information this can be done via the `max_hsps`- and `max_target_seqs`-flags in the `modules.config` file. @@ -89,17 +124,20 @@ In this folder, the filtered and re-renamed reads can be found. This result has Output files - `filter/`: Folder containing the filtered and re-renamed reads. - - `_filtered.fastq.gz`: The filtered reads, `` can stay as `` for single-end short reads, take the pattern `_{R1,R2}` for paired-end reads and `_longReads` for long reads. + - `filtered/`: Folder containing the decontaminated reads + - `_filtered.fastq.gz`: The filtered reads, `` can stay as `` for single-end short reads, take the pattern `_{R1,R2}` for paired-end reads and `_longReads` for long reads. +- `removed/`: Folder containing the removed reads (optional) + - `_removed.fastq.gz`: The removed reads, `` can stay as `` for single-end short reads, take the pattern `_{R1,R2}` for paired-end reads and `_longReads` for long reads. ### summary -The summary file lists all statistics of kraken2 and blastn per sample. It is a combination of the summary files of kraken2 and blastn and can be used for a quick overview of the pipeline run. If blastn is skipped, then only the statistics of kraken2 is shown. +The summary file lists all statistics of kraken2 and/or bbduk (and optionally blastn) per sample. It is a combination of the summary files of the classification step and blastn and can be used for a quick overview of the pipeline run. By default, only the summary of the classification step is shown. -| | kraken2 | isolatedkraken2 | blastn_unique_ids | blastn_lines | filteredblastn_unique_ids | filteredblastn_lines | -| ------------------------------------------------------------------------------------------------------------------ | -------------------------- | --------------------------------------- | ------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | -| `` (For short reads it is the same as in the `samplesheet.csv`, for long reads it is `_longReads`) | Read IDs in kraken2 output | Read IDs in the isolated kraken2 output | Number of unique IDs in blastn output, should be the same as blastn_lines | Number of lines in the blastn output | Number of IDs in the blastn output after the filtering for identity and coverage, should be the same as filteredblastn_lines | Number of lines in the blastn output after the filtering for identity and coverage | +| | classified with \* | blastn_unique_ids | blastn_lines | filteredblastn_unique_ids | filteredblastn_lines | +| ------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------- | ------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | +| `` (For short reads it is the same as in the `samplesheet.csv`, for long reads it is `_longReads`) | Number of IDs classified in the classification step | Number of unique IDs in blastn output, should be the same as blastn_lines | Number of lines in the blastn output | Number of IDs in the blastn output after the filtering for identity and coverage, should be the same as filteredblastn_lines | Number of lines in the blastn output after the filtering for identity and coverage |
Output files diff --git a/docs/usage.md b/docs/usage.md index 8ad5eec..d2611d0 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,7 +6,7 @@ ## Introduction -nf-core/detaxizer is a pipeline to assess raw (meta)genomic data for contaminations and optionally filter reads which were classified as contamination. Default taxa classified as contamination are **_Homo_** and **_Homo sapiens_**. +nf-core/detaxizer is a pipeline to assess raw (meta)genomic data for contaminations and optionally filter reads which were classified as contamination. Default taxon classified as contamination is **_Homo sapiens_**. ## Samplesheet input @@ -48,13 +48,17 @@ The task of decontamination has to be balanced out between false positives and f ### kraken2 -To reduce false negatives a larger kraken2 database should be used. This comes at costs in terms of hardware requirements. For the largest kraken2 standard database (which can be found [here](https://benlangmead.github.io/aws-indexes/k2)) at least 100 GB of memory should be available, depending on the size of your data the required memory may be higher. For standard decontamination tasks the Standard-8 database can be used (which is the default), but it should always be kept in mind that this may lead to false negatives to some extend. +To reduce false negatives a larger kraken2 database should be used. This comes at costs in terms of hardware requirements. For the largest kraken2 standard database (which can be found [here](https://benlangmead.github.io/aws-indexes/k2)) at least 100 GB of memory should be available, depending on the size of your data the required memory may be higher. For standard decontamination tasks the Standard-8 GB database can be used (which is the default), but it should always be kept in mind that this may lead to false negatives to some extend. -Also, pangenome databases of the organism(s) classified as contamination could increase the amount of true positives while reducing the hardware requirements. For human such a database can be found [here](https://zenodo.org/doi/10.5281/zenodo.8339731). Such a database will increase false positives, unless a custom database is built together with the data of the organisms not classified as contamination. To build your own database refer to [this site](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases). +To build your own database refer to [this site](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases). + +### bbduk + +bbduk uses a fasta file which contains sequences from the taxon/taxa classified as contamination. Default is the `GRCh38` human reference genome. Provide a custom file using the `fasta_bbduk` parameter. ### blastn -The blastn database is built from a fasta file. Default is the `GRCh38` human reference genome. To decrease the amount of false negatives in this step or include different taxa, a database of several taxa can be used. The fasta containing desired sequences has to be provided to the pipeline by using the `fasta` parameter. +The blastn database is built from a fasta file. Default is the `GRCh38` human reference genome. To decrease the amount of false negatives in this step or include different taxa, a database of several taxa can be used. The fasta containing desired sequences has to be provided to the pipeline by using the `fasta_blastn` parameter. ## Running the pipeline @@ -101,13 +105,21 @@ You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-c Before and after (if using the filter) the execution of the pipeline the headers inside the `.fastq.gz` files are renamed. This step is necessary to avoid difficulties with different header formats in the pipeline. The renamed headers will never be shown to you, except when looking into the work directory. Only the original read headers are shown in the results. -To change the taxon or taxonomic subtree which is classified by kraken2 as contamination use the `tax2filter` parameter (default `Homo`). The taxon has to be in the kraken2 database used, which can be specified using the `kraken2db` parameter. +To change the taxon or taxonomic subtree which is classified by kraken2 as contamination use the `tax2filter` parameter (default `Homo sapiens`). The taxon has to be in the kraken2 database used, which can be specified using the `kraken2db` parameter. + +To change what is classified by `bbduk`, a fasta containing the sequences of the contaminant taxon/taxa has to be provided using the `fasta_bbduk` parameter. + +If you want to run `bbduk` use the `--classification_bbduk` flag. For running both classification steps and use the merged output for filtering, use both flags (`--classification_kraken2` and `--classification_bbduk`). + +To change the organism(s) which should be validated as contamination(s) by blasting against a database, you have to provide a fasta from which the blastn database is built using the `fasta_blastn` parameter. Also, if just one reference genome is needed for blastn and it is in [igenomes.config](../conf/igenomes.config) use the according name (e.g. `'GRCh38'`) as `genome` parameter. + +blastn can be turned on using the `validation_blastn` parameter. -To change the organism(s) which should be validated as contamination(s) by blasting against a database, you have to provide a fasta from which the blastn database is built using the `fasta` parameter. Also, if just one reference genome is needed for blastn and it is in [igenomes.config](../conf/igenomes.config) use the according name (e.g. `'GRCh38'`) as `genome` parameter. +Optionally enabling the filter can be done by using `--enable_filter`. There are two options for the input of the filter, either the raw reads or the preprocessed ones. The first is the default option. Also, for the definition of the reads to be filtered by their IDs two options are available. Either the default is taken, the output from the classification step (kraken2), or using the output from the `blastn` step. -Skipping blastn can be done by using `--skip_blastn`. +If you want to output the removed reads, use `--output_removed_reads`. -Optionally enabling the filter can be done by using `--enable_filter`. There are two options for the input of the filter, either the raw reads or the preprocessed ones. The first is the default option. Also, for the definition of the reads to be filtered by their IDs two options are available. Either the default is taken, the output from the `blastn` step, or using the output from the `kraken2` step. If `blastn` is skipped, the classified read IDs of `kraken2` are automatically used in the filtering step. +Optional classification of the filtered (and removed) reads can be done using `--classification_kraken2_post_filtering`. This uses the kraken2 database provided by `kraken2db`. ### Updating the pipeline diff --git a/subworkflows/local/utils_nfcore_detaxizer_pipeline/main.nf b/subworkflows/local/utils_nfcore_detaxizer_pipeline/main.nf index cd103fd..4947453 100644 --- a/subworkflows/local/utils_nfcore_detaxizer_pipeline/main.nf +++ b/subworkflows/local/utils_nfcore_detaxizer_pipeline/main.nf @@ -190,10 +190,11 @@ def toolCitationText() { def citation_text = [ "Tools used in the workflow included:", "FastQC (Andrews 2010),", - "fastp (Chen et al. 2018)", - "Kraken2 (Wood et al. 2019),", + params["preprocessing"] ? "fastp (Chen et al. 2018),": "", + params["classification_kraken2"] | !params["classification_bbduk"] & !params["classification_kraken2"] ? "Kraken2 (Wood et al. 2019)," : "", + params["classification_bbduk"] ? "BBMap (Bushnell B. 2022)," : "", params["validation_blastn"] ? "BLAST (Altschul et al. 1990)," : "", - params["validation_blastn"] | params["enable_filter"] ? "seqkit (Shen et al. 2016)," : "", + params["validation_blastn"] | params["enable_filter"] | params["classification_bbduk"] ? "seqkit (Shen et al. 2016)," : "", "MultiQC (Ewels et al. 2016)", "." ].join(' ').trim() @@ -204,11 +205,12 @@ def toolCitationText() { def toolBibliographyText() { def reference_text = [ - "
  • Andrews S, (2010) FastQC, URL: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
  • ", - "
  • Chen, S., Zhou, Y., Chen, Y. & Gu, J. (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890. doi: 10.1093/bioinformatics/bty560
  • ", - "
  • Wood, D. E., Lu, J. & Langmead, B. (2019) Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257. doi: 10.1186/s13059-019-1891-0
  • ", + "
  • Andrews, S. (2010) FastQC, URL: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • ", + params["preprocessing"] ? "
  • Chen, S., Zhou, Y., Chen, Y. & Gu, J. (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890. doi: 10.1093/bioinformatics/bty560
  • " : "", + params["classification_kraken2"] | !params["classification_bbduk"] & !params["classification_kraken2"] ? "
  • Wood, D. E., Lu, J. & Langmead, B. (2019) Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257. doi: 10.1186/s13059-019-1891-0
  • " : "", + params["classification_bbduk"] ? "
  • Bushnell, B. (2022) BBMap, URL: http://sourceforge.net/projects/bbmap/
  • " : "", params["validation_blastn"] ? "
  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) Basic local alignment search tool. Journal of Molecular Biology 215, 403–410. doi: 10.1016/s0022-2836(05)80360-2.
  • " : "", - params["validation_blastn"] | params["enable_filter"] ? "
  • Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. In Q. Zou (Ed.), PLOS ONE (Vol. 11, Issue 10, p. e0163962). Public Library of Science (PLoS). doi: 10.1371/journal.pone.0163962
  • " : "", + params["validation_blastn"] | params["enable_filter"] | params["classification_bbduk"] ? "
  • Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. In Q. Zou (Ed.), PLOS ONE (Vol. 11, Issue 10, p. e0163962). Public Library of Science (PLoS). doi: 10.1371/journal.pone.0163962
  • " : "", "
  • Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354
  • " ].join(' ').trim()