Merge pull request nf-core#35 from jannikseidelQBiC/feature_bbduk_docu

Feature bbduk documentation
jannikseidelQBiC · Sep 5, 2024 · 40e31aa · 40e31aa
2 parents 64a91c4 + 517838f
commit 40e31aa
Show file tree

Hide file tree

Showing 8 changed files with 92 additions and 33 deletions.
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,6 +10,10 @@
 
 ## Pipeline tools
 
+- [bbmap](https://sourceforge.net/projects/bbmap/)
+
+  > Bushnell B. (2022) BBMap, URL: http://sourceforge.net/projects/bbmap/
+
 - [blastn](https://blast.ncbi.nlm.nih.gov/Blast.cgi)
 
   > Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990). doi:10.1016/s0022-2836(05)80360-2.

diff --git a/README.md b/README.md
@@ -19,16 +19,16 @@
 
 ## Introduction
 
-**nf-core/detaxizer** is a bioinformatics pipeline that checks for the presence of a specific taxon in (meta)genomic fastq files and offers the option to filter out this taxon or taxonomic subtree. The process begins with preprocessing (adapter trimming, quality cutting and optional length and quality filtering) using fastp and quality assessment via FastQC, followed by taxon classification with kraken2, and employs blastn for validation of the reads associated with the identified taxa. Users must provide a samplesheet to indicate the fastq files and, if utilizing the validation step, a fasta file for creating the blastn database to verify the targeted taxon.
+**nf-core/detaxizer** is a bioinformatics pipeline that checks for the presence of a specific taxon in (meta)genomic fastq files and offers the option to filter out this taxon or taxonomic subtree. The process begins with quality assessment via FastQC and optional preprocessing (adapter trimming, quality cutting and optional length and quality filtering) using fastp, followed by taxon classification with kraken2 and/or bbduk, and optionally employs blastn for validation of the reads associated with the identified taxa. Users must provide a samplesheet to indicate the fastq files and, if utilizing bbduk in the classification and/or the validation step, fasta files for usage of bbduk and creating the blastn database to verify the targeted taxon.
 
 ![detaxizer metro workflow](docs/images/Detaxizer_metro_workflow.png)
 
 1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Pre-processing ([`fastp`](https://github.com/OpenGene/fastp))
-3. Classification of reads ([`Kraken2`](https://ccb.jhu.edu/software/kraken2/))
+2. Optional pre-processing ([`fastp`](https://github.com/OpenGene/fastp))
+3. Classification of reads ([`Kraken2`](https://ccb.jhu.edu/software/kraken2/), and/or [`bbduk`](https://sourceforge.net/projects/bbmap/))
 4. Optional validation of searched taxon/taxa ([`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi))
-5. Optional filtering of the searched taxon/taxa from the reads (either from the raw files or the preprocessed reads, using either the output from kraken2 or blastn)
-6. Summary of the processes (how many reads were initially present after preprocessing, how many were classified as the `tax2filter` plus potential taxonomic subtree and optionally how many were validated)
+5. Optional filtering of the searched taxon/taxa from the reads (either from the raw files or the preprocessed reads, using either the output from the classification (kraken2 and/or bbduk) or blastn)
+6. Summary of the processes (how many were classified and optionally how many were validated)
 7. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
 
 ## Usage
@@ -45,6 +45,9 @@ CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,A
 
 Each row represents a fastq file (single-end) or a pair of fastq files (paired end). A third fastq file can be provided if long reads are present in your project. For more detailed information about the samplesheet, see the [usage documentation](docs/usage.md).
 
+> [!NOTE]
+> Be aware that the `tax2filter` (default _Homo sapiens_) has to be in the provided kraken2 database (if kraken2 is used) and that the reference for bbduk (provided by the `fasta_bbduk` parameter) should contain the taxa to filter/assess if it is wanted to assess/remove the same taxa as in `tax2filter`. This overlap in the databases is not checked by the pipeline. To filter out/assess taxa with bbduk only, the `tax2filter` parameter is not needed but a fasta file with references of these taxa has to be provided.
+
 Now, you can run the pipeline using:
 
 ```bash

diff --git a/conf/modules.config b/conf/modules.config
@@ -135,7 +135,7 @@ process {
 
     withName: MERGE_IDS {
         publishDir = [
-            path: { "${params.outdir}/ids" },
+            path: { "${params.outdir}/classification/ids" },
             mode: params.publish_dir_mode,
             pattern: '*ids.txt',
             enabled: params.save_intermediates

diff --git a/docs/images/Detaxizer_metro_workflow.png b/docs/images/Detaxizer_metro_workflow.png
diff --git a/docs/images/Detaxizer_metro_workflow.svg b/docs/images/Detaxizer_metro_workflow.svg
diff --git a/docs/output.md b/docs/output.md
@@ -11,15 +11,17 @@ The directories listed below will be created in the results directory after the
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
 - [FastQC](#fastqc) - Raw read QC - Output not in the results directory by default
-- [fastp](#fastp) - Preprocessing of raw reads
-- [kraken2](#kraken2) - Classification of the preprocessed reads and extracting the searched taxa from the results
-- [blastn](#blastn) - Validation of the reads classified as the searched taxa and extracting ids of validated reads
-- [filter](#filter) - (Optional) filtering of the raw or preprocessed reads using either the read ids from kraken2 output or blastn output
+- [fastp](#fastp) - (Optional) preprocessing of raw reads
+- [kraken2](#kraken2) - Classification of the (preprocessed) reads and extracting the searched taxa from the results
+- [bbduk](#bbduk) - Classification of the (preprocessed) reads
+- [classification](#classification) - Preparation of the read IDs for filtering and/or validation
+- [blastn](#blastn) - (Optional) validation of the reads classified as the searched taxa and extracting ids of validated reads
+- [filter](#filter) - (Optional) filtering of the raw or preprocessed reads using either the read ids from kraken2 and/or bbduk output or blastn output
 - [summary](#summary) - The summary of the classification and the optional validation
 - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
-Only the filtering results, the summary, MultiQC and pipeline information are shown by default in the results folder.
+Only the filtering results, the summary, MultiQC and pipeline information are shown by default in the results folder. Also, if the output from the filter are classified using kraken2, a kraken2 folder, containing a `filtered/` and a `removed/`folder, will be shown.
 
 ### FastQC
 
@@ -51,10 +53,16 @@ kraken2 classifies the reads. The important files are `*.classifiedreads.txt`, `
 <details markdown="1">
 <summary>Output files</summary>
 
-- `kraken2/`: Contains the output from the classification step.
+- `kraken2/`: Contains the output from the kraken2 classification steps.
+  - `filtered/`: Contains the classification of the filtered reads (post-filtering).
+    - `<sample>.classifiedreads.txt`: The whole kraken2 output for filtered reads.
+    - `<sample>.kraken2.report.txt`: Statistics on how many reads where assigned to which taxon/taxonomic group in the filtered reads.
   - `isolated/`: Contains the isolated lines and ids for the taxon/taxa mentioned in the `tax2filter` parameter.
     - `<sample>.classified.txt`: The whole kraken2 output for the taxon/taxa mentioned in the `tax2filter` parameter.
     - `<sample>.ids.txt`: The ids from the whole kraken2 output assigned to the taxon/taxa mentioned in the `tax2filter` parameter.
+  - `removed/`: Contains the classification of the removed reads (post-filtering).
+    - `<sample>.classifiedreads.txt`: The whole kraken2 output for removed reads.
+    - `<sample>.kraken2.report.txt`: Statistics on how many reads where assigned to which taxon/taxonomic group in the removed reads.
   - `summary/`: Summary of the kraken2 process.
     - `<sample>.kraken2_summary.tsv`: Contains two three columns, column 1 is the sample name, column 2 the amount of lines in the untouched kraken2 output and column 3 the amount of lines in the isolated output.
   - `taxonomy/`: Contains the list of taxa to filter/to assess for.
@@ -64,6 +72,33 @@ kraken2 classifies the reads. The important files are `*.classifiedreads.txt`, `
 
 </details>
 
+### bbduk
+
+bbduk classifies the reads. The important files are `*.bbduk.log` and `ids/*.bbduk.txt`. `<sample>` can be replaced by `<sample>_longReads`, `<sample>_R1` or left as `<sample>` depending on the cases mentioned in [fastp](#fastp).
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `bbduk/`: Contains the output from the bbduk classification step.
+  - `ids/`: Contains the files with the IDs classified by bbduk.
+    - `<sample>.bbduk.txt`: Contains the classified IDs per sample.
+  - `<sample>.bbduk.log`: Contains statistics on the bbduk run.
+
+</details>
+
+### classification
+
+Either the merged IDs from [bbduk](#bbduk) and [kraken2](#kraken2) or the ones produced by one of the tools are shown in this folder. Also, the summary files of the classification step are shown.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `classification/`: Contains the results and the summaries of the classification step.
+  - `ids/`: Contains either the merged ID files of the classification step or the ones from one classification tool.
+    - `<sample>.ids.txt`: Contains the classified IDs.
+  - `summary/`: Contains the summary files of either the classification step or the ones from one classification tool. - `<sample>.classification_summary.tsv`: Contains the count of reads classified.
+  </details>
+
 ### blastn
 
 blastn can validate the reads classified by kraken2 as the taxon/taxa to be assessed/to be filtered. To reduce computational burden only the highest scoring hit per input sequence is returned. If in any case one would need more information this can be done via the `max_hsps`- and `max_target_seqs`-flags in the `modules.config` file.
@@ -89,17 +124,20 @@ In this folder, the filtered and re-renamed reads can be found. This result has
 <summary>Output files</summary>
 
 - `filter/`: Folder containing the filtered and re-renamed reads.
-  - `<sample>_filtered.fastq.gz`: The filtered reads, `<sample>` can stay as `<sample>` for single-end short reads, take the pattern `<sample>_{R1,R2}` for paired-end reads and `<sample>_longReads` for long reads.
+  - `filtered/`: Folder containing the decontaminated reads
+    - `<sample>_filtered.fastq.gz`: The filtered reads, `<sample>` can stay as `<sample>` for single-end short reads, take the pattern `<sample>_{R1,R2}` for paired-end reads and `<sample>_longReads` for long reads.
+- `removed/`: Folder containing the removed reads (optional)
+  - `<sample>_removed.fastq.gz`: The removed reads, `<sample>` can stay as `<sample>` for single-end short reads, take the pattern `<sample>_{R1,R2}` for paired-end reads and `<sample>_longReads` for long reads.
 
 </details>
 
 ### summary
 
-The summary file lists all statistics of kraken2 and blastn per sample. It is a combination of the summary files of kraken2 and blastn and can be used for a quick overview of the pipeline run. If blastn is skipped, then only the statistics of kraken2 is shown.
+The summary file lists all statistics of kraken2 and/or bbduk (and optionally blastn) per sample. It is a combination of the summary files of the classification step and blastn and can be used for a quick overview of the pipeline run. By default, only the summary of the classification step is shown.
 
-|                                                                                                                    | kraken2                    | isolatedkraken2                         | blastn_unique_ids                                                         | blastn_lines                         | filteredblastn_unique_ids                                                                                                    | filteredblastn_lines                                                               |
-| ------------------------------------------------------------------------------------------------------------------ | -------------------------- | --------------------------------------- | ------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
-| `<sample>` (For short reads it is the same as in the `samplesheet.csv`, for long reads it is `<sample>_longReads`) | Read IDs in kraken2 output | Read IDs in the isolated kraken2 output | Number of unique IDs in blastn output, should be the same as blastn_lines | Number of lines in the blastn output | Number of IDs in the blastn output after the filtering for identity and coverage, should be the same as filteredblastn_lines | Number of lines in the blastn output after the filtering for identity and coverage |
+|                                                                                                                    | classified with \*                                  | blastn_unique_ids                                                         | blastn_lines                         | filteredblastn_unique_ids                                                                                                    | filteredblastn_lines                                                               |
+| ------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------- | ------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
+| `<sample>` (For short reads it is the same as in the `samplesheet.csv`, for long reads it is `<sample>_longReads`) | Number of IDs classified in the classification step | Number of unique IDs in blastn output, should be the same as blastn_lines | Number of lines in the blastn output | Number of IDs in the blastn output after the filtering for identity and coverage, should be the same as filteredblastn_lines | Number of lines in the blastn output after the filtering for identity and coverage |
 
 <details markdown="1">
 <summary>Output files</summary>