4.1.8.0
Download release: gatk-4.1.8.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.8.0 release:
-
A major new release of
GenomicsDB
(1.3.0), with enhanced support for shared filesystems such as NFS and Lustre, support for MNVs, and better compression leading to a roughly 50% reduction in workspace size in our tests. This also includes a fix for an error inGenotypeGVCFs
that several users were encountering when reading from GenomicsDB. -
A major overhaul of the
PathSeq
microbial detection pipeline containing many improvements -
Initial/prototype support for reading from HTSGET services in GATK
- Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
-
Fixes for a couple of frequently-reported errors in
HaplotypeCaller
andMutect2
(#6586 and #6516) -
Significant updates to our Python/R library dependencies and Docker image
Full list of changes:
-
New Tools
HtsgetReader
: an experimental tool to localize files from an HTSGET service (#6611)- Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
ReadAnonymizer
: a tool to anonymize reads with information from the reference (#6653)- This tool is useful in the case where you want to use data for analysis, but cannot publish the data without anonymizing the sequence information.
-
HaplotypeCaller/Mutect2
- Fixed an "evidence provided is not in sample" error in
HaplotypeCaller
when performing contamination downsampling (#6593)- This fixes the issue reported in #6586
- Fixed a "String index out of range" error in the
TandemRepeat
annotation withHaplotypeCaller
andMutect2
(#6583)- This addresses an edge case reported in #6516 where an alt haplotype starts with an indel, and hence the variant start is one base before the assembly region due to padding a leading matching base
- Better documentation for
FilterAlignmentArtifacts
(#6638) - Updated the
CreateSomaticPanelOfNormals
documentation (#6584) - Improved the tests for
NuMTFilterTool
(#6569)
- Fixed an "evidence provided is not in sample" error in
-
PathSeq
- Major overhaul of the PathSeq WDLs (#6536)
- This new PathSeq WDL redesigns the workflow for improved performance in the cloud.
- Downsampling can be applied to BAMs with high microbial content (ie >10M reads) that normally cause performance issues.
- Removed microbial fasta input, as only the sequence dictionary is needed.
- Broke pipeline down to into smaller tasks. This helps reduce costs by a) provisioning fewer resources at the filter and score phases of the pipeline and b) reducing job wall time to minimize the likelihood of VM preemption.
- Filter-only option, which can be used to cheaply estimate the number of microbial reads in the sample.
- Metrics are now parsed so they can be fed as output to the Terra data model.
- CRAM-to-BAM capability
- Updated WDL readme
- Deleted unneeded WDL json configuration, as the configuration can be provided in Terra
- Added an
--ignore-alignment-contigs
argument toPathSeq
filtering that lets users specify any contigs that should be ignored. (#6537)- This is useful for BAMs aligned to hg38, which contains the Epstein-Barr virus (chrEBV)
- Major overhaul of the PathSeq WDLs (#6536)
-
GenomicsDB
- Upgraded to
GenomicsDB
version 1.3.0 (#6654)- Added a new argument
--genomicsdb-shared-posixfs-optimizations
to help with shared POSIX filesystems like NFS and Lustre. This turns on disable file locking and for GenomicsDB import it minimizes writes to disks. The performance on some of the gatk datasets for the import of about 10 samples went from 23.72m to 6.34m on NFS which was comparable to importing to a local filesystem. Hopefully this helps with Issue #6487 and #6627. Also, fixes Issue #6519. - This version of GenomicsDB also uses pre-compression filters for offset and compression files for new workspaces and genomicsdb arrays. The total sizes for a GenomicsDB workspace using the same dataset as above and the 10 samples went from 313MB to 170MB with no change in import and query times. Smaller GenomicsDB arrays also help with performance on distributed and cloud file systems.
- This version has added support to handle MNVs similar to deletions as described in Issue #6500.
- There is added support in
GenomicsDBImport
to have multiple contigs in the same GenomicsDB partition/array. This will hopefully help import times in cases where users have many thousands of contigs. Changes are still needed from the GATK side to make use of this support. - Logging has been improved somewhat with the native C/C++ code using spdlog and fmt and the Java layer using apache log4j and log4j.properties provided by the application. Also, info messages like No valid combination operation found for INFO field AA - the field will NOT be part of INFO fields in the generated VCF records will only be output once for the operation.
- Added a new argument
- Made
VCFCodec
the default for query streams fromGenomicsDB
(#6675)- This fixes the frequently-reported
NullPointerException
inGenotypeGVCFs
when reading from GenomicsDB (see #6667) - Added a
--genomicsdb-use-bcf-codec
argument to opt back in to using the BCFCodec, which is faster but prone to the above error on certain datasets
- This fixes the frequently-reported
- Upgraded to
-
CNV Tools
-
Docker/Conda Overhaul (#5026)
- Our docker image is now built off of Ubuntu 18.04 instead of 16.04
- This brings in newer versions of several important packages such as
samtools
- This brings in newer versions of several important packages such as
- Updated many of the Python libraries installed via our conda environment and included in our Docker image to newer versions, resolving several outstanding issues in the process
- R dependencies are now installed via conda in our Docker build instead of the now-removed
install_R_packages.R
script- Due to this change, we recommend that tools that use R packages (e.g., to create plots) should now be run using the GATK docker image or the conda environment.
- NOTE: significant updates and changes to the Ubuntu version, native packages, and R/python packages may result in corresponding numerical changes in results.
- Our docker image is now built off of Ubuntu 18.04 instead of 16.04
-
Mitochondrial Pipeline
- Minor updates to the mitochondrial pipeline WDLs (#6597)
-
Notable Enhancements
RevertSamSpark
now supports CRAMs (#6641)- Fixed a
VariantAnnotator
performance issue that could cause the tool to run very slowly on certain inputs (#6672) - More flexible matching of dbSNP variants during variant annotation (#6626)
- Add all dbsnp id's which match a particular variant to the variant's id, instead of just the first one found in the dbsnp vcf.
- Be less brittle to variant normalization issues, and match differing variant representations of the same underlying variant. This is implemented by splitting and trimming multiallelics before checking for a match, which I suspect are the predominant cause of these types of matching failures.
- Added a
--min-num-bases-for-segment-funcotation
argument toFuncotateSegments
(#6577)- This will allow for segments of length less than 150 bases to be annotated if given at run time (defaults to 150 bases to preserve the previous behavior).
SplitIntervals
can now handle more than 10,000 shards (#6587)
-
Bug Fixes
- Fixed interval summary files being empty in
DepthOfCoverage
(#6609) - Fixed a crash in the BQSR R script with newer versions of R (#6677)
- Fix crash when reporting error when trying to build GATK with a JRE (#6676)
- Fixed an issue where
ReadsSourceSpark.getHeader()
wasn't propagating the reference at all when a CRAM file input resides on GCS, so it always resulted in a "no reference was provided" error, even when a reference was provided. (#6517) - Fixed an issue where
ReadsSourceSpark.checkCramReference()
always tried to create a Hadoop Path object for the reference no matter what file system it lives on, which fails when using a reference on GCS. (#6517) - Fixed an issue where the tab completion integration tests weren't emitting any output (#6647)
- Fixed interval summary files being empty in
-
Miscellaneous Changes
- Created a new
ReadsDataSource
interface (#6633) - Migrated read arguments and downstream code to
GATKPath
(#6561) - Renamed
GATKPathSpecifier
toGATKPath
. (#6632) - Add a read/write roundtrip Spark integration test for a CRAM and reference on HDFS. (#6618)
- Deleted redundant methods in
SVCigarUtils
, and rewrote and moved the rest toCigarUtils
(#6481) - Re-enabled tests for HTSGET now that the reference server is back to a stable version (#6668)
- Disabled
SortSamSparkIntegrationTest.testSortBAMsSharded()
(#6635) - Fixed a typo in a
SortSamSpark
log message. (#6636) - Removed incorrect logger from
DepthOfCoverage
. (#6622)
- Created a new
-
Documentation
- Fixed annotation equation rendering in the tool docs. (#6606)
- Adding a note as to how to filter on MappingQuality in
DepthOfCoverage
(#6619) - Clarified the docs for the
--gcs-project-for-requester-pays
argument to mention the need forstorage.buckets.get
permission on the bucket being accessed (#6594) - Fixed a dead forum link in the
SelectVariants
documentation (#6595)
-
Dependencies