4.2.1.0
Download release: gatk-4.2.1.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.1.0 release:
-
Several important fixes to HaplotypeCaller and the new DRAGEN-GATK code introduced in GATK 4.2.0.0
-
Started laying the groundwork in
Mutect2
forMutect3
, which will be more machine learning focused -
LocalAssembler
: a new tool that performs local assembly of small regions to discover structural variants (#6989) -
Support for multi-sample segmentation in
ModelSegments
-
Major speed improvements and several important fixes to
Funcotator
-
A new version of the Intel Genomics Kernel Library (GKL), with many important fixes and improvements
-
A new version of GenomicsDB, with improved cloud support
-
A GATK-wide option to shard VCFs on output, which is often useful for pipelining
-
GATK support for block compressed interval (
.bci
) files, which is useful when working with extremely large interval lists
Full list of changes:
-
New Tools
LocalAssembler
: a new tool that performs local assembly of small regions to discover structural variants (#6989)
-
HaplotypeCaller
- Fixed a rare edge case in DRAGEN mode that could result in negative GQs when
USE_POSTERIOR_PROBABILITIES
is set (#7120) - Fixed a rare edge case (mainly affecting DRAGEN mode) that could cause the PL arrays to be deleted when genotyping in
HaplotypeCaller
(#7148) - Fixed a bug in the
AlleleLikelihoods
that could result in new evidence X being assigned arbitrary likelihoods left over from previous evidence (#7154) - Fixed a "Padded span must contain active span" error caused by invalid feature file intervals that weren't being checked for validity against the sequence dictionary (#7295)
- Do not add the artificial haplotype read group to the bamout file when
--bam-writer-type NO_HAPLOTYPES
is specified (#7141) - Suppressed excessive log output related to
JumboAnnotation
warnings inHaplotypeCaller
(#7358)
- Fixed a rare edge case in DRAGEN mode that could result in negative GQs when
-
DRAGEN-GATK
-
Mutect2
- Added a training data mode (
--training-data-mode
) toMutect2
to prepare forMutect3
(#7109)- Training data mode collects data on variant- and artifact-supporting read sets for fitting a deep learning filtering model
- Better error bars for samples with small contamination in
CalculateContamination
(#7003)
- Added a training data mode (
-
Funcotator
- Greatly improved
Funcotator
performance by optimizing the VCF sanitization code (#7370)- In our tests, this change appears to speed up the tool by roughly 2x
- Updated the Gencode GTF Codec to be more permissive with transcript and gene types (#7166)
- Now the Gencode GTF Codec no longer restricts
transcriptType
andgeneType
to a limited set of values. These fields are now each stored as a String. This allows for arbitrary values in these fields and will help to future-proof (and species-proof) the GTF parser. - Fixes "IndexFeatureFile Error to Run Funcotator with Mouse Ensembl GTF" (#7054)
- Now the Gencode GTF Codec no longer restricts
- Now can decode codons containing IUPAC bases into amino acids. (#7188)
- Updated the tool to allow for protein changes with N / IUPAC bases. (#6778)
- Added the ability to have IUPAC bases in either the ref/alt alleles OR in the reference when calculating the amino acid sequence. In this case, the code will no longer throw a user exception, but will log a warning and will produce ? amino acids in the case that they cannot be decoded from the amino acid table. Currently this will happen any time an N or IUPAC base is in the region to be coded into amino acids.
- Added AminoAcid.UNDECODABLE as a placeholder for any unknown / undecodable amino acid (such as in the case of an ambiguous IUPAC base).
Funcotator
now checks whether the input has already been annotated, and by default throws an error in that case.- We also added a
--reannotate-vcf
override argument to explicitly allow reannotation (#7349)
- We also added a
- Greatly improved
-
CNV Calling
-
SV Calling
- Added
LocalAssembler
, a new tool that performs local assembly of small regions to discover structural variants (#6989)
- Added
-
The Genomics Kernel Library (GKL)
- Updated to GKL version 0.8.8, and remove the FPGA PairHMM as an option (#7203)
- This is a significant update to the GKL that comes with many fixes and improvements:
- Update ISAL and OTC Zlib libraries to latest version (Q1 2021)
- Fixed 3 reproducible issues and retested out of 4 more in GKL
- Updated build for Centos 7 and Current Mac.
- Ran valgrind on limited C unit tests (passed)
- Major improvements to input validation
- Major updates to Error handling and propagation.
- Added Negative space unit testing coverage
- Regular Static Code Scanning
- Good overall quality of life improvement for the software
- This is a significant update to the GKL that comes with many fixes and improvements:
- Updated to GKL version 0.8.8, and remove the FPGA PairHMM as an option (#7203)
-
GenomicsDB
- Moved to GenomicsDB 1.4.1, and add a toggle between the GCS Connector and native GCS support (#7224)
- This release allows for the direct use of the native GCS C++ client instead of the GCS Cloud Connector via HDFS. The GCS Cloud Connector can still be used with GenomicsDB via the ``--genomicsdb-use-gcs-hdfs-connector option`
- Using the native client with GCS allows for GenomicsDB to use the standard paradigms to help with authentication, retries with exponential backoff, configuring credentials, etc., and also helps with performance issues with GCS. See #7070.
- Allow specifying S3 and Azure blob storage uri's to GenomicsDB in addition to GCS and HDFS (#7271)
- Fixes related to the GenomicsDB upgrade (#7257)
- Improved the error message in
GenomicsDBImport
when failing to open aFeatureReader
(#7375)
- Moved to GenomicsDB 1.4.1, and add a toggle between the GCS Connector and native GCS support (#7224)
-
Mitochondrial pipeline
- Added median coverage metric to the mitochondrial pipeline (#7253)
-
Notable Enhancements
- Added a GATK-wide option (
--max-variants-per-shard
) to shard VCFs on output (#6959)- Sharded output is often extremely useful for pipelining
- Added GATK support for block compressed interval (
.bci
) files (#7142) - Added an
AlleleDepthPseudoCounts
(DD) genotype annotation. (#7303)- Similar to AD, the new annotation (DD) captures the depth of each allele's supporting evidence or reads, however it does so by following a variational Bayes approach looking into the likelihoods rather than applying a fixed threshold. This turns out to be more robust in some instances.
- To get the new non-standard annotation in
HaplotypeCaller
you need to add-A AllelePseudoDepth
- We now track the source of variants in
MultiVariantWalkers
, which is important for some tools such asVariantEval
(#7219)
- Added a GATK-wide option (
-
Bug Fixes
- Fixed key ordering bugs in the implementations of
Histogram.median()
andCompressedDataList.iterator()
(#7131)- These bugs could result in incorrect RankSumTest annotations in some cases
- Fixed the
DepthPerSampleHC
andStrandBiasBySample
annotations to not spam the logs with "Annotation will not be calculated" warnings (#7357) VariantEval
: fixed contig stratification to defer to user-defined intervals (#7238)
- Fixed key ordering bugs in the implementations of
-
Miscellaneous Changes
- The
ProgressMeter
can now be completely disabled for all tools / traversals by overridingGATKTool.disableProgressMeter()
(#7354) - We now authenticate with Dockerhub in our Travis builds, to help avoid tests failing due to quota issues (#7204) (#7256)
- Migrated
VariantEval
to be aMultiVariantWalkerGroupedOnStart
(#6973) VariantEval
: added an argument to specify thePedigreeValidationType
(#7240)- Converted
InfoFieldAnnotation
/GenotypeAnnotation
into interfaces. (#7041) - Allow
MultiVariantWalkerGroupedOnStart
subclasses to view/setignoreIntervalsOutsideStart
(#7301) PedigreeAnnotation
: consolidate code, provide getters, and allowPedigreeValidationType
to be set (#7277)ASEReadCounter
: added a warning for variants lacking GT fields (#7326)- Added filters to
dockstore.yml
so that only the master branch and the releases get synced to Dockstore (#7217) - Fixed a compatibility issue between Java 11 and
log4j2
(#7339) - We now update the gcloud package signing key at the start of every docker build (#7180)
- Updated our Artifactory key (#7208)
- Disabled some Spark dataproc tests because of dependency issues. (#7170)
- Removed some embedded licenses from scripts (#7340)
- The
-
Documentation
- Variant annotation documentation: removed broken links to related annotations from the tool docs (#7307)
- Updated the link to an article on Jexl expressions (#7317)
- Fixed several broken links in docs for the CNV tools (#7309)
- Fixed broken links in the docs for
Funcotator
,VariantRecalbrator
, andASEReadCounter
(#7270) - Fixed typos in the tool documentation for
HaplotypeCaller
andLeftAlignAndTrimVariants
(#6440) - Clarify pipeline inputs in documentation for
GnarlyGenotyper
(#7231)
-
Dependencies