Skip to content

4.2.1.0

Compare
Choose a tag to compare
@droazen droazen released this 30 Jul 21:27
· 367 commits to master since this release
9951f77

Download release: gatk-4.2.1.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.1.0 release:

  • Several important fixes to HaplotypeCaller and the new DRAGEN-GATK code introduced in GATK 4.2.0.0

  • Started laying the groundwork in Mutect2 for Mutect3, which will be more machine learning focused

  • LocalAssembler: a new tool that performs local assembly of small regions to discover structural variants (#6989)

  • Support for multi-sample segmentation in ModelSegments

  • Major speed improvements and several important fixes to Funcotator

  • A new version of the Intel Genomics Kernel Library (GKL), with many important fixes and improvements

  • A new version of GenomicsDB, with improved cloud support

  • A GATK-wide option to shard VCFs on output, which is often useful for pipelining

  • GATK support for block compressed interval (.bci) files, which is useful when working with extremely large interval lists

Full list of changes:

  • New Tools

    • LocalAssembler: a new tool that performs local assembly of small regions to discover structural variants (#6989)
  • HaplotypeCaller

    • Fixed a rare edge case in DRAGEN mode that could result in negative GQs when USE_POSTERIOR_PROBABILITIES is set (#7120)
    • Fixed a rare edge case (mainly affecting DRAGEN mode) that could cause the PL arrays to be deleted when genotyping in HaplotypeCaller (#7148)
    • Fixed a bug in the AlleleLikelihoods that could result in new evidence X being assigned arbitrary likelihoods left over from previous evidence (#7154)
    • Fixed a "Padded span must contain active span" error caused by invalid feature file intervals that weren't being checked for validity against the sequence dictionary (#7295)
    • Do not add the artificial haplotype read group to the bamout file when --bam-writer-type NO_HAPLOTYPES is specified (#7141)
    • Suppressed excessive log output related to JumboAnnotation warnings in HaplotypeCaller (#7358)
  • DRAGEN-GATK

    • CalibrateDragstrModel: fixed a sporadic out-of-memory error (#7212)
    • CalibrateDragstrModel: fixed an "IllegalArgumentException: Start cannot exceed end" error (#7212)
  • Mutect2

    • Added a training data mode (--training-data-mode) to Mutect2 to prepare for Mutect3 (#7109)
      • Training data mode collects data on variant- and artifact-supporting read sets for fitting a deep learning filtering model
    • Better error bars for samples with small contamination in CalculateContamination (#7003)
  • Funcotator

    • Greatly improved Funcotator performance by optimizing the VCF sanitization code (#7370)
      • In our tests, this change appears to speed up the tool by roughly 2x
    • Updated the Gencode GTF Codec to be more permissive with transcript and gene types (#7166)
      • Now the Gencode GTF Codec no longer restricts transcriptType and geneType to a limited set of values. These fields are now each stored as a String. This allows for arbitrary values in these fields and will help to future-proof (and species-proof) the GTF parser.
      • Fixes "IndexFeatureFile Error to Run Funcotator with Mouse Ensembl GTF" (#7054)
    • Now can decode codons containing IUPAC bases into amino acids. (#7188)
    • Updated the tool to allow for protein changes with N / IUPAC bases. (#6778)
      • Added the ability to have IUPAC bases in either the ref/alt alleles OR in the reference when calculating the amino acid sequence. In this case, the code will no longer throw a user exception, but will log a warning and will produce ? amino acids in the case that they cannot be decoded from the amino acid table. Currently this will happen any time an N or IUPAC base is in the region to be coded into amino acids.
      • Added AminoAcid.UNDECODABLE as a placeholder for any unknown / undecodable amino acid (such as in the case of an ambiguous IUPAC base).
    • Funcotator now checks whether the input has already been annotated, and by default throws an error in that case.
      • We also added a --reannotate-vcf override argument to explicitly allow reannotation (#7349)
  • CNV Calling

    • Enabled multi-sample segmentation in ModelSegments (#6499)
    • Removed mapping error rate from estimate of denoised copy ratios output by gCNV, and updated sklearn. (#7261)
    • Moved gCNV sample QA check into the Postprocessing task in the WDL (#7150)
  • SV Calling

    • Added LocalAssembler, a new tool that performs local assembly of small regions to discover structural variants (#6989)
  • The Genomics Kernel Library (GKL)

    • Updated to GKL version 0.8.8, and remove the FPGA PairHMM as an option (#7203)
      • This is a significant update to the GKL that comes with many fixes and improvements:
        • Update ISAL and OTC Zlib libraries to latest version (Q1 2021)
        • Fixed 3 reproducible issues and retested out of 4 more in GKL
        • Updated build for Centos 7 and Current Mac.
        • Ran valgrind on limited C unit tests (passed)
        • Major improvements to input validation
        • Major updates to Error handling and propagation.
        • Added Negative space unit testing coverage
        • Regular Static Code Scanning
        • Good overall quality of life improvement for the software
  • GenomicsDB

    • Moved to GenomicsDB 1.4.1, and add a toggle between the GCS Connector and native GCS support (#7224)
      • This release allows for the direct use of the native GCS C++ client instead of the GCS Cloud Connector via HDFS. The GCS Cloud Connector can still be used with GenomicsDB via the ``--genomicsdb-use-gcs-hdfs-connector option`
      • Using the native client with GCS allows for GenomicsDB to use the standard paradigms to help with authentication, retries with exponential backoff, configuring credentials, etc., and also helps with performance issues with GCS. See #7070.
    • Allow specifying S3 and Azure blob storage uri's to GenomicsDB in addition to GCS and HDFS (#7271)
    • Fixes related to the GenomicsDB upgrade (#7257)
      • Fixed an issue where the combine operation for certain fields needs to take care to not remap missing fields to NON_REF
      • Fixes "Regression in GenomicsDBImport progress meter" #7222
      • Adds tests for "GenomicsDBImport Creating Workspace Where REF is Inappropriately N?" #7089
    • Improved the error message in GenomicsDBImport when failing to open a FeatureReader (#7375)
  • Mitochondrial pipeline

    • Added median coverage metric to the mitochondrial pipeline (#7253)
  • Notable Enhancements

    • Added a GATK-wide option (--max-variants-per-shard) to shard VCFs on output (#6959)
      • Sharded output is often extremely useful for pipelining
    • Added GATK support for block compressed interval (.bci) files (#7142)
    • Added an AlleleDepthPseudoCounts (DD) genotype annotation. (#7303)
      • Similar to AD, the new annotation (DD) captures the depth of each allele's supporting evidence or reads, however it does so by following a variational Bayes approach looking into the likelihoods rather than applying a fixed threshold. This turns out to be more robust in some instances.
      • To get the new non-standard annotation in HaplotypeCaller you need to add -A AllelePseudoDepth
    • We now track the source of variants in MultiVariantWalkers, which is important for some tools such as VariantEval (#7219)
  • Bug Fixes

    • Fixed key ordering bugs in the implementations of Histogram.median() and CompressedDataList.iterator() (#7131)
      • These bugs could result in incorrect RankSumTest annotations in some cases
    • Fixed the DepthPerSampleHC and StrandBiasBySample annotations to not spam the logs with "Annotation will not be calculated" warnings (#7357)
    • VariantEval: fixed contig stratification to defer to user-defined intervals (#7238)
  • Miscellaneous Changes

    • The ProgressMeter can now be completely disabled for all tools / traversals by overriding GATKTool.disableProgressMeter() (#7354)
    • We now authenticate with Dockerhub in our Travis builds, to help avoid tests failing due to quota issues (#7204) (#7256)
    • Migrated VariantEval to be a MultiVariantWalkerGroupedOnStart (#6973)
    • VariantEval: added an argument to specify the PedigreeValidationType (#7240)
    • Converted InfoFieldAnnotation/GenotypeAnnotation into interfaces. (#7041)
    • Allow MultiVariantWalkerGroupedOnStart subclasses to view/set ignoreIntervalsOutsideStart (#7301)
    • PedigreeAnnotation: consolidate code, provide getters, and allow PedigreeValidationType to be set (#7277)
    • ASEReadCounter: added a warning for variants lacking GT fields (#7326)
    • Added filters to dockstore.yml so that only the master branch and the releases get synced to Dockstore (#7217)
    • Fixed a compatibility issue between Java 11 and log4j2 (#7339)
    • We now update the gcloud package signing key at the start of every docker build (#7180)
    • Updated our Artifactory key (#7208)
    • Disabled some Spark dataproc tests because of dependency issues. (#7170)
    • Removed some embedded licenses from scripts (#7340)
  • Documentation

    • Variant annotation documentation: removed broken links to related annotations from the tool docs (#7307)
    • Updated the link to an article on Jexl expressions (#7317)
    • Fixed several broken links in docs for the CNV tools (#7309)
    • Fixed broken links in the docs for Funcotator, VariantRecalbrator, and ASEReadCounter (#7270)
    • Fixed typos in the tool documentation for HaplotypeCaller and LeftAlignAndTrimVariants (#6440)
    • Clarify pipeline inputs in documentation for GnarlyGenotyper (#7231)
  • Dependencies

    • Updated HTSJDK to version 2.24.1 (#7149)
    • Updated Picard to version 2.25.4 (#7255)
    • Updated GenomicsDB to version 1.4.1 (#7224)
    • Updated the Genomics Kernel Library (GKL) to version 0.8.8 (#7203)