Skip to content

Releases: broadinstitute/gatk

4.6.0.0

29 Jun 23:24
64348bc
Compare
Choose a tag to compare

Download release: gatk-4.6.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.6.0.0 release:

  • We've fixed a serious CRAM writing bug that affects GATK versions 4.3 through 4.5 and Picard versions 2.27.3 through 3.1.1. This bug can, in limited cases, lead to reads with an incorrect base sequence being written. See this comment to GATK issue 8768 and the full release notes below for more details on what conditions trigger the bug.

    • To help users detect whether their CRAM files are affected, we've released a CRAM scanning tool called CRAMIssue8768Detector that can detect whether a particular CRAM file is affected by this bug. If you suspect that some of your CRAM files may have been affected, please run this tool on them for confirmation!
  • By overwhelming popular demand, we've switched back to using the standard ./. representation for no-calls in GenotypeGVCFs and GenomicsDB instead of 0/0 with DP=0. This reverts the change described in our article GenotypeGVCFs and the death of the dot.

    • We intend to publish a new article shortly to replace that older article with further details on this change. When we do so, we'll link to it from here.
  • The Mutect2 germline resource can now have split multiallelic format

  • Added an --inverted-read-filter argument to allow for selecting reads that fail read filters from the command line easily

  • We've fixed a number of issues with HTTP support, mainly affecting the loading of side inputs such as indices over HTTP

  • Reduced the number of layers in the GATK docker image to help users running into docker quota issues

Full list of changes:

  • Important CRAM writing bug fix and detection tool

    • We've updated to HTSJDK 4.1.1 and Picard 3.2.0 (#8900), which fix a serious bug in the CRAM writing code first reported in GATK issue 8768
    • This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0.
    • This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0.
    • The bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met:
      • At least one read is mapped to the very first base of a reference contig
      • The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig
    • When both of these conditions are met, the resulting CRAM file may have corrupt containers associated with that contig containing reads with an incorrect sequence.
    • Since many common references such as hg38 have N's at the very beginning of the autosomes and X/Y, many pipelines will not be affected by this bug. However, users of a telomere-to-telomere reference, users doing mitochondrial calling, and users with reads aligned to the alt sequences will want to scan their CRAM files for possible corruption.
    • The other mitigating circumstance is that when a CRAM is affected, the signal will be overwhelmingly obvious, with the mismatch rate typically jumping from sub-1% to 80-90% for the affected regions, making it likely to be caught by standard QC processes.
    • We've released a CRAM scanning tool called CRAMIssue8768Detector (#8819) that can detect whether a particular CRAM file is affected by this bug. If you suspect that some of your CRAM files may have been affected, please run this tool on them for confirmation!
  • Joint Calling

    • We've switched back to using the standard ./. representation for no-calls in GenotypeGVCFs and GenomicsDB instead of 0/0 with DP=0 (#8715) (#8741) (#8759)
    • Fix for GenotypeGVCFs with mixed ploidy sites (#8862)
    • Fix for GnarlyGenotyper when PLs are null (#8878)
    • Fixed bug in ReblockGVCF when removing annotations (#8870)
    • Enable ReblockGVCF to subset AS annotations that aren't "raw" (pipe-delimited) (#8771)
    • Remove header lines in ReblockGVCF when we remove FORMAT annotations (#8895)
    • ReblockGVCF: Add malaria spanning deletion exception regression test with fix (#8802)
    • Restore some GnarlyGenotyper tests (#8893)
  • HaplotypeCaller

    • Fix to long deletions that overhang into the assembly window causing exceptions in HaplotypeCaller (#8731)
  • Mutect2

    • The Mutect2 germline resource can now have split multiallelic format (#8837)
    • Make the Mutect2 haplotype and clustered events filters smarter about germline events (#8717)
    • Added the DragSTR model to the Mutect2 WDL (#8716)
    • Improvements to Mutect2's Permutect training data mode (#8663)
    • Bigger Permutect tensors and Permutect test datasets can be annotated with truth VCF (#8836)
    • Mutect2 WDL and GetSampleName can handle multiple sample names in BAM headers (#8859)
    • Permutect dataset engine outputs contig and read group indices, not names (#8860)
    • Normal artifact LOD is now defined without the extra minus sign (#8668)
  • CNV Calling

    • Fixed the GT header in PostprocessGermlineCNVCalls's --output-genotyped-intervals output (#8621)
  • SV Calling

    • Reduced SVConcordance memory footprint (#8623)
    • Rewrote complex SV functional annotation in SVAnnotate (#8516)
    • We now handle the CTX_INV subtype in SVAnnotate (#8693)
  • Flow-based Calling

    • SNVQ recalibration tool added for flow-based reads (#8697)
    • Bug fix in flow-based allele filtering (#8775)
    • Fixed a bug in flow-based AlleleFiltering that ignored more than a single sample (#8841)
    • Fixed an edge case in flow-based variant annotation (#8810)
  • Notable Enhancements

    • Added an --inverted-read-filter argument to allow for selecting reads that fail read filters from the command line easily (#8724)
    • Inverted SoftClippedReadFilter to conform to the standard filtering logic (#8888)
    • Reduced the number of docker layers in the GATK image from 44 to 16 (#8808)
    • VariantFiltration: added a --mask-description argument to write custom mask filter description in VCF header (#8831)
    • GatherVcfsCloud is no longer beta (#8680)
  • Miscellaneous Changes

    • GetPileupSummaries now uses the standard MappingQualityReadFilter instead of a custom --min-mapping-quality argument (#8781)
    • Funcotator: suppress a log message about b37 contigs when not doing b37/hg19 conversion (#8758)
    • Output the new image name at the end of a successful cloud docker build (#8627)
    • Exclude the test folder from code coverage calculations (#8744)
    • Removed deprecated genomes in the cloud docker image that was causing CNN WDL test failures (#8891)
    • Re-commit large test files as lfs stubs (#8769)
    • Standardize test results directory between normal/docker tests (#8718)
    • Improve failure message in VariantContextTestUtils (#8725)
    • Update the setup_cloud github action (#8651)
    • Parameterize the logging frequency for ProgressLogger in GatherVcfsCloud (#8662)
  • Documentation

    • Updated the README to include list of popular software included in docker image (#8745)
  • Dependencies

    • Updated HTSJDK to 4.1.1, which fixes the CRAM writing bug described above (#8900)
    • Updated Picard to 3.2.0, which fixes the CRAM writing bug described above (#8900)
    • Updated GenomicsDB to 1.5.3, which supports M1 Macs and switches no-call representation back to ./. (#8710) (#8759)
    • Updated http-nio to 1.1.1, which fixes several URL-handling bugs with HTTP support (#8889)
    • Updated several miscellaneous dependencies to fix security vulnerabilities (#8898)

4.5.0.0

13 Dec 22:53
8317d8b
Compare
Choose a tag to compare

Download release: gatk-4.5.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.5.0.0 release:

  • HaplotypeCaller now supports custom ploidy regions that can be specified via a new --ploidy-regions argument, overriding the global -ploidy setting

  • The default SmithWaterman implementation for HaplotypeCaller and Mutect2 is now the hardware-accelerated version, resulting in a significant speedup

  • Funcotator has a new datasource release that brings in the latest version of Gencode and several other key data sources

  • We've updated our dependencies and our docker environment to greatly cut down on known security vulnerabilities

  • We've greatly improved support for http/https inputs in GATK-native tools (though most Picard tools bundled with GATK do not yet support it)

  • We've ported some additional DRAGEN features to HaplotypeCaller that bring us closer to functional equivalence with DRAGEN v3.7.8

  • GenomicsDBImport now has support for Azure storage az:// URIs

  • GnarlyGenotyper now has haploid support

  • Lots of important bug fixes, including a fix for a bug in the Intel GKL that could cause output files to intermittently fail to be compressed properly

Full list of changes:

  • HaplotypeCaller

    • HaplotypeCaller now supports custom ploidy regions (#8609)
      • Added a new argument to HaplotypeCaller called --ploidy-regions which allows the user to input a .bed or .interval_list with the "name" column equal to a positive integer for the ploidy to use when calling variants in that region
      • The main use case is for calling haploid variants outside the PAR for XY individuals as required by the VCF spec, but this provides a much more flexible interface for other similar niche applications, like genotyping individuals with other known aneuploidies
      • The global -ploidy flag will still provide the background default (or the built-in ploidy of 2 for humans), but the user-supplied values will supersede these in overlapping regions
    • Changed the SmithWaterman implementation to default to FASTEST_AVAILABLE (#8485)
    • Fixed a bug in pileup calling mode relating to the number of haplotypes (#8489)
    • Huge simplication of genotyping likelihoods calculations -- no change in output (#6351)
    • Be explicit about when variants are biallelic (#8332)
    • Fixed debug log severity for read threading assembler messages (#8419)
    • Fixed issue with visibility of the --dont-use-softclipped-bases argument (#8271)
  • Mutect2

    • Added a --base-qual-correction-factor to allow a scale factor to be provided to modify the base qualities reported by the sequencer and used in the Mutect2 substitution error model (#8447)
      • Set to zero to turn off the error model changes introduced in GATK 4.1.9.0
    • Fixed a bug in FilterMutectCalls for GVCFs (#8458)
      • When using GVCFs with Mutect2 (for example with the Mitochondria mode), in the filtering step ADs for symbolic alleles are set to 0 so it doesn't contribute to overall AD. There was an off-by-one error that removed the alt allele AD rather than the <NON_REF> allele AD. This led to NaNs and errors when a site had no ref reads (for example a GT of [ref,alt,<NON_REF>] and AD of [0,300,0] would accidentally be changed to an AD of [0,0,0] if the alt index was removed instead of the <NON_REF> index).
  • DRAGEN-GATK

    • Added implementations of the "columnwise detection" and "PDHMM" (partially-determined HMM) features from DRAGEN to bring us much closer to functional equivalence with DRAGEN v3.7.8 (#8083)
    • Development work to prepare the way for the final missing DRAGEN 3.7.8 feature, "joint detection":
      • Graph method for PDHMM event groups that unifies finding/merging and overlap/mutual exclusion (#8366)
      • Rewrote haplotype construction methods in PartiallyDeterminedHaplotypeComputationEngine (#8367)
      • More refactoring in PartiallyDeterminedHaplotypeComputationEngine and preparing for joint detection (#8492)
      • Innocuous housekeeping changes in the partially-determined haplotypes code (#8361)
      • Clarify cryptic bitwise operations in the partially-determined haplotype EventGroup subclass (#8400)
  • Joint Calling

    • Added haploid support to GnarlyGenotyper (#7750)
    • Fix to allow GenotypeGVCFs to properly handle events not in minimal representation (#8567)
    • ReblockGVCF: added a --keep-site-filters argument to keep site-level filters (#8304) (#8308)
    • ReblockGVCF: added a --add-site-filters-to-genotype argument to move site-level filters to genotype-level filters (#8484)
    • ReblockGVCF: added a --format-annotations-to-remove argument to specify format-level annotations to remove from all genotypes in final GVCF (#8411)
    • ReblockGVCF: added a check to make sure the input VCF is a GVCF rather than a single sample VCF (#8411)
    • Improved an error message in GnarlyGenotyper (#8270)
    • Added a mergeWithRemapping() method in ReferenceConfidenceVariantContextMerger to perform allele remapping prior to genotyping (#8318)
    • GVS (Genomic Variant Store) development:
      • Incorporated changes from the GVS branch to existing files (#8256)
      • Incorporated build changes from the GVS branch (#8249)
      • Merged non-GVS bits required by the GVS branch [VS-971] (#8362)
  • GenomicsDB

    • Allow GenomicsDBImport to accept Azure az:// URIs as input (#8438)
    • Updated to a newer GenomicsDB release with Java 17 support, improved error messages/logging, and generally improved performance (#8358)
  • Funcotator

    • New data source release V1.8 (#8512)
      • Updated Gencode to version 43, and also updated COSMIC, Clinvar, and several other datasources to their latest versions
      • The data sources are now split by reference into separate hg19 and hg38 bundles to cut down on size
    • Fixed support for newer Gencode GTF versions by making the GencodeGTFField parsing more permissive (#8351)
    • Fixed Funcotator VCF output renderer to correctly preserve B37 contig names on output for B37 aligned files (#8539)
    • Fix bug in VCF comparison code that causes Funcotator to crash with certain datasources (#8445)
    • Connected the splice site window size to CLI parameters (#8463)
    • Allow LocatableXsvFuncotationFactory to read gzipped files (#8363)
  • CNV Calling

    • Matched gCNV pipeline arguments to those that were shown to have good performance in running large exome cohorts (#8234)
    • Added resource usage section to the GermlineCNVCaller java doc (#8064)
  • SV Calling

    • Added support for breakend replacement alleles in SVCluster (#8408)
      • Implements allele collapsing for "breakend replacement" BND alleles, as described in section 5.4 of the VCFv4.2 spec
    • Size similarity linkage and bug fixes for SV matching tools (#8257)
      • Added size similarity criterion to the SVConcordance and SVCluster tools. This is particularly useful for accurately matching smaller SVs that have a high degree of breakpoint uncertainty, in which case reciprocal overlap does not work well. PESR/mixed variant types must have size similarity, reciprocal overlap, and breakend window criteria met. Depth-only variants may have either size similarity + reciprocal overlap OR breakend window criteria met (or both).
    • Updated SV split-read strand validation and clustering (#8378)
      • Adds some flexibility to the allowed split-read strand annotations on SV records:
        • Allow INS -+ strands
        • Allow INV null strands
        • When clustering, only require that strands match for INV/BND records
    • Sample set and annotation improvements for SVConcordance (#8211)
  • Mitochondrial pipeline

    • Added a variable for the user to specify the java heap size in Picard in the MT pipeline (#8406)
    • Exposed runtime attributes as arguments in the MT pipeline (#8413) (#8417)
  • Flow-based Calling

    • New/updated flow-based read tools (#8579)
      • Added a new GroundTruthScorer tool to score reads against a reference/ground truth
      • Updated FlowFeatureMapper
    • Created an AddFlowBaseQuality tool that writes reads from flow-based SAM/BAM/CRAM files that pass criteria to a new file while adding a base-quality attribute (BQ) (#8235)
    • Added an experimental tool FlowPairHMMAlignReadsToHaplotypes that aligns flow-based reads to set of haplotypes / templates (#8305)
    • Fixed an issue with reads that contain the tp tag sometimes being incorrectly identified as flow-based (#8337)
    • Minor changes and fixes to flow-based annotations (#8442)
    • Removed a line in FlowBasedAnnotation that contained a bug and thus was meaningless (#8421)
    • Additional annotation in FeatureMap (#8347)
    • Removed unnecessary flow-based argument and option (#8342)
    • GroundTruthScorer doc update (#8597)
    • Removed unnecessary and buggy validation check (#8580)
  • Notable Enhancements

    • Major security fixes in our dependencies and docker environment
      • Updated the GATK base docker image to Ubuntu 22.04 for security fixes and newer versions of genomics packages like samtools and bcftools (#8610)
      • Updated GATK dependencies to address known security vulnerabilities, and added a vulnerability scanner to build.gradle (#8607)
    • Greatly improved HTTP support (#8611)
      • Updated the http-nio library and made tweaks to HTSJDK to make it available in more places. The new version of http-nio should provide much more reliable access to http(s) file paths. This is supported by all methods accessing Paths, and includes SAM/BAM/CRAM and VCF/Feature file...
Read more

4.4.0.0

16 Mar 19:08
2dbc025
Compare
Choose a tag to compare

Download release: gatk-4.4.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.4.0.0 release:

  • We've moved to Java 17, the latest long-term support (LTS) Java release, for building and running GATK! Previously we required Java 8, which is now end-of-life.

    • Newer non-LTS Java releases such as Java 18 or Java 19 may work as well, but since they are untested by us we only officially support running with Java 17.
  • Significant enhancements to SelectVariants, including arguments to enable GVCF filtering support and to work with genotype fields more easily.

  • A new tool SVConcordance, that calculates SV genotype concordance between an "evaluation" VCF and a "truth" VCF

  • Bug fixes and enhancements to the support for the Ultima Genomics flow-based sequencing platform introduced in GATK 4.3.0.0

Full list of changes:

  • Flow-based Variant Calling

    • FlowFeatureMapper: added surrounding-median-quality-size feature (#8222)
    • Removed hardcoded limit on max homopolymer call (#8088)
    • Fixed bug in dynamic read disqualification (#8171)
    • Fixed a bug in the parsing of the T0 tag (#8185)
    • Updated flow-based calling Mutect2 parameters to make them consistent with the HaplotypeCaller parameters (#8186)
  • SelectVariants

    • Enabled GVCF type filtering support in SelectVariants (#7193)
      • Added an optional argument --ignore-non-ref-in-types to support correct handling of VariantContexts that contain a NON_REF allele. This is necessary because every variant in a GVCF file would otherwise be assigned the type MIXED, which makes it impossible to filter for e.g. SNPs.
      • Note that this only enables correct handling of GVCF input. The filtered output files are VCF (not GVCF) files, since reference blocks are not extended when a variant is filtered out.
    • SelectVariants: added new arguments for controlling genotype JEXL filtering (#8092)
      • -select-genotype: with this new genotype-specific JEXL argument, we support easily filtering by genotype fields with expressions like 'GQ > 0', where the behavior in the multi-sample case is 'GQ > 0' in at least one sample. It's still possible to manually access genotype fields using the old -select argument and expressions such as vc.getGenotype('NA12878').getGQ() > 0.
      • --apply-jexl-filters-first: This flag is provided to allow the user to do JEXL filtering before subsetting the format fields, in particular the case where the filtering is done on INFO fields only, which may improve speed when working with a large cohort VCF that contains genotypes for thousands of samples.
  • SV Calling

    • Added a new tool SVConcordance, that calculates SV genotype concordance between an "evaluation" VCF and a "truth" VCF (#7977)
    • Recognize MEI DELs with ALT format DEL:ME in SVAnnotate (#8125)
    • Don't sort rejected reads output from AnalyzeSaturationMutagenesis (#8053)
  • Notable Enhancements

    • GenotypeGVCFs: added an --keep-specific-combined-raw-annotation argument to keep specified raw annotations (#7996)
    • VariantAnnotator now warns instead of fails when the variant contains too many alleles (#8075)
    • Read filters now output total reads processed in addition to the number of reads filtered (#7947)
    • Added GenomicsDB arguments to the CreateSomaticPanelOfNormals tool (#6746)
    • Added a DeprecatedFeature annotation and a process for officially marking GATK tools as deprecated (#8100)
    • Prevent tool close() methods from hiding underlying errors (#7764)
  • Bug Fixes

    • Fixed issue causing VariantRecalibrator to sometimes fail if user provided duplicate -an options (#8227)
    • ReblockGVCF: remove A,R, and G length attributes when ReblockGVCF subsets an allele (#8209)
      • Previously if an input gVCF had allele length, reference length, or genotype length annotations in the FORMAT field, ReblockGVCF would not remove all of them at sites where an allele was dropped. This makes the output gVCF invalid since the annotation length no longer matches the length described in the header at those sites. Now we fix up F1R2, F2R1, and AF annotations and remove any other annotations that are not already handled that are defined as A, R, or G length in the header.
    • Fixed a gCNV bug that breaks the inference when only 2 intervals are provided (#8180)
    • Fixed NPE from unintialized logger in GenotypingEngine (#8159)
    • Fixed asynchronous Python exception propagation in StreamingPythonExecutor/CNNScoreVariants (#7402)
    • Fixed issue in ShiftFasta where the interval list output was never written (#8070)
    • Bugfix for the type of some output files in the somatic CNV WDL (#6735) (#8130)
    • MergeAnnotatedRegions now requires a reference as asserted in its documentation (#8067)
  • Miscellaneous Changes

    • Deprecated an untested VariantRecalibrator argument and an old ReblockGVCF argument that produced invalid GVCFs (#8140)
    • Removed old GnarlyGenotyper code with a diploid assumption to prepare for adding haploid support to GnarlyGenotyper (#8140)
    • ReblockGVCF: add error message for when tree-score-threshold is set but the TREE_SCORE annotation is not present (#8218)
    • TransferReadTags: allow empty unaligned bams as input (#8198)
    • Refactored JointVcfFiltering WDL and expanded tests. (#8074)
    • Updated the carrot github action workflow to the most recent version, which supports using #carrot_pr to trigger branch vs master comparison runs (#8084)
    • Replaced uses of File.createTempFile() with IOUtils.createTempFile() to ensure that temp files are deleted on shutdown (#6780)
    • Don't require python just to instantiate the CNNScoreVariants tool classes. (#8128)
    • Made several Funcotator methods and fields protected so it is easier to extend the tool (#8124) (#8166)
    • Test for presence of ack result message and simplify ProcessControllerAckResult API (#7816)
    • Fixed the path reported by the gatkbot when there are test failures (#8069)
    • Fixed incorrect boolean value in DirichletAlleleDepthAndFractionIntegrationTest (#7963)
    • Removed two ancient and unused HaplotypeCaller test files that are no longer needed (#7634)
    • Added scattered gCNV case WDL to dockstore file (#8217)
  • Documentation

    • Updated instructions for installing Java in the README (#8089)
    • Added documentation on OMP_NUM_THREADS and MKL_NUM_THREADS to GermlineCNVCaller and DetermineGermlineContigPloidy (#8223)
    • Improvements to PileupDetectionArgumentCollection documentation (#8050)
    • Fixed typo in documentation for VariantAnnotator (#8145)
  • Dependencies

    • Moved to Java 17, the latest LTS Java release, for building/running GATK (#8035)
    • Updated Gradle to 7.5.1 (#8098)
    • Updated the GATK base docker image to 3.0.0 (#8228)
    • Updated HTSJDK to 3.0.5 (#8035)
    • Updated Picard to 3.0.0 (#8035)
    • Updated Barclay to 5.0.0 (#8035)
    • Updated GenomicsDB to 1.4.4 (#7978)
    • Updated Spark to 3.3.1 (#8035)
    • Updated Hadoop to 3.3.1. (#8102)
    • Require commons-text 1.10.0 to fix a security vulnerability (#8071)

4.3.0.0

13 Oct 01:13
8dbb78f
Compare
Choose a tag to compare

Download release: gatk-4.3.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.3.0.0 release:

  • Support for the Ultima Genomics flow-based sequencing platform

  • A next-generation suite of tools for variant filtration based on site-level annotation, intended to eventually supersede the older VariantRecalibrator workflow

  • CompareReferences and CheckReferenceCompatibility: new tools for comparing and checking compatibility with genomic references

  • Support in HaplotypeCaller/Mutect2 for supplementing the variants discovered in local assembly with variants discovered via a pileup-based approach

Full list of changes:

  • Support for the Ultima Genomics flow-based sequencing platform (#7876)

    • Added a new --flow-mode argument to HaplotypeCaller which better supports flow-based calling
      • Added a new Haplotype Filtering step after assembly which removes suspicious haplotypes from the genotyper
      • Added two new likelihoods models, FlowBasedHMM and the FlowBasedAlignmentLkelihoodEngine
    • Added a new --flow-mode argument to Mutect2 which better supports flow-based calling
    • Added support for uncertain read end-positions in MarkDuplicatesSpark
    • Added a new tool FlowFeatureMapper for quick heuristic calling of bams for diagnostics
    • Added a new tool GroundTruthReadsBuilder to generate ground truth files for Basecalling
    • Added a new diagnostic tool HaplotypeBasedVariantRecaller for recalling VCF files using the HaplotypeCallerEngine
    • Added a new tool breaking up CRAM files by their blocks, SplitCram
    • Added a new read interface called FlowBasedRead that manages the new features for FlowBased data
    • Added a number of flow-specific read filters
    • Added a number of flow-specific variant annotations
    • Added support for read annotation-clipping as part of clipreads and GATKRead
    • Added a new PartialReadsWalker that supports terminating before traversal is finished
  • Next-generation suite of tools for variant filtration based on site-level annotations (#7954) (#8049)

    • This tool suite is intended to eventually supersede the older VariantRecalibrator workflow
    • The new tools include:
      • ExtractVariantAnnotations: extracts site-level variant annotations, labels, and other metadata from a VCF file to HDF5 files
      • TrainVariantAnnotationsModel: trains a model for scoring variant calls based on site-level annotations
      • ScoreVariantAnnotations: scores variant calls in a VCF file based on site-level annotations using a previously trained model
  • New Reference Comparison Tools

    • CompareReferences: a new tool for analyzing the differences between references at both the dictionary and the base level (#7930) (#7987) (#7973)
      • In its default mode, this tool uses the reference dictionaries to generate an MD5-keyed table comparing the specified references, and does an analysis to summarize the differences between the references provided.
      • Comparisons are made against a "primary" reference, specified with the -R argument. Subsequent references to be compared may be specified using the ``--references-to-compare` argument.
      • A supplementary table keyed by sequence name can be displayed using the --display-sequences-by-name argument; to display only sequence names for which the references are not consistent, run with the --display-only-differing-sequences argument as well.
      • MD5s can be recalculated from the actual sequence when missing from the dictionary
      • When run with --base-comparison FULL_ALIGNMENT, the tool performs full-sequence alignment on the differing reference sequences to produce a VCF with SNPs and Indels. However, this mode ignores IUPAC / N bases.
      • Running with --base-comparison FIND_SNPS_ONLY finds single-base differences between differing reference sequences of the same length. This mode can handle IUPAC / N bases correctly, but not indels.
      • To perform the full-sequence alignment, GATK now packages a distribution of MUMmer for x86_64 Mac and Linux, which can be invoked from within the GATK using the new MummerExecutor class.
    • CheckReferenceCompatibility: a new tool to check a BAM/CRAM/VCF for compatibility against a set of references (#7959) (#7973)
      • This tool generates a table analyzing the compatibility of a BAM/CRAM/VCF input file against provided references.
      • The tool works to compare BAM/CRAMs (specified using the -I argument) as well as VCFs (specified using the -V argument) against provided reference(s), specified using the --references-to-compare argument.
      • When MD5s are present, the tool decides compatibility based on all sequence information (MD5, name, length); when MD5s are missing, the tool makes compatibility calls based only on sequence name and length.
  • HaplotypeCaller/Mutect2

    • Added an optional "Pileup Detection" step to Mutect2 and HaplotypeCaller before assembly that supplements the variants from local assembly with variants that show up in the pileups (#7432)
    • Fixed a Mutect2 IndexOutOfBoundException with germline resource (#7979)
    • Mutect3 dataset enhancements: optional truth VCF for labels, seq error likelihood annotation (#7975)
    • Added Mutect3 dataset generation to the Mutect2 WDL (#7992)
    • GetPileupSummaries now streams its output rather than storing it in memory (#7664)
    • Fixed a rare edge case in the AdaptiveChainPruner where the JavaPriorityQueue is undefined for tied elements (#7851)
  • SV Calling

    • CondenseDepthEvidence: a new tool that combines adjacent intervals in DepthEvidence files (#7926)
    • LocusDepthtoBAF: a new tool that merges locus-sorted LocusDepth evidence files, calculates the bi-allelic frequency (baf) for each sample and site, and writes these values as a BafEvidence output file (#7776)
    • PrintReadCounts: a new tool that prints (and optionally subsets) an read depth (DepthEvidence) file or a counts file as one or more (for multi-sample DepthEvidence files) counts files for CNV determination (#8015)
    • CollectSVEvidence: fixed a bug where trailing SNP sites and depth intervals without read coverage were being omitted from the output (#8045)
    • CollectSVEvidence: added read depth generation and raw-counts output (#8015)
    • Improved PrintSVEvidence performance by tweaking the MultiFeatureWalker traversal (#7869)
    • Fixes related to BafEvidence (biallelic-frequency of a sample at some locus) (#7861)
    • Fixed a bug where the end coordinate was being incorrectly compared when sorting discordant read pair evidence (#7835)
    • Sort output from SVClusterEngine (#7779)
    • Remove abandoned SV filtering project and unneeded build dependency (#7950)
  • CNV Calling

    • Fix a no-call genotype ploidy bug in JointGermlineCNVSegmentation (#7779)
    • Added numerical-stability tests and updated test data for all ModelSegments single-sample and multiple-sample modes (#7652)
    • Added a gCNV integration test to detect numerical differences in the outputs (#7889)
  • GenomicsDB

    • GenomicsDBImport: added the ability to specify explicit index locations via the sample name map file (#7967)
      • Each line in the sample name map file may now optionally contain a third column with the path/URI to the index. This is useful when the index is not in the same location as the corresponding GVCF.
  • Bug Fixes

    • Fixed an issue where we weren't properly merging AD values when combining GVCFs and no PLs were present (#7836)
    • Fixed a bug in ReblockGVCF that could cause the first position on a contig to be dropped (#8028)
    • Fixed an allele-ordering issue in the allele-specific annotation code (#7585)
    • VariantRecalibrator: type change int -> long to prevent tranche novel variant count overflow (#7864)
    • Fixed an issue with tabix index generation (#7858)
    • Fixed a bug in SiteDepthCodec (#7910)
  • Miscellaneous Changes

    • VariantsToTable now includes all fields when none are specified (#7911)
    • SelectVariants now warns the user about poor performance when the sample names in the VCF header are unsorted (#7887)
    • VariantRecalibrator now has a --dont-run-rscript argument to disable execution of its R script but still output the actual R script file (#7900)
    • Added some generic read tag/expression filters for use on numeric tags (#7746)
    • Replaced Travis CI with Github Actions for our continuous testing (#7754)
    • Switched over to Github Actions for building our nightly docker image (#7775)
    • Created a new build_docker_remote.sh script for building the docker image remotely with Google Cloud Build (#7951)
    • Added an argument mode manager for group arguments and a demonstration of how it might be used in HaplotypeCaller --dragen-mode (#7745)
    • Added unit tests for the Utils.concat() methods (#7918)
    • Added a test to validate WDLs in the scripts directory. (#7826)
    • Added a use_allele_specific_annotation arg and fixed task with empty input in the JointVcfFiltering WDL (#8027)
    • Fixed an issue in the GATK stats script in which the first day's downloads on a new release were set to 0 (#7794)
    • Fixed a typo in the Dockerfile that broke git lfs pull (#7806)
    • Removed unused code in the utils.solver package (#7922)
    • Corrected the time for GATK nightly build cron jobs (#7784)
    • Disabled the red "X" from failing CodeCov builds and de...
Read more

4.2.6.1

13 Apr 19:24
33bda5e
Compare
Choose a tag to compare

Download release: gatk-4.2.6.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.6.1 release:

This release contains a single bug fix for GenotypeGVCFs to fix an erroneous IllegalStateException ("No likelihood sum exceeded zero -- method was called for variant data with no variant information.") in the edge case where unnormalized PLs are present at monomorphic sites.

4.2.6.0

08 Apr 19:27
3b0bc03
Compare
Choose a tag to compare

Download release: gatk-4.2.6.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.6.0 release:

  • Important bug fixes for the joint calling tools (GenotypeGVCFs / GenomicsDB)

    • GATK 4.2.5.0 contained two joint genotyping bugs that are now fixed in GATK 4.2.6.0:
      • GenotypeGVCFs can throw NullPointerExceptions in some cases with many alternate alleles.
      • The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
    • If you are running these tools in 4.2.5.0 we strongly recommend updating to 4.2.6.0
  • Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when the --gcs-project-for-requester-pays argument was specified

    • If you continue to encounter problems accessing requester pays Google Cloud Storage buckets in 4.2.6.0, please let us know by filing a Github issue!
  • Two new tools for the Structural Variation calling pipeline: SVAnnotate and PrintSVEvidence

  • Some fixes to genotype-given-alleles mode in HaplotypeCaller and Mutect2

Full list of changes:

  • Joint Calling (GenotypeGVCFs / GenomicsDB)

    • GATK 4.2.5.0 contained two joint genotyping bugs which are now fixed in 4.2.6.0:
      • GenotypeGVCFs can throw NullPointerExceptions in some cases with many alternate alleles.
        • Fixed in:
          • Fix for NullPointerException when GenomicsDB has more ALT alleles than specified maximum and many GQ0 hom-ref genotypes allow variants to pass the QUAL filter (#7738)
      • The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
        • Fixed in:
          • Fix multi-allelic QUAL calculation and restore some missing ALT annotation data in ReblockGVCFs (#7670)
    • Mention acceptable compressed VCF file extensions in GenomicsDBImport error message (#7692)
  • SV Calling

    • Added a new tool SVAnnotate (#7431)
      • SVAnnotate adds functional annotations for SVs called by GATK-SV (#7431)
    • Added a new tool PrintSVEvidence (#7695)
      • PrintSVEvidence is a tool that can merge any number of files containing one of five types of evidence of structural variation. It's also capable of subsetting regions or samples. It's used to merge evidence from a cohort in the GATK-SV pipeline.
    • Added start/end coordinate validation to SVCallRecord (#7714)
  • HaplotypeCaller / Mutect2

    • Fixed an edge case in HaplotypeCaller where filtered alleles in the vicinity of forced-calling alleles could result in empty calls (#7740)
      • This affects users who run genotype given alleles mode in non-GVCF mode
    • Fixed a bug in HaplotypeCaller and Mutect2 where force-calling alleles were lost upon trimming by placing allele injection after trimming (#7679)
    • Added a debug ``--pair-hmm-results-file` argument that dumps the the exact inputs/outputs of the PairHMM to a file (#7660)
    • Some changes to Mutect2 to support the future Mutect3 (#7663)
      • Added training data for the Mutect3 normal artifact filter
      • Output tensors for Mutect3 as plain text rather than VCF
  • RNA Tools

    • TransferReadTags: a new tool that transfers a read tag from an unaligned bam to the matching aligned bam (#7739).
      • This tool allows us to retrieve read tags that get lost when converting a SAM file to fastqs, then back to SAM (which is necessary if e.g. running fastp to clip adapter bases before alignment).
    • PostProcessReadsForRSEM: a new tool that re-orders and filters reads before running RSEM, which has stringent requirements on the input SAM (https://github.com/deweylab/RSEM) (#7752).
  • Funcotator

    • Added custom VariantClassification severity ordering. (#7673)
      • Users can now customize the severity ratings of the various VariantClassifications using the new --custom-variant-classification-order argument
    • Added logging statements to the b37 conversion process explaining why the automatic b37 conversion does or does not take place on their VCFs (#7760)
  • VariantRecalibrator

    • Added regularization to covariance in GMM maximization step to fix convergence issues in VariantRecalibrator (#7709)
      • This makes the tool more robust in cases where annotations are highly correlated
  • Bug Fixes

    • Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when --gcs-project-for-requester-pays was specified (#7700) (#7730)
    • Fix for the PossibleDeNovo annotation to work without Genotype Likelihoods (#7662)
      • PossibleDeNovo checks each trio's genotype (including parent hom ref genotypes) for likelihoods even though it doesn't actually use the PLs. The PLs can get dropped if GVCFs are reblocked which means this annotation no longer works as expected. This changes the check to look for GQs instead of PLs as the GQs are used as part of the annotation.
    • Fixed a bug with the --mate-too-distant-length in MateDistantReadFilter not being configurable (#7701)
  • GATK Engine

    • Added a new MultiFeatureWalker traversal to the GATK engine (#7695)
    • Removed an ancient, unused option to track unique reads in a LocusIteratorByState (#6410)
  • Miscellaneous Changes

    • Added back the jcenter repository resolver to our gradle build, fixing a "Could not find biz.k11i:xgboost-predictor:0.3.0" error when building GATK from source (#7665)
    • We now properly update the latest tag in the broadinstitute/gatk-nightly Dockerhub repo (#7703)
    • The docker build now only does a git lfs pull on src/main/resources/large (#7727)
    • Install git lfs with --force in the Dockerfile (#7682)
    • Fix WDL generation for MultiVariantWalkers by adding a companion index to the MultiVariantWalker input variant arg (#7689)
    • Added google apps script to automatically update GATK release stats. (#7637)
    • Updated the GATK stats script to be more universally usable (#7759)
    • Added JointCallExomeCNVs to .dockstore.yml and included a note in the WDL (#7719)
  • Documentation

    • Corrected the docs for the --heterozygosity argument in the GenotypeCalculationArgumentCollection (#7661)
  • Dependencies

    • Updated Picard to 2.27.1 (#7766)
    • Updated google-cloud-nio to 0.123.25 (#7730)

4.2.5.0

04 Feb 22:26
da7cd83
Compare
Choose a tag to compare

Download release: gatk-4.2.5.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.5.0 release:

  • Fixed a GenotypeGVCFs IllegalStateException error reported by multiple users in #7639

  • Added a new tool SVCluster that clusters structural variants based on coordinates, event type, and supporting algorithms.

Full list of changes:

  • Joint Calling (GenotypeGVCFs / GenomicsDB)

    • Fixed an IllegalStateException in GenotypeGVCFs arising from GenomicsDB output with too many alts and no likelihoods, and also added a --genomicsdb-max-alternate-alleles argument that is separate from the --max-alternate-alleles argument used by GenotypeGVCFs (#7655)
      • This fixes the GenotypeGVCFs error reported in #7639
      • The new --genomicsdb-max-alternate-alleles argument is required to be at least one greater than the --max-alternate-alleles argument, to account for the NON_REF allele.
    • ReblockGVCF: fixed an edge case where hom-ref "variant" records with no data had wrong-sized PLs and didn't merge with adjacent blocks (#7644)
  • SV Calling

    • Added a new tool SVCluster that clusters structural variants based on coordinates, event type, and supporting algorithms. (#7541)
      • Primary use cases include:
        • Clustering SVs produced by multiple callers, based on interval overlap, breakpoint proximity, and sample overlap.
        • Merging multiple SV VCFs with disjoint sets of samples and/or variants.
        • Defragmentation of copy number variants produced with depth-based callers.
  • Mutect2

    • The palindrome ITR artifact transformer now skips reads whose contigs are not in sequence dictionary (#6968)
      • This fixes a NullPointerException error in Mutect2 reported in #6851
  • GATK Engine

    • Added a new read filter, ExcessiveEndClippedReadFilter (#7638)
      • This filter will keep reads that have fewer than the specified number of clipped bases on either end.
      • Designed with long reads in mind, and as a result has a default value of 1000.

4.2.4.1 the log4j strikes back

04 Jan 22:10
4.2.4.1
5b87367
Compare
Choose a tag to compare

Download release: gatk-4.2.4.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.4.1 release:

  • Fix more newly discovered log4j2 vulnerabilities. Now that people are paying attention they are finding all sorts of things.

Full list of changes:

  • Build System

    • Upgrade our build from Gradle 5.6 to the newest 7.3.2 (#7609)
    • This fixes some gradle bugs which were blocking development
  • GenomicsDB

    • Update to genomicsdb 1.4.3 (#7613) which fixes #7598
    • Fix bug which caused --max_alternate_alleles to be ignored when using GenomicsDB (#7576)
  • Miscellaneous Changes

    • Update .dockstore.yml (#7595)
    • Fix developer doc in AS_RMSMappingQuality (#7607)
  • Dependencies

    • Update log4j to 2.17.1 (#7624)(#7615)
    • Upgrade to Barclay 4.0.2. (#7602)
    • Update to genomicsdb 1.4.3 (#7613)

4.2.4.0 the log4shell edition

15 Dec 19:33
4.2.4.0
2d3d4aa
Compare
Choose a tag to compare

Download release: gatk-4.2.4.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.4.0 release:

  • Fix a major security bug due to log4j vulnerability. (CVE-2021-44228)
  • Improvement to calculation of ExcessHet in joint genotyping. (GenotypeGVCFs, GnarlyGenotyper, ExcessHet).

Full list of changes:

  • Funcotator

    • Aligned the Funcotator checkIfAlreadyAnnotated test with the Funcotator engine code. (#7555)
  • GenotypeGVCFs / ExcessHet

    • Removed undocumented mid-p correction to p-values in exact test of Hardy-Weinberg equilibrium and updated corresponding tests. We now report the same value as ExcHet in bcftools. Note that previous values of 3.0103 (corresponding to mid-p values of 0.5) will now be 0.0000. (#7394)
    • Updated expected ExcessHet values in integration test resources and added an update toggle to GnarlyGenotyperIntegrationTest.
    • Updated ExcessHet documentation.
  • Miscellaneous Changes

    • Delete an unused .gitattributes file which was unintentionally stored in git-lfs and caused an error message to appear sometimes when checking out the repository. (#7594)
    • Remove trailing tab in VariantsToTable output header (#7559)
  • Documentation

    • Updated AUTHORS file to remove a contributor's name at their request. (#7580)
    • Remove outdated javadoc line in AssemblyBasedCallerUtils (#7554)
  • Dependencies

4.2.3.0

02 Nov 22:08
f31f019
Compare
Choose a tag to compare

Download release: gatk-4.2.3.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.3.0 release:

  • Notable bug fixes for Mutect2 and Funcotator

  • Support in CombineGVCFs and GenotypeGVCFs for "reblocked" GVCFs as produced by the ReblockGVCF tool. Reblocked GVCFs have a significantly reduced storage footprint.

  • More control over the Smith-Waterman parameters in HaplotypeCaller and Mutect2

  • A new Fragment Allele Depth (FAD) variant annotation similar to the AD annotation except that allele support is considered per read pair, not per individual read

  • GenomicsDB bug fixes and enhancements

Full list of changes:

  • HaplotypeCaller/Mutect2

    • Fixed a bug where Mutect2 failed to filter germline variants with alternate representations (#7103)
      • This caused variants with alternative representations in gnomAD to not be recognized as being the same as called variants in some cases. This resulted in variants that were called and not filtered, but they should have been filtered by "germline".
    • Exposed Smith-Waterman parameters as tool arguments in HaplotypeCaller, Mutect2, and FilterAlignmentArtifacts. (#6885)
      • Enables use of alternative parameters for different event representation (e.g. three consecutive SNPs instead of two small indels)
    • Can now specify the Smith-Waterman implementation in FilterAlignmentArtifacts (#7105)
    • Added a --debug-assembly-variants-out diagnostic option to output a side VCF with variants detected by assembly for HaplotypeCaller and Mutect2 (#7384)
    • Mutect2: the --genotype-germline-sites argument is no longer marked as experimental (#7533)
  • GenotypeGVCFs / CombineGVCFs

    • Updated CombineGVCFs and GenotypeGVCFs to handle "reblocked" GVCFs with diploid data that are potentially missing hom-ref genotype PLs (#7223)
    • Homozygous reference genotypes with no PLs and zero depth are now output as no-calls by GenotypeGVCFs (#7471)
    • Bug fixes for GenotypeGVCFs/GnarlyGenotyper when allele-specific annotations have empty values due to lack of informative reads or no depth (#7491) (#7186)
  • GenomicsDB

    • Added a new --call-genotypes GenomicsDB argument, enabling output of called genotypes (i.e. not ./.) when tools like CombineGVCFs and SelectVariants read from a GenomicsDB workspace (#7223)
    • Added a --bypass-feature-reader argument to GenomicsDBImport to allow the C-based htslib VCF reader implementation to be used instead of the Java implementation (#7393)
      • Using this option will reduce memory usage and potentially speed up the import process
    • Updated to GenomicsDB 1.4.2 (#7520)
  • Funcotator

    • Fixed a StringIndexOutOfBoundsException in the protein change prediction code that could be triggered by certain indels. The fix avoids the crash by adding additional bounds checking. (#7513)
    • Allow FilterFuncotations to process multi-transcript genes (#7506)
  • CNV Calling

    • CNV WDLs now handle BAM/CRAM index paths explicitly, as for cases where the index is not in the same path as its file (#7518)
    • gCNV in the CASE mode now fills in all hidden DenoisingModelConfig and CopyNumberCallingConfig arguments from the input model configuration (#7464)
    • Exposed number of samples used for estimating denoised copy ratios in gCNV via a new --num-samples-copy-ratio-approx argument (#7450)
  • SV Calling

    • JointGermlineCNVSegmentation: bug fixes and refactoring (#7243)
      • A number of bugs, particularly with max-clique clustering, have been fixed, as well as a parameter swap bug in JointGermlineCNVSegmentation
      • Reworks classes used by JointGermlineCNVSegmentation for SV clustering and defragmentation. The design of SVClusterEngine has been overhauled to enable the implementation of CNVDefragmenter and BinnedCNVDefragmenter subclasses. Logic for producing representative records from a collection of clustered SVs has been separated into an SVCollapser class, which provides enhanced functionality for handling genotypes for SVs more generally.
  • Notable Enhancements

    • Added a new Fragment Allele Depth (FAD) variant annotation (#7511)
      • This annotation is identical to the AD annotation except that allele support is considered per read pair, not per individual read
  • Miscellaneous Changes

    • SplitIntervals: added new tool arguments to control output file naming (#7488)
    • Fixed an issue that caused the Travis CI test suite reports to fail to be uploaded (#7525)
    • Updated Travis CI authentication information (#7521)
  • Documentation

    • Updated StrandBiasBySample documentation (#7283)
    • Updated MarkDuplicatesSpark documentation (#7191) (#7535)
    • Added a comment to ``.travis.yml` about the checkout depth (#7421)
  • Dependencies

    • Updated to GenomicsDB 1.4.2 (#7520)
    • Updated sqlite-jdbc library to a newer version to support M1 Macs (#7519)