illumina dragen bio-it platorm v3.3 user guide ... · revisionhistory document date...

Illumina DRAGEN Bio-IT Platform v3.3User Guide

Document # 1000000101034 v00

January 2020

ILLUMINA PROPRIETARY

For Research Use Only. Not for use in diagnostic procedures.

This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely forthe contractual use of its customer in connection with the use of the product(s) described herein and for no otherpurpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwisecommunicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illuminadoes not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any thirdparties by this document.

The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel inorder to ensure the proper and safe use of the product(s) described herein. All of the contents of this document mustbe fully read and understood prior to using such product(s).

FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREINMAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS,AND DAMAGE TO OTHER PROPERTY, AND WILL VOID ANY WARRANTY APPLICABLE TO THE PRODUCT(S).

ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE).

© 2020 Illumina, Inc. All rights reserved.

All trademarks are the property of Illumina, Inc. or their respective owners. For specific trademark information, seewww.illumina.com/company/legal.html.

Document # 1000000101034 v00

For Research Use Only. Not for use in diagnostic procedures.ii

Illumina DRAGEN Bio-IT Platform v3.3 User Guide

http://www.illumina.com/company/legal.html

Revision History

Document Date Description of Change

Document #1000000101034 v00

January2020

Added v3.3 to the document title.Changed the document number to 1000000101034.

Document #1000000070494 v04

April 2019 Added version history for v03, added South Korean Technical Support phonenumber.

Document #1000000070494 v03

April 2019 • Removed reference to --cram-reference option, which is no longer needed.• --annotation-sj-file option removed. Use --annotation-file instead.• Added information on the following options:

• --vc-enable-phasing• --vc-tlod-filter-threshold• --vc-enable-triallelic-filter• --cnv-merge-distance• --cnv-enable-tracks

• Added the following sections:• Down-sampling Options for Germline Small Variant Calling• Phasing and Phased Variants• SMA Calling• De Novo Quality Scoring• Multisample CNV Calling• Joint Segmentation• De Novo Calling Stage• Multisample CNV VCF Outuput• MapQ and BQ Coverage Filtering• Unique Molecular Identifiers (UMIs)• Gene Quantification

• Updated the following sections:• Mitochondrial Calling• De Novo Joint Calling• Variant Hard Filtering• Repeat Genotyping• Structural Variant Calling• gVCF, Combine gVCF, and Joint VCF• Quality Scoring• Mapping and Aligning

Document #1000000070494 v02

March2019

Updated the following application names:• DRAGEN Genome Pipeline is now DRAGEN DNA Applications• DRAGEN Transcriptome Pipeline is now DRAGEN RNA Applications• DRAGEN Epigenome Pipeline is now DRAGEN Methylation ApplicationsRepeat Genotyping is now Repeat Expansion Detection

Document #1000000070494 v01

January2019

Updated for version 3.2.5 of the software.Added Copy Number Variant Calling section.Added new options to Appendix A.

Document #1000000070494 v00

December2018

Initial release.

Document # 1000000101034 v00

For Research Use Only. Not for use in diagnostic procedures.iii


Table of ContentsRevision History iii

Chapter 1 Illumina DRAGEN Bio-IT Platform 1DRAGEN DNA Applications 1DRAGEN RNA Applications 2DRAGEN Methylation Applications 2System Updates 2Additional Resources and Support 2Getting Started 3

Chapter 2 DRAGEN Host Software 6Command-line Options 6Autogenerated MD5SUM for BAMs 14Configuration Files 14

Chapter 3 DRAGEN DNA Applications 15DNA Mapping 15DNA Aligning 17ALT-Aware Mapping 23Sorting 24Duplicate Marking 24Small Variant Calling 26Copy Number Variant Calling 40Multisample CNV Calling 57Repeat Expansion Detection with Expansion Hunter 60SMA Calling 62Structural Variant Calling 63Structural Variant De Novo Quality Scoring 65QC Metrics and Coverage (Enrichment) Reports 66Variant Quality Score Recalibration 75Virtual Long Read Detection 82Force Genotyping 84Unique Molecular Identifiers 85

Chapter 4 DRAGEN RNA Applications 86RNA Mapping and Intron Handling 86RNA Aligning 86Input Files 87Output Files 89Gene Fusion Detection 91Gene Expression Quantification 94

Chapter 5 DRAGEN Methylation Applications 96

Document # 1000000101034 v00

For Research Use Only. Not for use in diagnostic procedures.iv

DRAGEN Methylation Calling 97Methylation-Related BAM Tags 98Methylation Cytosine and M-Bias Reports 98Using Bismark for Methylation Calling 99

Chapter 6 Preparing a Reference Genome 100Hash Table Background 100Command Line Options 103Pipeline Specific Hash Tables 109

Chapter 7 Tools and Utilities 110Illumina BCL Data Conversion 110Monitoring System Health 112Hardware-Accelerated Compression and Decompression 114Usage Reporting 114

Chapter 8 Troubleshooting 116How to Identify if the System is Hanging 116Sending Diagnostic Information to Illumina Support 116Resetting Your System after a Crash or Hang 116

Appendix A Command Line Options 117Host Software Options 117Mapper Options 122Aligner Options 122Variant Caller Options 124Repeat Expansion Detection Options 131

Technical Assistance 132

Document # 1000000101034 v00

For Research Use Only. Not for use in diagnostic procedures.v


Chapter 1 Illumina DRAGEN Bio-IT PlatformThe Illumina DRAGEN™ Bio-IT Platform is based on the highly reconfigurable DRAGEN Bio-ITProcessor, which is integrated on a Field Programmable Gate Array (FPGA) card and is available in apreconfigured server that can be seamlessly integrated into bioinformatics workflows. The platform canbe loaded with highly optimized algorithms for many different NGS secondary analysis pipelines,including the following:

u Whole genome

u Exome

u RNA-Seq

u Methylome

u Cancer

All user interaction is accomplished via DRAGEN software that runs on the host server and manages allcommunication with the DRAGEN board.

This user guide summarizes the technical aspects of the system and provides detailed information forall DRAGEN command line options.

DRAGENDNAApplications

Figure 1 DRAGEN DNA Applications

The DRAGEN DNA Applications massively accelerate the secondary analysis of NGS data. Forexample, the time taken to process an entire human genome at 30x coverage is reduced fromapproximately 10 hours (using the current industry standard, BWA-MEM+GATK-HC software) toapproximately 20 minutes. Time scales linearly with coverage depth.

These applications harness the tremendous power of the DRAGEN Bio-It Platform and include highlyoptimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling.They also use platform features such as compression and BCL conversion, together with the full set ofplatform tools.

Unlike all other secondary analysis methods, DRAGEN DNA Applications do not reduce accuracy toachieve speed improvements. Accuracy for both SNPs and INDELs is improved over that of BWA-MEM+GATK-HC in side-by-side comparisons.

Document # 1000000101034 v00

For Research Use Only. Not for use in diagnostic procedures.1

DRAGENRNAApplicationsThe DRAGEN RNA Applications share many components with the DNA Applications, but the RNA-Seqread alignment stage is executed with a spliced aligner. Mapping of short seed sequences from RNA-Seq reads is performed similarly to the DRAGEN DNA Applications, but detecting splice junctions (thejoining of noncontiguous exons in RNA transcripts) near the mapped seeds and incorporating thejunctions into alignments of the entire read sequences are specific to the RNA Applications.

DRAGEN uses hardware accelerated algorithms to map and align RNA-Seq–based reads faster andmore accurately than popular software tools. For instance, it can align 100 million paired-end RNA-Seq–based reads in about three minutes. With simulated benchmark RNA-Seq data sets, its splice junctionsensitivity and specificity are unsurpassed.

The output BAM generated by the DRAGEN RNA Applications is compatible with popular downstreamanalysis tools.

DRAGENMethylation ApplicationsThe DRAGEN Methylation Applications provide support for automating the processing of bisulfitesequencing data, generating a BAM with the tags required for methylation analysis.

System UpdatesDRAGEN is a flexible and extensible platform that is highly reconfigurable. Your DRAGEN subscriptionallows you to download updates to the DRAGEN processors and software. These updates providespeed, performance, throughput, and accuracy improvements.

Additional Resources and SupportFor additional information, resources, system updates, and support, please visit the DRAGEN supportpage on the Illumina website.

Document # 1000000101034 v00



Getting StartedDRAGEN provides tests you can run to make sure that your DRAGEN system is properly installed andconfigured. Before running the tests, make sure that the DRAGEN server has adequate power andcooling, and is connected to a network that is fast enough to move your data to and from the machinewith adequate performance.

Running the System CheckAfter powering up, you can make sure that your DRAGEN system is functioning properly by running/opt/edico/self_test/self_test.sh, which does the following:

u Automatically indexes chromosome M from the hg19 reference genome

u Loads the reference genome and index

u Maps and aligns a set of reads

u Saves the aligned reads in a BAM file

u Asserts that the alignments exactly match the expected results

Each system ships with the test input FASTQ data for this script, which is located in /opt/edico/self_test. The system check takes approximately 25–30 minutes.

The following example shows how to run the script and shows the output from a successful test.[root@edico2 ~]# /opt/edico/self_test/self_test.sh

---------------------------------test hash creatingtest hash created---------------------------------reference loading /opt/edico/self_test/ref_data/chrM/hg19_chrMreference loaded---------------------------------

real0m0.640suser0m0.047ssys0m0.604snot properly paired and unmapped input records percentages: PASS---------------------------------md5sum check dbam sorted: PASS---------------------------------SELF TEST COMPLETEDSELF TEST RESULT : PASS

If the output BAM file does not match expected results, then the last line of the above text is as follows:SELF TEST RESULT : FAIL

If you experience a FAIL result after running this test script immediately after powering up yourDRAGEN system, contact Illumina Technical Support.

Document # 1000000101034 v00



Running Your Own TestWhen you are satisfied that your DRAGEN system is performing as expected, you are ready to runsome of your own data through the machine, as follows:

u Load the reference table for the reference genome

u Determine location of input and output files

u Process input data

Loading the ReferenceGenomeBefore a reference genome can be used with DRAGEN, it must be converted from FASTA format into acustom binary format for use with the DRAGEN hardware. For more information, see Preparing aReference Genome on page 100.

Use the following command to load the hash table for your reference genome:dragen -r <reference_hash-table_directory>

Make sure that the reference hash table directory is on the fast file IO drive.

The default location for the hash table for hg19 is as follows./staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The command to load reference genome hg19 from the default location is as follows.dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

This command loads the binary reference genome into memory on the DRAGEN board, where it is usedfor processing any number of input data sets. You do not need to reload the reference genome unlessyou restart the system or need to switch to a different reference genome. It can take up to a minute toload a reference genome.

DRAGEN checks whether the specified reference genome is already resident on the board. If it is, thenthe upload of the reference genome is automatically skipped. You can force reloading of the samereference genome using the force-load-reference (-l) command line option.

The command to load the reference genome prints the software and hardware versions to standardoutput. For example:

DRAGEN Host Software Version 01.001.035.01.00.30.6682 and

Bio-IT Processor Version 0x1001036

After the reference genome has been loaded, the following message is printed to standard output:DRAGEN finished normally

Determine Input andOutput File LocationsThe DRAGEN Bio-IT Platform is very fast, which requires careful planning for the locations of the inputand output files. If the input or output files are on a slow file system, then the overall performance of thesystem is limited by the throughput of that file system.

The DRAGEN system is preconfigured with at least one fast file system consisting of a set of fast SSDdisks grouped with RAID-0 for performance. This file system is mounted at /staging. This name waschosen to emphasize the fact that this area was built to be large and fast, but is not redundant. Failureof any of the file system's constituent disks leads to the loss of all data stored there.

Document # 1000000101034 v00



Before running any analyses, the following are recommended:

u Copy your input data to /staging.

u Direct your outputs to /staging.

u Save a copy of the input data from the staging area in a separate location.

Process Your Input DataTo analyze FASTQ data, use the dragen command. For example, the following command can be usedto analyze a single-ended FASTQ file:

dragen-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \-1 /staging/test/data/SRA056922.fastq \--output-directory /staging/test/output \--output-file-prefix SRA056922_dragen

For information on the command line options, see DRAGEN Host Software on page 6

Document # 1000000101034 v00



Chapter 2 DRAGEN Host SoftwareYou use the DRAGEN host software program dragen to build and load reference genomes, and then toanalyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking withoptional removal, and variant calling.

Invoke the software using the dragen command. The command line options are described in thefollowing sections.

Command line options can also be set in a configuration file. For more information on configuration files,see Configuration Files on page 14. If an option is set in the configuration file and is also specified onthe command-line, the command line option overrides the configuration file.

Command-line OptionsThe usage summary for the dragen command is as follows:

u Build Reference/Hash Tabledragen --build-hash-table true --ht-reference <REF_FASTA> \

--output-directory <REF_DIR> [options]

u Run Map/Align and Variant Caller (*.fastq to *.vcf)dragen -r <REF_DIR> --output-directory <OUT_DIR> \

--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ> \[-2 <FASTQ>] --enable-variant-caller true \--vc-sample-name <SAMPLE>

u Run Map/Align (*.fastq to *.bam)dragen -r <REF_DIR> --output-directory <OUT_DIR> \

--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ> [-2 <FASTQ>]

u Run Variant Caller (*.bam to *.vcf)dragen -r <REF_DIR> --output-directory <OUT_DIR> \

--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \--enable-variant-caller true --vc-sample-name <SAMPLE>

u Run BCL Converter (BCL to *.fastq)dragen --bcl-conversion-only=true --bcl-input-dir <BCL dir> \

--bcl-output-dir <output directory>

u Run RNA Map/Align (*.fastq to *.bam)dragen -r <REF_DIR> --output-directory <OUT_DIR> \

--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ> \[-2 <FASTQ>] --enable-rna true

For a complete list of command line options, see Command Line Options on page 117.

ReferenceGenomeOptionsBefore you can use the DRAGEN system for aligning reads, you must load a reference genome and itsassociated hash tables onto the PCIe card. For information on preprocessing a reference genome’sFASTA files into the native DRAGEN binary reference and hash table formats, see Preparing aReference Genome on page 100. You must also specify the directory containing the preprocessed binaryreference and hash tables with the --ref-dir option. This argument is always required.

Document # 1000000101034 v00


You can load the reference genome and hash tables to DRAGEN card memory separately fromprocessing reads, as follows.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

You can use the -l (--force-load-reference) option to force the reference genome to load even if it isalready loaded, as follows.

dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The time needed to load the reference genome depends on the size of the reference, but for typicalrecommended settings, it takes approximately 30–60 seconds.

OperatingModesDRAGEN has two primary modes of operation, as follows:

u Mapper/aligner

u Variant caller

The DRAGEN system is capable of performing each mode independently or as an end-to-end solution.The DRAGEN system also allows you to enable and disable decompression, sorting, duplicate marking,and compression along the DRAGEN pipeline.

u Full pipeline mode

To execute full pipeline mode, set --enable-variant-caller to true and provide input as unmappedreads in *.fastq, *.bam, or *.cram formats.

DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking andfeeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallelstages throughout the pipeline to drastically reduce the overall run time.

u Map/align mode

Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format.DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the sametime, set --enable-duplicate-marking to true.

u Variant caller mode

To execute variant caller mode, set the --enable-variant-caller option to true.

Input is a mapped and aligned BAM file. DRAGEN produces a VCF file. If the BAM file is alreadysorted, sorting can be skipped by setting --enable-sort to false. BAM files cannot be duplicatemarked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Usethe end-to-end mode of operation to take advantage of the mark-duplicates feature.

u RNA-Seq data

To enable processing of RNA-Seq–based data, set --enable-rna to true.

DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. The DRAGEN Bio-ITProcessor dynamically switches between the required modes of operation.

Document # 1000000101034 v00



u Bisulfite MethylSeq data

To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling option to true.DRAGEN automates the processing of data for Lister and Cokus (aka directional and nondirectional)protocols, generating a single BAM with bismark-compatible tags.

Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for eachcombination of the C->T and G->A converted reads and references. To enable this mode ofprocessing, you need to build a set of reference hash tables with --ht-methylated enabled, and rundragen with the appropriate --methylation-protocol setting.

The remainder of this section provides more information on options that provide finer control of theDRAGEN pipeline.

Output OptionsThe following command line options for output are mandatory:

u --output-directory <out_dir> specifies the output directory for generated files.

u --output-file-prefix <out_prefix> specifies the output file prefix. DRAGEN appends the appropriate fileextension onto this prefix for each generated file.

u -r [ --ref-dir ] specifies the reference hash table.

For brevity, the following examples do not include these mandatory options.

For mapping and aligning, the output is sorted and compressed into BAM format by default before savingto disk. The user can control the output format from the map/align stage with the --output-format<SAM|BAM|CRAM> option. If the output file exists, the software issues a warning and exits. To forceoverwrite if the output file already exists, use the -f [ --force ] option.

For example, the following commands output to a compressed BAM file, and forces overwrite:dragen ... -f

dragen ... -f --output-format=bam

To generate a BAI-format BAM index file (.bai file extension), set --enable-bam-indexing to true.

The following example outputs to a SAM file, and forces overwrite:dragen ... -f --output-format=sam

The following example outputs to a CRAM file, and forces overwrite:dragen ... -f --output-format=cram

DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. However,because there is a small performance cost to generating these strings, it is turned off by default. Togenerate MD tags, set --generate-md-tags to true.

To generate ZS:Z alignment status tags, set --generate-zs-tags to true. These tags are only generated inthe primary alignment, and only when a read has any suboptimal alignments qualifying for secondaryoutput (even if none were output because --Aligner.sec-aligns was set to 0). Valid tag values are:

u ZS:Z:R—Multiple alignments with similar score were found

u ZS:Z:NM—No alignment was found

u ZS:Z:QL—An alignment was found but it was below the quality threshold

To generate SA:Z tags, set --generate-sa-tags to true. These tags provide alignment information(position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variantcalling.

Document # 1000000101034 v00



Input OptionsThe DRAGEN system is capable of processing reads in FASTQ format or BAM/CRAM format. If yourinput FASTQ files end in .gz, DRAGEN automatically decompresses the files using hardware-accelerated decompression.

FASTQ Input FilesFASTQ input files can be single-ended or paired-end, as shown in the following examples.

u Single-ended in one FASTQ file (-1 option)dragen -r <ref_dir> -1 <fastq> --output-directory <out_dir> \

--output-file-prefix <out_prefix>

u Paired-end in two matched FASTQ files (-1 and -2 options)dragen -r <ref_dir> -1 <fastq1> -2 <fastq2> \

--output-directory <out_dir> --output-file-prefix <out_prefix>

u Paired-end in a single interleaved FASTQ file (--interleaved (-i) option)dragen -r <ref_dir> -1 <interleaved_fastq> -i

FASTQ samples can be segmented into multiple files to limit file size or to decrease the time togenerate them. Both bcl2fastq and the DRAGEN BCL command use a common file naming convention,as follows:

<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz

For Example:RDRS182520_S1_L001_R1_001.fastq.gz

RDRS182520_S1_L001_R1_002.fastq.gz

...

RDRS182520_S1_L001_R1_008.fastq.gz

The file naming convention is different for HiSeqX and NextSeq instruments.

These files do not need to be concatenated to be processed together by DRAGEN. To map/align anysample, provide it with the first file in the series (-1 <FileName>_001.fastq). DRAGEN reads allsegment files in the sample consecutively for both of the FASTQ file sequences specified using the -1and -2 options for paired-end input, and for compressed fastq.gz files as well. This behavior can beturned off by setting --enable-auto-multifile to false on the command line.

DRAGEN can also optionally read multiple files by the sample name given in the file name, which canbe used to combine samples that have been distributed across multiple BCL lanes or flow cells. Toenable this feature, set the --combine-samples-by-name option to true.

If the FASTQ files specified on the command line use the Casava 1.8 file naming convention shownabove, and additional files in the same directory share that sample name, those files and all theirsegments are processed automatically. Note that sample name, read number, and file extension mustmatch. Index barcode and lane number may differ.

Input files must be located on a fast file system to avoid impact on system performance.

fastq-list input fileTo provide multiple FASTQ input files, use the --fastq-list <csv file name> option to specify the name ofa CSV file containing the list of FASTQ files. For example:

Document # 1000000101034 v00



dragen -r <ref_dir> --fastq-list <csv_file> \--fastq-list-sample-id <Sample ID> \--output-directory <out_dir> --output-file-prefix <out_prefix>

Using a CSV file allows the FASTQ input files to have arbitrary names, to be input from multiplesubdirectories, and to have tags specified explicitly for multiple read groups. DRAGEN automaticallygenerates a CSV file of the correct format during BCL conversion to FASTQ. This CSV file is namedfastq_list.csv, and contains an entry for each FASTQ file or file pair (for paired-end data) producedduring that run.

NOTEFor a single run, only one BAM and VCF output file are produced because all input read groups areexpected to belong to the same sample.

A CSV file of the correct format is automatically generated during BCL conversion to FASTQ usingDRAGEN. This CSV file is named fastq_list.csv, and contains an entry for each fastq file or file pair (forpaired-end data) produced during that run. ITo use this as a template for map-align input, copy the fastq_list.csv file and delete lines that do not correspond to inputs to the run being prepared.

FASTQCSV File FormatThe first line of the CSV file specifies the title of each column, and is followed by one or more datalines. All lines in the CSV file must contain the same number of comma-separated values and shouldnot contain white space or other extraneous characters.

Column titles are case-sensitive. The following column titles are required:

u RGID—Read Group

u RGSM—Sample ID

u RGLB—Library

u Lane—Flow cell lane

u Read1File—Full path to a valid FASTQ input file

u Read2File—Full path to a valid FASTQ input file (required for paired-end input, leave empty if notpaired-end)

Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2Filecolumn must be either nonempty and reference valid files, or they must all be empty.

When generating a BAM file using fastq-list input, one read group is generated per unique RGID value.The BAM header contains RG tags for the following:

u ID (from RGID)

u SM (from RGSM)

u LB (from RGLB)

Additional tags can be specified for each read group by adding a column title that is four characters, all-uppercase, and begins with RG. For example, to add a PU (platform unit) tag, add a column namedRGPU and specify the value for each read group in this column. All column titles must be unique.

A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one uniqueRGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed inthe fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, one of the followingtwo options must also be specified in addition to --fastq-list <filename>.

Document # 1000000101034 v00



u To process a specific sample from the CSV file, use --fastq-list-sample-id <SampleID>. Only theentries in the fastq-list file with an RGSM value matching the specified SampleID are processed.

u To process all samples together in the same run, regardless of the RGSM value, set --fastq-list-all-samples to true.

There is no option to specify groupings or subsets of RGSM values for more complex filtering, but thefastq-list file can be modified to achieve the same effect.

The following is an example FASTQ list CSV file with the required columns:RGID,RGSM,RGLB,Lane,Read1File,Read2File

CACACTGA.1,RDSR181520,UnknownLibrary,1,/staging/RDSR181520_S1_L001_R1_001.fastq, /staging/RDSR181520_S1_L001_R2_001.fastqAGAACGGA.1,RDSR181521,UnknownLibrary,1,/staging/RDSR181521_S2_L001_R1_001.fastq, /staging/RDSR181521_S2_L001_R2_001.fastqTAAGTGCC.1,RDSR181522,UnknownLibrary,1,/staging/RDSR181522_S3_L001_R1_001.fastq, /staging/RDSR181522_S3_L001_R2_001.fastqAGACTGAG.1,RDSR181523,UnknownLibrary,1,/staging/RDSR181523_S4_L001_R1_001.fastq, /staging/RDSR181523_S4_L001_R2_001.fastq

If you use the --tumor-fastq-list option for somatic input, use the --tumor-fastq-list-sample-id<SampleID> option to specify the sample ID for the corresponding FASTQ list, as shown in thefollowing example:

dragen -r <ref_dir> --tumor-fastq-list <csv_file> \--tumor-fastq-list-sample-id <Sample ID> \--output-directory <out_dir> \--output-file-prefix <out_prefix> --fastq-list <csv_file_2> \--fastq-list-sample-id <Sample ID 2>

BAM input filesTo use BAM files as input to the mapper/aligner, set --enable-map-align to true. If you leave this optionset to false (the default), you can use the BAM file as input to the variant caller.

When you specify a BAM file as input, DRAGEN ignores any alignment information contained in theinput file, and outputs new alignments for all reads. If the input file contains paired-end reads, it isimportant to specify that the input data should be sorted so that pairs can be processed together. Otherpipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases thespeed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairsare identified. Use the --pair-by-name option to enable or disable this feature (the default is true).

Specify single-ended input in one BAM file with the (-b) and --pair-by-name=false options, asfollows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> --output-file-prefix <out_prefix> --pair-by-name=false

Specify paired-end input in one BAM file with the (-b) and --pair-by-name=true options, as follows:dragen -r <ref_dir> -b <bam> --output-directory <out_dir> --output-file-

prefix <out_prefix> --pair-by-name=true

CRAM inputYou can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGENfunctionality available when using CRAM input is the same as when using BAM input.

Document # 1000000101034 v00



The --cram-reference option is no longer needed. The CRAM compressor and decompressor uses theDRAGEN reference.

The following options are used for providing a CRAM input to either mapper/aligner or variant caller:

u --cram-input—The name and path for the CRAM file

u --cram-input—One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option to true.dragen -r <ref_dir> --cram-input <cram> --output-directory <out_dir> \

--output-file-prefix <out_prefix> --pair-by-name=true

Handling of N basesOne of the techniques that DRAGEN uses for optimizing the handling of sequences can lead to theoverwriting of the base quality score assigned to N base calls.

When you use the --fastq-n-quality and --fastq-offset options, the base quality scores are overwrittenwith a fixed base quality. The default values for these options are 2 and 33, respectively, and theycombine to match the Illumina minimum quality of 35 (ASCII character ‘#’).

ReadNames for Paired-End ReadsBy a common convention, read names can include suffixes (such as “/1” or “/2”) that indicate which endof a pair the read represents. For BAM input with the --pair-by-name option, DRAGEN ignores thesesuffixes to find matching pair names. By default, DRAGEN uses the forward slash character as thedelimiter for these suffixes and ignores the “/1” and “/2” when comparing names. By default, DRAGENstrips these suffixes from the original read names.

DRAGEN has the following options to control how suffixes are used:

u To change the delimiter character, for suffixes, use the --pair-suffix-delimiter option. Valid values forthis option include forward-slash (/), dot (.), and colon (:).

u To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes to false.

u To append a new set of suffixes to all read names, set --append-read-index-to-name to true, wherethe delimiter is determined by the --pair-suffix-delimiter option. By default the delimiter is a slash, so/1 and /2 are added to the names.

Gene Annotation Input FilesWhen processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-fileoption. Providing this file improves the accuracy of the mapping and aligning stage (see Input Files onpage 87). The file should conform to the GTF/GFF format specification and should list annotatedtranscripts that match the reference genome being mapped against. The similar GFF3 format is currentlynot supported.

DRAGEN can take the SJ.out.tab file (see SJ.out.tab on page 89) as an annotations file to help guidethe aligner in a two-pass mode of operation.

Preservation or Stripping of BQSRTagsThe Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tagsBI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file withBI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags canbecome invalid.

Document # 1000000101034 v00



The recommendation is to strip these tags when using BAM files as input. To remove the BI and BDtags, set the --preserve-bqsr-tags option to false. If you preserve the tags, DRAGEN warns you todisable hard clipping.

ReadGroupOptionsDRAGEN assumes that all the reads in a given FASTQ belong to the same read group. The DRAGENsystem creates a single @RG read group descriptor in the header of the output BAM file, with the abilityto specify the following standard BAM attributes:

Attribute Argument Description

ID --RGID Read group identifier. If you include any of the read groupparameters, RGID is required. It is the value written into each outputBAM record.

LB --RGLB Library.

PL --RGPL Platform/technology used to produce the reads. The BAM standardallows for values CAPILLARY, LS454, ILLUMINA, SOLID,HELICOS, IONTORRENT and PACBIO.

PU --RGPU Platform unit, eg, flowcell-barcode.lane.

SM --RGSM Sample.

CN --RGCN Name of the sequencing center that produced the read.

DS --RGDS Description.

DT --RGDT Date the run was produced.

PI --RGPI Predicted mean insert size.

If any of these arguments are present, the DRAGEN software adds an RG tag to all the output recordsto indicate that they are members of a read group. The following example shows a command line thatincludes read group parameters:

dragen --RGID 1 --RGCN Broad --RGLB Solexa-135852 \--RGPL Illumina --RGPU 1 --RGSM NA12878 \-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \-1 SRA056922.fastq --output-directory /staging/tmp/ \--output-file-prefix rg_example

Future versions of the DRAGEN system will allow FASTQ records to be individually tagged asbelonging to different read groups.

LicenseOptionsTo suppress the license status message at the end of the run, use the --lic-no-print option. The followingshows an example of the license status message:

LICENSE_MSG| =====================================================LICENSE_MSG| License reportLICENSE_MSG| Genome status [ACxxxxxxxxxxx] : used 1263.9 Gbasessince 2018-Feb-15 (1263886160894 bases, unlimited)LICENSE_MSG| Genome bases [ACxxxxxxxxxxx] : 202000000LICENSE_MSG| Genome bases [total] : 202000000

Document # 1000000101034 v00



AutogeneratedMD5SUM for BAMsAn MD5SUM file is generated automatically for BAM output files. This file has the same name as theBAM output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). TheMD5SUM file is a single-line text file that contains the md5sum of the BAM output file, which exactlymatches the output of the Linux md5sum command.

The MD5SUM calculation is performed as the BAM output file is written, so there is no measurableperformance impact (compared to the Linux md5sum command, which can take several minutes for a30x BAM).

Configuration FilesCommand line options can be stored in a configuration file. The location of the default configuration fileis /opt/edico/config/dragen-user-defaults.cfg. You can override this file by using the --config-file (-c)option to specify a different file. The configuration file used for a given run supplies the default settingsfor that run, any of which can be overridden by command line options.

The recommended approach is to use the dragen-user-defaults.cfg file as a template to create defaultsettings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the newfile for the specific use-case. Best practice is to put options that rarely change into the configuration fileand to specify options that vary from run to run on the command line.

Document # 1000000101034 v00



Chapter 3 DRAGEN DNA ApplicationsFigure 2 DNA Applications for DRAGEN

DNAMapping

Seed Density OptionThe seed-density option controls how many (normally overlapping) primary seeds from each read themapper looks up in its hash table for exact matches. The maximum density value of 1.0 generates aseed starting at every position in the read, ie, (L-K+1) K-base seeds from an L-base read.

Seed density must be between 0.0 and 1.0. Internally, an available seed pattern equal or close to therequested density is selected. The sparsest pattern is one seed per 32 positions, or density 0.03125.

u Accuracy Considerations—Generally, denser seed lookup patterns improve mappingaccuracy. However, for modestly long reads (eg, 50 bp+) and low sequencer error rates, there islittle to be gained beyond the default 50% seed lookup density.

u Speed Considerations—Denser seed lookup patterns generally slow down mapping, and sparserseed patterns speed it up. However, when the seed mapping stage can run faster than the aligningstage, a sparser seed pattern does not make the mapper much faster.

Relationship to Reference Seed IntervalFunctionally, a denser or sparser seed lookup pattern has an impact very similar to a shorter or longerreference seed interval (build hash table option --ht-ref-seed-interval). Populating 100% of reference seedpositions and looking up 50% of read seed positions has the same effect as populating 50% of referenceseed positions and looking up 100% of read seed positions. Either way, the expected density of seedhits is 50%.

More generally, the expected density of seed hits is the product of the reference seed density (theinverse of the reference seed interval) and the seed lookup density. For example, if 50% of referenceseeds are populated and 33.3% (1/3) of read seed positions are looked up, then the expected seed hitdensity should be 16.7% (1/6).

Document # 1000000101034 v00


DRAGEN automatically adjusts its precise seed lookup pattern to ensure it does not systematicallymiss the seed positions populated from the reference. For example, the mapper does not look up seedsmatching only odd positions in the reference when only even positions are populated in the hash table,even if the reference seed interval is 2 and seed-density is 0.5.

MapOrientations OptionThe --map-orientations option is used in mapping reads for bisulfite methylation analysis. It is setautomatically based on the value set for --methylation-protocol.

The --map-orientations option can restrict the orientation of read mapping to only forward in the referencegenome, or only reverse-complemented. The valid values for --map-orientations are as follows.

u 0—Either orientation (default)

u 1—Only forward mapping

u 2—Only reverse-complemented mapping

If mapping orientations are restricted and paired end reads are used, the expected pair orientation canonly be FR, not FF or RF.

Seed-Editing OptionsAlthough DRAGEN primarily maps reads by finding exact reference matches to short seeds, it can alsomap seeds differing from the reference by one nucleotide by also looking up single-SNP editedseeds. Seed editing is usually not necessary with longer reads (100 bp+), because longer reads have ahigh probability of containing at least one exact seed match. This is especially true when paired endsare used, because a seed match from either mate can successfully align the pair. But seed editing can,for example, be useful to increase mapping accuracy for short single-ended reads, with some cost inincreased mapping time. The following options control seed editing:

Command-Line Option Name Configuration File Option Name

--Mappper.seed-density seed-density

--Mapper.edit-mode edit-mode

--Mapper.edit-seed-num edit-seed-num

--Mapper.edit-read-len edit-read-len

--Mapper.edit-chain-limit edit-chain-limit

Table 1 Seed Editing Options

edit-mode and edit-chain-limitThe edit-mode and edit-chain-limit options control when seed editing is used. The following four edit-mode values are available:

Mode Description

0 No editing (default)

1 Chain length test

2 Paired chain length test

3 Full seed editing

Document # 1000000101034 v00



Edit mode 0 requires all seeds to match exactly. Mode 3 is the most expensive because every seed thatfails to match the reference exactly is edited. Modes 1 and 2 employ heuristics to look up edited seedsonly for reads most likely to be salvaged to accurate mapping.

The main heuristic in edit modes 1 and 2 is a seed chain length test. Exact seeds are mapped to thereference in a first pass over a given read, and the matching seeds are grouped into chains of similarlyaligning seeds. If the longest seed chain (in the read) exceeds a threshold edit-chain-limit, the read isjudged not to require seed editing, because there is already a promising mapping position.

Edit mode 1 triggers seed editing for a given read using the seed chain length test. If no seed chainexceeds edit-chain-limit (including if no exact seeds match), then a second seed mapping pass isattempted using edited seeds. Edit mode 2 further optimizes the heuristic for paired-end reads. If eithermate has an exact seed chain longer than edit-chain-limit, then seed editing is disabled for the pair,because a rescue scan is likely to recover the mate alignment based on seed matches from oneread. Edit mode 2 is the same as mode 1 for single-ended reads.

edit-seed-num and edit-read-lenFor edit modes 1 and 2, when the heuristic triggers seed editing, these options control how many seedpositions are edited in the second pass over the read. Although exact seed mapping can use a denselyoverlapping seed pattern, such as seeds starting at 50% or 100% of read positions, most of the value ofseed editing can be obtained by editing a much sparser pattern of seeds, even a nonoverlappingpattern. Generally, if a user application can afford to spend some additional amount of mapping time onseed editing, a greater increase in mapping accuracy can be obtained for the same time cost by editingseeds in sparse patterns for a large number of reads, than by editing seeds in dense patterns for a smallnumber of reads.

Whenever seed editing is triggered, these two options request edit-seed-num seed editing positions,distributed evenly over the first edit-read-len bases of the read. For example, with 21-base seeds, edit-seed-num=6 and edit-read-len=100, edited seeds can begin at offsets {0, 16, 32, 48, 64, 80} from the 5’end, consecutive seeds overlapping by 5 bases. Because sequencing technologies often yield betterbase qualities nearer the (5’) beginning of each read, this can focus seed editing where it is most likelyto succeed. When a particular read is shorter than edit-read-len, fewer seeds are edited.

Seed editing is more expensive when the reference seed interval (build hash table option --ht-ref-seed-interval) is greater than 1. For edit modes 1 and 2, additional seed editing positions are automaticallygenerated to avoid missing the populated reference seed positions. For edit mode 3, the time cost canincrease dramatically because query seeds matching unpopulated reference positions typically miss andtrigger editing.

DNA Aligning

Smith-Waterman Alignment Scoring SettingsThe first stage of mapping is to generate seeds from the read and look for exact matches in thereference genome. These results are then refined by running full Smith-Waterman alignments on thelocations with the highest density of seed matches. This well-documented algorithm works by comparingeach position of the read against all the candidate positions of the reference. These comparisonscorrespond to a matrix of potential alignments between read and reference. For each of these candidatealignment positions, Smith-Waterman generates scores that are used to evaluate whether the bestalignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal

Document # 1000000101034 v00



movement), a deletion (horizontal movement), or an insertion (vertical movement). A match betweenread and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. Theoverall highest scoring path through the matrix is the alignment chosen.

The specific values chosen for scores in this algorithm indicate how to balance, for an alignment withmultiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or thepreference for an alignment without clipping. The default DRAGEN scoring values are reasonable foraligning moderate length reads to a whole human reference genome for variant calling applications. Butany set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation andsequencing errors, and differently tuned alignment scoring values can be more appropriate for someapplications.

The following alignment options control Smith-Waterman Alignment:

Command-Line Option Name Configuration File Option Name

--Aligner.global global

--Aligner.match-score match-score

--Aligner.match-n-score match-n-score

--Aligner.mismatch-pen mismatch-pen

--Aligner.gap-open-pen gap-open-pen

--Aligner.gap.ext.pen gap-ext-pen

--Aligner.unclip-score unclip-score

--Aligner.no-unclip-score no-unclip-score

--Aligner.aln-min-score aln-min-score

u global

The global option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in theread. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch globalalignment algorithm (although not end-to-end in the reference), and alignment scores can be positiveor negative. When set to 0, alignments can be clipped at either or both ends of the read, as in theSmith-Waterman local alignment algorithm, and alignment scores are nonnegative.

Generally, global=0 is preferred for longer reads, so significant read segments after a break of somekind (large indel, structural variant, chimeric read, and so forth) can be clipped without severelydecreasing the alignment score. Setting global=1 might not have the desired effect with longer readsbecause insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0,multiple (chimeric) alignments can be reported when various portions of a read match widelyseparated reference positions.

Using global=1 is sometimes preferable with short reads, which are unlikely to overlap structuralbreaks, unable to support chimeric alignments, and are suspected of incorrect mapping if theycannot align well end-to-end.

Consider using the unclip-score option, or increasing it, instead of setting global=1, to make a softpreference for unclipped alignments.

u match-score

The match-score option is the score for a read nucleotide matching a reference nucleotide (A, C, G,or T). Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used whenglobal=1. A higher match score results in longer alignments, and fewer long insertions.

Document # 1000000101034 v00



u match-2-score

The match-2-score option is the score for a read nucleotide matching a 2-base IUPAC-IUB code inthe reference (K, M, R, S, W, or Y). This option is a signed integer, ranging from -16 to 15.

u match-3-score

The match-3-score option is the score for a read nucleotide matching a 3-base IUPAC-IUB code inthe reference (B, D, H, or V). This option is a signed integer, from -16 to 15.

u match-n-score

The match-n-score option is the score for a read nucleotide matching an N code in the reference.This option is a signed integer, from -16 to 15.

u mismatch-pen

The mismatch-pen option is the penalty (negative score) for a read nucleotide mismatching anyreference nucleotide or IUPAC-IUB code (except ‘N’, which cannot mismatch). This option is anunsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with moreinsertions, deletions, and clipping to avoid SNPs.

u gap-open-pen

The gap-open-pen option is the penalty (negative score) for opening a gap (ie, an insertion ordeletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewerinsertions and deletions of any length in alignment CIGARs, with clipping or alignment throughSNPs used instead.

u gap-ext-pen

The gap-ext-pen option is the penalty (negative score) for extending a gap (ie, an insertion ordeletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extensionpenalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping,or alignment through SNPs used instead.

u unclip-score

The unclip-score option is the score bonus for an alignment reaching the beginning or end of theread. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read moreoften, where this can be done without too many SNPs or indels.

A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments.Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignmentsare forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). It is recommended to use the default for unclip-score when global=1, because someinternal heuristics consider how local alignments would have been clipped.

Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen canhave the undesirable effect of insertions at or near one end of a read being utilized aspseudoclipping, as happens with global=1.

u no-unclip-score

The no-unclip-score option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, anyunclipped bonus (unclip-score) contributing to an alignment is removed from the alignment scorebefore further processing, such as comparison with aln-min-score, comparison with other alignmentscores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoringalignment found by Smith-Waterman alignment to a given reference segment, biasing toward

Document # 1000000101034 v00



unclipped alignments.

When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both endsof the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it staysthe same or decreases if no-unclip-score=1.

The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment.

When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-scorefloor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.

u aln-min-score

The aln-min-score option specifies a minimum acceptable alignment score. Any alignment resultsscoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase thepercentage of reads mapped. This option is a signed integer (negative alignment scores are possiblewith global=0).

aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is thedifference between the best and second-best alignment scores, and aln-min-score serves as thesuboptimal alignment score if nothing higher was found except the best score. Therefore, increasingaln-min-score can decrease reported MAPQ for some low-scoring alignments.

Paired-EndOptionsDRAGEN can process paired-end data, passed in either via a pair of FASTQ files, or in a singleinterleaved FASTQ file. The hardware maps the two ends separately, and then determines a set ofalignments that seem most likely to form a pair in the expected orientation and having roughly theexpected insert size. The alignments for the two ends are evaluated for the quality of their pairing, withlarger penalties for insert sizes far from the expected size. The following options control processing ofpaired-end data:

u Reorientation

The pe-orientation option specifies the expected paired-end orientation. Only pairs with thisorientation can be flagged as proper pairs. Valid values are as follows:

u 0—FR (default)

u 1—RF

u 2—FF

u unpaired-pen

For paired end reads, best mapping positions are determined jointly for each pair, according to thelargest pair score found, considering the various combinations of alignments for each mate. A pairscore is the sum of the two alignment scores minus a pairing penalty, which estimates theunlikelihood of insert lengths further from the mean insert than this aligned pair.

The unpaired-pen option specifies how much alignment pair scores should be penalized when thetwo alignments are not in properly paired position or orientation. This option also serves as themaximum pairing penalty for properly paired alignments with extreme insert lengths.

The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ.Internally, it is scaled into alignment score space based on Smith-Waterman scoring parameters.

u pe-max-penalty

Document # 1000000101034 v00



The pe-max-penalty option limits how much the estimated MAPQ for one read can increase becauseits mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that itwould have received mapping single-ended, plus this value. By default, pe-max-penalty = mapq-max= 255, effectively disabling this limit.

The key difference between unpaired-pen and pe-max-penalty is that unpaired-pen affects calculatedpair scores and thus which alignments are selected and pe-max-penalty affects only reported MAPQfor paired alignments.

Mean Insert Size DetectionWhen working with paired-end data, DRAGEN must choose among the highest-quality alignments forthe two ends to try to choose likely pairs. To make this choice, DRAGEN uses a Gaussian statisticalmodel to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on theintuition that a particular library prep tends to create fragments of roughly similar size, thus producingpairs whose insert lengths cluster well around some mean insert length.

If you know the statistics of your library prep for an input file (and the file consists of a single readgroup), you can specify the characteristics of the insert-length distribution: mean, standard deviation,and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert,Alinger.pe-stat-stddev-insert, Aligner.pe-stat-quartiles-insert, and Aligner.pe-stat-mean-read-len options. However, it is typically preferable to allow DRAGEN to detect these characteristicsautomatically.

To enable automatic sampling of the insert-length distribution, set --enable-sampling to true. When thesoftware starts execution, it runs a sample of up to 100,000 pairs through the aligner, calculates thedistribution, and then uses the resulting statistics for evaluating all pairs in the input set.

The DRAGEN host software reports the statistics in its stdout log in a report, as follows:Final paired-end statistics detected for read group 0, based on 79935

high quality pairs for FR orientation

Quartiles (25 50 75) = 398 410 421

Mean = 410.151

Standard deviation = 14.6773

Boundaries for mean and standard deviation: low = 352, high = 467

Boundaries for proper pairs: low = 329, high = 490

NOTE: DRAGEN's insert estimates include corrections for clipping (sothey are no identical to TLEN)

When the number of sample pairs is very small, there is not enough information to characterize thedistribution with high confidence. In this case, DRAGEN applies default statistics that specify a verywide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lietens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:

WARNING: Less than 28 high quality pairs found - standard deviation iscalculated from the small samples formula

The small samples formula calculates standard deviation as follows:if samples < 3 then

standard deviation = 10000

else if samples < 28 then

standard deviation = 25 * (standard deviation + 1) / (samples – 2)

Document # 1000000101034 v00



end if

if standard deviation < 12 then

standard deviation = 12

end if

The default model is “standard deviation = 10000”. If the first 100000 reads are unmapped or if all pairsare improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0.Note that the minimum value for standard deviation is 12, which is independent of the number ofsamples.

For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. TheDRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to thesamples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data andproper pairing.

Rescue ScansFor paired-end reads, where a seed hit is found for one mate but not the other, rescue scans hunt formissing mate alignments within a rescue radius of the mean insert length. Normally, the DRAGEN hostsoftware sets the rescue radius to 2.5 standard deviations of the empirical insert distribution. But incases where the insert standard deviation is large compared to the read length, the rescue radius isrestricted to limit mapping slowdowns. In this case, a warning message is displayed, as follows:

Rescue radius = 220

Effective rescue sigmas = 0.5

WARNING: Default rescue sigmas value of 2.5 was overridden by host software!

The user may wish to set rescue sigmas value explicitly with --Aligner.rescue-sigmas

Although the user can ignore this warning, or specify an intermediate rescue radius to maintain mappingspeed, it is recommended to use 2.5 sigmas for the rescue radius to maintain mapping sensitivity. Todisable rescue scanning, set max-rescues to 0.

Output OptionsDRAGEN can track up to four independent alignments for each read. These alignments can be chimeric,meaning they map different regions of the read, or they may be suboptimal mappings of the read todifferent areas of the alignment.

As DRAGEN tracks the four best alignments for a read, it gives priority to chimeric (supplemental)alignments over suboptimal (secondary). In other words, it tracks as many chimeric alignments aspossible up to its limit of four. If there are fewer than four chimeric alignments, then the remaining slotsare used to track suboptimal alignments

You can use the following configuration options to control how many of each type of alignment toinclude in DRAGEN output.

u mapq-max

The mapq-max option specifies a ceiling on the estimated MAPQ that can be reported for anyalignment, from 0 to 255. If the calculated MAPQ is higher, this value is reported instead. Thedefault is 60.

u supp-aligns, sec-aligns

The supp-aligns and sec-aligns options restrict the maximum number of supplementary (ie, chimericand SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100)alignments, respectively, that can be reported for each read.

Document # 1000000101034 v00



A maximum of 31 alignments are reported for any read total, including primary, supplementary, andsecondary. Therefore, supp-aligns and sec-aligns each range from 0 to 30. Supplementaryalignments are tracked and output with higher priority than secondary ones.

High settings for these two options impact speed so it is advisable to increase only as needed.

u sec-phred-delta

The sec-phred-delta option controls which secondary alignments are emitted based on the alignmentscore relative to the primary reported alignment. Only secondary alignments with likelihood withinthis Phred value of the primary are reported.

u sec-aligns-hard

The sec-aligns-hard option suppresses the output of all secondary alignments if there are moresecondary alignments than can be emitted. Set sec-aligns-hard to 1 to force the read to beunmapped when not all secondary alignments can be output.

u supp-as-sec

When the supp-as-sec option is set to 1, then supplementary (chimeric) alignments are reported withSAM FLAG 0x100 instead of 0x800. The default is 0. The supp-as-sec option provides compatibilitywith tools that do not support FLAG 0x800.

u hard-clips

The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specifyalignments, as follows:

u Bit 0—primary alignmentsu Bit 1—supplementary alignmentsu Bit 2—secondary alignments

Each bit determines whether local alignments of that type are reported with hard clipping (1) or softclipping (0). The default is 6, meaning primary alignments use soft clipping and supplementary andsecondary alignments use hard clipping.

ALT-AwareMappingThe GRCh38 human reference contains many more alternate haplotypes (ALT contigs) than previousversions of the reference. Generally, including ALT contigs in the mapping reference improves mappingand variant calling specificity, because misalignments are eliminated for reads matching an ALT contigbut scoring poorly against the primary assembly. However, mapping with GRCh38’s ALT contigs withoutspecial treatment can substantially degrade variant calling sensitivity in corresponding regions, becausemany reads align equally well to an ALT contig and to the corresponding position in the primaryassembly. DRAGEN ALT-aware mapping eliminates this issue, and instead obtains both sensitivity andspecificity improvements from ALT contigs.

ALT-aware mapping requires hash tables that are built with ALT liftover alignments specified (see ALT-Aware Hash Tables on page 102). If a hash table built with liftover alignments is provided, DRAGENautomatically runs with ALT-aware mapping. Set the --alt-aware option to false to disable ALT-Awaremapping with a liftover reference.

DRAGEN requires ALT-Aware hash tables for any hg19 or GRCh38 reference where ALT contigs aredetected. To disable this requirement in DRAGEN, set the --ht-alt-aware-validate option to false.

When ALT-aware mapping is enabled, the mapper and aligner are aware of the liftover relationshipbetween ALT contig positions and corresponding primary assembly positions. Seed matches within ALTcontigs are used to obtain corresponding primary assembly alignments, even if the latter score poorly.

Document # 1000000101034 v00



Liftover groups are formed, each containing a primary assembly alignment candidate, and zero or moreALT alignment candidates that lift to the same location. Each liftover group is scored according to itsbest-matching alignments, taking properly paired alignments into account. The winning liftover groupprovides its primary assembly representative as the primary output alignment, with MAPQ calculatedbased on the score difference to the second-best liftover group. Emitting primary alignments within theprimary assembly maintains normal aligned coverage and facilitates variant calling there. If the --Aligner.en-alt-hap-aln option is set to 1 and --Aligner.supp-aligns is greater than 0, then correspondingalternate haplotype alignments can also be output, flagged as supplementary alignments.

The following is a comparison of alternative options for dealing with alternate haplotypes.

u Mapping without ALT contigs in the reference:u False-positive variant calls result when reads matching an alternate haplotype misalign

somewhere else.u Poor mapping and variant calling sensitivity where reads matching an ALT contig differ greatly

from the primary assembly.

u Mapping with ALT contigs but no ALT awareness:u False-positive variant calls from misaligned reads matching ALT contigs are eliminated.u Low or zero aligned coverage in primary assembly regions covered by alternate haplotypes, due

to some reads mapping to ALT contigs.u Low or zero MAPQ in regions covered by alternate haplotypes, where they are similar or

identical to the primary assembly.u Variant calling sensitivity is dramatically reduced throughout regions covered by alternate

haplotypes.

u Mapping with ALT contigs and alt awareness:u False-positive variant calls from misaligned reads matching ALT contigs are eliminated.u Normal aligned coverage in regions covered by alternate haplotypes because primary alignments

are to the primary assembly.u Normal MAPQs are assigned because alignment candidates within a liftover group are not

considered in competition.u Good mapping and variant calling sensitivity where reads matching an ALT contig differ greatly

from the primary assembly.

SortingThe map/align system produces a BAM file sorted by reference sequence and position bydefault. Creating this BAM file typically eliminates the requirement to run samtools sort or any equivalentpostprocessing command. The --enable-sort option can be used to enable or disable creation of theBAM file, as follows:

u To enable, set to true.

u To disable, set to false.

On the reference hardware system, running with sort enabled increases run time for a 30x full genomeby about 6–7 minutes.

DuplicateMarkingMarking or removing duplicate aligned reads is a common best practice in whole-genome sequencing.Not doing so can bias variant calling and lead to incorrect results.

Document # 1000000101034 v00



The DRAGEN system can mark or remove duplicate reads, and produces a BAM file with duplicatesmarked in the FLAG field, or with duplicates entirely removed.

In testing, enabling duplicate marking adds minimal run time over and above the time required toproduce the sorted BAM file. The additional time is approximately 1–2 minutes for a 30x whole humangenome, which is a huge improvement over the long run times of open source tools.

TheDuplicateMarking AlgorithmThe DRAGEN duplicate-marking algorithm is modeled on the Picard toolkit’s MarkDuplicates feature. Allthe aligned reads are grouped into subsets in which all the members of each subset are potentialduplicates.

For two pairs to be duplicates, they must have the following:

u Identical alignment coordinates (position adjusted for soft- or hard-clips from the CIGAR) at bothends.

u Identical orientations (direction of the two ends, with the left-most coordinate being first).

In addition, an unpaired read may be marked as a duplicate if it has identical coordinate and orientationwith either end of any other read, whether paired or not.

Unmapped or secondary alignments are never marked as duplicates.

When DRAGEN has identified a group of duplicates, it picks one as the best of the group, and marksthe others with the BAM PCR or optical duplicate flag (0x400, or decimal 1024). For this comparison,duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of thescores of both ends, while unpaired reads get the score of the one mapped end. The idea of this scoreis to try, all other things being equal, to preserve the reads with the highest-quality base calls.

If two reads (or pairs) have exactly matching quality scores, DRAGEN breaks the tie by choosing thepair with the higher alignment score. If there are multiple pairs that also tie on this attribute, thenDRAGEN chooses a winner arbitrarily.

The score for an unpaired read R is the average Phred quality score per base, calculated as follows:

Where R is a BAM record, QUAL is its array of Phred quality scores, and dedup-min-qual is a DRAGENconfiguration option with default value of 15. For a pair, the score is the sum of the scores for the twoends.

This score is stored as a one-byte number, with values rounded down to the nearest one-quarter. Thisrounding may lead to different duplicate marks from those chosen by Picard, but because the readswere very close in quality this has negligible impact on variant calling results.

DuplicateMarking LimitationsThe limitations to DRAGEN duplicate marking implementation are as follows:

u When there are two duplicate reads or pairs with very close Phred sequence quality scores,DRAGEN might choose a different winner from that chosen by Picard. These differences havenegligible impact on variant calling results.

u The dragen executable program accepts only a single library ID as a command-line argument(PGLB). For this reason, the FASTQ inputs to the system must be already separated by library ID.Library ID cannot be used as a criterion for distinguishing non-duplicates.

Document # 1000000101034 v00



DuplicateMarking SettingsThe following options can be used to configure duplicate marking in DRAGEN:

u --enable-duplicate-marking

Set to true to enable duplicate marking. When --enable-duplicate-marking is enabled, the output issorted, regardless of the value of the enable-sort option.

u --remove-duplicates

Set to true to suppress the output of duplicate records. If set to false, set the 0x400 flag in theFLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable-duplicate-marking is forced to enabled as well.

u --dedup-min-qual

Specifies the Phred quality score below which a base should be excluded from the quality scorecalculation used for choosing among duplicate reads.

Small Variant CallingThe DRAGEN small variant caller is a high-speed haplotype caller implemented with a hybrid ofhardware and software. The approach taken includes performing localized de novo assembly in regionsof interest to generate candidate haplotypes and performing read likelihood calculations using a hiddenMarkov model (HMM).

Variant calling is disabled by default. You can enable variant calling by setting the --enable-variant-calleroption to true.

The Variant Caller AlgorithmThe DRAGEN Haplotype Caller performs the following steps:

u Active Region Identification—Areas where multiple reads disagree with the reference are identified,and windows around them (active regions) are selected for processing.

u Localized Haplotype Assembly—In each active region, all the overlapping reads are assembled intoa de Bruijn graph (DBG), a directed graph based on overlapping K-mers (length K subsequences) ineach read or multiple reads. When all reads are identical, the DBG is linear. Where there aredifferences, the graph forms bubbles of multiple paths diverging and rejoining. If the local sequenceis too repetitive and K is too small, cycles can form, which invalidate the graph. Values of K=10 and25 are tried by default. If those values produce an invalid graph, then additional values of K = 35,45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, every possiblepath is extracted to produce a complete list of candidate haplotypes, ie, hypotheses for what thetrue DNA sequence may be on at least one strand.

u Haplotype Alignment—Each extracted haplotype is Smith-Waterman aligned back to the referencegenome, to determine what variations from the reference it implies.

u Read Likelihood Calculation—Each read is tested against each haplotype, to estimate a probabilityof observing the read assuming the haplotype was the true original DNA sampled. This calculation isperformed by evaluating a pair hidden Markov model (HMM), which accounts for the various possibleways the haplotype might have been modified by PCR or sequencing errors into the read observed.The HMM evaluation uses a dynamic programming method to calculate the total probability of anyseries of Markov state transitions arriving at the observed read.

Document # 1000000101034 v00



u Genotyping—The possible diploid combinations of variant events from the candidate haplotypes areformed, and for each of them, a conditional probability of observing the entire read pileup iscalculated, using the constituent probabilities of observing each read given each haplotype from thepair HMM evaluation. These feed into the Bayesian formula to calculate a likelihood that eachgenotype is the truth, given the entire read pileup observed. Those genotypes with maximumlikelihood are reported.

Variant Caller OptionsThe following options control the variant caller stage of the DRAGEN host software.

u --enable-variant-caller

Set --enable-variant-caller to true to enable the variant caller stage for the DRAGEN pipeline.

u --vc-target-bed

Restricts processing to regions specified in the BED file.

u --vc-sample-name

The --vc-sample-name option specifies the sample name being processed. This field is requiredwhen running the variant caller in stand-alone mode. In end-to-end mode, you can use this option oruse the --RGSM map/align option and that option is passed through to the variant caller.

u --vc-target-coverage

The --vc-target-coverage option specifies the target coverage for downsampling. The default value is500 for germline mode and 1000 for somatic mode.

u --vc-enable-gatk-acceleration

If --vc-enable-gatk-acceleration is set to true, the variant caller runs in GATK mode (concordant withGATK 4.0).

u --vc-remove-all-soft-clips

If --vc-remove-all-soft-clips is set to true, the variant caller does not use soft clips of reads todetermine variants.

u --vc-decoy-contigs

The --vc-decoy-contigs option specifies a comma-separated list of contigs to skip during variantcalling. This option can be set in the configuration file.

u --vc-enable-decoy-contigs

If --vc-enable-decoy-contigs is set to true, variant calls on the decoy contigs are enabled. Thedefault value is false.

u --vc-enable-phasing

The –vc-enable-phasing option enables variants to be phased when possible. The default value istrue.

Down-sampling Options for Germline Small Variant CallingThe options for down-sampling reads in the germline small variant calling pipeline are as follows:

u --vc-target-coverage specifies the maximum number of reads with a start position overlapping anygiven position.

u --vc-max-reads-per-active-region specifies the maximum number of reads covering a given activeregion.

Document # 1000000101034 v00



u --vc-max-reads-per-raw-region specifies the maximum number of reads covering a given raw region.

u --vc-min-reads-per-start-pos specifies the minimum number of reads with a start position overlappingany given position.

The target coverage and max/min reads in raw/active region options are not directly related and could betriggered independently.

The target coverage option runs first and is meant to limit the number of reads that share the same startposition at any given position. It is not a limit on the total coverage at a given position.

The following example that shows that the DP reported in a variant record can far exceed the --vc-target-coverage default value (of say 500):

For example, assume the default value of --vc-target-coverage is 500. If there are 400 reads starting atposition 1, another 400 starting at position 2, and another 400 starting at position 3, the target coverageoption is not triggered (because 400 < 500). If there is a variant at position 4, its reported depth could beas high as 1200. This example shows that the DP reported in a variant record can far exceed the --vc-target-coverage value.

After the target coverage step, the maximum number of reads that share the same position is 500 (if --vc-target-coverage is set to 500).

The next down-sampling step is to apply the --vc-max-reads-per-raw-region and --vc-max-reads-per-active-region limits. In this step, the maximum number of reads that share the same position can befurther reduced from the 500 maximum value from the first step. These options are used to limit the totalnumber of reads in an entire region using a leveling down-sampling method.

The down-sampling mechanism scans each start position from the start boundary of the region anddiscards one read from that position, then moves on to the next position, until the total number of readsfalls below the threshold. It can potentially take several passes across the entire region for the totalnumber of reads in the entire region to fall below the threshold. After the threshold is met, the down-sampling step is stopped regardless of which position was considered last in the region.

If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos, that position is skipped to ensure that there is always at least a minimum number ofreads (set to --vc-min-reads-per-start-pos) at any start position.

When down-sampling occurs, the choice of which reads to keep or remove is somewhat random.However, the random number generator is seeded to a default value to ensure that it produces the sameset of values in each run. This ensures exactly reproducible results, which means there is no run to runvariation when using the same input data.

Phasing and Phased VariantsDRAGEN supports output of phased variant records in the germline VCF and gVCF file. When two ormore variants are phased together, the phasing information is encoded in a sample-level annotation,FORMAT/PS, that identifies which set the phased variant is in. The value in the field in an integerrepresenting the position of the first phased variant in the set. All records in the same contig withmatching PS values belong to the same set.

##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing IDinformation, where each unique ID within a given sample (but notacross samples) connects records within a phasing group">

The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.

Document # 1000000101034 v00



chr1 1947645 . C T,<NON_REF> 48.44 PASSDP=35;MQ=250.00;MQRankSum=4.983;ReadPosRankSum=3.217;FractionInformativeReads=1.000;R2_5P_bias=0.000 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB:PS0|1:20,15,0:0.429:35:9,7,0:11,8,0:47:83,0,50,572,758,622:255,0,255:19,0:4.844e+01,8.387e-05,5.300e+01,4.500e+02,4.500e+02,4.500e+02:0.00,34.77,37.77,34.77,69.54,37.77:11,9,10,5:12,8,8,7:1947645

chr1 1947648 . G A,<NON_REF> 50.00 PASSDP=36;MQ=250.00;MQRankSum=5.078;ReadPosRankSum=2.563;FractionInformativeReads=1.000;R2_5P_bias=0.000 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB:PS1|0:16,20,0:0.556:36:8,9,0:8,11,0:48:85,0,49,734,613,698:255,0,255:16,0:5.000e+01,7.067e-05,5.204e+01,4.500e+02,4.500e+02,4.500e+02:0.00,34.77,37.77,34.77,69.54,37.77:10,6,11,9:8,8,12,8:1947645

During the genotyping step, all haplotypes and all variants are considered over an active region. Foreach pair of variants, if both variants occur on all of the same haplotypes, or if either is a homozygousvariant, then they are phased together. If the variants only occur on different haplotypes, then they arephased opposite to each other. If any heterozygous variants are present on some of the samehaplotypes but not others, phasing is aborted and no phasing information is output for the active region.

Mitochondrial CallingFor the small variant calling processing of the mitochondrial chromosome, significant changes weremade starting in DRAGEN version 3.2 compared to previous versions.

In previous versions, chrM was either handled as diploid (if no sex was provided on the command line)or haploid (if sex was provided on the command line). Due to the nature of chromosome M, neither ahaploid or a diploid model is adequate because a given cell has many copies of the haploidmitochondrial chromosome and these mitochondria copies don’t share the exact same DNA sequence.Typically, there are approximately 100 mitochondria in each mammalian cell, and each mitochondrionharbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have avariant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency.The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.

Starting in DRAGEN 3.2, chrM is now processed through a continuous AF pipeline, which is similar tothe somatic variant calling pipeline. In this case, a single ALT allele is considered, and the AF isestimated, and can be anywhere between 0% and 100%.

The mitochondrial chromosome processing is now more accurate than in previous versions, because inDRAGEN versions prior to 3.2, low AF calls were not output (due to expecting the AF being close to100% in a haploid model). In the current version, you will be able to see all the single ALT allele variantson the mitochondrial chromosome, across the whole range of AF (from low AF to high AF).

QUAL is not output in the chrM variant records. Instead, the confidence score is INFO/LOD.##INFO=<ID=LOD,Number=1,Type=Float,Description="Variant LOD score">

which gives the confidence that a variant is present at a given locus.

GQ is not output in the chrM variant records, because we don’t test for multiple diploid genotypecandidates. Instead, an ALT allele is considered as a candidate variant, and if FORMAT/LOD>emit_threshold (default = 4), then the FOMAT/GT is hard-coded to “0/1”, and the FORMAT/AF yields anestimate on the variant allele frequency, which range is anywhere within [0,1].

Document # 1000000101034 v00



u If FORMAT/LOD > emit_threshold, the variant is not emitted in the VCF

u If FORMAT/LOD > emit_threshold but FORMAT/LOD < lod_fstar_threshold the variant is emitted inthe VCF but FILTER=lod_fstar.

u If FORMAT/LOD > emit_threshold and FORMAT/LOD > lod_fstar_threshold the variant is emitted inthe VCF and FILTER=PASS.

The following are example VCF records on the chrM, showing one call with very high AF and anotherwith very low AF. In both cases FORMAT/LOD > emit_threshold. FORMAT/LOD is also > lod_fstarthreshold, so the FILTER annotation is PASS.

chrM 2259 . C T . PASSDP=9791;MQ=60.00;LOD=38838.40;FractionInformativeReads=0.994GT:AD:AF:F1R2:F2R1:DP:SB:MB0/1:5,9729:0.999:1,5007:4,4722:9734:3,2,4885,4844:1,4,4807,4922

chrM 16192 . C T . PASSDP=9644;MQ=60.00;LOD=26.12;FractionInformativeReads=0.992GT:AD:AF:F1R2:F2R1:DP:SB:MB0/1:9537,26:0.003:5530,16:4007,10:9563:4484,5053,13,13:4961,4576,19,

gVCF and joint VCF modeIn gVCF mode, the NON_REF regions are output alongside the variant records. The following is anexample of NON_REF and variant regions output in chrM gVCF mode:

chrM 751 . A <NON_REF> . PASS END=1437 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT0/0:6920,9:6929:99:4077:0,120,1800:0,255,255:40,4

chrM 1438 . A G,<NON_REF> . PASSDP=8500;MQ=57.39;LOD=30441.87;FractionInformativeReads=0.871GT:AD:AF:F1R2:F2R1:DP:SB:MB0/1:0,7400,0:1.000:0,3765,0:0,3635,0:7400:0,0,3994,3406:0,0,3633,3767

chrM 1439 . A <NON_REF> . PASS END=2258 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT0/0:6120,10:6130:99:4190:0,120,1800:0,255,255:40,14

FORMAT/GTIn NON_REF regions, the FORMAT/GT is hard-coded to 0/0 and at variant loci, the FORMAT/GT ishard-coded to 0/1.

The FORMAT/GT for chrM is not determined by FORMAT/AF. It is determined by whether a variant isemitted at a position or not. The decision to emit the variant or not is determined by comparing theINFO/LOD score to a threshold. All variants with FORMAT/LOD > emit_threshold get emitted. Variantswith FORMAT/LOD < emit_threshold are not emitted in the gVCF and get banded in a NON_REF regionwith FORMAT/GT=0/0.

One of the following two scenarios can occur at a given position.

u A variant is detected by the genotyper at a given position, and the FORMAT/LOD > emit_threshold.In this case, the FORMAT/GT is hard-coded to 0/1, and DRAGEN outputs FORMAT/AD,FORMAT/DP and FORMAT/AF computed at that position.

u No variant is detected at the given position, or, a variant is detected but the FORMAT/LOD < emit_threshold. In this case, the FORMAT/GT is hard-coded to 0/0, and the position gets banded withcontiguous positions if they are in the same scenario. All positions with FORMAT/GT = 0/0 getbanded together. The FORMAT/DP reported for the band is computed as the median DP across all

Document # 1000000101034 v00



positions within the band. The FORMAT/AD values reported in the band are the AD values selectedat the position where FORMAT/DP = median DP. In the joint VCF, FORMAT/AF is computed basedon FORMAT/AD.

The following are examples of variant record on chrM in a trio joint VCF.

The first sample shows a variant, but FORMAT/FT is lod_fstar because FORMAT/LOD < lod_fstar_threshold.

The second sample does not have a variant and the FORMAT/AF=0.

The third sample does not have a variant even though the FORMAT/AF>0, this means that that positionfor the sample is in a NON_REF region over which no variant was detected with sufficient confidence.

chrM 2622 . G A . . DP=11199;MQ=59.73 GT:AD:AF:DP:FT:LOD:F1R2:F2R10/1:3375,8:0.002:3383:lod_fstar:4.4:1689,2:1686,60/0:5629,1:0:5118:PASS:.:.:. 0/0:3505,8:0.003:2645:PASS:.:.:.

Somatic modeThe DRAGEN Somatic Pipeline allows ultrarapid analysis of NGS data to identify cancer-associatedmutations. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-onlysamples.

For the tumor-normal pipeline, both samples are analyzed jointly such that germline variants areexcluded, generating an output specific to tumor mutations. The tumor-only pipeline produces a VCF filethat can be further analyzed to identify tumor mutations. Both pipelines make no ploidy assumptions,enabling detection of low-frequency alleles.

The output, after multiple filtering steps, is in the form of a VCF file. Variants that fail the filtering stepsare kept in the output VCF, with a FILTER annotation indicating which filtering steps have failed.

Somatic ModeOptionsSomatic mode has the following command line options:

u --tumor-fastq1 and --tumor-fastq2

The --tumor-fastq1 and --tumor-fastq2 options are used to input a pair of FASTQ files into themapper aligner and somatic variant caller. These options can be used with OTHER FASTQ optionsto run in tumor-normal mode. For example:

dragen -f -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149--tumor-fastq1 <TUMOR_FASTQ1>--tumor-fastq2 <TUMOR_FASTQ2> -1 <NORMAL_FASTQ1> \-2 <NORMAL_FASTQ2>--enable-variant-caller true --vc-sample-name RGSM--output-directory /staging/examples/ \--output-file-prefix SRA056922_30x_e10_50M

u --tumor-fastq-list

The --tumor-fastq-list option is used to input a list of FASTQ files into the mapper aligner andsomatic variant caller. This option can be used with other FASTQ options to run in tumor-normalmode. For example:

dragen -f \-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \--tumor-fastq-list <TUMOR_FASTQ_LIST> \--fastq-list <NORMAL_FASTQ_LIST> \

Document # 1000000101034 v00



--enable-variant-caller true \--output-directory /staging/examples/ \--output-file-prefix SRA056922_30x_e10_50M

u --tumor-bam-input

The --tumor-bam-input option is used to input a mapped BAM file into the somatic variant caller. Thisoption can be used with other BAM options to run in tumor-normal mode.

u --vc-min-tumor-read-qual

The --vc-min-tumor-read-qual option specifies the minimum read quality (MAPQ) to be considered forvariant calling. The default value is 20.

Post Somatic Calling FilteringThe following options are available for post somatic calling filtering:

u --vc-tlod-filter-threshold

Set the TLOD filter threshold. The default is 6.5.

u --vc-enable-clustered-events-filter

Enables the clustered events filter. The default is true.

u --vc-enable-triallelic-filter

Enables the multi-allelic filter. The default is true.

Somatic Mode Filter ID Description

Tumor-Only &Tumor-Normal

clustered_events Clustered events (≥ 3) were observed, in a given active region.


t_lod Variant does not meet likelihood threshold (t_lod < 6.5).


multiallelic Site filtered if there are two or more alt alleles at this location in thetumor.


str_contraction Suspected PCR error where the alt allele is one repeat unit less thanthe reference. i.e. ACTACTACT-> ACTACT. Enabled only whenusing --vc-enable-gatk-acceleration=true.


base_quality Median base quality of alt reads at this locus is < 20.


mapping_quality Median mapping quality of alt reads at this locus is < 30.


fragment_length Absolute difference between the median fragment length of alt readsand median fragment length of ref reads at a given locus > 10000.


read_position Median of distances between start/end of read and a given locus > 5(the variant is too close to edge of all the reads)


panel_of_normals Seen in at least one sample in the panel of normal VCF.

Document # 1000000101034 v00



Post Somatic Calling Filtering in The Presence of aMatched Normal ControlSample

SomaticMode

Filter ID Description

Tumor-Normal

germline_risk Likelihood of allele being present in normal > 0.025.

Tumor-Normal

artifact_in_normal TLOD of the normal read set (Normal artifact LOD) is> 0.0. Not called normal artifact if allele fraction innormal is much smaller than allele fraction in tumor(normalAlleleFraction < (0.1 * tumorAlleleFraction).

gVCF, Combine gVCF, and Joint VCFJoint calling is a mode in which an individual's genotype sensitivity and specificity can be improved byusing information from a related cohort.

gVCF, combine gVCF, and joint calling are implemented in a multistep approach First, a gVCF file iscreated for each individual. The gVCF is an enriched VCF that contains information not only at thepositions where variants were detected, but also at positions that are homozygous reference. However,the gVCF may have discontinuity, ie, there may be gaps between regions reported in the gVCF.Additional information is available that indicates how well the evidence (reads) supports the absence ofvariants (reference) or alternative alleles.

The DRAGEN gVCF is not necessarily continuous, ie, it may contain gaps, which correspond to regionsthat are non-callable, and those regions are omitted from the gVCF output.

A region is non-callable if it does not meet the following criterion:

u At least 1 read is mapped with MAPQ>0.

Examples of non-callable regions:

u No reads are mapped to the region.

u More than 1 read are mapped to the regions, but all reads have MAPQ=0.

DRAGEN identifies callable regions upstream from the variant caller, so any gaps in the callable regionsremain a gap in the gVCF because the non-callable regions are not passed down through the VC.

In joint calling, if a position is in a gVCF gap for some samples, and in a homref block for othersamples, then the position in the joint gVCF is reported as a homref block, but the samples with gapshave their format field as “./.:.:.” all dots, ie, no call. That position is not reported in the joint VCFbecause there are no variant calls for any of the individuals.

The following is an example of a position from the joint VCF where one individual has a variant and theother two samples are in a gVCF gap at that same position (their FORMAT field is all dots):

1 605262 . G A 13.41 DRAGENHardQUALAC=2;AF=1.000;AN=2;DP=2;FS=0.000;MQ=14.00;QD=6.70;SOR=0.693GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP ./.:.:.:.:.:LowDepth1/1:0,2:1.000:2:4:PASS:0,0:0,2:50,6,0:1.383e+01,4.943e+00,1.951e+00./.:.:.:.:.:LowDepth

For the next step, you can choose one of the following options:

Document # 1000000101034 v00



u Combine gVCF, which takes multiple gVCF files as input and produces one combined gVCF file thatrepresents all the input samples. The combine gVCF file can then be sent to the joint caller.

u Run joint caller directly with multiple gVCF files as input. The gVCF files are jointly genotyped togenerate a single VCF. The joint VCF contains multiple genotype columns, each corresponding to asample in the cohort.

gVCFOptionsIn addition to the standard parameters for the variant caller stage of the DRAGEN host software, thefollowing parameters are available for gVCF generation:

u --vc-emit-ref-confidence

To enable base pair resolution gVCF generation, set the --vc-emit-ref-confidence option to BP_RESOLUTION. To enable banded gVCF generation, set to GVCF.

u --vc-gvcf-gq-bands

The --vc-gvcf-gq-bands option is optional and is used to define GQ bands for gVCF output. Thedefault value is 10,20,30,40,60,80.

u --vc-max-alternate-alleles

The --vc-max-alternate-alleles option specifies the maximum number of ALT alleles that are output ina VCF or gVCF. The default value is 6.

Combine gVCFOptionsCombine gVCF can be called after the gVCF files are created. The Combine gVCF tool is used to mergeinput gVCF files to generate a single gVCF file that represents all the input gVCF files.

The following options are required by DRAGEN host software for combine gVCF.

u --enable-combinegvcfs

To enable a combine gVCF run, set to true.

u --output-directory

Specifies the output directory.

u --output-file-prefix

Specifies the prefix used to label all output files for the run.

u -r

Specifies the directory in which the hash table resides.

u --variant, --variant-list

Specifies the path to a single gVCF file. Multiple --variant options can be used on the command line,one for each gVCF. Up to 500 gVCFs are supported. The --variant-list option can be used to providethe path to a file that contains a list of input gVCF files that need to be combined, one variant filepath per line.

Joint CallingJoint calling can be called after the required gVCF files have been created. The following options arerequired by the DRAGEN host software during joint calling.

u --enable-joint-genotyping

To enable a combine gVCF run, set to true.

Document # 1000000101034 v00




Specifies the output directory.


Specifies the prefix used to label all output files for the run.

u -r

Specifies the directory in which the hash table resides.

u --variant, or --variant-list

Specifies the path to a single gVCF file. Multiple --variant options can be used on the command line,one for each gVCF. Up to 500 gVCFs are supported. You can use the --variant-list option to specifya file that contains a list of input gVCF files that need to be combined, one variant file path per line.

The following options are optional:

u --pedigree-file

The --pedigree-file option specifies the path to a PED pedigree file containing a structureddescription of the familial relationships between samples. Using this option allows the joint caller toincorporate pedigree information in the analysis. Currently, only pedigree files that contain trios(mother, father, child) are supported.

u --enable-multi-sample-gvcf

To have the joint caller generate a multisample gVCF file, set --enable-multi-sample-gvcf to true.This option requires a combined gVCF file as input.

File to Generate population joint-called multi-samplegVCF

family joint-calledmulti-sample gVCF

population joint-called multi-sampleVCF

family joint-calledmulti-sample VCF

Input File multi-samplecombined gVCF file

multi-samplecombined gVCF file

multi-samplecombined gVCF fileor X individual gVCFfiles

multi-samplecombined gVCF fileor X individual gVCFfiles

Use Pedigree File no yes no yes

Command-lineOptions

--enable-joint-genotyping true --enable-multi-sample-gvcf=TRUE

--enable-joint-genotyping true --enable-multi-sample-gvcf=TRUE --pedigree-file file.ped

--enable-joint-genotyping true

--enable-joint-genotyping true --pedigree-file file.ped

Table 2 Joint-calling Modes, Input Files, and Command-line Options

In the joint VCF output records, samples that have FORMAT GT=”0/0” do not have a variant at thatposition, and therefore may be banded together with other adjacent positions that are also homozygousreference (or have too little evidence for a variant to be called).

For those positions that are banded, the FORMAT/AD, FORMAT/AF, and FORMAT/DP have specialmeaning, as follows:

u The FORMAT/DP reported for the band is computed as the minimum DP across all positions withinthe band.

u The FORMAT/AD values reported in the band are the AD values selected at the position in the bandwhere DP = median DP.

u The FORMAT/AF is computed based on FORMAT/AD.

For example:

Document # 1000000101034 v00



chr1 10230 . AC A 66.08 PASSAC=2;AF=0.333;AN=6;DP=767;FS=2.949;MQ=30.17;MQRankSum=-

0.808;QD=0.19;ReadPosRankSum=-1.942;SOR=0.866GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GL:GP0/1:108,65:0.376:173:19:PASS:63,35:45,30:50,0,45:-5.006,0,-4.51:1.911e+01,5.366e-02,4.815e+010/1:114,70:0.380:184:18:PASS:68,35:46,35:48,0,45:-4.825,0,-4.486:1.831e+01,6.462e-02,4.792e+010/0:223,74:0.249:297:0:LowGQ:.:.:0,0,3318:.:.

DeNovo Joint CallingWhen a pedigree file is used on the DRAGEN joint calling command line (--pedigree-file file.ped),indicating the familial relationships between samples, DRAGEN identifies all variants in the Probandthat are in Mendelian conflict and computes an associated DeNovo Quality (DQ) score, whichcorresponds to the probability of the variant being a DeNovo mutation. The higher the score value, themore likely the variant is DeNovo.

The DeNovo information is output in a slightly different format between the prefilter joint VCF and thepostfilter joint VCF.

Prefilter Joint VCFIn the prefilter joint VCF, all variants in the Proband that are in Mendelian conflict are marked withFORMAT/DN as DeNovo, and all have a FORMAT/DQ score printed. All variants in the Proband thatare consistent with inheritance from parents are marked with FORMAT/DN as Inherited.

The FORMAT/FT field is also printed as PASS or other annotation, depending on whether the single-sample variant is PASS or not. The FILTER status is marked as ‘.’.

The following examples show a variant record taken out of the joint prefilter VCF, when called with apedigree file, with the proband variant marked as DeNovo. In this case, both parents are highconfidence homozygous reference, and the child is heterozygous ALT. FORMAT order isproband/father/mother.

chr1 861154 . T C 27.62 .AC=1;AF=0.167;AN=6;DP=61;FS=0.000;MQ=41.44;MQRankSum=-

0.617;QD=0.84;ReadPosRankSum=3.574;SOR=0.850GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GL:GP:PP:DQ:DN0/1:21,12:0.364:33:31:PASS:11,5:10,7:66,0,45:-6.635,0,-4.45:3.159e+01,3.092e-

03,4.750e+01:24,0,83:6.8575e-04:DeNovo0/0:13,0:0.000:11:30:PASS:.:.:0,30,450:.:.:0,26,2530/0:16,1:0.059:17:6:PASS:.:.:0,6,592:.:.:22,0,254

chr1 234710899 . T C 44.74 .

AC=1;AF=0.167;AN=6;DP=73;FS=4.720;MQ=250.00;MQRankSum=5.310;QD=1.15;ReadPosRankSum=1.366;SOR=0.251

GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GL:GP:PP:DQ:DN0/1:21,18:0.462:39:48:PASS:14,10:7,8:84,0,50:-8.427,0,-5:4.950e+01,7.041e-05,5.300e+01:15,0,120:3.2280e-01:DeNovo

Document # 1000000101034 v00



0/0:13,0:0.000:11:30:PASS:.:.:0,30,450:.:.:10,0,2270/0:25,0:0.000:22:60:PASS:.:.:0,60,899:.:.:0,33,227

Postfilter Joint VCFIn the postfilter joint VCF, all variants in the Proband that are in Mendelian conflict have a FORMAT/DQscore printed, and those that have FORMAT/DQ>= DQ threshold are marked with FORMAT/DN asDeNovo. Those that have FORMAT/DQ<DQ threshold are marked with FORMAT/DN as LowDQ. Allvariants in the Proband that are consistent with inheritance from parents are marked with FORMAT/DNas Inherited.

The FORMAT/FT field is also printed as PASS or other annotation, depending on whether the single-sample variant is PASS or not. The FILTER status is marked as PASS or other annotation depending onwhether the joint QUAL is > a threshold or not.

Taking the same examples given for the prefilter joint VCF, the following are the corresponding recordstaken out of the joint postfilter VCF. In this case, the first record is no longer marked as DeNovo, but ismarked as LowDQ, because FORMAT/DQ is below the threshold. FORMAT order isproband/father/mother.

chr1 861154 . T C 27.62 DRAGENHardQUALAC=1;AF=0.167;AN=6;DP=61;FS=0.000;MQ=41.44;MQRankSum=-

0.617;QD=0.84;ReadPosRankSum=3.574;SOR=0.850GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GL:GP:PP:DQ:DN0/1:21,12:0.364:33:31:PASS:11,5:10,7:66,0,45:-6.635,0,-4.45:3.159e+01,3.092e-

03,4.750e+01:24,0,83:6.8575e-04:LOWDQ0/0:13,0:0.000:11:30:PASS:.:.:0,30,450:.:.:0,26,2530/0:16,1:0.059:17:6:PASS:.:.:0,6,592:.:.:22,0,254

chr1 234710899 . T C 44.74 PASS

AC=1;AF=0.167;AN=6;DP=73;FS=4.720;MQ=250.00;MQRankSum=5.310;QD=1.15;ReadPosRankSum=1.366;SOR=0.251

GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GL:GP:PP:DQ:DN0/1:21,18:0.462:39:48:PASS:14,10:7,8:84,0,50:-8.427,0,-5:4.950e+01,7.041e-

05,5.300e+01:15,0,120:3.2280e-01:DeNovo0/0:13,0:0.000:11:30:PASS:.:.:0,30,450:.:.:10,0,2270/0:25,0:0.000:22:60:PASS:.:.:0,60,899:.:.:0,33,227

DeNovo MetricsIn the VC metrics report, the number of DeNovo variants (SNP and INDEL) is counted and reported, forall DeNovo variants above a default DQ threshold. You can specify these thresholds by using thefollowing options:

u --qc-snp-DeNovo-quality-threshold <value>, default is 0.05

u --qc-indel-DeNovo-quality-threshold <value>, default is 0.02

Document # 1000000101034 v00



These thresholds affect the DeNovo counts reported in the VC metrics, as well as the content of thejoint postfilter VCF VCF (only those variants in the Proband that are in Mendelian conflict withFORMAT/DQ> DQ threshold are marked with FORMAT/DN as DeNovo). The default threshold valueswere chosen to achieve a good tradeoff between specificity and sensitivity but can be modified asneeded.

In the scenario where there are multiple trios specified in the pedigree file (eg, multi-generationpedigree), DRAGEN automatically detects the trios and assesses the DeNovo variants on the probandsample of each trio.

Germline Variant Hard FilteringDRAGEN provides post-VCF variant filtering based on annotations present in the VCF records. Defaultand non-default variant hard filtering are described below. However, due to the nature of DRAGEN’salgorithms, which incorporate the hypothesis of correlated errors from within the core of variant caller,the pipeline has improved capabilities in distinguishing the true variants from noise, and therefore thedependency on post-VCF filtering is substantially reduced. For this reason, the default post-VCF filteringin DRAGEN is very simple.

Default Variant Hard FilteringThe default filters in the germline pipeline are as follows:

u ##FILTER=<ID=DRAGENHardQUAL,Description="Set if true:QUAL < 10.4139">

u ##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent withchromosome ploidy">

u ##FILTER=<ID=lod_fstar,Description="Variant does not meet likelihood threshold (default thresholdis 6.3)">

u DRAGENHardQUAL: For all contigs other than the mitochondrial contig, the default hard filteringconsists of thresholding the QUAL value only.

u lod_fstar: For the mitochondrial contig, the default hard filtering consists of thresholding the LODscore only.

u PloidyConflict: This filter is applied to all variant calls on chrY of a female subject, if female isspecified on the DRAGEN command line.

Non-Default Variant Hard FilteringDRAGEN supports basic filtering of variant calls as described in the VCF standard. You can apply anynumber of filters with the --vc-hard-filter option, which takes a semicolon-delimited list of expressions,as follows:

<filter ID>:<snp|indel|all>:<list of criteria>,

where the list of criteria is itself a list of expressions, delimited by the || (OR) operator in this format:

<annotation ID> <comparison operator> <value>

The meaning of these expression elements is as follows:

u filterID—The name of the filter, which is entered in the FILTER column of the VCF file for calls thatare filtered by that expression.

u snp/indel/all—The subset of variant calls to which the expression should be applied.

u annotation ID—The variant call record annotation for which values should be checked for the filter.Supported annotations include FS, MQ, MQRankSum, QD, and ReadPosRankSum.

Document # 1000000101034 v00



u comparison operator—The numeric comparison operator to use for comparing to the specified filtervalue. Supported operators include <, ≤, =, ≠, ≥, and >.

For example, the following expression would mark with the label "SNP filter" any SNPs with FS < 2.1 orwith MQ < 100, and would mark with "indel filter" any records with FS < 2.2 or with MQ < 110:

--vc-hard-filter=”SNP filter:snp:FS < 2.1 || MQ < 100; indelfilter:indel:FS < 2.2 || MQ < 110”

This example is for illustration purposes only and is NOT recommended for use with DRAGEN V3output. Illumina recommends using the default hard filters.

The only supported operation for combining value comparisons is OR, and there is no support forarithmetic combinations of multiple annotations. More complex expressions may be supported in thefuture.

Orientation Bias FilterThe orientation bias filter is designed to reduce noise typically associated with the following:

u Pre-adapter artifacts introduced during genomic library preparation (eg, a combination of heat,shearing, and metal contaminates can result in the 8-oxoguanine base pairing with either cytosine oradenine, ultimately leading to G→T transversion mutations during PCR amplification), or

u FFPE (formalin-fixed paraffin-embedded) artifact. FFPE artifacts stem from formaldehydedeamination of cytosines, which results in C to T transition mutations.

The orientation bias filter can only be used on somatic pipelines. To enable the filter, set the --vc-enable-orientation-bias-filter option to true. The default is false.

The artifact type to be filtered can be specified with the --vc-orientation-bias-filter-artifacts option. Thedefault is C/T,G/T, which correspond to OxoG and FFPE artifacts. Valid values include C/T, or G/T, orC/T,G/T,C/A.

An artifact (or an artifact and its reverse compliment) cannot be listed twice. For example, C/T,G/A isnot valid, because C→G and T→A are reverse compliments.

dbSNPAnnotationIn Germline, Tumor-Normal somatic, or Tumor-Only somatic modes, DRAGEN can look up variant callsin a dbSNP database and add annotations for any matches that it finds there. To enable the dbSNPdatabase search, set the --dbsnp option to the full path to the dbSNP database VCF or .vcf.gz file,which must be sorted in reference order.

For each variant call in the output VCF, if the call matches a database entry for CHROM, POS, REF,and at least one ALT, then the rsID for the matching database entry is copied to the ID column for thatcall in the output VCF. In addition, DRAGEN adds a DB annotation to the INFO field for calls that arefound in the database.

DRAGEN matches variant calls based on the name of the reference sequence/contig, but there is noadditional way to assert that the reference used for constructing the dbSNP is the same as thereference used for alignment and variant calling. Make sure that the contigs in the selected annotationdatabase match those in the alignment/variant calling reference.

Document # 1000000101034 v00



Panel of Normals VCFWhen DRAGEN is used in the Tumor-Normal or Tumor-Only somatic mode, it looks up variant calls in apanel of normals (PON) VCF. The PON VCF must be generated ahead of time and represents a set ofvariants that were detected by the DRAGEN Somatic Pipeline when run on a set of normal samples(which are not necessarily matched to the subject from whom the tumor sample was taken). Ideallythough, the PON VCF should come from normal samples collected on the same library prep/sequencinginstrument, so that if there are systematic errors that occur during sequencing/library prep, they getcaptured in the PON VCF.

u --panel-of-normals

Specifies a PON VCF file. When a PON VCF file is used as input, if a somatic variant is found in atleast one sample in the file (which may contain several dozen samples), it is marked as panel_of_normals in the FILTER column of the output VCF.

AutogeneratedMD5SUM for VCF FilesAn MD5SUM file is generated automatically for VCF output files. This file is in the same outputdirectory and has the same name as the VCF output file, but with an .md5sum extension appended. Forexample, whole_genome_run_123.vcf.md5sum. The MD5SUM files is a single-line text file that containsthe md5sum of the VCF output file. This md5sum exactly matches the output of the Linux md5sumcommand.

Copy Number Variant CallingThe DRAGEN Copy Number Variant (CNV) pipeline can call CNV events using next-generationsequencing (NGS) data. This pipeline supports multiple applications in a single interface via theDRAGEN Host Software, including processing of whole-genome sequencing data and whole-exomesequencing data for germline analysis.

The DRAGEN CNV pipeline supports two normalization modes of operation. The two modes applydifferent normalization techniques to handle biases that differ based on the application, for example,WGS versus WES. While the default option settings attempt to provide the best trade-off in terms ofspeed and accuracy, a specific workflow may require more finely tuned option settings.

CNVWorkflowThe DRAGEN CNV pipeline follows the workflow shown in the following figure.

Figure 3 DRAGEN CNV Pipeline Workflow

Document # 1000000101034 v00



This pipeline uses many aspects of the DRAGEN platform available in other pipelines, such ashardware acceleration and efficient I/O processing. To enable CNV processing in the DRAGEN HostSoftware, set the --enable-cnv command line option to true.

The CNV pipeline has the following processing modules:

u Target Counts—binning of the read counts and other signals from alignments.

u Bias Correction—correction of intrinsic system biases.

u Normalization—detection of normal ploidy levels and normalization of the case sample.

u Segmentation—breakpoint detection via segmentation of the normalized signal.

u Calling / Genotyping—thresholding, scoring, qualifying, and filtering of putative events as copynumber variants.

The normalization module can optionally take in a panel of normals (PoN), which is used when a cohortor population samples are readily available. All other modules are shared between the different CNValgorithms.

Signal FlowAnalysisThe following figures show a high-level overview of the steps in the DRAGEN CNV Pipeline as thesignal traverses through the various stages. These figures are examples and are not identical to theplots that are generated from the DRAGEN CNV Pipeline.

The first step in the DRAGEN CNV Pipeline is the target counts stage, which extracts signals such asread count and improper pairs and puts them into target intervals.

Figure 4 Read Count Signal

Figure 5 Improper Pairs Signal

Next, the case sample is normalized against the panel of normals, or against the estimated normalploidy level, and any other biases are subtracted out of the signal to amplify any event level signals.

Document # 1000000101034 v00



Figure 6 Pre/Post Tangent Normalization

The normalized signal is then segmented using one of the available segmentation algorithms, andevents are called from the segments.

Figure 7 Segments

Figure 8 Called Events

The events are then scored and emitted in the output VCF.

CNVPipeline OptionsThe following are the top-level options that are shared with the DRAGEN Host Software to control theCNV pipeline. The input into the DRAGEN CNV can be a BAM or CRAM file. If you are using theDRAGEN mapper and aligner, FASTQ files can be used.

u --bam-input—The BAM file to be processed.

u --cram-input—The CRAM file to be processed.

u --enable-cnv—Enable or disable CNV processing. Set to true to enable CNV processing.

u --enable-map-align—Enables the mapper and aligner module. The default is true, so all input readsare remapped and aligned unless this option is set to false.

u --fastq-file1, --fastq-file2—FASTQ file, or files, to be processed.

u --output-directory—Output directory where all results are stored.

u --output-file-prefix—Output file prefix that will be prepended to all result file names.

u --ref-dir—The DRAGEN reference genome hashtable directory.

Document # 1000000101034 v00



CNVPipeline InputThe DRAGEN CNV pipeline supports multiple input formats. The most common format is an alreadymapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, seeGenerate an Alignment File on page 43.

To run the DRAGEN CNV Pipeline directly with FASTQ input without generating a BAM or CRAM file,then see Streaming Alignments on page 44, which outlines steps for streaming alignment recordsdirectly from the DRAGEN map/align stage.

Reference HashtableFor the DRAGEN CNV pipeline, the hashtable must be generated with the --enable-cnv option set totrue, in addition to any other options required by other pipelines. When --enable-cnv is true, dragengenerates an additional kmer uniqueness map that the CNV algorithm uses to counteract mapabilitybiases. The kmer uniqueness map file only needs to be generated once per reference hashtable andtakes about 1.5 hours per whole human genome.

The reference hashtable is a pregenerated binary representation of the reference genome. Forinformation on generating a hashtable, see Preparing a Reference Genome on page 100.

The following is an example of a command to generate a hashtable.dragen \

--build-hash-table true \--ht-reference $FASTA \--output-directory $OUTPUT \--enable-cnv true \--enable-rna true

Generate an Alignment FileThe following command line examples show how to run the DRAGEN map/align pipeline depending onyour input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM filethat can then be used in the DRAGEN CNV Pipeline.

You need to generate alignment files for all samples that have not already been mapped and aligned,including any samples to be used as references for normalization. Each sample must have a uniquesample identifier, which is specified with the --RGSM option. For BAM and CRAM input files, the sampleidentifier is taken from the file, so the --RGSM option is not required.

Example command to map/align a FASTQ file:dragen \

-r $HASHTABLE \-1 $FASTQ1 \-2 $FASTQ2 \--RGSM $SAMPLE \--RGID $RGID \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align true

Example command to map/align an existing BAM file:

Document # 1000000101034 v00



dragen \-r $HASHTABLE \--bam-input $BAM \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align true

Example command to map/align an existing CRAM file:dragen \

-r $HASHTABLE \--cram-input $CRAM \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align true

Streaming AlignmentsDRAGEN can map and align FASTQ samples and then directly stream them to downstream callers.Examples of downstream callers include the CNV Caller and the Haplotype Variant Caller. This processallows you to skip generation of a BAM or CRAM file, bypassing the need to store additional files.

To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regularDRAGEN map/align workflow, and then provide additional arguments to enable CNV. The followingshows an example command line to map/align a FASTQ file and send it to the CNV pipeline.

dragen \-r $HASHTABLE \-1 $FASTQ1 \-2 $FASTQ2 \--RGSM $SAMPLE \--RGID $RGID \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align true \--enable-cnv true \--cnv-enable-self-normalization true

For information on running CNV concurrently with the Haplotype Variant Caller, see Concurrent CNVand Haplotype Variant Calling on page 55.

Target CountsThe target counts stage is the first processing stage for the DRAGEN CNV pipeline. This step bins thealignments into intervals. The primary analysis format for CNV processing is the target counts file,which contains the feature signals that are extracted from the alignments to be used in downstreamprocessing. The binning strategy, interval sizes, and their boundaries are controlled by the target countsgeneration options, and the normalization technique used.

When working with whole genome sequence data, the intervals are autogenerated from the referencehashtable. Only the primary contigs from the reference hashtable are considered for binning. You canspecify additional contigs to bypass with the --cnv-wgs-skip-contig-list option.

With whole exome sequence data, the target BED file supplied with the --cnv-target-bed option is usedto determine the intervals for analysis.

Document # 1000000101034 v00



The targets counts stage generates a .target.counts file, which can be later used in place of any BAM orCRAM by specifying it with the --cnv-input option for the normalization stage. The .target.counts file isan intermediate file for the DRAGEN CNV pipeline and should not be modified.

The .target.counts file is a tab-delimited text file with the following columns:

u Contig identifier

u Start position

u End position

u Target interval name

u Count of alignments in this interval

u Count of improperly paired alignments in this interval

An example of a *.target.counts file is shown below.

contig start stop name SampleName improper_pairs1 565480 565959 target-wgs-1-565480 7 61 566837 567182 target-wgs-1-566837 9 01 713984 714455 target-wgs-1-713984 34 41 721116 721593 target-wgs-1-721116 47 11 724219 724547 target-wgs-1-724219 24 211 725166 725544 target-wgs-1-725166 43 121 726381 726817 target-wgs-1-726381 47 141 753243 753655 target-wgs-1-753243 31 21 754322 754594 target-wgs-1-754322 27 01 754594 755052 target-wgs-1-754594 41 0

Whole GenomeIf the samples are whole genome, then the effective target intervals width is specified with the --cnv-wgs-interval-width option. The higher the coverage of a sample, the higher the resolution that can bedetected. This option is important when running with a panel of normal because all the samples musthave matching intervals. For self normalization, the effective width may be larger than the specifiedvalue.

WGS Coverage Recommended Resolution*

5x 1000 bp

10x 1000 bp

20x 500 bp

30x 250 bp

50x 250 bp

*You can choose to trade off between resolution and speed.

The intervals are autogenerated for every contig in the reference. You can specify a list of contigs toskip by using the --cnv-wgs-skip-contig-list option. This option takes comma-separateded list of contigidentifiers. The contig identifiers must match the reference hashtable that you are using. By default, onlythe mitochondrial chromosomes are skipped. Nonprimary contigs are never processed.

For example, to skip chromosome M, X, and Y, use the following option:--cnv-wgs-skip-contig-list "chrM,chrX,chrY"

Document # 1000000101034 v00



Whole ExomeIf the samples are whole exome samples, a target BED file should be supplied with the --cnv-target-bed$TARGET_BED option.

The target BED file requires a header line and a fourth column, which indicates the target name. Thefollowing is a simple awk command that translates an input target BED file (without any headers orcomment lines) to one that is suitable for CNV.

cat <(echo -e "contig\tstart\tstop\tname") \

<(awk '{print $1"\t"$2"\t"$3"\ttarget-"NR}' $ORIGINAL_BED)

A regular BED file without a header is also allowed, in which case the fourth column target names areauto generated by the CNV algorithm during the target counts stage. To use a standard BED file, makesure that there is no header present in the file. In this case, all columns after the third column areignored, similar to the operation of DRAGEN Variant Caller.

Target Counts OptionsThe following options control the generation of target counts.

u --cnv-counts-method—Specifies the counting method for an alignment to be counted in a target bin.Values are midpoint, start, or overlap. The default value is overlap when using the panel of normalapproach, which means if an alignment overlaps any part of the target bin, it is counted for that bin.In the self normalization mode, the default counting method is start.

u --cnv-enable-split-intervals—When this option is set to true, splits up all target BED intervals into twoequally spaced intervals. When this option is enabled, then all samples (case and panel of normals)must be executed with this option enabled. The default is false.

u --cnv-min-mapq—Specifies the minimum MAPQ for a read to be counted during target countsgeneration. Note that MAPQ 0 reads are always counted, independent of this setting. The defaultvalue is 20.

u --cnv-target-bed—Specifies a properly formatted BED file that indicates the target intervals to samplecoverage over. For use in WES analysis.

u --cnv-wgs-interval-width—Specifies the width of the sampling interval for CNV WGS processing. Thisoption controls the effective window size. The default is 1000.

u --cnv-wgs-skip-contig-list—Specifies a comma-separated list of contig identifiers to skip whengenerating intervals for WGS analysis. The default contigs that are skipped, if not specified, are"chrM,MT,m,chrm".

GCBias CorrectionBiases are introduced in a typical NGS workflow, which make calling CNV events difficult. Thesebiases can arise from library prep, capture kits, sequencer differences, and even mapping biases. TheDRAGEN CNV pipeline attempts to correct for these biases in the various processing stages.

The GC bias correction module immediately follows the target counts stage and operates on the.target.count file. GC bias correction generates a GC bias corrected version of the file, which has a.target.counts.gc-corrected extension in the file name. The GC bias corrected versions arerecommended for any downstream processing when working with WGS data. For WES, if there areenough target regions, then the GC bias corrected counts can also be used.

Document # 1000000101034 v00



Typical capture kits have over 200,000 targets spanning the regions of interest. If your BED file hasfewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (suchthat GC bias statistics may be skewed), then GC bias correction should be disabled.

The following options control the GC bias correction module.

u --cnv-enable-gcbias-correction—Enable or disable GC bias correction when performing target countsgeneration. The default is true.

u --cnv-enable-gcbias-smoothing—Enable or disable smoothing the GC bias correction across adjacentGC bins with an exponential kernel. The default is true.

u --cnv-num-gc-bins—Specifies the number of bins for GC bias correction. Each bin represents the GCcontent percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.

NormalizationThe DRAGEN CNV pipeline supports two normalization algorithms:

u Self Normalization—Uses statistics from the sample under analysis to determine the baseline ploidylevels.

u Panel of Normals—A reference-based normalization algorithm that uses additional matched normalsamples to determine a baseline level from which to call CNV events. The matched normal samplesin this case means it has undergone the same library prep and sequencing workflow as the casesample.

Which algorithm to use depends on the available data and the application. Use the following guidelinesto select the mode of normalization.

Self Normalizationu Whole genome sequencing

u Single sample analysis

u Additional matched samples are not readily available

u Simpler workflow via a single invocation

Panel of Normalsu Whole genome sequencing

u Whole exome sequencing

u Targeted panels

u Additional matched samples are available

u Tumor/Matched-Normal analysis

Self NormalizationThe DRAGEN CNV pipeline provides the self normalization mode that does not require a referencesample or a panel of normals. Enable this mode by setting --cnv-enable-self-normalization to true. Thisoperating mode bypasses the need to run two stages and can save time. It uses the statistics within thecase sample itself to determine the baseline from which to make a call.

Because self normalization uses the statistics within the case sample, this mode is not recommendedfor WES or targeted sequencing analysis due to the potential for insufficient data.

Document # 1000000101034 v00



The self normalization mode is the recommended approach for whole-genome sequencing single sampleprocessing. The pipeline continues through to the segmentation and calling stage, producing the finalcalled events.

dragen \

-r $HASHTABLE \

--bam-input $BAM \

--output-directory $OUTPUT \

--output-file-prefix $SAMPLE \

--enable-map-align false \

--enable-cnv true \

--cnv-enable-self-normalization true \

"

If you are running from a FASTQ sample, then the default mode of operation is self normalization.

When operating in self normalization mode, the --cnv-wgs-interval-width option that is used during thetarget counts stage becomes the effective interval width based on the number of unique kmer positions.You typically do not have to modify this option.

Panel of NormalsThe Panel of Normals approach uses a set of matched normal samples to determine the baseline levelfrom which to call CNV events. These matched normal samples should be derived from the same libraryprep and sequencing workflow that was used for the case sample. This allows the algorithm to subtractout system level biases that are not sample specific.

In this mode of operation, the DRAGEN CNV pipeline is broken down into two distinct stages. Thetarget counts stage is performed on each sample, case, and normals, to bin the alignments. Thenormalization and call detection stage is then performed with the case sample against the panel ofnormals to determine the events.

Target Counts StageTarget counts should be performed for all your samples, whether they are to be used as references orare the case samples under investigation. The case sample and all samples to be used as a panel ofnormals sample must have identical intervals and therefore should be generated with identical settings.The target counts stage also performs GC Bias correction, which is enabled by default.

The examples below are for WGS processing. For exome processing, see Whole Exome on page 46.

The following is an example command for processing a BAM file.dragen \

-r $HASHTABLE \--bam-input $BAM \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align false \--enable-cnv true \"

Document # 1000000101034 v00



The following is an example command for processing a CRAM file.dragen \

-r $HASHTABLE \--cram-input $CRAM \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align false \--enable-cnv true \"

Normalization and Call Detection StageThe next step in the CNV pipeline when using a panel of normals is to perform the normalization and tomake the calls. This step involves selecting a panel of normals, which is a list of target counts files tobe used for reference-based median normalization.

You can run the analysis in other workflow combinations, keeping in mind that the CNV events arecalled for the reference samples used. Ideally the panel of normals samples follow library prep andsequencing workflows that are identical to the workflows of the case sample under analysis. For callingon sex chromosomes, it is recommended that you use sex matched references in the panel. Becausethe normalization is performed on a per-target basis against the panel of normals median, having sexmatched references is critical to detecting copy number events on the sex chromosomes.

To generate a Panel of Normals (PON), create a plain text file in which each line in the file contains apath pointing to a target.counts file generated from the target counts stage. Relative paths are supportedprovided the paths are relative to the current working directory. Absolute paths are recommended incase the workflow is used later or shared with other users.

The following is an example PON file, in which uses a subset of the GC corrected files from the targetcounts stage.

/data/output_trio1/sample1.target.counts.gc-corrected






Alternatively, the files to be used in the panel of normals can be specified with the --cnv-normals-fileoption. This option takes a single file name, and can be specified multiple times.

After you have created a PON file, you can run the caller by specifying your case sample with the --cnv-input option and the PON file with the --cnv-normals-list option. Because we recommend using the GCbias corrected counts from the previous stage, there is no need to run GC bias correction again.GC bias correction can be disabled by setting --cnv-enable-gcbias-correction to false. For example:

dragen \

-r $HASHTABLE \

--output-directory $OUTPUT \

--output-file-prefix $SAMPLE \

--enable-map-align false \

--enable-cnv true \

Document # 1000000101034 v00



--cnv-input $CASE_COUNTS \

--cnv-normals-list $NORMALS \

--cnv-enable-gcbias-correction false \

"

This command normalizes the case sample against the panel of normals and then continuesdownstream to the segmentation and calling stage.

Normalization OptionsThese options control the preconditioning of the panel of normals and the normalization of the casesample.

u --cnv-enable-self-normalization—Enable/disable self normalization mode, which does not require apanel of normals.

u --cnv-extreme-percentile—Specifies the extreme median percentile value at which to filter outsamples. The default is 2.5.

u --cnv-input—Specifies a target counts file for the case sample under investigation when using a panelof normals.

u --cnv-matched-normal—Specifies the target counts file of the matched normal sample.

u --cnv-normals-file—Specifies a target.counts file to be used in the panel of normals. This option canbe specified multiple times, once for each file.

u --cnv-normals-list—Specifies a text file containing paths to the list of reference target counts files tobe used as a panel of normals. Absolute paths are recommended in case the workflow is used lateror shared with other users. Relative paths are supported provided the paths are relative to thecurrent working directory.

u --cnv-max-percent-zero-samples—Specifies a threshold for filtering out targets with too many zerocoverage samples. The default is 5%.

u --cnv-max-percent-zero-targets—Specifies a threshold for filtering out samples with too many zerocoverage targets. The default is 2.5%.

u --cnv-target-factor-threshold—Specifies a percentage of median target factor threshold to filter outuseable targets. The default is 1% for whole genome processing and 10% for targeted sequencingprocessing.

u --cnv-truncate-threshold—Specifies a percentage threshold for truncating extreme outliers. Thedefault is 0.1%.

SegmentationAfter a case sample has been normalized, it goes through a segmentation stage. There are multiplesegmentation algorithms implemented in DRAGEN, including the following:

u CBS (Circular Binary Segmentation)

u SLM (Shifting Level Models)

u FPOP (Functional Pruning Optimal Partitioning)

The SLM algorithm has two variants, SLM and HSLM. HSLM (Heterogeneous SLM) is for use in exomeanalysis and handles target capture kits that are not equally spaced.

The default segmentation algorithm in use is SLM for whole genome processing, and CBS for wholeexome processing.

Document # 1000000101034 v00



u --cnv-segmentation-mode—Specifies the segmentation algorithm to perform. Values are cbs, slm,hslm, or fpop. The default value is slm or cbs depending on whether the intervals are whole genomeintervals or targeted sequencing intervals.

u --cnv-merge-distance—Specifies the minimum number of base pairs between two segments thatwould allow them to be merged. The default value is 0, which means the segments must be directlyadjacent.

u --cnv-merge-threshold—Specifies the maximum segment mean difference at which two adjacentsegments should be merged. The segment mean is represented as a linear copy ratio value. Thedefault is 0.2. Set to 0 to disable merging.

External R PackagesTo use the SLM or FPOP segmentation algorithms, you must install external R packages. R packagesare executed externally outside of the DRAGEN Host Software. A simple installation script is packagedwith DRAGEN and can be found at /opt/edico/R/install_R_packages.R. These R Packages areadditional segmentation methods that can be used. They are not officially maintained by the DRAGENHost Software. If your system does not allow the installation of R or third party libraries, then you candefault to Circular Binary Segmentation (CBS).

Circular Binary SegmentationCircular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circularbinary segmentation for the analysis of array CGH data with enhancements to improve sensitivity forNGS data.

The following options control Circular Binary Segmentation.

u --c-alpha—Specifies the significance level for the test to accept change points. The default is 0.01.

u --cnv-cbs-eta—Specifies the Type I error rate of the sequential boundary for early stopping whenusing the permutation method. The default is 0.05.

u --cnv-cbs-kmax—Specifies maximum width of smaller segment for permutation. The default is 25.

u --cnv-cbs-min-width—Specifies the minimum number of markers for a changed segment. The defaultis 2.

u --cnv-cbs-nmin—Specifies the minimum length of data for maximum statistic approximation. Thedefault is 200.

u --cnv-cbs-nperm—Specifies the number of permutations used for p-value computation. The default is10000.

u --cnv-cbs-trim—Specifies the proportion of data to be trimmed for variance calculations. The defaultis 0.025.

Shifting Level Models SegmentationThe Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: asuite of algorithms for segmenting genomic profiles.

u --cnv-slm-eta—Baseline probability that the mean process changes its value. The default is 1e-5.

u --cnv-slm-fw—Minimum number of data points for a CNV to be emitted. The default is 0, whichmeans segments with one design probe could in effect be emitted.

u --cnv-slm-omega—Scaling parameter modulating relative weight between experimental/biologicalvariance. The default is 0.3.

Document # 1000000101034 v00



https://www.ncbi.nlm.nih.gov/pubmed/17234643

https://www.ncbi.nlm.nih.gov/pubmed/17234643

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1734-5

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1734-5

u --cnv-slm-stepeta—Distance normalization parameter. The default is 10000. This option is only validfor HSLM.

Functional Pruning Optimal Partitioning SegmentationThe FPOP segmentation mode follows from the R implementation of On Optimal Multiple ChangepointAlgorithms for Large Data.

u --cnv-fpop-penalty—Penalty option for changepoint detection. The default is 0.03.

Quality ScoringQuality scores are computed using a probabilistic model that uses a mixture of heavy tailed probabilitydistributions (one per integer copy number) with a weighting for event length. Noise variance isestimated. The output VCF contains a phred-scale metric measuring confidence in called amplification(CN > 2 for diploid locus), deletion (CN < 2 for diploid locus), or copy neutral (CN=2 for diploid locus)events.

The scoring algorithm also calculates exact-copy-number quality scores that are inputs to the DeNovoCNV detection pipeline.

Output FilesThe DRAGEN host software generates many intermediate files. The final call file that contains theamplification and deletion events is the *.seg.called.merged file.

In addition to the segment file, DRAGEN emits the calls in the standard VCF format. By default, theVCF file includes only copy number gain and loss events. For copy neutral segments, refer to the*.seg.called.merged file. To have copy neutral (REF) calls included in the output VCF, set --cnv-enable-ref-calls to true.

For more information about the .seg.called.merged file, and how to use the output files to aid in debugand analysis, see Signal Flow Analysis on page 41.

CNVVCF FileThe CNV VCF file follows the standard VCF format. Due to the nature of how CNV events arerepresented versus how structural variants are represented, not all fields are applicable. In general, ifmore information is available about an event, then it is annotated. Some fields in the DRAGEN CNVVCF are unique to CNVs.

The following is an example of the header lines that are specific to CNV.

##fileformat=VCFv4.1##ALT=<ID=CNV,Description="Copy number variant region">##ALT=<ID=DEL,Description="Deletion relative to the reference">##ALT=<ID=DUP,Description="Region of elevated copy number relative to thereference">##contig=<ID=1,length=249250621>##contig=<ID=2,length=243199373>##contig=<ID=3,length=198022430>##contig=<ID=4,length=191154276>##contig=<ID=5,length=180915260>…##reference=file:///reference_genomes/Hsapiens/hs37d5/DRAGEN##INFO=<ID=REFLEN,Number=1,Type=Integer,Description="Number of REF positions

Document # 1000000101034 v00



https://arxiv.org/pdf/1409.1842.pdf

https://arxiv.org/pdf/1409.1842.pdf

included in this record">##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variantdescribed in this record">##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval aroundPOS">##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval aroundEND">##FILTER=<ID=cnvQual,Description="CNV with quality below 10">##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of1.0">##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segmentmean">##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Estimated copy number">##FORMAT=<ID=BC,Number=1,Type=Integer,Description="Number of bins in the region">##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Number of improperly pairedend reads at start and stop breakpoints">

The ID column is used to represent the event.

The REF column contains an N for all CNV events.

The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, onlythe DEL or DUP entry is used.

The QUAL column contains an estimated quality score for the CNV event, which is used in hardfiltering.

The FILTER column contains PASS if the CNV event passes all filters, otherwise it contains the nameof the failed filter.

The INFO column contains information representing the event, mostly identical to the ID column. TheREFLEN entry indicates the length of the event. The SVTYPE entry is always CNV. The END entryindicates the end position of the event. The CIPOS and CIEND entries are currently not used.

The FORMAT fields are described in the header.

u SM—Linear copy ratio of the segment mean

u CN—Estimated copy number

u BC—Number of bins in the region

u PE—Number of improperly paired end reads at start and stop breakpoints

Visualization and BigWig FilesTo perform analysis on a known truth set, you can use the intermediate output files from the pipelinestages. These files can be parsed to aid in fine-tuning options.

All files have a structure similar to a BED file, with an optional header line.

u *.target.counts—Contains the number of read counts per target interval. This is the raw signal asextracted from the alignments of the BAM or CRAM file. The format is identical for both the casesample and any panel of normals samples. There is also a bigwig representation of atarget.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.

u *.tn.tsv—The case sample’s tangent normalized signal, per target interval. This file contains the log-normalized signal. A strong signal deviation from 0.0 indicates a potential for a CNV event.

Document # 1000000101034 v00



u *.seg.called.merged—Contains the segments produced from the segmentation algorithm.

u *.cnv.vcf—Output CNV VCF file indicating events.

To generate additional equivalent bigwig and gff files, set the --enable-cnv-tracks option to true. Thesefiles can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Usingthese tracks alongside publicly available tracks allows for easier interpretation of calls. An example isshown in the following figure.

Figure 9 IGV Example

Output and Filtering OptionsThe output and filtering options control the CNV output files.

u --cnv-blacklist-bed—Specifies a BED file indicating intervals to exclude from the final output CNVVCF. If a call overlaps any interval from the blacklist BED file by at least 50%, it is suppressed.

u --cnv-enable-plots—Generate plots as part of the CNV pipeline. The default is false.

u If you perform WGS CNV analysis with high-resolution intervals (less than 1000 bp), then plotgeneration can take longer to complete. Illumina recommends that you use the default (disabled).

u --cnv-enable-ref-calls—Emit copy neutral (REF) calls in the output VCF file. The default is false.

u --cnv-enable-tracks—Generate track files that can be imported into IGV for viewing. When this optionis enabled, a *.gff file for the output variant calls is generated, as well as *.bw files for the tangentnormalized signal. The default is true.

u --cnv-filter-bin-support-ratio—Filters out a candidate event if the span of supporting bins is less thanthe specified ratio with respect to the overall event length. The default ratio is 0.2 (20% support). Asan example, if an event is called and has a length of 100,000 bp, but the target interval bins thatsupport the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then it will be filtered out.

u --cnv-filter-copy-ratio—Specifies the minimum copy ratio threshold value centered about 1.0 at whicha reported event is marked as PASS in the output VCF file. The default value is 0.2, leading to callsless than CR=0.8 or greater than CR=1.2.

u --cnv-filter-length—Specifies the minimum event length in bases at which a reported event is markedas PASS in the output VCF file. The default is 10000.

Document # 1000000101034 v00



u --cnv-filter-qual—Specifies the QUAL value at which a reported event is marked as PASS in theoutput VCF file. The default is 10.

u --cnv-min-qual—Specifies the minimum reported QUAL. The default is 3.

u --cnv-max-qual—Specifies the maximum reported QUAL. The default is 200.

u --cnv-ploidy—Specifies the normal ploidy value. This option is used only for estimation of the copynumber value emitted in the output VCF file. The default is 2.

u --cnv-qual-length-scale—Specifies the bias weighting factor to adjust QUAL estimates for segmentswith longer lengths. This is an advanced option and should not need to be modified. The default is0.9303 (2-0.1).

u --cnv-qual-noise-scale—Specifies the bias weighting factor to adjust QUAL estimates based onsample variance. This is an advanced option and should not need to be modified. The default is 1.0.

Concurrent CNV and Haplotype Variant CallingDRAGEN can perform mapping and aligning of FASTQ samples and then directly stream the data todownstream callers. A single sample can run through both the CNV and the Haplotype VC if the input isa FASTQ sample. This triggers self normalization by default.

Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additionalarguments to either enable the CNV or VC, or both. The options that apply to CNV in the standaloneworkflows are also applicable here.

The following examples show different commands.

Map/align FASTQ with CNVdragen \

-r $HASHTABLE \-1 $FASTQ1 \-2 $FASTQ2 \--RGSM $SAMPLE \--RGID $RGID \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align true \--enable-cnv true \--cnv-enable-self-normalization true

Map/Align FASTQ w/ VCdragen \

-r $HASHTABLE \-1 $FASTQ1 \-2 $FASTQ2 \--RGSM $SAMPLE \--RGID $RGID \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align true \--enable-variant-caller true \--vc-sample-name $SAMPLE

Document # 1000000101034 v00



Map/Align FASTQ w/ CNV and VCdragen \

-r $HASHTABLE \-1 $FASTQ1 \-2 $FASTQ2 \--RGSM $SAMPLE \--RGID $RGID \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align true \--enable-cnv true \--cnv-enable-self-normalization true \--enable-variant-caller true \--vc-sample-name $SAMPLE \"

BAM Input to CNV and VCdragen \

-r $HASHTABLE \--bam-input $BAM \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-map-align false \--enable-cnv true \--cnv-enable-self-normalization true \--enable-variant-caller true \--vc-sample-name $SAMPLE

Sample Correlation and SexGenotyperWhen running the target counts stage or the normalization stage, the DRAGEN CNV pipeline alsoprovides the following information about the samples in the run.

u A correlation metric of the read count profile between the case sample and any panel of normalssamples. A correlation metric greater than 0.90 is recommended for confident analysis, but there isno hard restriction enforced by the software.

u The predicted sex of each sample in the run. The sex is predicted based on the read countinformation in the sex chromosomes and the autosomal chromosomes. The median value for thecounts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Ychromosome.

The results are printed to the screen when running the pipeline. For example:=============================================Correlation Table=============================================Correlation of case sample PlatinumGenomes_50X_NA12877 againstPlatinumGenomes_50X_NA12878: 0.984092

Sex Genotyper=============================================

Document # 1000000101034 v00



Predicted sex of samplesPlatinumGenomes_50X_NA12877: MALE XY 0.99737PlatinumGenomes_50X_NA12878: FEMALE XX 0.968929

To perform analysis on the sex chromosomes using a panel of normals, it is recommended that you usesex matched samples in the panel of normals.

You can override the sex of the sample with the –sample-sex option .

Multisample CNVCallingMultisample CNV calling is possible starting from tangent normalized counts files (*.tn.tsv) specifiedwith the --cnv-input option (one per sample). Multisample CNV analysis benefits from using jointsegmentation to increase the sensitivity of detection of copy number variable segments. For each copynumber variable segment identified, the copy number genotype of each sample is emitted in a singleVCF entry to facilitate annotation and interpretation.

The following is an example command line for running a trio analysis:

dragen \-r $HASHTABLE \--output-directory $OUTPUT \--output-file-prefix $SAMPLE \--enable-cnv true \--cnv-input $FATHER_TN_TSV \--cnv-input $MOTHER_TN_TSV \--cnv-input $PROBAND_TN_TSV \--pedigree-file $PEDIGREE_FILE

De Novo CNV Calling OptionsThe following options are used in DeNovo CNV calling:

u --cnv-input—For DeNovo CNV calling, this specifies the input tangent-normalized signal files(*.tn.tsv) from the single sample runs. This option can be specified multiple times, once for eachinput sample.

u --cnv-filter-de-novo-qual—Phred-scaled threshold at which a putative event in the proband sample ismarked as DeNovo. Default value is 0.10.

u --pedigree-file—Pedigree file specifying the relationship between the input samples.

Joint SegmentationFirst, CNV calling is performed on each sample independently. Joint segmentation then uses the copynumber variable segments from each single sample analysis to derive a set of joint copy numbervariable segments. This set of joint segments is determined simply by taking the union of all breakpointsfrom the copy number variable segments of all samples. This results in the splitting of any partiallyoverlapping segments across different samples. For example:

Document # 1000000101034 v00



Following joint segmentation, copy number calling is again performed independently on each sampleusing the joint segments. Segments can be merged as with the single sample analysis, but each jointsegment is emitted in the mutlisample VCF as a single entry. The quality score (QS in the VCF) fromthe sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered usingthe sample's FT field in the multisample VCF. The QUAL column of the multisample VCF is alwaysmissing (ie, "."). The FILTER column of the mutlisample VCF is "SampleFT" if none of the sample's FTfields are "PASS", and "PASS" if any of the sample's FT fields are "PASS".

DeNovo Calling StageA de novo event is defined as the existence of a genotype at a particular locus in a proband’s genomethat did not result from standard Mendelian inheritance from the parents. The de novo calling stageidentifies putative de novo events in the proband of each trio of a multisample analysis. In some casesthese putative de novo events may be real, but they can also arise from sequencing or analysisartifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used tofilter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-fileoption. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.

For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflictfor the called copy number genotypes. The CNV caller does not identify the copy number for each alleleof a given diploid segment, which means assumptions are made about the possible allelic compositionof the parent genotypes.

The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome(sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, asfollows:

Parent Copy NumberGenotype

Possible Copy NumberAlleles

Assumed Possible Copy NumberAlleles

2 0/2, 1/1 1/1

3 0/3, 1/2 1/2

4 0/4, 1/3, 2/2 1/3, 2/2

N x/(N-x) for x <= N/2 x/(N-x) for 1 <= x <= N/2

The following are examples of consistent and inconsistent copy number genotypes for diploid regionsusing these assumptions:

Mother Copy Number Father Copy Number Proband Copy Number Mendelian Consistent?2 2 2 Yes

2 2 1 No

Document # 1000000101034 v00



Mother Copy Number Father Copy Number Proband Copy Number Mendelian Consistent?3 2 4 No

3 2 2 Yes

If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ field inthe VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) ofeach sample in the trio, combined with a prior for the trio genotypes:

DQ = -10log(1-Sum over conflicting genotypes(p(CNm|data)*p(CNf|data)*p(CNp|data)*p(CNm,CNf,CNp))/Sum over all genotypes(p(CNm|data)*p(CNf|data)*p(CNp|data)*p(CNm,CNf,CNp)))

Where

u CNm = Mother copy number

u CNf = Father copy number

u CNp = Proband copy number

u p(CNm,CNf,CNp) = the prior for the trio genotype

The DN field in the VCF is used to indicate the de novo status for each segment. Possible values are:

u Inherited - the called trio genotype is consistent with Mendelian inheritance

u LowDQ - the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than thede novo quality threshold (default 0.1)

u DeNovo - the called trio genotype is inconsistent with Mendelian inheritance and DQ is greater thanor equal to the de novo quality threshold (default 0.1)

Multisample CNVVCFOutputThe records in a multisample CNV VCF differ slightly from the single sample case. The majordifferences are as follows:

u The per-record entries are broken down into the segments among the union of all the input samplesbreakpoints, which means there are more entries in the overall VCF.

u The QUAL column is not used and its value is “.”. The per-sample quality is carried over into theSAMPLE columns with the QS tag.

u The FILTER column indicates PASS if any of the individual SAMPLE columns PASS. Otherwise, itindicates SampleFT.

u The per-sample annotations are carried over from their originating calls. The single sample filters areapplied at the sample level and are emitted in the FT annotation.

Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following twoannotations to the proband sample.

##FORMAT=<ID=DQ,Number=1,Type=Float,Description="De novo quality">

##FORMAT=<ID=DN,Number=1,Type=String,Description="Possible values are‘Inherited’, ‘DeNovo’ or ‘LowDQ’. Threshold for a passing de novo callis DQ > 0.100000">

Document # 1000000101034 v00



While the VCF contains many entries, due to the joint segmentation stage, the number of de novoevents can be found by extracting entries that have a DN and DQ annotation. These records are alsoextracted and are converted to GFF3 in the de novo calling case.

Repeat Expansion Detection with Expansion HunterShort tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNAsegments called repeat units. STRs can expand to lengths beyond the normal range and thereby causemutations called repeat expansions. Repeat expansions are responsible for many diseases includingFragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.

DRAGEN includes a repeat-expansion detection method called ExpansionHunter. This method works byperforming an accurate sequence-graph based realignment of reads that originate inside and around eachtarget repeat. It then genotypes the length of the repeat in each allele based on these graph alignments.More information and analysis is available in the following ExpansionHunter papers:

u ExpansionHunter (original version)

u Graph ExpansionHunter (new version integrated in DRAGEN)

It is important to note that these methods work only for whole human genome samples generated withPCR-free methods. Repeats are only genotyped if the coverage at the locus is at least 10x.

Repeat Expansion Detection OptionsTo enable DRAGEN repeat expansion detection, the following command-line options are required.

u --repeat-genotype-enable = true

u --repeat-genotype-specs=<path to spec file>

In addition the sex of the sample should be set using the --sample-sex option.

The following options are optional.

u --repeat-genotype-region-extension-length=<length of region around repeat to examine> (default1000bp)

u --repeat-genotype-min-baseq=<Minimum base quality for ‘high confidence’ bases> (default 20)

For more information on the spec file specified by --repeat-genotype-specs option, see RepeatExpansion Specification Files on page 60.

The main output of repeat expansion detection is a VCF file, containing the variants found via thisanalysis.

Repeat Expansion Specification FilesThe repeat-specification (also called variant catalog) JSON file defines the repeat regions forExpansionHunter to analyze. Default repeat-specification for some pathogenic repeats are in the/opt/edico/repeat-specs/ directory (based on the reference genome used with DRAGEN).

You can create specification files for new repeat regions by using one of the provided specification filesas a template. See the ExpansionHunter documentation for details on the format.

Document # 1000000101034 v00



Repeat Expansion Detection Output Files

VCFOutput FileThe results of repeat genotyping are output as a separate VCF file, giving the length of each allele ateach callable repeat defined in the repeat-specification catalog file. The name is<outputPrefix>.repeats.vcf (.gz).

The VCF output file begins with the following fields.

Field Description

CHROM Chromosome identifier

POS Position of the first base before the repeat region in the reference

ID Always .

REF The reference base at position POS

ALT List of repeat alleles in format <STRn> where n is the number of repeatunits

QUAL Always .

FILTER Always PASS

Table 3 Core VCF Fields

Field Description

SVTYPE Always STR

END Position of the last base of the repeat region in the reference

REF Number of repeat units spanned by the repeat in thereference

RL Reference length in bp

RU Repeat unit in the reference orientation

REPID Repeat id from the repeat-specification file

Table 4 Additional INFO Fields

Field Description

GT Genotype

SO Type of reads that support the allele; can be SPANNING, FLANKING, or INREPEAT meaning that thereads span, flank, or are fully contained in the repeat

CI Confidence interval called repeat length of each allele

AD_SP Number of spanning reads consistent with the allele

AD_FL Number of flanking reads consistent with the allele

AD_IR Number of in-repeat reads consistent with the allele

Table 5 GENOTYPE (Per Sample) Fields

For example, the following VCF entry describes the state of C9orf72 repeat in a sample with IDLP6005616-DNA_A03.

QUAL FILTER INFO FORMAT LP6005616-DNA_A03chr9 27573526 . C <STR2>,<STR349> . PASSSVTYPE=STR;END=27573544;REF=3;RL=18;RU=GGCCCC;REPID=ALS GT:SO:CN:CI:AD_SP:AD_

Document # 1000000101034 v00



FL:AD_IR1/2:SPANNING/INREPEAT:2/349:2-2/323-376:19/0:3/6:0/459

In this example, the first allele spans 2 repeat units while the second allele spans 349 repeat units. Therepeat unit is GGCCCC (RU INFO field), so the sequence of the first allele is GGCCCCGGCCCC andthe sequence of the second allele is GGCCCC x 349. The repeat spans three repeat units in thereference (REF INFO field).

The length of the short allele was estimated from spanning reads (SPANNING) while the length of theexpanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size ofthe expanded allele is (323,376). There are 19 spanning and 3 flanking reads consistent with the repeatallele of size 2 (that is 19 reads fully contain the repeat of size 2 and 2 flanking reads overlap at most 2repeat units). Also, there are 6 flanking and 459 in-repeat reads consistent with the repeat allele of size349.

Additional Output FilesIn addition to the main VCF output, the repeat genotyper also outputs the same information in a JSONfile for easy parsing.

The new sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file.The alignment position of each read is set to the position in the repeat-region to which it realigned. Thefull alignment (CIGAR) against the sequence graph is given in the custom XG tag in the format<LocusName>,<StartPosition>,<GraphCIGAR>

StartPosition is the first alignment position of the read in the first node and GraphCIGAR describes thealignment against the graph from there. For reference see https://git.illumina.com/Bioinformatics/graph-tools/blob/master/docs/alignment.md. Quality scores in this BAM file are transformed into binaryrepresentation, one high score (40) and one low score (0), used by graph-genotyping.

SMACallingDisruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA), aprogressive disease affecting ~1 in 10,000 live births. SMN1 has a very high identity paralog, SMN2,with differs only in approximately 10 SNVs and small indels. One of these (hg19 chr5:70247773 C->T)affects splicing and largely disrupts the production of functional SMN protein from SMN2. StandardWGS analysis does not produce complete variant calling results for SMN due to this high-similarityduplication combined with common copy-number variation. However, approximately 95% of SMA casescan be detected by determining the absence of the functional C (SMN1) allele in any copy of SMN.

DRAGEN uses sequence-graph realignment (also used for repeat expansion calling) to align reads to asingle reference representing SMN1 and SMN2. In addition to the standard diploid genotype call, theprogram uses a direct statistical test to check for presence of any C allele. If no C allele is detected,the sample is called affected, otherwise unaffected.

SMA calling is only supported for human whole-genome sequencing samples with PCR-free libraries.

UsageSMA calling is implemented together with repeat expansion detection. For information on graph-alignment and options, see Repeat Expansion Detection with Expansion Hunter on page 60.

Document # 1000000101034 v00



https://git.illumina.com/Bioinformatics/graph-tools/blob/master/docs/alignment.md.

https://git.illumina.com/Bioinformatics/graph-tools/blob/master/docs/alignment.md.

SMA calling is enabled, along with repeat expansion detection, by setting the --repeat-genotype-enableoption to true. To activate SMA calling, the variant specification catalog file must include a descriptionof the targeted SMN1/2 variant. Example files are available in the /opt/edico/repeat-specs/experimentalfolder.

SMN output is included along with any targeted repeats in <outputPrefix>.repeat.vcf. SMN output isrepresented as a single SNV call at the key (splice-affecting) position in SMN1, with SMA status incustom fields:

Field Description

VARID SMN marks the SMN call.

GT Genotype call at this position using a normal (diploid) genotype model.

DST SMA status call: + indicates affected, - indicates unaffected, ? indicates uncertain,

AD Total read counts supporting the C and T allele.

RPL Log10 Likelihood ratio between the affected and unaffected models. Positive scores are in favor ofunaffected.

Table 6 SMA Result in repeat.vcf Output

Structural Variant CallingDRAGEN has integrated the methods of the Manta Structural Variant Caller based on version 1.5.1. Fordescriptions of these methods, see Manta Documentation.

Structural variants (SVs) and indels are called from mapped paired-end sequencing reads. The SV calleris optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor-normal sample pairs.

The SV caller does the following:

u Discovers, assembles, and scores large-scale SVs, medium-sized indels, and large insertions withina single efficient workflow.

u Combines paired and split-read evidence during SV discovery and scoring to improve accuracy, butdoes not require split-reads or successful breakpoint assemblies to report a variant in cases wherethere is strong evidence otherwise.

u Provides scoring models for germline variants in small sets of diploid samples and somatic variantsin matched tumor-normal sample pairs.

There is experimental support for analysis of unmatched tumor samples as well. All SV and indelinferences are output in VCF 4.1 format.

Structural Variant Calling OptionsThe following command line options are supported for the Structural Variant Caller:

u --enable-sv—Enable or disable the structural variant caller. The default is false.

u --sv-reference—Specifies a reference file in FASTA format.

u --sv-call-regions-bed—Specifies a BED file containing the set of regions to call. The file can beunzipped text or Bgzip-compressed/tabix-indexed.

u --sv-region—Limit the analysis to a specified region of the genome for debugging purposes. Thisoption can be specified multiple times to build a list of regions. The value must be in the format“chr:startPos-endPos”.

Document # 1000000101034 v00



https://github.com/Illumina/manta/tree/master/docs

u --sv-exome—When set to true, sets options for WES input, which includes disabling depth filters.The default is false.

u --sv-output-contigs—Set to true to have assembled contig sequences output in a VCF file. Thedefault is false.

u --sv-quiet—Set to true to suppress log output to stderr (still writes to log file). The default is true.

Modes of OperationStructural Variant calling can run in the following modes:

u Standalone—Uses mapped BAM/CRAM input files. This mode requires the following options:u --enable-map-align=falseu --enable-sv=true

u Integrated—Automatically runs on the output of the DRAGEN mapper/aligner. This mode requires thefollowing options:u --enable-map-align=trueu --enable-sv=trueu --enable-map-align-output=trueu --output-format=bam

Structural Variant calling can be enabled along with any other caller as well.

The following is an example command line for Integrated mode:dragen -f

--sv-reference=<FASTA> \--ref-dir=<HASH_TABLE> \--enable-map-align=true \--enable-map-align-output=true \--output-format=BAM \--enable-sv=true \--output-directory=<OUT_DIR> \--output-file-prefix=<PREFIX> \-1 <FASTQ1> \-2 <FASTQ2>

The following is an example command line for Joint Diploid calling in Standalone mode:dragen -f

--sv-reference=<FASTA> \--ref-dir=<HASH_TABLE> \--bam-input=<BAM1> \--bam-input=<BAM2> \--bam-input=<BAM3> \--enable-map-align=false \--enable-sv=true \--output-directory=<OUT_DIR> \--output-file-prefix=<PREFIX>

Output FilesThe structural variants VCF output file is available in the output directory. The file is named <output-file-prefix>.sv.vcf.gz.

Document # 1000000101034 v00



The Structural Variant caller also produces also output in the <output-directory>/sv/results and <output-directory>/sv/workspace folders. The results folder contains stats files and variants. The variants areoutput in a Bgzip-compressed/tabix-indexed VCF file. The workspace folder contains logs and temporaryfiles that are used for debugging. The temporary files are automatically deleted, but can be kept bysetting the --sv-retain-temp-files option to true.

Structural Variant De NovoQuality ScoringDe novo quality scoring can be enabled for structural variant joint diploid calling, by setting --sv-denovo-scoring to true and supplying a pedigree file. This adds a FORMAT/DQ field to the output VCF file.

The following example shows a command line for enabling the de novo quality scoring for a joint diploidrun.

dragen -f--sv-reference=<FASTA> \--ref-dir=<HASH_TABLE> \--bam-input=<BAM1> \--bam-input=<BAM2> \--bam-input=<BAM3> \--enable-map-align=false \--enable-sv=true \--output-directory=<OUT_DIR> \--output-file-prefix=<PREFIX> \--sv-denovo-scoring=true \--pedigree-file=<PED_FILE>

Dragen can also be run on an existing Structural Variant output VCF containing multiple samples (ie, aTrio with a Proband and Parents) to generate a modified VCF file that contains a FORMAT/DQ field (theoriginal file is not changed).

The following example shows a command line for deriving the de novo quality score from an existing SVtrio.

dragen -f--variant=<TRIO_VCF_FILE> \--pedigree-file=<PED_FILE> \--enable-map-align=false \--sv-denovo-scoring=true \--output-directory=<OUT_DIR> \--output-file-prefix=<PREFIX>

The DQ field is defined as follows:##FORMAT=<ID=DQ,Number=1,Type=Float,Description="Denovo quality">

The DQ field represents a score of the posterior probability of the variant being denovo in the proband. Ifit can be calculated, the score in Phred scale is added to the proband, while the other samples aremarked with a period ( . ) to indicate missing.

The inputs to the function are the VCF file and the Pedigree file that specifies which sample in the trio isthe proband, mother, or father.

Document # 1000000101034 v00



QCMetrics and Coverage (Enrichment) ReportsPipeline-specific metrics are generated during each run. The metrics are autogenerated and do notrequire any activation or specific commands. Metric calculation is performed during analysis so that itdoes not impact the DRAGEN run time.

Report name Equivalent commandDefaultreport?

DRAGEN output

full_res DRAGEN/samtools genomecov No <output prefix>_<coverage region>_full_res.bed

cov_report DRAGEN's depth of coverage report No <output prefix>_<coverage region>_cov_report.bed

hist DRAGEN's default histogram Yes <output prefix>_<coverage region>_hist.txt

mean_cov Mean coverage over region Yes <output prefix>_<coverage region>_mean_cov.txt

Table 7 Metrics Reports

Where <coverage region> is one of the following:

u wgs

u target_bed

u qc-coverage-region-1



Mapping and AligningMapping and aligning metrics, like the metrics computed by the Samtools Flagstat command, areavailable on an aggregate level (over all input data), and on a per read group level. Unless explicitlystated, the metrics units are in reads (ie, not in terms of pairs or alignments).

u Total input reads—Total number of reads in the input FASTQ files.

u Number of duplicate marked reads—Reads marked as duplicates as a result of the --enable-duplicate-marking option being used.

u Number of duplicate marked and mate reads removed—Reads marked as duplicates, along withany mate reads, that are removed when the --remove-duplicatesoption is used.

u Number of unique reads—Total number of reads minus the duplicate marked reads.

u Reads with mate sequenced—Number of reads with a mate.

u Reads without mate sequenced—Total number of reads minus number of reads with matesequenced.

u QC-failed reads—Reads not passing platform/ vendor quality checks (SAM flag 0x200).

u Mapped reads—Total number of mapped reads minus number of unmapped reads

u Number of unique and mapped reads—Number of mapped reads minus number of duplicatemarked reads.

u Unmapped reads—Total number of reads that could not be mapped.

Document # 1000000101034 v00



u Singleton reads—Number of reads where the read could be mapped, but the paired mate could notbe read.

u Paired reads—Count of reads in which both reads in the pair are mapped.

u Properly paired reads—Both reads in the pair are mapped and fall within an acceptable range fromeach other based on the estimated insert length distribution.

u Not properly paired reads (discordant)—The number of paired reads minus the number of properlypaired reads.

u Reads with indel R1— The percentage of R1 reads containing at least 1 indel.

u Reads with indel R2— The percentage of R2 reads containing at least 1 indel.

u Soft-clipped bases R1— The percentage of bases in R1 reads that are soft clipped.

u Soft-clipped bases R2— The percentage of bases in R2 reads that are soft clipped.

u Histogram of reads map qualitiesu Reads with MAPQ [40:inf)u Reads with MAPQ [30:40)u Reads with MAPQ [20:30)u Reads with MAPQ [10:20)u Reads with MAPQ [0:10)

u Total alignments—Total number of loci reads aligned to with > 0 quality.

u Secondary alignments—Number of secondary alignment loci.

u Supplementary (chimeric) alignments—A chimeric read is split over multiple loci (possibly due tostructural variants). One alignment is referred to as the representative alignment, the other aresupplementary.

u Estimated read length—Total number of input bases divided by the number of reads.

u Average sequenced coverage over genome/target region—Number of all reads in Fastqs * readlength divided by the number of sites in the target bed or genome.

u Average alignment coverage over genome/target region—Number of uniquely mapped bases(ignoring duplicate reads and any clipped bases, only including reads with MAPQ > 0 ) divided bythe number of sites in the target bed or genome.

u Histogram—See Hist Coverage Report on page 72.

u Average chromosome X coverage over genome/target region—Total number of bases thataligned to chromosome X (or to the intersection of chromosome X with the target region) divided bythe total number of loci in chromosome X (or to the intersection of chromosome X with the targetregion). If there is no chromosome X in the reference genome, this metric shows as NA.

u Average chromosome Y coverage over genome/target region—Total number of bases thataligned to chromosome Y (or to the intersection of chromosome Y with the target region) divided bythe total number of loci in chromosome Y (or in the intersection of chromosome Y with the targetregion). If there is no chromosome Y in the reference genome, this metric shows as NA.

u XAvgCov/YAvgCov ratio over genome/target region—Average chromosome X alignment coveragein the genome (or in the target region) divided by the average chromosome Y alignment coverage inthe genome (or in the target region). If there was chromosome X or chromosome Y in the referencegenome, this metric shows as NA.

Document # 1000000101034 v00



u Average mitochondrial coverage over genome/target region—Total number of bases that alignedto the mitochondrial chromosome (or to the intersection of the mitochondrial chromosome with thetarget region) divided by the total number of loci in the mitochondrial chromosome (or in theintersection of the mitochondrial chromosome with the target region). If there was no mitochondrialchromosome in the reference genome, this metric shows as NA.

u Average autosomal coverage over genome/target region—Total number of bases that aligned toautosomes (or to the autosomal loci in the target region) divided by the total number of loci in theautosomes (or to the autosomal loci in the target region). If there was no autosome in the referencegenome, this metric shows as NA.

u Median autosomal coverage over genome/target region—Median alignment coverage over theautosomes (or over the autosomal loci in the target region). If there was no autosome in thereference genome, this metric shows as NA.

u Mean/Median autosomal coverage ratio over genome/target region—Mean autosomal coveragein the genome (or in the target region) divided by the median autosomal coverage in the genome (orin the target region). If there was no autosome in the reference genome, this metric shows as NA.

u PCT of bases aligned that fell inside the interval region—Number of bases inside the intervalregion and the target region divided by the total number of bases aligned.

u Capture specificity—Number of unique (nonduplicate) reads that mapped inside the target regionwith a MAP quality greater than 0, divided by the number of unique (nonduplicate) reads that mappedanywhere in the reference with a MAP quality greater than 0.

u Estimated sample contamination—When variant calling is conducted with any bioinformatics tool,prediction accuracy is highly affected by cross-sample contamination. Even small levels ofcontamination can lead to many FP calls, especially in pipelines where the aim is to detect variantswith low allele frequencies.

The DRAGEN cross-sample contamination module uses a probabilistic mixture model to estimatethe fraction of reads in a sample that may be from another human source. This samplecontamination fraction is estimated as the parameter value in the mixture model that maximizes thelikelihood of the observed reads at multiple pileup locations. The mixture model accounts for thepopulation allele frequencies and the inferred sample genotypes.

To enable this metric, you must provide the file path on the command line to a VCF that includesmarker sites (RSIDs) with population allele frequencies. For example:

--qc-cross-cont-vcf /opt/edico/config/sample_cross_contamination_resource_hg19.vcf

The VCF resource files that are included with DRAGEN can be reconstructed from the Ensambldatabase. The VCFs included in DRAGEN’s config folder contain ~5000 marker locations where thepopulation AFs are close to 0.5. The files are reference specific (hg19/GRCh37/hg38). DRAGEN willabort if an incompatible resource and reference file is used (eg, CRCh37 resource file and hg19reference).

The following shows example output for a sample with no contamination. This value is provided as afraction, so a value of 0.011 the same as 1.1% estimated contamination.

MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.000

Document # 1000000101034 v00



Variant CallingThe generated variant calling metrics are similar to the metrics computed by RTG vcfstats. Metrics arereported for each sample in multi sample VCF and gVCF files. Based on the run case, metrics arereported either as standard VARIANT CALLER or JOINT CALLER. Metrics are reported both for the raw(PREFILTER) and hard filtered (POSTFILTER) VCFs.

Number of samples—Number of samples in the population/ joint VCF.

Reads Processed—The number of reads used for variant calling, excluding any duplicate marked readsand reads falling outside of the target region.

Total—The total number of variants (SNPs + MNPs + INDELS).

Biallelic—Number of sites in a genome that contains two observed alleles, counting the reference asone, and therefore allowing for one variant allele.

Multiallelic—Number of sites in the VCF that contain three or more observed alleles. The reference iscounted as one, therefore allowing for two or more variant alleles.

SNPs—A variant is counted as an SNP when the reference, allele 1, and allele2 are all length 1.

INDELs—Every other variant side is considered an Indel, ie, if allele 1 is of length 1, but allele 2 is not,then the variant site is classified as an indel.

MNPs—A variant is counted as a MNP when the alleles 1 and 2 and also the ref sequence are of equallength and longer than 1.

DeNovo SNPs—DeNovo marked SNPs, with DQ > 0.05. Sett the --qc-snp-denovo-quality-thresholdoption to the required threshold. The default is 0.05.

DeNovo INDELs—DeNovo marked indels with DQ values > 0.02. This DQ threshold can be specified bySetting the --qc-indel-denovo-quality-threshold option to the required DQ threshold. The defaultis 0.02.

DeNovo MNPs—Same as the option for SNPs. Set the --qc-snp-denovo-quality-threshold to the requiredthreshold. The default is 0.05.

(Chr X SNPs)/(Chr Y SNPs) ratio in the genome (or the target region)—Number of SNPs inchromosome X (or in the intersection of chromosome X with the target region) divided by the number ofSNPs in chromosome Y (or in the intersection of chromosome Y with the target region). If there was noalignment to either chromosome X or chromosome Y, this metric shows as NA.

SNP Transitions—An interchange of two purines (A<->G) or two pyrimidines (C<->T).

SNP Transversions—An interchange of purine and pyrimidine bases "Ti/Tv ratio": ratio of transitions totransitions.

Heterozygous—Number of heterozygous variants.

Homozygous—Number of homozygous variants.

Het/Hom ratio—Heterozygous/ homozygous ratio.

In dbSNP—Number of variants detected that are present in the dbsnp reference file. If no dbsnp file isprovided via the --bsnp option, then both the In dbSNP and Novel metrics show as NA.

Novel—Total number of variants minus number of variants in dbSNP.

DurationA breakdown of the run duration for each process, eg, reference loading, aligning reads, and variantcalling.

Document # 1000000101034 v00



QCMetrics Output FormatThe QC metrics are printed to the standard out in a human friendly format and csv files are written to therun output directory.

u <output prefix>.mapping_metrics.csv

u <output prefix>.time_metrics.csv

u <output prefix>.vc_metrics.csv

Each row is self-explained, making it easy to parse.

Section RG/Sample Metric Count/Ration/Time Percentage/Seconds

MAPPING/ALIGNINGSUMMARY

Total input reads 816360354

MAPPING/ALIGNINGSUMMARY

Number of duplicate reads(marked not removed)

15779031 1.93

...

MAPPING/ALIGNINGPER RG

RGID_1 Total reads in RG 816360354 100

MAPPING/ALIGNINGPER RG

RGID_1 Number of duplicate reads(marked)

15779031 1.93

...

VARIANT CALLERSUMMARY

Number of samples 1

VARIANT CALLERSUMMARY

Reads Processed 738031938

...

VARIANT CALLERPREFILTER

SAMPLE_1 Total 4918287 100

VARIANT CALLERPREFILTER

SAMPLE_1 Biallelic 4856654 98.75

...

RUN TIME Time loading reference 00:18.6 18.65

RUN TIME Time aligning reads 19:24.4 1164.42

Coverage/Enrichment ReportsDRAGEN generates a set of default coverage reports for either the whole genome, or, if the --vc-target-bed option is specified, for the target region.

In addition to these default reports, the coverage reports feature allows the user to specify up to threeregions of interest (coverage regions). For each of these regions, DRAGEN generates the defaultreports, and any optional report requested for the region.

Document # 1000000101034 v00



You can enable the region coverage reports with the –qc-coverage-region-i options, where i can equal 1,2, or 3. Each –qc-coverage-region-i option specifies a bed file and has an optional associated –qc-coverage-reports-i option, which specifies the type of reports that should be generated for each specificregion in addition to the default reports.

All default and optional coverage reports listed in the two tables below use the following default rules forcounting reads and bases:

u Duplicate reads are ignored.

u Soft and/or hard-clipped bases are ignored

u Reads with MAPQ=0 are ignored.

u Overlapping mates are double-counted.

Non-default settings:

It is possible to override the min MAPQ and min BQ to apply for a given region using the qc-coverage-filters. For more information, see MAPQ and BQ Coverage Filtering on page 74.

The reports are available with or without running either the mapper and aligner, or the variant caller.However, the --enable-sort options needs to be set to true (the default is true).

Any combination of the optional reports can be requested for each region. If multiple report types areselected per region, they should be space-separated.

Report NameDRAGEN OutputFile Type

Hist Coverage _hist.csv

Overall meancoverage

_overall_mean_cov.csv

Per contig meancoverage

_contig_mean_cov.csv

Predicted ploidy _ploidy.csv

Table 8 Default Reports

ReportName

DRAGEN Output FileType

full_res _full_res.bed

cov_report _cov_report.bed

Table 9 Optional Reports

Coverage Reports Use Cases and ExpectedOutput

--vc-target-bed specified?Y/N

--qc- coverage-region-i specified?

Y/NExpected Output Files

N N • Four .wgs_ default reports

N Y • Four .wgs_ default reports• Four coverage region defaultreports

• Up to two optional coverageregion reports as specified onthe command line

Document # 1000000101034 v00



--vc-target-bed specified?Y/N

--qc- coverage-region-i specified?

Y/NExpected Output Files

Y N • Four .target_bed_ defaultreports

Y Y • Four .target_bed_ defaultreports

• Four <coverage region> defaultreports

• Up to two optional <coverageregion> reports as requested bythe user

The following is an example of the additional arguments required to generate coverage reports:$ dragen … \

--qc-coverage-region-1 <bed file 1> \--qc-coverage-report-1 full_res mean_cov \--qc-coverage-region-2 <bed file 2> \--qc-coverage-region-3 <bed file 3> \--qc-coverage-report-3 cov_report mean_cov

Hist Coverage ReportThe hist report outputs a _hist.csv file, which provides the following:

u Percentage of bases in the genome/target region/coverage region that fall within a certain range ofcoverage.

u Duplicate reads are ignored if DRAGEN is run with --enable-duplicate-marking true.

The following ranges are used:"[100x:inf)", "[50x:100x)", "[20x:50x)", "[10x:20x)", "[3x:10x)", "[0x:

3x)"

Overall Mean Coverage ReportThe Overall Mean Coverage report generates an _overall_mean_cov.csv file, which contains the averagealignment coverage over the coverage bed/target bed/wgs, as applicable.

The following is an example of the contents of the _overall_mean_cov.csv file:Average alignment coverage over target_bed,80.69

Contig Mean Coverage ReportThe Contig Mean Coverage report generates a _contig_mean_cov.csv file, which contains the estimatedcoverage for all contigs, and an autosomal estimated coverage. The file includes the following threecolumns:

Document # 1000000101034 v00



Column 1 Column 2 Column3

Contig name Number of bases aligned to thatcontig, which excludes bases fromduplicate marked reads, reads withMAPQ=0, and clipped bases.

Estimated coverage, as follows:<number of bases aligned to thecontig (ie, Col2)> divided by<length of the contig or (if a targetbed is used) the total length of thetarget region spanning thatcontig>.

Contig lengths and target bed region lengths (used as denominators) include regions with N in theFASTA.

Ploidy ReportThe ploidy report generates a predicted ploidy in a _whg_ploidy.txt file.

The following is an example of the contents of the _ploidy.txt file:Predicted sex chromosome ploidy XX

Full Res ReportThe Full Res Report outputs a _full_res.bed file. This file is tab-delimited file and has as its first threecolumns the standard BED fields, and as its fourth column the depth. Each record in the file is for agiven interval that has a constant depth. If the depth changes, then a new record is written to the file.Alignments that have a mapping quality value of 0, duplicate reads, and clipped bases are not countedtowards the depth.

Only base positions that fall under the user-specified coverage-region bed regions are present in the _full_res.bed output file.

The _full_res.bed file structure is similar to the output file of bedtools genomecov -bg. The contents areidentical if the bedtools command line is executed after filtering out alignments with mapping qualityvalue of 0, and possibly filtering by a target BED (if specified).

The following is an example of the contents of the _full_res.bed file:chr1 121483984 121483985 10

chr1 121483985 121483986 9

chr1 121483986 121483989 8

chr1 121483989 121483991 7

chr1 121483991 121483992 6

chr1 121483992 121483993 4

chr1 121483993 121483994 2

chr1 121483994 121484039 1

chr1 121484039 121484043 2

chr1 121484043 121484048 3

chr1 121484048 121484050 7

chr1 121484050 121484051 11

chr1 121484051 121484052 17

chr1 121484052 121484053 149

Document # 1000000101034 v00



chr1 121484053 121484054 323

chr1 121484054 121484055 2414

Coverage ReportThe cov_report report generates a _cov_report.bed file. This tab-delimited file has the standard BEDfields as its first three columns. The column fields that follow are statistics calculated over the intervalregion specified on the same record line. The additional columns are as follows:

u total_cvg—The total coverage value.

u mean_cvg—The mean coverage value.

u Q1_cvg—The lower quartile (25th percentile) coverage value.

u median_cvg—The median coverage value.

u Q3_cvg—The upper quartile (75th percentile) coverage value.

u min_cvg—The minimum coverage value.

u max_cvg—The maximum coverage value.

u pct_above_X—Indicates the percentage of bases over the specified interval region that had a depthcoverage greater than X.

By default, if an interval has a total coverage of 0, then the record is written to the output file. To filterout intervals with zero coverage, set vc-emit-zero-coverage-intervals to false in the configuration file.

The following is an example of the contents of the _cov_report.bed file:

chrom start end total_cvg mean_cvg Q1_cvg median_cvg Q3_cvg min_cvg max_cvg pct_above_5

…

chr5 34190121 34191570 76636 52.89 44.00 54.00 60.00 32 76 100.00

…

chr5 34191751 34192380 39994 63.58 57.00 61.00 69.00 50 85 100.00

…

chr5 34192440 34192642 10074 49.87 47.00 49.00 51.00 44 62 100.00

…

chr9 66456991 66457682 31926 46.20 39.00 45.00 52.00 27 65 100.00

…

chr9 68426500 68426601 4870 48.22 42.00 48.00 54.00 39 58 100.00

…

chr17 41465818 41466180 24470 67.60 4.00 66.00 124.00 2 153 66.30

…

chr20 29652081 29652203 5738 47.03 40.00 49.00 52.00 34 58 100.00

…

chr21 9826182 9826283 4160 41.19 23.00 52.00 58.00 5 60 99.01

…

MAPQ and BQ Coverage FilteringCoverage filtering is implemented to allow filtering out of reads below a certain mapping quality, and/ornot counting bases below a certain base call quality in any read, over a specified coverage region. Thisfiltering is applied to all reports for this region.

Document # 1000000101034 v00



The coverage filtering can be enabled by using one of the --qc-coverage-filters-i options, where i canequal 1, 2, or 3, in combination with the associated --qc-coverage-region-i option:

u --qc-coverage-region-i=<targetedregions.bed>

u --qc-coverage-filters-i <filters string>

For example, the following options are used to enable 1 bp resolution coverage output with filtering:

--qc-coverage-region-1= <targetedregions.bed>--qc-coverage-filters-1 'mapq<10,bq<30’--qc-coverage-reports-1 full_res

u The argument syntax is mapq<value,bq<value, which means that reads that have a mapping qualityless than the specified value are not counted, and/or bases with a base call quality below thespecified value.

u Valid filter arguments are mapq and bq only. Either, or both, can be specified.

u Only one operator < is supported. <=, >, >=, = are not supported.

u When filtering is enabled for a targeted region, DRAGEN outputs the filtered report files for thisregion. Unfiltered report files are not output for the targeted filtered region.

u When metrics compression is enabled, a _full_res.bw file is output instead of the _full_res.bed.

BigWig compression of coverage metricsThe 1 bp resolution coverage metrics output bed file (_full_res.bed) can become very large. Bigwigformat compression of this output can be enabled by setting the --enable-metrics-compression option totrue.

Variant Quality Score RecalibrationThe Variant Quality Score Recalibration (VQSR) module applies machine learning algorithms on an inputVCF with training data from databases of known variants to produce a metric (VQSLOD). This metric isa better representation of whether a call is a true variant or a false variant. The VQSLOD metric isadded to the INFO field of each variant call record and can be used later to filter out calls based on athreshold. The VQSLOD metric is the log odds ratio of the call being a true variant versus the call beinga false variant.

AlgorithmThe VQSR module attempts to build a Gaussian mixture model using the distribution of annotationvalues from a subset of high quality variant call sites. These variant call sites are determined from thespecified resource files, or training sets, such as HapMap3 or Omni 2.5M SNP. From the model, eachvariant call site from the input set is then given a co-varying estimate based on the specifiedannotations, which indicates the likelihood that the call is a true genetic variant or not. The algorithmimplements the variational inference version of the expectation maximization algorithm over a Gaussianmixture model. The model generation is done iteratively until convergence is met. Convergence may notbe possible if there are too few sites in the input data set or the training sets.

Both a positive model and a negative model are generated for each pass. The negative model isgenerated from the worst performing sites via the --vqsr-lod-cutoff threshold. After both models aregenerated, the log likelihood that a call site was generated from each model is calculated, and then thelog likelihood ratio is taken. The log likelihood ratio is annotated in the output VQSR VCF as VQSLODunder the INFO column.

Document # 1000000101034 v00



You can further filter out call sites at a specific threshold by specifying tranche values, which indicatethe target sensitivity levels. The VQSR module then calculates the minimum VQSLOD score required tomeet this target sensitivity level. The --vqsr-filter-level option determines the threshold to filter out calls.

Usage and SettingsEnabling VQSR postprocessing adds only minutes to the DRAGEN pipeline for a whole genomeanalysis. The VQSLOD annotations are calculated for both SNPs and indels and are output in anadditional VCF file, <output-file-prefix>.vqsr.vcf. To enable VQSR, use the --enable-vqsr true option onthe DRAGEN command line.

The VQSR-specific options are as follows.

u --enable-vqsr

Enables the VQSR post processing module. When used with --enable-variant-caller or --enable-joint-genotyping, the VQSR processing occurs after the variant calling stages. When used with --vqsr-input, the VQSR engine works in standalone mode to process VCF files.

u --vqsr-config

Specifies the path to the VQSR configuration file. DRAGEN can optionally read in a configurationfile that contains the VQSR options. VQSR options that are specified in the configuration file can beoverridden with options specified on the command line. The configuration file is useful to storesettings that are used across multiple runs, such as the tranche settings or resource files. The/opt/edico/config/dragen-VQSR.cfg file provides an example of a VSQR configuration file.

u --vqsr-input

Specifies the VQSR input VCF to be processed. If this option is specified on the DRAGENcommand line, VQSR runs in standalone mode, which allows you to run VQSR on VCF files thatwere generated ahead of a time.

u --vqsr-annotation

Specifies the annotations to be used for building the models as a comma-separated string thatspecifies the mode of operation followed by a list of annotations to be used under that mode. Forexample: <mode>,<annotation>,<annotation>...

The <mode> is either SNP or INDEL. If only SNP is specified, then only SNPs are processed withVQSR. If only INDEL is specified, then only indels are processed. If both SNP and INDEL arespecified, by using this option twice on the command line, then both SNPs and indels areprocessed.

The list of annotations come from the INFO column of a VCF. You can specify up to eightannotations. These annotations and their associated values are used to build the model.

The following is an example of VQSR annotations options:--vqsr-annotation SNP,DP,QD,FS,ReadPosRankSum,MQRankSum,MQ

--vqsr-annotation INDEL,DP,QD,FS,ReadPosRankSum,MQRankSum

u --vqsr-resource

Specifies the training resource files to be used for determining which variant call sites are true. Thisoption can be specified multiple times to include multiple resource files, each with a different priorvalue. DRAGEN does not distinguish between truth or training resource files. All resource files willbe used for both truth and training.

Document # 1000000101034 v00



The format of this option is a comma-separated string that specifies the mode of operation, the priorvalue to weight this resource, then the path the resource file. For example:<mode>,<prior>,<resource file>.

<mode> is either SNP or INDEL. The <mode> indicates how to apply the resource files.

<prior> applies a weight to the variant call sites from the specified resource file with a priorprobability of being correct.

<resource file> provides the path and file name of the VCF file to be used as the resource file. Forexample:

--vqsr-resource "SNP,15.0,<path>/hapmap_3.3.vcf"

--vqsr-resource "SNP,12.0,<path>/1000G_omni2.5.vcf"

--vqsr-resource "SNP,10.0,<path>/1000G_phase1.snps.high_confidence.vcf"

--vqsr-resource "INDEL,12.0,<path>/Mills_and_1000G_gold_standard.indels.vcf"

u --vqsr-tranche

Specifies the truth sensitivity levels to calculate the LOD cutoff values. These values are specifiedin percentages. This option can be specified multiple times with a different truth sensitivity level. Ifnone are specified, the default values of 100.0, 99.99, 99.90, 99.0, and 90.0 are used. For example:

--vqsr-tranche 100.0





u --vqsr-filter-level

Specifies the truth sensitivity level as a percentage to filter out variant calls. From the truthsensitivity level, the corresponding minimum VQSLOD score is calculated, and all annotated callsthat are below this threshold are marked as filtered. The FILTER field is marked as failing with thefilter VQSLODThresholdSNP or VQSLODThresholdINDEL. If not specified, none of the calls arefiltered. The filter value must be specified independently for SNPs and indels. For example:

--vqsr-filter-level SNP,99.5

--vqsr-filter-level INDEL,90.0

u --vqsr-lod-cutoff

Specifies the LOD cutoff score for selecting which variant call sites to use in building the negativemodel. The default value is -5.0.

u --vqsr-num-gaussians

Specifies the number of Gaussians to generate the positive and negative models. This option isspecified as a comma-separated string of the following four integer values: <SNP positive>,<SNPnegative>,<INDEL positive>,<INDEL negative>. If not specified, the default values of 8,2,4,2 areused.

The number of Gaussians to use per model must be greater than 0 and at most 8. The number ofpositive Gaussians must be greater than the number of negative Gaussians. As an example, togenerate the models with 6 Gaussians for SNP positive, 2 Gaussians for SNP negative, 4Gaussians for indel positive, and 2 Gaussians for indel negative, use the following option:

Document # 1000000101034 v00



--vqsr-num-gaussians 6,2,4,2


Specifies the output directory where output files are stored.


Specifies the output file prefix for all output files.

VSQRExample OutputIf you already have a VCF that needs to be annotated and filtered, the VQSR module can be run as astandalone tool. The following is an example of the output:

==================================================================DRAGEN Variant Quality Score Recalibration

==================================================================Input file: /home/username/input.vcfOutput file: output/dragen.vqsr.vcfTranches: 100, 99.99, 99.9, 99, 90Annotations:

SNP: MQ FS QD MQRankSum ReadPosRankSum DPINDEL: FS QD MQRankSum ReadPosRankSum DP

Priors and training files:Q15.0:/variant_dbases/hapmap_3.3.vcfQ12.0:/variant_dbases/Mills_and_1000G_gold_standard.indels.vcfQ12.0:/variant_dbases/1000G_omni2.5.vcfQ10.0:/variant_dbases/1000G_phase1.snps.high_confidence.vcf

Number of Gaussians SNP INDELPositive model: 8 4Negative model: 2 2

==================================================================Building SNP TrainingSet==================================================================

Number of valid records in input file: 5266144Number of training records detected: 4170912

WARNING: Annotation 'QD' was missing in 41 records.WARNING: Annotation 'ReadPosRankSum' was missing in 702218 records.WARNING: Annotation 'MQRankSum' was missing in 702185 records.

Training set statistics:DP - mean: 140.838 std dev: 24.8402MQ - mean: 59.9042 std dev: 0.946757QD - mean: 22.7226 std dev: 6.67918FS - mean: 3.01561 std dev: 4.18792 ReadPosRankSum - mean:

0.73308 std dev: 0.964922

MQRankSum - mean: 0.220358 std dev: 0.893418

Number of outliers removed from the full set: 363154 out of 5266144Number of outliers removed from training set: 12046 out of 4170912

Document # 1000000101034 v00



==================================================================Generating SNP Positive Model==================================================================Number of Gaussians: 8Number of data points: 4158866

Initializing model with k-means algorithmK-means stabilized after 86 iterationsRunning expectation-maximization algorithm........................................................Expectation-maximization algorithm converged after 112 iterations

Cluster weight assigned to Gaussian #0 is 0.0109316Cluster weight assigned to Gaussian #1 is 0.00988Cluster weight assigned to Gaussian #2 is 0.646337Cluster weight assigned to Gaussian #3 is 0.160654Cluster weightassigned to Gaussian #4 is 0.00700244

Cluster weight assigned to Gaussian #5 is 0.005172Cluster weight assigned to Gaussian #6 is 0.00252354Cluster weight assigned to Gaussian #7 is 0.157501

==================================================================Building SNP Negative Training Set==================================================================LOD Cutoff: -5Number of data points: 359971

==================================================================Generating SNP Negative Model==================================================================Number of Gaussians: 2Number of data points: 359971

Initializing model with k-means algorithmK-means stabilized after 15 iterationsRunning expectation-maximization algorithm.............Expectation-maximization algorithm converged after 25 iterations

Cluster weight assigned to Gaussian #0 is 0.366887Cluster weight assigned to Gaussian #1 is 0.633116

==================================================================Calculating SNP LOD Ratios==================================================================Minimum LOD: -1317.29Maximum LOD: 21.9196

Document # 1000000101034 v00



==================================================================Calculating SNP Truth Sensitivity Tranches==================================================================Tranche ts = 100.00 minVQSLOD = -1317.2891Tranche ts = 99.99 minVQSLOD = -2.1371Tranche ts = 99.90 minVQSLOD = -1.1866Tranche ts = 99.00 minVQSLOD = 0.0199Tranche ts = 90.00 minVQSLOD = 15.9062

==================================================================Building INDEL Training Set==================================================================Number of valid records in input file: 1156226Number of training records detected: 440621

WARNING: Annotation 'QD' was missing in 41 records.WARNING: Annotation 'ReadPosRankSum' was missing in 77316 records.WARNING: Annotation 'MQRankSum' was missing in 76319 records.

Training set statistics:DP - mean: 137.2 std dev: 41.05QD - mean: 18.83 std dev: 8.258FS - mean: 2.88 std dev: 4.756ReadPosRankSum - mean: 0.2631 std dev: 0.9896MQRankSum - mean: 0.1943 std dev: 0.907

Number of outliers removed from the full set: 8150 out of 1156226Number of outliers removed from training set: 686 out of 440621

==================================================================Generating INDEL Positive Model==================================================================Number of Gaussians: 4Number of data points: 439935

Initializing model with k-means algorithmK-means stabilized after 52iterations

Running expectation-maximization algorithm..................Expectation-maximization algorithm converged after 36 iterations

Cluster weight assigned to Gaussian #0 is 0.02934Cluster weight assigned to Gaussian #1 is 0.2792Cluster weight assigned to Gaussian #2 is 0.3979Cluster weight assigned to Gaussian #3 is 0.2936

==================================================================Building INDEL Negative Training Set==================================================================LOD Cutoff: -5

Document # 1000000101034 v00



Number of data points: 61977

==================================================================Generating INDEL Negative Model==================================================================Number of Gaussians: 2Number of data points: 61977

Initializing model with k-means algorithmK-means stabilized after 13 iterationsRunning expectation-maximization algorithm...................Expectation-maximization algorithm converged after 38 iterations

Cluster weight assigned to Gaussian #0 is 0.5351Cluster weight assigned to Gaussian #1 is 0.4649

==================================================================Calculating INDEL LOD Ratios==================================================================Minimum LOD: -679.9Maximum LOD: 6.154

==================================================================Calculating INDEL Truth Sensitivity Tranches==================================================================Tranche ts = 100.00 minVQSLOD = -679.9446Tranche ts = 99.99 minVQSLOD = -3.7305Tranche ts = 99.90 minVQSLOD = -1.4809Tranche ts = 99.00 minVQSLOD = -0.1606Tranche ts = 90.00 minVQSLOD = 1.6201

==================================================================Merging SNP and INDEL Records==================================================================Number of records processed as SNPs: 5266144Number of records processed as INDELs: 1156226

==================================================================Generating Output VCF==================================================================Number of total records from input file: 6422370Number of records annotated with VQSLOD: 6422370VQSR annotated VCF written to: output/dragen.vqsr.vcf

Document # 1000000101034 v00



Virtual Long Read DetectionDRAGEN Virtual Long Read Detection (VLRD) is an alternate and more accurate variant caller focusedon processing homologous/similar regions of the genome. A conventional variant caller relies on themapper/aligner to determine which reads likely originated from a given location. It also detects theunderlying sequence at that location independently of other regions not immediately adjacent toit. Conventional variant calling works well when the region of interest does not resemble any other regionof the genome over the span of a single read (or a pair of reads for paired-end sequencing).

However, a significant fraction of the human genome does not meet this criterion. Many regions of thegenome have near-identical copies elsewhere, and as a result, the true source location of a read mightbe subject to considerable uncertainty. If a group of reads is mapped with low confidence, a typicalvariant caller might ignore the reads, even though they contain useful information. If a read ismismapped (ie, the primary alignment is not the true source of the read), it can result in detection errors.Short-read sequencing technologies are especially susceptible to these problems. Long-read sequencingcan mitigate these problems, but it typically has much higher cost and/or higher error rates, or othershortcomings.

DRAGEN VLRD attempts to tackle the complexities presented by the genome's redundancy from aperspective driven by the short-read data. Instead of considering each region in isolation, VLRDconsiders all locations from which a group of reads may have originated and attempts to detect theunderlying sequences jointly using all available information.

Running DRAGENVLRDLike the DRAGEN variant caller, VLRD takes either FASTQ or sorted BAM files as input and producesan output VCF file. VLRD supports processing a set of only two homologous regions. DRAGEN cannotprocess a set of three or more homologous regions. Support for three or more homologous regions willbe added in a future release.

VLRD is not enabled by default. To run VLRD, set the --enable-vlrd option to true. The following is anexample DRAGEN command to run VLRD.

/opt/edico/bin/dragen \

-r $REF \

-1 $FQ1 \

-2 $FQ2 \

--output-dir $OUTPUT \

--output-file-prefix $PREFIX \

--enable-map-align=true \

--enable-sort=true \

--enable-duplicate-marking=true \

--vc-sample-name=test \

--enable-vlrd true \

--vc-target-bed similar_regions.bed \

Document # 1000000101034 v00



VLRDUpdatedMap/Align OutputDRAGEN VLRD can output a remapped BAM/SAM file in addition to the regular DRAGEN Map/Alignoutput. To enable output of a remapped BAM/SAM file, set the --enable-vlrd-map-align-output option totrue. The default for this option is false.

The additional map/align output by VLRD only contains reads mapped to the regions that wereprocessed by VLRD.

VLRD considers the information available from all the homologous regions jointly to update the readalignments (mapping position and/or mapping quality, and so on). The updated VLRD map/align output isprimarily useful for pile-up analysis involving homologous regions.

VLRDSettingsThe following options are specific to VLRD in the DRAGEN host software.

u --enable-vlrd

If set to true, VLRD is enabled for the DRAGEN pipeline.

u --vc-target-bed

Specifies the input bed file. DRAGEN requires an input target bed file specifying the homologousregions that are to be processed by VLRD. This bed file has a special format that is required byVLRD to correctly process the homologous regions. The maximum region span processed by VLRDis 900 bp.

For example:chr1 161497562 161498362 0 0

chr1 161579204 161580004 0 0

chr1 21750837 21751637 1 0

chr1 21809355 21810155 1 1

u The first three columns are like traditional bed files: column 1 is chromosome description,column 2 is region start, and column 3 is region end.

u Column 4 is homologous region Group ID. This groups regions that are homologous to eachother.

u Rows 1 and 2 have the same value in column 4 indicating that these should be processed as aset of homologous regions, independent of the next group in row 3 and 4. If not set correctly,software might group regions that are not homologous to each other, leading to incorrect variantcalls.

u Column 5 indicates whether a region is reverse complemented with respect to the otherhomologous region. A value of 1 denotes that the region is reverse complemented with respectto other regions in the same group.

u Row 4, column 5 is set to 1. This indicates that the region is homologous to the region in row 3only if it is reverse complemented.

The DRAGEN installation package contains two VRLD bed files for hg19 and hs37d5 referencegenomes under /opt/edico/examples/VLRD. You can use these bed files as is for running VLRD, oras an example to generate a custom bed file.

u --enable-vlrd-map-align-output

If set to true, VLRD outputs a remapped BAM/SAM file that only contains reads mapped to the

Document # 1000000101034 v00



regions that were processed by VLRD.

ForceGenotypingDRAGEN now supports force genotyping (ForceGT) for Germline SNV variant calling. To use ForceGT,use the --vc-forcegt-vcf option with a list of small variants to force genotype. The input list of smallvariants can be a .vcf or .vcf.gz file.

The current limitations of ForceGT are as follows:

u ForceGT is supported for Germline SNV variant calling in the V3 mode. The V1, V2, and V2+ modesare not supported.

u ForceGT is not supported for Somatic SNV variant calling.

u ForceGT variants do not propagate through Joint Genotyping.

ForceGT InputDRAGEN software supports only a single ForceGT VCF input file, which must meet the followingrequirements

u Have the same reference contigs as the VCF used for variant calling.

u Be sorted by reference contig name and position.

u Be normalized (parsimonious and left-aligned).

u Contain no complex variants (variants that require more than one substitution / insertion / deletion togo from ref allele to alt allele). For example, any variant in the ForceGT VCF similar to the followingresults in undefined behavior in the DRAGEN software:

chrX 153592402 GTTGGGGATGCTGAC CACCCTGAAGGG

The following nonnormalized variants cause undefined behavior in the DRAGEN software:

u Not parsimonious: chrX 153592402 GC GCG

u Parsimonious representation: chrX 153592403 C CG

ForceGTOperation and ExpectedOutcomeWhen running Germline SNV variant calling with ForceGT, a single sample gVCF is generated, using theForceGT VCF as input on the DRAGEN command line. The single sample gVCF output file contains allregular calls and the forceGT calls, as follows:

u For a ForceGT call that was not called by the variant caller (not common), the call is tagged withFGT in the INFO field.

u For a ForceGT call that was also called by the variant caller and filter field is PASS (common), thecall is tagged with NML:FGT in the INFO field (NML denotes normal)

u For a normal call (and PASS) by the variant caller, with no ForceGT call (normal), no extra tags areadded (no NML tag, no FGT tag)

This scheme allows us to distinguish between calls that are present due to FGT only, common in bothForceGT input and normal calling, and normal calls.

All the variants in the input ForceGT VCF are genotyped and present in the output single sample gVCFfile. The reported GT for the variants are as follows:

Document # 1000000101034 v00



Condition Reported GT

At a position with no coverage ./.

At a position with coverage but no reads supportingALT allele

0/0

At a position with coverage and reads supportingALT allele

0/0 or 0/1 or 1/1or 1/2

At a position where the variant call made by default DRAGEN software is different from the onespecified in the input ForceGT vcf, there are multiple entries at the same position in the output gVCF, asfollows:

u One entry for the default DRAGEN variant call, and

u One entry each for every variant call present in the input ForceGT VCF at that position.chrX 100 G C [Default DRAGEN variant call]

chrX 100 G A [Variant in ForceGT vcf]

If a target bed file is provided along with the input ForceGT VCF, then the output gVCF file onlycontains ForceGT variants that overlap the bed file positions.

UniqueMolecular IdentifiersDRAGEN version 3.3 includes a beta release of the UMI pipeline.

The DRAGEN UMI pipeline supports the TruSight Oncology (TSO) UMI Reagents which use uniquemolecular identifiers (UMIs) to lower the rate of PCR or sequencing errors aiding in the detection of rareand low frequency somatic variants present in DNA samples, such as cfDNA isolated from plasma.After libraries that contain UMIs are sequenced, the DRAGEN UMI pipeline aligns the reads, thencollapses the sequences with shared UMIs (families) down to unique reads. This process enables thefiltering of false positives, which are excluded from the collapsed reads, reducing the error rate. Theresulting reads have higher per-base quality and lower noise from various sources.

To use the UMI pipeline, the input reads files must be from a paired-end run and must contain UMI tagsin the read name field in the standard Illumina format. The following example shows the UMI for bothreads of the pair appended to the end of the read name:

NDX550136:7:H2MTNBDXX:1:13302:3141:10799:AAGGATG+TCGGAGA

To enable read collapsing, use the --enable-umi and --remove-duplicates options. The following is anexample DRAGEN command to run the UMI pipeline:

/opt/edico/bin/dragen \-r $REF \-1 $FQ1 \-2 $FQ2 \--output-dir $OUTPUT \--output-file-prefix $PREFIX \--enable-map-align=true \--enable-sort=true \--enable-umi=true \--remove-duplicates=true

Document # 1000000101034 v00



Chapter 4 DRAGEN RNA ApplicationsThe DRAGEN RNA Applications use the DRAGEN RNA-Seq spliced aligner. Most of the functionalityand options described in Host Software Options on page 117, DRAGEN Host Software on page 6, andDRAGEN DNA Applications on page 15DRAGEN DNA Applications on page 15 are also applicable toRNA-Seq processing. At a minimum, it is advisable to first read through Host Software Options on page117, and DNA Mapping on page 15 through Duplicate Marking on page 24 before proceeding.

RNAMapping and Intron HandlingThe following options control the mapping stage of the RNA spliced aligner.

u Mapper.min-intron-bases

For RNA-Seq mapping, a reference alignment gap can be interpreted as a deletion or an intron. Inthe absence of an annotated splice junction, the min-intron-bases option is a threshold gap lengthseparating this distinction. Reference gaps at least this long are interpreted and scored as introns,and shorter reference gaps are interpreted and scored as deletions. However, alignments can bereturned with annotated splice junctions shorter than this threshold.

u Mapper.max-intron-bases

The max-intron-bases option controls the largest possible intron that is reported, which useful forpreventing false splice junctions that would otherwise be reported. Set this option to a value that issuitable to the species you are mapping against.

u Mapper.ann-sj-max-indel

For RNA-seq, seed mapping can discover a reference gap in the position of an annotated intron, butwith slightly different length. If the length difference does not exceed this option, the mapperinvestigates the possibility that the intron is present exactly as annotated, but an indel on one sideor the other near the splice junction explains the length difference. Indels longer than this optionchanged parameter to option

and very near annotated splice junctions are not likely to be detected. Higher values may increasemapping time and false detections.

RNAAligningThe following options control the aligner stage of the RNA spliced aligner.

Smith-Waterman Alignment Scoring OptionsRefer to Smith-Waterman Alignment Scoring Settings on page 17 for more details about the alignmentalgorithm used within DRAGEN. The following scoring options are specific to the processing ofcanonical and noncanonical motifs within introns.

u --Aligner.intron-motif12-pen

The --Aligner.intron-motif12-pen option controls the penalty for canonical motifs 1/2 (GT/AG,CT/AC). The default value calculated by the host software is 1*(match-score + mismatch-pen).


The --Aligner.intron-motif34-pen option controls the penalty for canonical motifs 3/4 (GC/AG,CT/GC). The default value calculated by the host software is 3*(match-score + mismatch-pen).


The --Aligner.intron-motif56-pen option controls the penalty for canonical motifs 5/6 (AT/AC, GT/AT).

Document # 1000000101034 v00


The default value calculated by the host software is 4*(match-score + mismatch-pen).


The --Aligner.intron-motif0-pen option controls the penalty for noncanonical motifs. The default valuecalculated by the host software is 6*(match-score + mismatch-pen).

Compatibility with CufflinksCufflinks might require spliced alignments to emit the XS:A strand tag. This tag is present in the SAMrecord if the alignment contains a splice junction. The values for XS:A strand tag are as follows:

‘.’ (undefined), ‘+’ (forward strand), ‘-’ (reverse strand), or ‘*’ (ambiguous).

If the spliced alignment has an undefined strand or a conflicting strand, then the alignment can besuppressed by setting the no-ambig-strand option to 1.

Cufflinks also expects that the MAPQ for a uniquely mapped read is a single value. This value isspecified by the --rna-mapq-unique option. To force all uniquely mapped reads to have a MAPQ equal tothis value, set --rna-mapq-unique to a nonzero value.

MAPQ ScoringBy default, the MAPQ calculation for RNA-Seq is identical to DNA-Seq. The primary contributor toMAPQ calculation is the difference between the best and second-best alignment scores. Therefore,adjusting the alignment scoring parameters impacts the MAPQ estimate. These adjustments are outlinedin Smith-Waterman Alignment Scoring Settings on page 17 and Smith-Waterman Alignment ScoringSettings on page 1.

The --mapq-strict-sjs option is specific to RNA, and applies where at least one exon segment is alignedconfidently, but there is ambiguity about some possible splice junction. When this option is set to 0, ahigher MAPQ value is returned, expressing confidence that the alignment is at least partially correct.When this option is set to 1, a lower MAPQ value is returned, expressing the splice junction ambiguity.

Some downstream tools, such as Cufflinks, expect the MAPQ value to be a unique value for all uniquelymapped reads. This value is specified with the --rna-mapq-unique option. Setting this option to a nonzerovalue overrides all MAPQ estimates based on alignment score. Instead, all uniquely mapped reads havea MAPQ set to the value of --rna-mapq-unique. All multimapped reads have a MAPQ value of int(-10*log10(1 - 1/NH), where the NH value is the number of hits (primary and secondary alignments) forthat read.

Input FilesIn addition to the standard input files (*.fastq, *.bam, and so on), DRAGEN can also take a geneannotations file as input. A gene annotations file aids in the detection of splice junctions during RNA-Seq mapping and aligning.

Gene Annotation FileTo specify a gene annotation file, use the -a (--annotation-file) command line option. The input file mustconform to the GTF/GFF specification (http://uswest.ensembl.org/info/website/upload/gff.html). The filemust contain features of type exon, and the record must contain attributes of type gene_id andtranscript_id. An example of a valid GTF file is shown below.

chr1 HAVANA transcript 11869 14409 . + . gene_id"ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …chr1 HAVANA exon 11869 12227 . + . gene_id

Document # 1000000101034 v00



"ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …chr1 HAVANA exon 12613 12721 . + . gene_id"ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …chr1 HAVANA exon 13221 14409 . + . gene_id"ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …chr1 ENSEMBL transcript 11872 14412 . + . gene_id"ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …chr1 ENSEMBL exon 11872 12227 . + . gene_id"ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …chr1 ENSEMBL exon 12613 12721 . + . gene_id"ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …chr1 ENSEMBL exon 13225 14412 . + . gene_id"ENSG00000223972.4"; transcript_id "ENST00000515242.2"; ……

Similarly, a GFF file can be used. Each exon feature must have as a Parent a transcript identifier that isused to group exons. An example of a valid GFF file is shown below.

1 ensembl_havana processed_transcript 11869 14409 . + . ID=transcript:ENST00000456328;1 havana exon 11869 12227 . + . Parent=transcript:ENST00000456328; …1 havana exon 12613 12721 . + . Parent=transcript:ENST00000456328; …1 havana exon 13221 14409 . + . Parent=transcript:ENST00000456328; ……

The DRAGEN host software parses the file for exons within the transcripts and produces splicejunctions. The following output displays the number of splice junctions detected.

==================================================================Generating annotated splice junctions==================================================================Input annotations file: ./gencode.v19.annotation.gtfSplice junctions database file: output/rna.sjdb.annotations.out.tab

Number of genes: 27459

Number of transcripts: 196520Number of exons: 1196293Number of splice junctions: 343856

The splice junctions that are detected are also written to an output file, *.sjdb.annotations.out.tab, whichis viewable by the user. When forming the splice junctions from the input gene annotations file, a simplefilter is used that discards any splice junctions that do not meet the minimum required length. This helpsto lower the false detection rate for falsely annotated junctions. This minimum annotation splice junctionlength is controlled by the --rna-ann-sj-min-len option, which has a default value of 6.

Document # 1000000101034 v00



Two-PassModeIn addition to reading in GTF/GFF files for use as annotated splice junctions, the DRAGEN software isalso capable of reading in an SJ.out.tab file (see SJ.out.tab on page 89). The SJ.out.tab file enablesDRAGEN to run in a two-pass mode, where the splice junctions (denoted by the SJ.out.tab file)discovered in the first pass are used to guide the mapping and alignment of a second invocation ofDRAGEN. This mode of operation is useful when a gene annotations file is not readily available for thedata set.

Output FilesThe output files generated when running DRAGEN in RNA mode are similar to those generated in DNAmode. RNA mode also produces extra information related to spliced alignments. Details regarding thesplice junctions are present both in the SAM alignment record and an additional file, the SJ.out.tab file.

Alignments in BAM/SAMThe output BAM or SAM file meets the SAM specification and is compatible with downstream RNA-Seqanalysis tools.

RNA-Seq SAM TagsThe following SAM tags are emitted alongside spliced alignments.

u XS:A—The XS tag denotes the strand orientation of an intron. See Compatibility with Cufflinks onpage 87.

u jM:B—The jM tag lists the intron motifs for all junctions in the alignments. It has the followingdefinitions:u 0: non-canonicalu 1: GT/AGu 2: CT/ACu 3: GC/AGu 4: CT/GCu 5: AT/ACu 6: GT/AT

If a gene annotations file is used during the map/align stage, and the splice junction is detected asan annotated junction, then 20 is added to its motif value.

NH:i—A standard SAM tag indicating the number of reported alignments that contains the query in thecurrent record. This tag may be used for downstream tools such as featureCounts.

HI:i—A standard SAM tag denoting the query hit index, with its value indicating that this alignment is thei-th one stored in the SAM. Its value ranges from 1 … NH. This tag may be used for downstream toolssuch as featureCounts.

SJ.out.tabAlong with the alignments emitted in the SAM/BAM file, an additional SJ.out.tab file summarizes thehigh confidence splice junctions in a tab-delimited file. The columns for this file are as follows:

1 contig name

2 first base of the splice junction (1-based)

Document # 1000000101034 v00



3 last base of the splice junction (1-based)strand (0: undefined, 1: +, 2: -)

4 intron motif: 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT

5 0: unannotated, 1: annotated, only if an input gene annotations file was used

6 number of uniquely mapping reads spanning the splice junction

7 number of multimapping reads spanning the splice junction

8 maximum spliced alignment overhang

The maximum spliced alignment overhang (column 9) field in the SJ.out.tab file is the anchoringalignment overhang. For example, if a read is spliced as ACGTACGT------------ACGT, then the overhangis 4. For the same splice junction, across all reads that span this junction, the maximum overhang isreported. The maximum overhang is a confidence indicator that the splice junction is correct based onanchoring alignments.

There are two SJ.out.tab files generated by the DRAGEN host software, an unfiltered version and afiltered version. The records in the unfiltered file are a consolidation of all spliced alignment records fromthe output SAM/BAM. However, the filtered version has a much higher confidence for being correct dueto the use of the following filters.

A splice junction entry in the SJ.out.tab file is filtered out if any of these conditions are met:

u SJ is a noncanonical motif and is only supported by < 3 unique mappings.

u SJ of length > 50000 and is only supported by < 2 unique mappings.



u SJ is a noncanonical motif and the maximum spliced alignment overhang is < 30.

u SJ is a canonical motif and the maximum spliced alignment overhang is < 12.

The filtered SJ.out.tab is recommended for use with any downstream analysis or post processing tools.Alternatively, you can use the unfiltered SJ.out.tab and apply your own filters (for example, with basicawk commands).

Note that the filter does not apply to the alignments present in the BAM or SAM file.

Chimeric.out.junction FileIf there are chimeric alignments present in the sample, then a supplementary Chimeric.out.junction fileis also output. This file contains information about split-reads that can be used to perform downstreamgene fusion detection. Each line contains one chimerically aligned read. The columns of the file are asfollows:

1 Chromosome of the donor.

2 First base of the intron of the donor (1-based).

3 Strand of the donor.

4 Chromosome of the acceptor.

5 First base of the intron of the acceptor (1-based).

6 Strand of the acceptor.

7 N/A—not used, but is present to be compatible with other tools. It will always be 1.

8 N/A—not used, but is present to be compatible with other tools. It will always be *.

Document # 1000000101034 v00



9 N/A—not used, but is present to be compatible with other tools. It will always be *.

10 Read name.

11 First base of the first segment, on the + strand.

12 CIGAR of the first segment.

13 First base of the second segment.

14 CIGAR of the second segment.

CIGARs in this file follow the standard CIGAR operations as found in the SAM specification, with theaddition of a gap length L that is encoded with the operation p. For paired end reads, the sequence ofthe second mate is always reverse complemented before determining strandedness.

The following is an example entry that shows two chimerically aligned read pairs, in which one of themates is split, mapping segments of chr19 to chr12. Also shown are the corresponding SAM recordsassociated with these entries.

chr19 580462 + chr12 120876182 + 1 * * R_15448 571532 49M8799N26M8p49M26S120876183 49H26Mchr19 580462 + chr12 120876182 + 1 * * R_15459 57155229M8799N46M8p29M46S 120876183 29H46M

R_15448:1 99 chr19 571531 60 49M8799N26M = 580413R_15448:2 147 chr19 580413 60 49M26S = 571531R_15448:2 2193 chr12 120876182 15 49H26M chr19 571531

R_15459:1 99 chr19 571551 60 29M8799N46M = 580433R_15459:2 147 chr19 580433 4 29M46S = 571551R_15459:2 2193 chr12 120876182 15 29H46M chr19 571551

Gene Fusion DetectionThe DRAGEN Gene Fusion module uses the DRAGEN RNA spliced aligner for detection of gene fusionevents. It performs a split-read analysis on the supplementary (chimeric) alignments to detect potentialbreakpoints. The putative fusion events then go through various filtering stages to mitigate potentialfalse positives. The detection parameters are highly configurable and allow for a trade-off in sensitivityand specificity. Furthermore, all potential candidates (unfiltered) are output, which you can use toperform your own filtering.

Running DRAGENGene FusionThe DRAGEN Gene Fusion module can be run in parallel with a regular RNA-Seq map/align job. Thisadditional module adds minimal processing to the overall run time, while providing additional informationto your RNA-Seq experiments.

To enable the DRAGEN Gene Fusion module, set the --enable-rna-gene-fusion option to true in yourcurrent RNA-Seq command-line scripts. The DRAGEN Gene Fusion module requires the use of a geneannotations file in the format of a GTF or GFF file.

The following is an example command line for running an end to end RNA-Seq experiment./opt/edico/bin/dragen \

" -r "$REF" \"" -1 "$FQ1" \"" -2 "$FQ2" \"

Document # 1000000101034 v00



" -a "$GTF" \"" --output-dir "$OUTPUT" \"" --output-file-prefix "$PREFIX" \"--enable-rna true \--enable-rna-gene-fusion true \

At the end of a run, a summary of detected gene fusion events is output, which is similar to thefollowing example.

==================================================================Loading gene annotations file================================================================== Input annotations file: ref_annot.gtf

Number of genes: 27459Number of transcripts: 196520Number of exons: 1196293

==================================================================Launching DRAGEN Gene Fusion Detection==================================================================annotation-file: ref_annot.gtfrna-gf-blast-pairs: blast_pairs.outfmt6rna-gf-exon-snap: 50rna-gf-min-anchor: 25rna-gf-min-neighbor-dist: 15rna-gf-max-partners: 3rna-gf-min-score-ratio: 0.15rna-gf-min-support: 2rna-gf-min-support-be: 10rna-gf-restrict-genes true

==================================================================Completed DRAGEN Gene Fusion Detection==================================================================Chimeric alignments: 107923Total fusion candidates: 38 (2116 before filters)

Time loading annotations: 00:00:08.543Time running gene fusion: 00:00:18.470Total runtime: 00:00:27.760***********************************************************DRAGEN finished normally

RunGene Fusion StandaloneThe DRAGEN Gene Fusion module can be run as a standalone utility, taking the *.Chimeric.out.junctionfile as input and the gene annotations file as a GTF/GFF file. Running the Gene Fusion modulestandalone is useful for trying out various configuration options at the gene fusion detection stage,without having to map and align the RNA-Seq data multiple times.

To execute the DRAGEN Gene Fusion module as a standalone utility, use the --rna-gf-input-file option tospecify the already generated *.Chimeric.out.junction file.

Document # 1000000101034 v00



The following is an example command line for running the gene fusion module as a standalone utility./opt/edico/bin/dragen \

" -a "$GTF" \"" --rna-gf-input-file "$INPUT_CHIMERIC" \"" --output-dir "$OUTPUT" \"" --output-file-prefix "$PREFIX" \"--enable-rna true \--enable-rna-gene-fusion true \

Gene Fusion CandidatesThe detected gene fusion events are listed in the fusion_candidate output files. Two files reside in theoutput directory, *.fusion_candidates.preliminary and *.fusion_candidates.final. The preliminary file is aprefiltered list of candidate events detected. The final file contains the candidate events with asufficiently high confidence after passing all filters (content of the file is described below). The filescontain the following five columns:

u Fusion gene

u A score

u The two breakpoints

u The supporting reads

The breakpoints are with respect to the split-read breakpoints as observed from the alignments. The + orin the breakpoint indicates which strand, with respect to the reference FASTA, that the gene istranscribed from. The read names are delimited by semicolons. The file is sorted by the score, which isan indicator of the number of reads supporting that fusion event.

#FusionGene Score LeftBreakpoint RightBreakpoint ReadNames

BSG--AL021546.6 108 chr19:580461:+ chr12:120876182:+ R_1;R_2;R_3; …

FIS1--PMEL 70 chr7:100884111:- chr12:56351868:- …

CBX3--G3BP2 57 chr7:26242642:+ chr4:76572341:- …

ELOVL5--SF3B14 46 chr6:53139888:- chr2:24291329:-

ATXN10--GORASP2 44 chr22:46134719:+ chr2:171804860:+

FADS3--MTOR 39 chr11:61646784:- chr1:11184690:-

AKR1B1--HMGB1 33 chr7:134135538:- chr13:31036849:-

Gene FusionOptions and FiltersVarious filters are implemented to help mitigate the number of false positive gene fusion candidates. Thefollowing thresholds and options are configurable. These filters are applied and all candidates that stillqualify are emitted in the *.fusion_candidates.final output file. For all prefiltered fusion candidates, seethe *.fusion_candidates.preliminary output file.

u --rna-gf-blast-pairs

A file listing gene pairs that have a high level of similarity. This list of gene pairs is used as ahomology filter to reduce false positives. One method to generate this file is to follow theinstructions as described in Fusion Filter.

Document # 1000000101034 v00



https://github.com/FusionFilter/FusionFilter/wiki/Building-a-Custom-FusionFilter-Dataset

u --rna-gf-restrict-genes

When parsing the gene annotations file (GTF/GFF) for use in DRAGEN Gene Fusion, this option canbe used to restrict the entries of interest to only protein-coding regions. Restricting the GTF to onlythe protein-coding genes reduces false positive rates in currently studied fusion events. The defaultvalue is true.

u --rna-gf-min-support

Specifies the minimum number of supporting reads to call a gene fusion event. A breakpoint isconsidered a broken exon if it is not precisely at an intron/exon boundary. A double broken exon iswhen both breakpoints do not lie on an intron/exon boundary. The default value is 2.

u --rna-gf-min-support-be

Specifies the minimum number of supporting reads to call a double broken exon gene fusion event.The default value is 10.

u --rna-gf-min-anchor

Specifies the minimum anchor length to call a gene fusion event. The anchor length is the number ofnucleotides adjacent to a breakpoint. This minimum value applies to both breakpoints for the fusionevent. The default value is 25.

u --rna-gf-min-neighbor-dist

Specifies the minimum distance to neighboring candidate fusion events for them to be consideredunique. If two candidate breakpoints are within this distance, then they are merged and their relativescores and attributes combined. The default value is 15.

u --rna-gf-min-score-ratio

Minimum score ratio for a fusion event, for the highest scoring event. For all isoforms of a genefusion pair, only accept it as a separate fusion event if it scores above this threshold. The dominantisoform is used as the maximum score. The default value is 0.15 (must score at least 15% of thehighest scoring isoform).

u --rna-gf-max-partners

Specifies the maximum number of gene partners a single gene can fuse with. Spurious gene fusionevents can be indicative of false positives if a single gene fuses with multiple partner genes. If thenumber of partners exceeds this value, then filter it out as a false positive. The default value is 3.

u --rna-gf-exon-snap

Specifies the breakpoint distance to an exon boundary for snapping. Any split-reads with breakpointsnear an exon boundary are automatically clustered together to support a fusion event at that exonboundary. If the split-read breakpoint distance to the boundary is greater than this threshold, then itis not pulled in.

Gene Expression QuantificationThe DRAGEN RNA pipeline contains a gene expression quantification module, that estimates theexpression of each transcript and gene in an RNA-seq dataset. First, it internally translates the genomicmapping of each read (read pair) to the corresponding transcript mappings. Then it uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observedreads. The EM algorithm can also model GC-bias and correct for it in the reported quantification results.

Document # 1000000101034 v00



Running QuantificationTo enable the quantification module, set the --enable-rna-quantification option to true in your currentRNA-seq command-line scripts. In addition, quantification requires the use of a gene annotations file(GTF/GFF), which provides the genomic position of all transcripts to quantify. This is specified with the-a (or --annotation-file) option.

Currently, only paired-end RNA-seq libraries in standard (inward) orientation are supported.

Quantification OutputsTranscript quantification results are reported in the <outputPrefix>.quant.sf file. This is a text file, inwhich each line lists results for one transcript. For example:

Name Length EffectiveLength TPM NumReadsENST00000364415.1 116 12.3238 5.2328 1ENST00000564138.1 2775 2105.58 1.28293 41.8885

u Name lists the transcriptID of the transcript

u Length is the length of the (spliced) transcript in basepairs

u EffectiveLength is the length as accessible to RNA-seq, accounting for insert-size and edge effects

u TPM is Transcripts per Million, which represents the expression of the transcript, normalized fortranscript length and sequencing depth

u NumReads is the estimated number of reads from the transcript (not normalized).

This file can be used as input for differential gene expression using tools such as tximport and DESeq2.

Similarly, the <outputPrefix>.quant.genes.sf file contains quantification results at the gene level. Theseare produced by summing together all transcripts with the same geneID in the annotation (GTF). Lengthand EffectiveLength are the (expression-)weighted means of the individual transcripts in the gene.

Quantification Optionsu --enable-rna-quantification

If set to true, enables RNA quantification. Requires enable-rna to be true as well.

u --rna-quantification-library-type

Specifies the type of RNA-seq library. Only paired-end libraries in standard (inward) orientation arecurrently supported.u IU—Unstranded libraryu ISR—Stranded library in which read2 matches the transcript strand (eg, TruSeq RNA)u ISF—Stranded library in which read1 matches the transcript strand

u --rna-quantification-gc-bias

If set to true, enables GC bias correction. This estimates the effect of transcript %GC onsequencing coverage and accounts for it when estimating expression.

Document # 1000000101034 v00



Chapter 5 DRAGENMethylation ApplicationsThe epigenetic methylation of cytosine bases in DNA can have a dramatic effect on gene expression,and bisulfite sequencing is the gold standard for detecting epigenetic methylation patterns at single-baseresolution. This technique involves chemically treating DNA with sodium bisulfite, which convertsunmethylated cytosine bases to uracil, but does not alter methylated cytosines. Subsequent PCRamplification converts any uracils to thymines.

A bisulfite sequencing library can either be nondirectional or directional. For nondirectional, each double-stranded DNA fragment yields four distinct strands for sequencing, post-amplification, as shown in thefollowing figure:

Figure 10 Nondirectional Bisulfite Sequencing

u Bisulfite Watson (BSW), reverse complement of BSW (BSWR),u Bisulfite Crick (BSC), reverse complement of BSC (BSCR)

For directional libraries, the four strand types are generated, but adapters are attached to the DNAfragments such that only the BSW and BSC strands are sequenced (Lister protocol). Less commonly,the BSWR and BSCR strands are selected for sequencing (directional-complement protocol).

BSW and BSC strands:

u A, G, T: unchanged

u Methylated C remains C

u Unmethylated C converted to T

BSWR and BSCR strands:

u Bases complementary to original Watson/Crick A, G, T bases remain unchanged

u G complementary to original Watson/Crick methylated C remains G

u G complementary to original Watson/Crick unmethylated C becomes A

Standard DNA sequencing is used to produce sequencing reads. Reads containing more unconvertedC’s and G’s complementary to unconverted C’s are less drastically affected by the bisulfite treatment,and have a higher likelihood of mapping to the reference than reads with more bases affected. The

Document # 1000000101034 v00


standard protocol to minimize this mapping bias is to perform multiple alignments per read, wherespecific combinations of read and reference genome bases are converted in-silico prior to eachalignment run. Each alignment run has a set of constraints and base-conversions that corresponds toone of the bisulfite+PCR strand types expected from the protocol. By comparing the alignment resultsacross runs, you can determine the best alignment and most likely strand type for each read or readpair. This information is required for downstream methylation calling.

DRAGENMethylation CallingDifferent methylation protocols require the generation of two or four alignments per input read, followedby an analysis to choose a best alignment and determine which cytosines are methylated. DRAGENcan automate this process, generating a single output BAM file with Bismark-compatible tags (XR, XG,and XM) that can be used in downstream pipelines, such as Bismark’s methylation extraction scripts.

When the --methylation-protocol option is set to a valid value other than none, DRAGEN automaticallyproduces the required set of alignments, each with its appropriate conversions on the reads,conversions on the reference, and constraints on whether reads must be forward-aligned or reversecomplement (RC) aligned with the reference. When --enable-methylation-calling is set to true, DRAGENanalyzes the multiple alignments to produce a single methylation-tagged BAM file. When --enable-methylation-calling is set to false, DRAGEN outputs a separate BAM file per alignment run.

The following table describes these alignment runs:

Protocol BAM Reference Read 1 Read 2OrientationConstraint

Directional

1 C->T C->T G->A forward-only

2 G->A C->T G->A RC-only

Nondirectional, or directional-complement

1 C->T C->T G->A forward-only

2 G->A C->T G->A RC-only

3 C->T G->A C->T RC-only

4 G->A G->A C->T forward-only

In directional protocols, the library is prepared such that only the BSW and BSC strands aresequenced. Thus, alignment runs are performed with the two combinations of base conversions andorientation constraints that correspond to these strands (directional runs 1 & 2 above). However, withnondirectional protocols, reads from each of the four strands are equally likely, so alignment runs mustbe performed with two more combinations of base conversions and orientation constraints(nondirectional runs 3 & 4 above).

The directional-complement protocol resembles the directional protocol in purpose, except thatsequencing reads are only derived from the reverse-complement strands of the BSW and BSC strands.For the directional-complement protocol, very few good alignments are expected from runs 1 and 2, soDRAGEN is automatically tuned to a faster analysis mode for those runs.

For any protocol, you must run with a reference generated with --ht-methylated enabled, as described inPipeline Specific Hash Tables on page 109.

The following is an example dragen command line for the directional protocol:

Document # 1000000101034 v00



dragen --enable-methylation-calling=true \--methylation-protocol directional \--ref-dir /staging/ref/mm10/methylation --RGID RG1 --RGCN CN1 \--RGLB LIB1 --RGPL illumina --RGPU 1 --RGSM Samp1 \--intermediate-results-dir /staging/tmp \-1 /staging/reads/samp1_1.fastq.gz \-2 /staging/reads/samp1_2.fastq.gz \--output-directory /staging/outdir \--output-file-prefix samp1_directional_prot

Methylation-Related BAM TagsWhen --enable-methylation-calling is set to true, DRAGEN analyzes the alignments produced for theconfigured --methylation-protocol, and generates a single output BAM file that includes methylation-related tags for all mapped reads. As in Bismark, reads without a unique best alignment are excludedfrom the output BAM. The added tags are as follows:

Tag Brief Description Description

XR:Z Read conversion For the best alignment, which base-conversion was performed on the read:CT or GA.

XG:Z Reference conversion For the best alignment, which base-conversion was performed on thereference: CT or GA

XM:Z Methylation call A byte-per-base methylation string.

The XM:Z (methylation call) tag contains a byte corresponding to each base in the read’ssequence. There is a period (‘.’) for each position not involving a cytosine, and a letter at cytosinepositions. The letter indicates the context (CpG, CHG, CHH, or unknown), and the case indicatesmethylation, with upper-case positions being methylated and lower-case unmethylated. The letters usedat cytosine positions are as follows:

Character Methylated? Context

. not cytosine not cytosine

z No CpG

Z Yes CpG

X No CHG

X Yes CHG

h No CHH

H Yes CHH

u No Unknown

U Yes Unknown

Methylation Cytosine andM-Bias ReportsTo generate a genomewide cytosine methylation report, set --methylation-generate-cytosine-report totrue. The position and strand of each C in the genome are given in the first three fields of the report.A record with a '-' in the strand field is for a G in the reference FASTA. The counts of methylated and

Document # 1000000101034 v00



unmethylated C's covering the position are given in the fourth and fifth fields, respectively. The C-context in the reference (CG, CHG, or CHH) is given in the sixth field, and the trinucleotide sequencecontext is given in the last field (eg, CCC, CGT, CGA, etc.). The following is an example cytosinereport record:

chr2 24442367 + 18 0 CG CGC

To generate an M-bias report, set --methylation-generate-mbias-report to true. This report contains threetables for a single-ended data with one table for each C-context, and six tables for a paired-end data.Each table is a series of records, with one record per read base position. For example, the first recordfor the CHG table contains the counts of methylated C's (field 2) and unmethylated C's (field 3)that occur in the first read base position, restricting to those reads in which the first base is aligned to aCHG location in the genome. Each record of a table also includes the percent methylated C bases (field4) and the sum of methylated and unmethylated C counts (field 5).

The following is an example M-bias record for read base position 10:10 7335 2356 75.69 9691

For data sets with paired-end reads that overlap, both the Cytosine and M-bias reports skip reportingany C’s in the second read that overlaps the first read. In addition, 1-based coordinates are used forpositions in both reports.

To match the bismark_methylation_extractor cytosine and M-bias reports generated by Bismark version0.19.0, set the --methylation-match-bismark option to true. The ordering of records in Bismark andDRAGEN cytosine reports may differ. DRAGEN reports are sorted by genomic position.

Using Bismark for Methylation CallingThe recommended approach to methylation calling is to have DRAGEN automate the multiple requiredalignments, and add the XM, XR, XG tags as described previously. However, you can set --enable-methylation-calling to false to have DRAGEN generate a separate BAM file for each of the constraintsand conversions by the methylation protocol. You can then feed these BAM files into a third-party toolfor methylation calling. For example, you could modify Bismark to pipe reads from these BAM filesinstead of having Bismark internally run Bowtie or Bowtie2. Please contact Illumina Technical Supportfor assistance.

When you run in this mode, a single DRAGEN run produces multiple BAM files in the directory specifiedby --output-directory, each containing alignments in the same order as the input reads. It is not possibleto enable sorting or duplicate marking for these runs. Alignments include MD tags, and have “/1” or “/2”appended to the names of paired-end reads for Bismark-compatibility. These BAM files use the followingnaming conventions:

u Single-end reads—output-directory/output-file-prefix.{CT,GA}read{CT,GA}reference.bam

u Paired-end-reads—output-directory/output-file-prefix.{CT,GA}read1{CT,GA}read2{CT,GA}reference.bam,

Where output-directory and output-file-prefix are specified by the corresponding options, and CT and GAcorrespond to the base conversions listed in the table above.

Bismark does not have a directional-complement mode, but you can process such samples usingBismark’s nondirectional mode with the expectation that runs 1 and 2 produce very few goodalignments. For that reason, when running the nondirectional protocol, DRAGEN is automatically tunedto invest less work into the alignments on those runs.

Document # 1000000101034 v00



Chapter 6 Preparing a Reference GenomeBefore a reference genome can be used with DRAGEN, it must be converted from FASTA format into acustom binary format for use with the DRAGEN hardware. The options used in this preprocessing stepoffer tradeoffs between performance and mapping quality.

The DRAGEN system is shipped with reference genomes hg19 and GRCh37. Both reference genomesare preinstalled based on recommended settings for general applications. If you find that performanceand mapping quality are adequate, there is a good chance that you can simply work with these suppliedreference genomes. Depending on your read lengths and other particular aspects of your application, youmay be able to improve mapping quality and/or performance by tuning the reference preprocessingoptions.

Hash Table BackgroundThe DRAGEN mapper extracts many overlapping seeds (subsequences or K-mers) from each read, andlooks up those seeds in a hash table residing in memory on its PCIe card, to identify locations in thereference genome where the seeds match. Hash tables are ideal for extremely fast lookups of exactmatches. The DRAGEN hash table must be constructed from a chosen reference genome using thedragen --build-hash-table option, which extracts many overlapping seeds from the reference genome,populates them into records in the hash table, and saves the hash table as a binary file.

Reference Seed IntervalThe size of the DRAGEN hash table is proportionate to the number of seeds populated from thereference genome. The default is to populate a seed starting at every position in the reference genome,ie, roughly 3 billion seeds from a human genome. This default requires at least 32 GB of memory on theDRAGEN PCIe board.

To operate on larger, nonhuman genomes or to reduce hash table congestion, it is possible to populateless than all reference seeds using the --ht-ref-seed-interval option to specify an average referenceinterval. The default interval for 100% population is --ht-ref-seed-interval 1, and 50% population isspecified with --ht-ref-seed-interval 2. The population interval does not need to be an integer. Forexample, --ht-ref-seed-interval 1.2 indicates 83.3% population, with mostly 1-base and some 2-baseintervals to achieve a 1.2 base interval on average.

Hash Table OccupancyIt is characteristic of hash tables that they are allocated a certain size, but always retain some emptyrecords, so they are less than 100% occupied. A healthy amount of empty space is important for quickaccess to the DRAGEN hash table. Approximately 90% occupancy is a good upper bound. Empty spaceis important because records are pseudo-randomly placed in the hash table, resulting in an abnormallyhigh number of records in some places. These congested regions can get quite large as the percentageof empty space approaches zero, and queries by the DRAGEN mapper for some seeds can becomeincreasingly slow.

Hash Table / Seed LengthThe hash table is populated with reference seeds of a single common length. This primary seed length iscontrolled with the --ht-seed-len option, which defaults to 21.

The longest primary seed supported is 27 bases when the table is 8 GB to 31.5 GB in size. Generally,longer seeds are better for run time performance, and shorter seeds are better for mapping quality(success rate and accuracy). A longer seed is more likely to be unique in the reference genome,

Document # 1000000101034 v00


facilitating fast mapping without needing to check many alternative locations. But a longer seed is alsomore likely to overlap a deviation from the reference (variant or sequencing error), which preventssuccessful mapping by an exact match of that seed (although another seed from the read may stillmap), and there are fewer long seed positions available in each read.

Longer seeds are more appropriate for longer reads, because there are more seed positions available toavoid deviations.

Value for -ht-seed-len Read Length

21 100 bp to 150 bp

17 to 19 shorter reads (36 bp)

27 250+ bp

Table 10 Seed Length Recommendations

Hash Table / Seed ExtensionsDue to repetitive sequences, some seeds of any given length match many locations in the referencegenome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many referencelocations, it extends the seed by some number of bases at both ends, to some greater length that ismore unique in the reference.

For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extendedseed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these100 seed positions may divide into 40 groups of 1–3 identical 35-base seeds. Iterative seed extensionsare also supported, and are automatically generated when a large set of identical primary seeds containsvarious subsets that are best resolved by different extension lengths.

The maximum extended seed length, by default equal to the primary seed length plus 128, can becontrolled with the --ht-max-ext-seed-len option. For example, for short reads, it is advisable to set themaximum extended seed shorter than the read length, because extensions longer than the whole readcan never match.

It is also possible to tune how aggressively seeds are extended using the following options (advancedusage):

u --ht-cost-coeff-seed-len

u --ht-cost-coeff-seed-freq

u --ht-cost-penalty

u --ht-cost-penalty-incr

There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved usinglonger seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved byavoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies thatresult. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, andby finding more candidate mapping locations at which to score alignments. The default extensionsettings along with default seed frequency settings, lean aggressively toward mapping accuracy, withrelatively short seed extensions and high hit frequencies.

Document # 1000000101034 v00



The defaults for the seed frequency options are as follows:

Option Default

--ht-cost-coeff-seed-len 1

--ht-cost-coeff-seed-freq 0.5

--ht-cost-penalty 0

--ht-cost-penalty-incr 0.7

--ht-max-seed-freq 16

--ht-target-seed-freq 4

Seed Frequency Limit and TargetOne primary or extended seed can match multiple places in the reference genome. All such matches arepopulated into the hash table, and retrieved when the DRAGEN mapper looks up a corresponding seedextracted from a read. The multiple reference positions are then considered and compared to generatealigned mapper output. However, dragen enforces a limit on the number of matches, or frequency, ofeach seed, which is controlled with the --ht-max-seed-freq option. By default, the frequency limit is16. In practice, when the software encounters a seed with higher frequency, it extends it to a sufficientlylong secondary seed that the frequency of any particular extended seed pattern falls within the limit.However, if a maximum seed extension would still exceed the limit, the seed is rejected, and notpopulated into the hash table. Instead, dragen populates a single High Frequency record.

This seed frequency limit does not tend to impact DRAGEN mapping quality notably, for two reasons.First, because seeds are rejected only when extension fails, only extremely high-frequency primaryseeds, typically with many thousands of matches are rejected. Such seeds are not very useful formapping. Second, there are other seed positions to check in a given read. If another seed position isunique enough to return one or more matches, the read can still be properly mapped. However, if allseed positions were rejected as high frequency, often this means that the entire read matches similarlywell in many reference positions, so even if the read were mapped it would be an arbitrary choice, withvery low or zero MAPQ.

Thus, the default frequency limit of 16 for --ht-max-seed-freq works well. However, it may be decreasedor increased, up to a maximum of 256. A higher frequency limit tends to marginally increase the numberof reads mapped (especially for short reads), but commonly the additional mapped reads have very lowor zero MAPQ. This also tends to slow down DRAGEN mapping, because correspondingly largenumbers of possible mappings are occasionally considered.

In addition to a frequency limit, a target seed frequency can be specified with --ht-target-seed-freqoption. This target frequency is used when extensions are generated for high frequency primary seeds.Extension lengths are chosen with a preference toward extended seed frequencies near the target. Thedefault of 4 for --ht-target-seed-freq means that the software is biased toward generating shorter seedextensions than necessary to map seeds uniquely.

ALT-Aware Hash TablesTo enable ALT-aware mapping in DRAGEN, build GRch38 (and other references with ALT contigs) witha liftover file by using the --ht-alt-liftover option. The hash table builder classifies each referencesequence as primary or alternate based on the liftover file, and packs primaries before alternates inreference.bin. SAM liftover files for hg38DH and hg19 are in the /opt/edico/liftover folder. The --ht-alt-liftover option specifies the path to the liftover file to build an ALT-aware hash table.

Document # 1000000101034 v00



To override the liftover file requirement, set the --ht-alt-aware-validate option to false when building thehash tables and when running dragen.

The hash table builder detects when hg19 and hg38 references are missing decoy contigs andautomatically adds them to the hash table. The decoys file is found at /opt/edico/liftover/hs_decoys.fa.Set the --ht-suppress-decoys option to true to suppress adding these decoys to the hash table.

Custom Liftover FilesCustom liftover files can be used in place of those provided with DRAGEN. Liftover files must be SAMformat, but no SAM header is required. SEQ and QUAL fields can be omitted (‘*’). Each alignment recordshould have an alternate haplotype reference sequence name as QNAME, indicating the RNAME andPOS of its liftover alignment in a destination (normally primary assembly) reference sequence.

Reverse-complemented alignments are indicated by bit 0x10 in FLAG. Records flagged unmapped (0x4)or secondary (0x100) are ignored. The CIGAR may include hard or soft clipping, leaving parts of the ALTcontig unaligned.

A single reference sequence cannot serve as both an ALT contig (appearing in QNAME) and a liftoverdestination (appearing in RNAME). Multiple ALT contigs can align to the same primary assemblylocation. Multiple alignments can also be provided for a single ALT contig (extras optionally be flagged0x800 supplementary), such as to align one portion forward and another portion reverse-complemented.However, each base of the ALT contig only receives one liftover image, according to the first alignmentrecord with an M CIGAR operation covering that base.

SAM records with QNAME missing from the reference genome are ignored, so that the same liftover filemay be used for various reference subsets, but an error occurs if any alignment has its QNAME presentbut its RNAME absent.

Command LineOptionsUse the --build-hash-table option to transform a reference FASTA into the hash table for DRAGENmapping. It takes as input a FASTA file (multiple reference sequences being concatenated) and apreexisting output directory and generates the following set of files:

reference.bin The reference sequences, encoded in 4 bits per base. Four-bit codes are used, so thesize in bytes is roughly half the reference genome size. In between referencesequences, N are trimmed and padding is automatically inserted. For example, hg19 has3,137,161,264 bases in 93 sequences. This is encoded in 1,526,285,312 bytes = 1.46GB, where 1 GB means 1 GiB or 230 bytes.

hash_table.cmp Compressed hash table. The hash table is decompressed and used by the DRAGENmapper to look up primary seeds with length specified by the --ht-seed-len option andextended seeds of various lengths.

hash_table.cfg A list of parameters and attributes for the generated hash table, in a text format. This fileprovides key information about the reference genome and hash table.

hash_table.cfg.bin A binary version of hash_table.cfg used to configure the DRAGEN hardware.

hash_table_stats.txt A text file listing extensive internal statistics on the constructed hash including the hashtable occupancy percentages. This table is for information purposes. It is not used byother tools.

Build command usage is as follows:dragen --build-hash-table true [options] --ht-reference <reference.fasta>

--output-directory <outdir>

Document # 1000000101034 v00



The sections that follow provide information on the options for building a hash table.

Input/Output OptionsThe --ht-reference and --output-directory options are required for building a hash table. The --ht-referenceoption specifies the path to the reference FASTA file, while --output-directory specifies a preexistingdirectory where the hash table output files are written. Illumina recommends organizing various hashtable builds into different folders. As a best practice, folder names should include any nondefaultparameter settings used to generate the contained hash table.

Primary Seed LengthThe --ht-seed-len option specifies the initial length in nucleotides of seeds from the reference genome topopulate into the hash table. At run time, the mapper extracts seeds of this same length from each read,and looks for exact matches (unless seed editing is enabled) in the hash table.

The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from16 GB to 64 GB, covering typical sizes for whole human genome, or k=26 for sizes from 4 GB to 16 GB.

The minimum primary seed length depends mainly on the reference genome size and complexity. Itneeds to be long enough to resolve most reference positions uniquely. For whole human genomereferences, hash table construction typically fails with k < 16. The lower bound may be smaller forshorter genomes, or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16 for the 3.1Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, soit requires at least 16 choices from 4 nucleotides to distinguish 3.1 G reference positions.

Accuracy ConsiderationsFor read mapping to succeed, at least one primary seed must match exactly (or with a single SNP whenedited seeds are used). Shorter seeds are more likely to map successfully to the reference, becausethey are less likely to overlap variants or sequencing errors, and because more of them fit in eachread. So for mapping accuracy, shorter seeds are mainly better.

However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map tomultiple reference positions, and lead the mapper to consider more false mapping locations. Due toimperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics,occasionally these noise matches may be reported. Run time quality filters such as --Aligner.alignn_min_score can control the accuracy issues with very short seeds.

Speed ConsiderationsShorter seeds tend to slow down mapping, because they map to more reference locations, resulting inmore work such as Smith-Waterman alignments to determine the best result. This effect is mostpronounced when primary seed length approaches the reference genome’s uniqueness threshold, eg,K=16 for whole human genome.

Application Considerationsu Read Length—Generally, shorter seeds are appropriate for shorter reads, and longer seeds for

longer reads. Within a short read, a few mismatch positions (variants or sequencing errors) can chopthe read into only short segments matching the reference, so that only a short seed can fit betweenthe differences and match the reference exactly. For example, in a 36 bp read, just one SNP in themiddle can block seeds longer than 18 bp from matching the reference. By contrast, in a 250 bpread, it takes 15 SNPs to exceed a 0.01% chance of blocking even 27 bp seeds.

Document # 1000000101034 v00



u Paired Ends—The use of paired end reads can make longer seeds yield good mappingaccuracy. DRAGEN uses paired end information to improve mapping accuracy, including withrescue scans that search the expected reference window when only one mate has seeds mapping toa given reference region. Thus, paired end reads have essentially twice the opportunity for an exactmatching seed to find their correct alignments.

u Variant or Error Rate—When read differences from the reference are more frequent, shorter seedsmay be required to fit between the difference positions in a given read and match the referenceexactly.

u Mapping Percentage Requirement—If the application requires a high percentage of reads to bemapped somewhere (even at low MAPQ), short seeds may be helpful. Some reads that do not matchthe reference well anywhere are more likely to map using short seeds to find partial matches to thereference.

Maximum Seed LengthThe --ht-max-ext-seed-len option limits the length of extended seeds populated into the hash table.Primary seeds (length specified by --ht-seed-len) that match many reference positions can be extendedto achieve more unique matching, which may be required to map seeds within the maximum hitfrequency (--ht-max-seed-freq).

Given a primary seed length k, the maximum seed length can be configured between k and k+128. Thedefault is the upper bound, k+128.

When to Limit Seed ExtensionThe --ht-max-ext-seed-len option is recommended for short reads, eg, less than 50 bp. In such cases, itis helpful to limit seed extension to the read length minus a small margin, such as 1–4 bp. For example,with 36 bp reads, setting --ht-max-ext-seed-len to 35 might be appropriate. This ensures that the hashtable builder does not plan a seed extension longer than the read causing seed extension and mappingto fail at run time, for seeds that could have fit within the read with shorter extensions.

While seed extension can be similarly limited for longer reads, eg, setting --ht-max-ext-seed-len to 99 for100 bp reads, there is little utility in this because seeds are extended conservatively in any event. Evenwith the default k+128 limit, individual seeds are only extended to the lengths required to fit under themaximum hit frequency (--ht-max-seed-freq), and at most a few bases longer to approach the target hitfrequency (--ht-target-seed-freq), or to avoid taking too many incremental extension steps.

Maximum Hit FrequencyThe --ht-max-seed-freq option sets a firm limit on the number of seed hits (reference genome locations)that can be populated for any primary or extended seed. If a given primary seed maps to more referencepositions than this limit, it must be extended long enough that the extended seeds subdivide into smallergroups of identical seeds under the limit. If, even at the maximum extended seed length (--ht-max-ext-seed-len), a group of identical reference seeds is larger than this limit, their reference positions are notpopulated into the hash table. Instead, dragen populates a single High Frequency record.

The maximum hit frequency can be configured from 1 to 256. However, if this value is too low, hashtable construction can fail because too many seed extensions are needed. The practical minimum for awhole human genome reference, other options being default, is 8.

Document # 1000000101034 v00



Accuracy ConsiderationsGenerally, a higher maximum hit frequency leads to more successful mapping. There are two reasons forthis. First, a higher limit rejects fewer reference positions that cannot map under it. Second, a higherlimit allows seed extensions to be shorter, improving the odds of exact seed matching withoutoverlapping variants or sequencing errors.

However, as with very short seeds, allowing high hit counts can sometimes hurt mappingaccuracy. Most of the seed hits in a large group are not to the true mapping location, and occasionallyone of these noise hits may be reported due to imperfect scoring models. Also, the mapper limits thetotal number of reference positions it considers, and allowing very high hit counts can potentially crowdout the actual best match from consideration.

Speed ConsiderationsHigher maximum hit frequencies slow down read mapping, because seed mapping finds more referencelocations, resulting in more work, such as Smith-Waterman alignments, to determine the best result.

ALT-Aware Liftover File OptionsThe following options control See ALT-Aware Hash Tables on page 102 for more information on buildinga custom liftover file.

u --ht-alt-liftover

The --ht-alt-liftover option specifies the path to the liftover file to build an ALT-aware hash table. Thisoption is required when building from a reference with ALT contigs. SAM liftover files for hg38DHand hg19 are provided in /opt/edico/liftover.

u --ht-alt-aware-validate

When building hash tables from a reference that contains ALT-contigs, building with a liftover file isrequired. To disable this requirement, set the --ht-alt-aware-validate option to false.

u --ht-decoys

The DRAGEN software automatically detects the use of hg19 and hg38 references and adds decoysto the hash table when they are not found in the FASTA file. Use the --ht-decoys option to specifythe path to a decoys file. The default is /opt/edico/liftover/hs_decoys.fa.

u --ht-suppress-decoys

Use the --ht-suppress-decoys option to suppress the use of the decoys file when building the hashtable.

DRAGEN Software Optionsu --ht-num-threads

The --ht-num-threads option determines the maximum number of worker CPU threads that are usedto speed up hash table construction. The default for this option is 8, with a maximum of 32 threadsallowed.

If your server supports execution of more threads, it is recommended that you use the maximum.For example, the DRAGEN servers contain 24 cores that have hyperthreading enabled, so a value of32 should be used. When using a higher value, adjust --ht-max-table-chunks needs to be adjusted aswell. The servers have 128 GB of memory available.

Document # 1000000101034 v00



u --ht-max-table-chunks

The --ht-max-table-chunks option controls the memory footprint during hash table construction bylimiting the number of ~1 GB hash table chunks that reside in memory simultaneously. Eachadditional chunk consumes roughly twice its size (~2 GB) in system memory during construction.

The hash table is divided into power-of-two independent chunks, of a fixed chunk size, X, whichdepends on the hash table size, in the range 0.5 GB < X ≤ 1 GB. For example, a 24 GB hash tablecontains 32 independent 0.75 GB chunks that can be constructed by parallel threads with enoughmemory and a 16 GB hash table contains 16 independent 1 GB chunks.

The default is --ht-max-table-chunks equal to --ht-num-threads, but with a minimum default --ht-max-table-chunks of 8. It makes sense to have these two options match, because building one hashtable chunk requires one chunk space in memory and one thread to work on it. Nevertheless, thereare build-speed advantages to raising --ht-max-table-chunks higher than --ht-num-threads, or toraising --ht-num-threads higher than --ht-max-table-chunks.

SizeOptionsu --ht-mem-limit—Memory Limit

The --ht-mem-limit option controls the generated hash table size by specifying the DRAGEN boardmemory available for both the hash table and the encoded reference genome. The --ht-mem-limitoption defaults to 32 GB when the reference genome approaches WHG size, or to a generous sizefor smaller references. Normally there is little reason to override these defaults.

u --ht-size—Hash Table Size

This option specifies the hash table size to generate, rather than calculating an appropriate tablesize from the reference genome size and the available memory (option --ht-mem-limit). Using defaulttable sizing is recommended and using --ht-mem-limit is the next best choice.

Seed Population Optionsu --ht-ref-seed-interval—Seed Interval

The --ht-ref-seed-interval option defines the step size between positions of seeds in the referencegenome populated into the hash table. An interval of 1 (default) means that every seed position ispopulated, 2 means 50% of positions are populated, etc. Noninteger values are supported, eg, 2.5yields 40% populated.

Seeds from a whole human reference are easily 100% populated with 32 GB memory on DRAGENboards. If a substantially larger reference genome is used, change this option.

u --ht-soft-seed-freq-cap and --ht-max-dec-factor—Soft Frequency Cap and Maximum DecimationFactor for Seed Thinning

Seed thinning is an experimental technique to improve mapping performance in high-frequencyregions. When primary seeds have higher frequency than the cap indicated by the --ht-soft-seed-freq-cap option, only a fraction of seed positions are populated to stay under the cap. The --ht-max-dec-factor option specifies a maximum factor by which seeds can be thinned. For example, --ht-max-dec-factor 3 retains at least 1/3 of the original seeds. --ht-max-dec-factor 1 disables any thinning.

Seeds are decimated in careful patterns to prevent leaving any long gaps unpopulated. The idea isthat seed thinning can achieve mapped seed coverage in high frequency reference regions where themaximum hit frequency would otherwise have been exceeded. Seed thinning can also keep seedextensions shorter, which is also good for successful mapping. Based on testing to date, seedthinning has not proven to be superior to other accuracy optimization methods.

Document # 1000000101034 v00



u --ht-rand-hit-hifreq and --ht-rand-hit-extend—Random Sample Hit with HIFREQ Record andEXTEND Record

Whenever a HIFREQ or EXTEND record is populated into the hash table, it stands in place of alarge set of reference hits for a certain seed. Optionally, the hash table builder can choose a randomrepresentative of that set, and populate that HIT record alongside the HIFREQ or EXTEND record.

Random sample hits provide alternative alignments that are very useful in estimating MAPQaccurately for the alignments that are reported. They are never used outside of this context forreporting alignment positions, because that would result in biased coverage of locations thathappened to be selected during hash table construction.

To include a sample hit, set --ht-rand-hit-hifreqto 1. The --ht-rand-hit-extend option is a minimum pre-extension hit count to include a sample hit, or zero to disable. Modifying these options is notrecommended.

Seed Extension ControlDRAGEN seed extension is dynamic, applied as needed for particular K-mers that map to too manyreference locations. Seeds are incrementally extended in steps of 2–14 bases (always even) from aprimary seed length to a fully extended length. The bases are appended symmetrically in each extensionstep, determining the next extension increment if any.

There is a potentially complex seed extension tree associated with each high frequency primary seed.Each full tree is generated during hash table construction and a path from the root is traced by iterativeextension steps during seed mapping. The hash table builder employs a dynamic programming algorithmto search the space of all possible seed extension trees for an optimal one, using a cost function thatbalances mapping accuracy and speed. The following options define that cost function:

u --ht-target-seed-freq—Target Hit Frequency

The --ht-target-seed-freq option defines the ideal number of hits per seed for which seed extensionshould aim. Higher values lead to fewer and shorter final seed extensions, because shorter seedstend to match more reference positions.

u --ht-cost-coeff-seed-len—Cost Coefficient for Seed Length

The --ht-cost-coeff-seed-len option assigns the cost component for each base by which a seed isextended. Additional bases are considered a cost because longer seeds risk overlapping variants orsequencing errors and losing their correct mappings. Higher values lead to shorter final seedextensions.

u --ht-cost-coeff-seed-freq—Cost Coefficient for Hit Frequency

The --ht-cost-coeff-seed-freq option assigns the cost component for the difference between thetarget hit frequency and the number of hits populated for a single seed. Higher values result primarilyin high-frequency seeds being extended further to bring their frequencies down toward the target.

u --ht-cost-penalty—Cost Penalty for Seed Extension

The --ht-cost-penalty option assigns a flat cost for extending beyond the primary seed length. Ahigher value results in fewer seeds being extended at all. Current testing shows that zero (0) isappropriate for this parameter.

u --ht-cost-penalty-incr—Cost Increment for Extension Step

The --ht-cost-penalty-incr option assigns a recurring cost for each incremental seed extension steptaken from primary to final extended seed length. More steps are considered a higher cost becauseextending in many small steps requires more hash table space for intermediate EXTEND records,and takes substantially more run time to execute the extensions. A higher value results in seed

Document # 1000000101034 v00



extension trees with fewer nodes, reaching from the root primary seed length to leaf extended seedlengths in fewer, larger steps.

Pipeline Specific Hash TablesWhen building a hash table, DRAGEN configures the options to work for DNA-seq processing bydefault. To run RNA-Seq data, you must build an RNA-Seq hash table using the --ht-build-rna-hashtabletrue option. For an RNA-Seq alignment run, refer to the original --output-directory, not to theautomatically generated subdirectory.

The CNV pipeline requires that the hash table be built with --enable-cnv set to true, which generates anadditional kmer hashmap that is used in the CNV algorithm. Illumina recommends that that you alwaysuse the --enable-cnv option, in case you wish to perform CNV calling with the same hash table that isused for mapping and aligning.

DRAGEN methylation runs require building a special pair of hash tables with reference bases convertedfrom C->T for one table, and G->A for the other. When running the hash table generation with the --ht-methylated option, these conversions are done automatically, and the converted hash tables aregenerated in a pair of subdirectories of the target directory specified with --output-directory. Thesubdirectories are named CT_converted and GA_converted, corresponding to the automatic baseconversions. When using these hash tables for methylated alignment runs, refer to the original --output-directory and not to either of the automatically generated subdirectories.

These base conversions remove a significant amount of information from the hashtables, so you mayfind it necessary to tune the hash table parameters differently than you would in a conventional hashtable build. The following options are recommended for building hash tables for mammalian species:

dragen --build-hash-table=true --output-directory $REFDIR \--ht-reference $FASTA --ht-max-seed-freq 16 \--ht-seed-len 27 --ht-num-threads 40 --ht-methylated=true

Document # 1000000101034 v00



Chapter 7 Tools and UtilitiesIllumina BCLData ConversionBCL format is the native output format of Illumina sequencing systems, and consists of a directorycontaining many data and metadata files. The data files are organized according to the sequencer’s flowcell layout, so converting this data to sample-separated FASTQ files is an intensive operation that canbe time-consuming.

DRAGEN provides a custom implementation of the conversion software that is faster. To run thisconversion, use the --bcl-input-directory <BCL_ROOT>, --output-directory <DIR>, and --bcl-conversion-only true options.

The DRAGENconversions implementation is designed to output FASTQ files that match Illumina'sbcl2fastq2 2.19 output for HiSeq, MiSeq, HiSeq X, NextSeq, iSeq, and NovaSeq sequencers. DRAGENsupports most of the features of bcl2fastq2, including demultiplexing samples by barcode with optionalmismatch tolerance, adapter sequence masking or trimming with adjustable matching stringency, andUMI sequence tagging and trimming. Sample sheet settings not supported by DRAGEN includeFindAdapterWithIndels, CreateFastqForIndexReads, and ReverseComplement.

Command LineOptionsThe required options for BCL conversion in DRAGEN are shown in the following example command:

dragen --bcl-conversion-only --bcl-input-dir <…> --bcl-output-dir <…>

The following additional options can be specified on the command line:

u --sample-sheet—Specifies the path to SampleSheet.csv file. Optional if the SampleSheet.csv file isin the --bcl-input-directory directory

u --strict-mode—If set to true, dragen aborts if any files are missing. The default is false.

u --first-tile-only—If set to true, dragen only converts the first tile of input (for testing and debugging).The default is false.

The input BCL root directory and the output directory must be specified. The specified input path is notthe BaseCalls directory but three levels higher, and should contain the following files and directories,among others:

Config\

Data\

Logs\

runParameters.xml

RunInfo.xml

The directory where output FASTQ files are to be stored is specified by the --output-dir option.

Sample Sheet OptionsIn addition to the command line options, which control the behavior of bcl conversion, the sample sheetaccepts settings in the [Settings] section of the configuration file specifying how the samples areprocessed. The sample sheet settings for bcl conversion are as follows:

Document # 1000000101034 v00


Option Default Value Description

AdapterBehavior trim trim, mask Whether adaptershould betrimmed ormasked.

AdapterRead1 None Read 1 adapter sequencecontaining A, C, G, or T

The sequence totrim or mask fromthe end of read 1.

AdapterRead2 None Read 2 adapter sequencecontaining A, C, G, or T

The sequence totrim or mask fromthe end of read 2.

AdapterStringency 0.9 Float between 0.5 and 1.0 The stringencyfor matching theread to theadapter using thesliding windowalgorithm.

BarcodeMismatchesIndex1 1 0, 1, or 2 The number ofallowedmismatchesbetween the firstindex read andindex sequence.

BarcodeMismatchesIndex2 1 0, 1, or 2 The number ofallowedmismatchesbetween thesecond indexread and indexsequence.

MinimumTrimmedReadLength The minimum of 35 and theshortest non-indexed readlength.

0 to the shortest non-indexedread length

Reads trimmedbelow this pointbecome maskedat that point.

MaskShortReads The minimum of 22 andMinimumTrimmedReadLength.

0 toMinimumTrimmedReadLength

Reads trimmedbelow this pointbecomecompletelymasked out.

OverrideCycles None Y: Specifies a sequencingreadI: Specifies an indexing readU: Specifies a UMI length tobe trimmed from read

String used tospecify UMIcycles and maskout cycles of aread.

The OverrideCycles mask elements are semi-colon separated. For example:OverrideCycles,N1Y150;I8;I7N1;Y151

LimitationsThe BCL conversion in DRAGEN has been tested on, and is supported for, output from HiSeq 2000 and2500, MiSeq, HiSeq X Ten, NextSeq, iSeq, and NovaSeq sequencer models with single-end and paired-end adapters. Related models may also operate correctly.

Document # 1000000101034 v00



Monitoring System HealthWhen you power up your DRAGEN system a daemon (dragen_mond) is started that monitors the cardfor hardware issues. This daemon is also started when your DRAGEN system is installed or updated.The main purpose of this daemon is to monitor DRAGEN Bio-IT Processor temperature and abortDRAGEN when the temperature exceeds a configured threshold.

To manually start, stop, or restart the monitor, run the following as root:sudo service dragen_mond [stop|start|restart]

By default, the monitor polls for hardware issues once per minute and logs temperature once every hour.

The /etc/sysconfig/dragen_mond file specifies the command line options used to start dragen_mondwhen the service command is run. Edit DRAGEN_MOND_OPTS in this file to change the defaultoptions. For example, the following changes the poll time to 30 seconds and the log time to once every2 hours:

DRAGEN_MOND_OPTS="-d -p 30 -l 7200"

The -d option is required to run the monitor as a daemon.

The dragen_mond command line options are as follows:

Option Description

-m --swmaxtemp<n>

Maximum software alarm temperature (Celsius). Default is 85.

-i --swmintemp<n>

Minimum software alarm temperature (Celsius). Default is 75.

-H --hwmaxtemp<n>

Maximum hardware alarm temperature (Celsius). Default is 100.

-p --polltime <n> Time between polling chip status register (seconds). Default is 60.

-l --logtime <n> Log FPGA temp every n seconds. Default is 3600. Must be amultipleof polltime

-d --daemon Detach and run as a daemon.

-h --help Print help and exit.

-V --version Print the version and exit.

To display the current temperature of the DRAGEN Bio-IT Processor, use the dragen_info -tcommand. This command does not execute if dragen_mond is not running.

% dragen_info -tFPGA Temperature: 42C (Max Temp: 49C, Min Temp: 39C)

LoggingAll hardware events are logged to /var/log/messages and /var/log/dragen_mond.log. The following showsan example in /var/log/messages of a temperature alarm:

Jul 16 12:02:34 komodo dragen_mond[26956]: WARNING: FPGA software over temperature alarm has been

triggered -- temp threshold: 85 (Chip status: 0x80000001)

Jul 16 12:02:34 komodo dragen_mond[26956]: Current FPGA temp: 86, Max temp: 88, Min temp: 48

Jul 16 12:02:34 komodo dragen_mond[26956]: All dragen processes will be stopped until alarm

Document # 1000000101034 v00



clears

Jul 16 12:02:34 komodo dragen_mond[26956]: Terminating dragen in process 1510 with SIGUSR2

signal

By default, temperature is logged to /var/log/dragen_mond.log every hour:Aug 01 09:16:50 Setting FPGA hardware max temperature threshold to 100

Aug 01 09:16:50 Setting FPGA software max temperature threshold to 85

Aug 01 09:16:50 Setting FPGA software min temperature threshold to 75

Aug 01 09:16:50 FPGA temperatures will be logged every 3600 seconds

Aug 01 09:16:50 Current FPGA temperature is 52 (Max temp = 52, Min temp = 52)



If DRAGEN is executing when a thermal alarm is detected, the following is displayed in the terminalwindow of the DRAGEN process:

************************************************************ Received external signal -- aborting dragen. **** An issue has been detected with the dragen card. **** Check /var/log/messages for details. **** **** It may take up to a minute to complete shutdown. ************************************************************

If you see this message, stop running the DRAGEN software. Do the following to alleviate theoverheating condition on the card:

u Be sure that there is ample air flow over the card. Consider moving the card to a slot where there ismore air flow, adding another fan or increasing the fan speed.

u Give the card more space in the box. If there are available PCIe slots, move the card so that it hasempty slots on either side.

Contact Illumina Technical Support if you are having trouble resolving the thermal alarm on your system.

Hardware AlarmsThe following table lists the hardware events logged by the monitor when an alarm is triggered:

ID Description Monitor Action

0 Softwareoverheating

Terminate usage until DRAGEN Bio-IT Processor cools to software minimum temperature.

1 Hardwareoverheating

Fatal. Aborts dragen software; system reboot required

2 Board SPDoverheating

Logged as nonfatal

3 SODIMMoverheating

Logged as nonfatal

4 Power 0 Fatal. Aborts dragen software; system reboot required

5 Power 1 Fatal. Aborts dragen software; system reboot required

Document # 1000000101034 v00



ID Description Monitor Action

6 DRAGEN Bio-IT Processorpower

Logged as nonfatal

7 Fan 0 Logged as nonfatal

8 Fan 1 Logged as nonfatal

9 SE5338 Fatal. Aborts dragen software; system reboot required

10–30

Undefined(Reserved)

Fatal. Aborts dragen software; system reboot required

Fatal alarms prevent the DRAGEN host software from running and require a system reboot. When asoftware overheating alarm is triggered, the monitor looks for and aborts any running DRAGENprocesses. The monitor continues to abort any new DRAGEN processes until the temperaturedecreases to the minimum threshold and the hardware clears the chip status alarm. When the softwareoverheating alarm clears, DRAGEN jobs can resume executing.

Contact Illumina Technical Support with details from the log files if any of these alarms are triggered onyour system.

Hardware-Accelerated Compression and DecompressionGzip compression is ubiquitous in bioinformatics. FASTQ files are often gzipped, and the BAM formatitself is a specialized version of gzip. For that reason, the DRAGEN BioIT processor provides hardwaresupport for accelerating compression and decompression of gzipped data. If your input files are gzipped,DRAGEN detects that and decompresses the files automatically. If your output is BAM files, then thefiles are automatically compressed.

DRAGEN provides standalone command-line utilities to enable you to compress or decompress arbitraryfiles. These utilities are analogous to the Linux gzip and gunzip commands, but are named dzip anddunzip (dragen zip and dragen unzip). Both utilities are able to accept as input a single file, and producea single output file with the .gz file extension removed or added, as appropriate. For example:

dzip file1 # produces output file file1.gzdunzip file2.gz # produces output file file2

Currently, dzip and dunzip have the following limitations and differences from gzip/gunzip:

u Each invocation of these tools can handle only a single file. Additional file names (including thoseproduced by a wildcard * character) are ignored.

u They cannot be run at the same time as the DRAGEN host software.

u They do not support the command line options found in gzip and gunzip (eg, recursive, --fast, --best,--stdout).

Usage ReportingAs part of the install process, a daemon (dragen_licd) is created (or stopped then restarted). Thisbackground process self-activates at the end of each day to upload DRAGEN host software usage to anIllumina server. Information includes date, duration, size (number of bases), status of each run, andsoftware version used.

Document # 1000000101034 v00



Communication to the Illumina server is secured by encryption. If there is a communication error, thedaemon retries until the next morning. If the upload continues to fail, communication is tried again thefollowing night until communication succeeds. This means that during working hours, the systemresources remain fully available and are not in any way hampered by this background activity.

To check the current license usage, use the dragen_lic command.

Document # 1000000101034 v00



Chapter 8 TroubleshootingIf the DRAGEN system does not seem to be responding, do the following:

1 To determine if the DRAGEN system is hanging, follow the instructions in How to Identify if theSystem is Hanging on page 116.

2 Collect diagnostic information after a hang, or a crash, as described in Sending DiagnosticInformation to Illumina Support on page 116.

3 After all information has been collected, reset your system. if needed, as described in ResettingYour System after a Crash or Hang on page 116.

How to Identify if the System is HangingThe DRAGEN system has a watchdog to monitor the system for hangs. If a run seems to be takinglonger than it should, the watchdog may not be detecting the hang. Here are some things to try:

u Run the top command to find the active DRAGEN process. If your run is healthy, you should expectto see it consuming over 100% of the CPU. If it is consuming 100% or less, then your system maybe hanging.

u Run the du -s command in the directory of the output BAM/SAM file. During a normal run, thisdirectory should be growing with either intermediate output data (when sort is enabled) or BAM/SAMdata.

Sending Diagnostic Information to Illumina SupportIllumina would like your feedback on your DRAGEN system, including any reports of systemmalfunction. In the event of a crash, hang, or watchdog fault, run the sosreport command to collectdiagnostic and configuration information, as follows:

sudo sosreport --batch

This command takes several minutes to execute and reports the location where it has saved thediagnostic information in /tmp. Please include the report when you submit a support ticket for IlluminaTechnical Support.

Resetting Your System after a Crash or HangIf the DRAGEN system crashes or hangs, the dragen_reset utility must be run to reinitialize thehardware and software. This utility is automatically executed by the host software any time it detects anunexpected condition. In this case, the host software shows the following message:

Running dragen_reset to reset DRAGEN Bio-IT processor and software

If the software is hanging, please collect diagnostic information as described in subsection SendingDiagnostic Information to Illumina Support on page 116 and then execute dragen_reset manually, asfollows:

/opt/edico/bin/dragen_reset

Any execution of dragen_reset requires the reference genome to be reloaded to the DRAGEN board. Thehost software automatically reloads the reference on the next execution.

Document # 1000000101034 v00


Appendix A Command Line OptionsHost Software OptionsThe following options are in the default section of the configuration file. The default section does nothave a section name (eg, [Aligner]) associated with it. The default section is at the top of theconfiguration file. Note that some mandatory fields must be specified on the command line and are notpresent in configuration files.

Name Description Command Line Equivalent Range

alt-aware Enables special processing for alt contigs,if alt liftover was used in hashtable. Enabled by default if reference wasbuilt with liftover.

--alt-aware true/false

annotation-file Transcript annotation file (RNA). --annotation-file, -a

append-read-index-to-name By default, DRAGEN names both mateends of pairs the same. When set to true,DRAGEN appends /1 and /2 to the twoends.

true/false

bam-input Aligned BAM file for input to the DRAGENvariant caller.

-b, --bam-input

bcl-conversion-only Perform Illumina BCL conversion toFASTQ format

--bcl-conversion-only

bcl-input-directory Input BCL directory for BCL conversion. --bcl-input-directory

sample-sheet For BCL input, path to SampleSheet.csvfile. Default location is the BCL rootdirectory.

--sample-sheet

build-hash-table Generate a reference/hash table. --build-hash-table true/false

cram-input CRAM file for input to theDRAGEN variant caller.

--cram-input

dbsnp Path to the variant annotation databaseVCF (or .vcf.gz) file.

--dbsnp

enable-auto-multifile Import subsequent segments of the *_001.{dbam,fastq} files.

--enable-auto-multifile true/false

enable-bam-indexing Enable generation of a BAI index file. --enable-bam-indexing true/false

enable-cnv Enable copy number variant (CNV). --enable-cnv true/false

enable-duplicate-marking Enable the flagging of duplicate outputalignment records.

--enable-duplicate-marking true/false

enable-map-align-output Enables saving the output from themap/align stage. Default is true when onlyrunning map/align. Default is false ifrunning the variant caller.

--enable-map-align-output true/false

enable-methylation-calling Whether to automatically add methylation-tags and output a single BAM formethylation protocols.

true/false

enable-rna Enable processing of RNS-seq data. --enable-rna true/false

enable-sampling Automatically detect paired-endparameters by running a sample throughthe mapper/aligner.

true/false

Document # 1000000101034 v00



enable-sort Enable sorting after mapping/alignment. true/false

enable-variant-caller Enables the variant caller. --enable-variant-caller true/false

enable-vcf-compression Enable compression of VCF output files.Default is true.

true/false

fastq-file1 FASTQ file to input to the DRAGENpipeline (may be gzipped).

-1, --fasatq-file1

fastq-file2 Second FASTQ file with paired-end readsfor input.

-2, --fastq-file2

fastq-list CSV file that contains a list of FASTQ filesto process.

--fastq-list

fastq-list-all-samples Enable/disable processing of all samplestogether, regardless of the RGSM value.

--fastq-list-all-samples true/false

fastq-n-quality Base call quality to output for Nbases. Automatically added to fastq-n-quality for all output N’s.

--fastq-n-quality 0 to 255

fastq-offset FASTQ quality offset value. --fastq-offset 33 or 64

filter-flags-from-output Filter output alignments with any bits setin val present in the flags field. Hex anddecimal values accepted.

--filter-flags-from-output

first-tile-only Only convert the first tile of each lane. --first-tile-only

force Force overwrite of existing output file. -f

force-load-reference Force loading of the reference and hashtables before starting the DRAGENpipeline.

-l

generate-md-tags Whether to generate MD tags withalignment output records. Default is false.

true/false

generate-sa-tags Whether to generate SA:Z tags forrecords that have chimeric/supplementalalignments.

true/false

generate-zs-tags Whether to generate ZS tags foralignment output records. Default is false.

true/false

ht-alt-liftover SAM format liftover file of alternatecontigs in reference.

--ht-alt-liftover

ht-alt-aware-validate Disable requirement for a liftover filewhen building a hash table from areference that contains alt-contigs.

--ht-alt-aware-validate true/false

ht-build-rna-hashtable Enable generation of RNA hash table.Default is false.

--ht-build-rna-hashtable true/false

ht-cost-coeff-seed-freq Cost coefficient of extended seedfrequency.

--ht-cost-coeff-seed-freq

ht-cost-coeff-seed-len Cost coefficient of extended seed length. --ht-cost-coeff-seed-len

ht-cost-penalty-incr Cost penalty to incrementally extend aseed another step.

--ht-cost-penalty-incr

ht-cost-penalty Cost penalty to extend a seed by anynumber of bases.

--ht-cost-penalty

ht-decoys Specifies the path to a decoys file. --ht-decoys

Document # 1000000101034 v00




ht-max-dec-factor Maximum decimation factor for seedthinning.

--ht-max-dec-factor

ht-max-ext-incr Maximum bases to extend a seed by inone step.

--ht-max-ext-incr

ht-max-ext-seed-len Maximum extended seed length. -- ht-max-ext-seed-len

ht-max-seed-freq Maximum allowed frequency for a seedmatch after extension attempts.

--ht-max-seed-freq 1–256

ht-max-table-chunks Maximum ~1 GB thread table chunks inmemory at once.

--ht-max-table-chunks

ht-mem-limit Memory limit (hash table + reference),units B|KB|MB|GB.

--ht-mem-limit

ht-methylated Automatically generate C->T and G->Aconverted reference hashtables.

--ht-methylated true/false

ht-num-threads Maximum worker CPU threads forbuilding hash table.

--ht-num-threads

ht-rand-hit-extend Include a random hit with each EXTENDrecord of this freq record.

--ht-rand-hit-extend

ht-rand-hit-hifreq Include a random hit with each HIFREQrecord.

--ht-rand-hit-hifreq

ht-ref-seed-interval Number of positions per reference seed. --ht-ref-seed-interval

ht-reference Reference file in .fasta format to be usedto build a hash table.

--ht-reference

ht-seed-len Initial seed length to store in hash table. --ht-seed-len

ht-size Size of hash table, units B|KB|MB|GB. --ht-size

ht-soft-seed-freq-cap Soft seed frequency cap for thinning --ht-soft-seed-freq-cap

ht-suppress-decoys Suppress the use of a decoys file whenbuilding a hash table.

--ht-suppress-decoys

ht-target-seed-freq Target seed frequency for seedextension.

--ht-target-seed-freq

input-qname-suffix-delimiter Controls the delimiter used for append-read-index-to-name and for detectingmatching pair names with BAM input.

/ or . or:

interleaved Interleaved paired-end reads in singleFASTQ.

-i

intermediate-results-dir Directory to store intermediate results in(eg, sort partitions).

lic-no-print Suppress the license status message atthe end of a run.

--lic-no-print true/false

mapq-strict-js Specific to RNA. When set to 0, a higherMAPQ value is returned, expressingconfidence that the alignment is at leastpartially correct. When set to 1, a lowerMAPQ value is returned, expressing thesplice junction ambiguity.

--mapq-strict-js 0/1

methylation-generate-cytosine-report

Generate a genomewide cytosinemethylation report.

--methylation-generate-cytosine-report

true/false

Document # 1000000101034 v00




methylation-generate-mbias-report

Generate a per-sequencer-cycle methylation bias report.

true/false

methylation-match-bismark If true, match bismark’s tags exactly,including bugs.

--methylation-match-bismark true/false

methylation-protocol Library protocol for methylation analysis. none /directional /nondirectional /directional-complement

num-threads The number of processor threads to use. -n, --num-threads

output-directory Output directory. --output-directory

output-file-prefix Output file name prefix to use for all filesgenerated by the pipeline.

--output-file-prefix

output-format The format of the output file from themap/align stage. Valid values are bam(the default), sam, or dbam (a proprietarybinary format).

--output-format BAM / SAM /DBAM

pair-by-name Whether to shuffle the order of BAM inputrecords such that paired-end mates areprocessed together.

pair-suffix-delimiter Change the delimiter character forsuffixes.

--pair-suffix-delimiter / . :

preserve-bqsr-tags Whether to preserve input BAM file’s BIand BD flags. Note this may causeproblems with hard clipping.

true/false

preserve-map-align-order Produce output file that preserves originalorder of reads in the input file.

true/false

qc-coverage-region-1 First bed file to report coverage on. --qc-coverage-region-1

qc-coverage-region-2 Second bed file to report coverage on. --qc-coverage-region-2

qc-coverage-region-3 Third bed file to report coverage on. --qc-coverage-region-3

qc-coverage-reports-1 Types of reports requested for qc-coverage-region-1.

--qc-coverage-reports-1 full_res / cov_report

qc-coverage-reports-2 Types of reports requested for qc-coverage-region-2.


qc-coverage-reports-3 Types of reports requested for qc-coverage-region31.


ref-dir Directory containing the reference hashtable. This reference is automaticallyloaded to the DRAGEN card, if it is notalready there.

-r, --ref-dir

ref-sequence-filter Output only reads mapping to thisreference sequence.

--ref-sequence-filter

remove-duplicates If true, remove duplicate alignmentrecords instead of just flagging them.

true/false

RGCN Read group sequencing center name. --RGCN

RGCN-tumor Read group sequencing center name fortumor input.

--RGCN-tumor

Document # 1000000101034 v00




RGDS Read group description. --RGDS

RGDS-tumor Read group description for tumor input. --RGDS-tumor

RGDT Read group run date. --RGDT

RGDT-tumor Read group run date for tumor input. --RGDT-tumor

RGID Read group ID. --RGID

RGID-tumor Read group ID for tumor input. --RGID-tumor

RGLB Read group library. --RGLB

RGLB-tumor Read group library for tumor input. --RGLB-tumor

RGPI Read group predicted insert size. --RGPI

RGPI-tumor Read group predicted insert size fortumor input.

--RGPI-tumor

RGPL Read group sequencing technology. --RGPL

RGPL-tumor Read group sequencing technology fortumor input.

--RGPL-tumor

RGPU Read group platform unit. --RGPU

RGPU-tumor Read group platform unit for tumor input. --RGPU-tumor

RGSM Read group sample name. --RGSM

RGSM-tumor Read group sample name for tumor input. --RGSM-tumor

rna-ann-sj-min-len Discard splice junctions that have lengthless than this value, during the generationof splice junctions from an annotations file(GTF/GFF/SJ.out.tab).

rna-gf-input-file An already generated.Chimeric.out.junction file. When this file isprovided, DRAGEN Gene Fusion Moduleruns in standalone mode.

--rna-gf-input-file

rna-mapq-unique For compatibility with Cufflinks, enablethis parameter with a nonzero value.Unique mappers have a MAPQ set to thisvalue. Multimappers have a MAPQ of int(-10*log10(1 – 1 /NH)).

0, 1 to 255

sample-size Number of reads to sample when enable-sampling is true.

strict-mode Abort if any files are missing --strict-mode

strip-input-qname-suffixes Whether to strip read-index suffixes (eg,/1 and /2) from input QNAMEs.

true/false

tumor-bam-input Aligned BAM file to for the DRAGENvariant caller in somatic mode.

--tumor-bam-input

tumor-fastq-list A CSV file containing a list of FASTQ filesfor the mapper, aligner, and somaticvariant caller.

--tumor-fastq-list

tumor-fastq-list-sample-id The sample ID for the list of FASTQ filesspecified by tumor-fastq-list.

--tumor-fastq-list-sample-id

Document # 1000000101034 v00




tumor-fastq1 FASTQ file for the DRAGEN pipelineusing the variant caller in somatic mode(may be gzipped).

--tumor-fastq1

tumor-fastq2 Second FASTQ file with reads paired totumor-fastq1 reads for the DRAGENpipeline using the variant caller in somaticmode (may be gzipped).

--tumor-fastq2

umi-enable Enable UMI-based read processing. --umi-enable true/false

verbose Enable verbose output from DRAGEN. -v

version Print the version and exit. -V

Mapper OptionsThe following options are in the [Mapper] section of the configuration file. For more detailed informationon these options, see DNA Mapping on page 15.


ann-sj-max-indel

Maximum indel length to expect near anannotated splice junction.

--Mapper.ann-sj-max-indel 0 to 63

edit-chain-limit For edit-mode 1 or 2: Maximum seed chainlength in a read to qualify for seed editing.

--Mapper.edit-chain-limit edit-chain-limit >= 0

edit-mode 0 = No edits, 1 = Chain len test, 2 = Pairedchain len test, 3 = Edit all std seeds.

--Mapper.edit-mode 0 to 3

edit-read-len For edit-mode 1 or 2: Read length in which totry edit-seed-num edited seeds.

--Mapper.edit-read-len edit-read-len > 0

edit-seed-num For edit-mode 1 or 2: Requested number ofseeds per read to allow editing on.

--Mapper.edit-seed-num edit-seed-num >= 0

enable-map-align

Enable use of BAM input files formapper/aligner.

--enable-map-align true/false

map-orientations

0=Normal, 1=No Rev Comp, 2=No Forward(paired end requires Normal).

--Mapper.map-orientations 0 to 2

max-intron-bases

Maximum intron length reported. --Mapper.max-intron-bases

min-intron-bases

Minimum reference deletion length reportedas an intron.

--Mapper.min-intron-bases

seed-density Requested density of seeds from readsqueried in the hash table

--Mapper.seed-density 0 > seed-density > 1

Aligner OptionsThe following options are in the [Aligner] section of the configuration file. For more information, seeDNA Aligning on page 17

Document # 1000000101034 v00



Name Description Command Line Equivalent Value

aln-min-score (signed) Minimum alignment score toreport; baseline for MAPQ.When using local alignments (global =0), aln-min-score is computed by thehost software as "22 * match-score".When using global alignments (global =1), aln-min-score is set to -1000000.Host software computation may beoverridden by setting aln-min-score inconfiguration file.

--Aligner.aln-min-score −2,147,483,648to2,147,483,647

dedup-min-qual Minimum base quality for calculatingread quality metric for deduplication.

--Aligner.dedup-min-qual 0–63

en-alt-hap-aln Allows chimeric alignments to beoutput, as supplementary.

--Aligner.en-alt-hap-aln 0–1

en-chimeric-aln Allows chimeric alignments to beoutput, as supplementary.

--Aligner.en-chimeric-aln 0–1

gap-ext-pen Score penalty for gap extension. --Aligner.gap-ext-pen 0–15

gap-open-pen Score penalty for opening a gap(insertion or deletion).

gap-open-pen 0–127

global If alignment is global (Needleman-Wunsch) rather than local (Smith-Waterman).

--Aligner.global 0–1

hard-clips Flags for hard clipping: [0] primary, [1]supplementary, [2] secondary.

--Aligner.hard-clips 3 bits

map-orientations Constrain orientations to acceptforward-only, reverse-complementonly, or any alignments.

--Aligner.map-orientations 0 (any)1 (forwardonly)2 (reverse only)

mapq-max Ceiling on reported MAPQ. --Aligner.mapq-max 0 to 255

match-n-score (signed) Score increment for matchinga reference 'N' nucleotide IUB code.

--Aligner.match-n-score -16–15

match-score Score increment for matchingreference nucleotide.

--Aligner.match-score When global =0, match-score> 0; Whenglobal = 1,match-score >=0

min-score-coeff Adjustment to aln-min-score per readbase.

--Aligner. min-score-coeff -64–63.999

mismatch-pen Score penalty for a mismatch. --Aligner.mismatch-pen 0–63

no-unpaired If only properly paired alignmentsshould be reported for paired reads.

--Aligner. no-unpaired 0–1

pe-max-penalty Maximum pairing score penalty, forunpaired or distant ends.

--Aligner.pe-max-penalty 0–255

pe-orientation Expected paired-end orientation:0=FR, 1=RF, 2=FF.

--Aligner.pe-orientation 0, 1, 2

Document # 1000000101034 v00



Name Description Command Line Equivalent Value

sec-aligns Maximum secondary (suboptimal)alignments to report per read.

--Aligner.sec-aligns 0–30

sec-aligns-hard Set to force unmapped when not allsecondary alignments can be output.

--Aligner.sec-aligns-hard 0–1

sec-phred-delta Only secondary alignments withlikelihood within this Phred of theprimary are reported.

--Aligner.sec-phred-delta 0–255

sec-score-delta Secondary aligns allowed with pairscore no more than this far belowprimary.

--Aligner. sec-score-delta

supp-aligns Maximum supplementary (chimeric)alignments to report per read.

--Aligner.supp-aligns 0–30

supp-as-sec If supplementary alignments should bereported with secondary flag.

--Aligner.supp-as-sec 0–1

supp-min-score-adj Amount to increase minimumalignment score for supplementaryalignments. This score is computed byhost software as "8 * match-score" forDNA, and is default 0 for RNA.

--Aligner. supp-min-score-adj

unclip-score Score bonus for reaching each edge ofthe read.

--Aligner.unclip-score 0–127

unpaired-pen Penalty for unpaired alignments inPhred scale.

--Aligner.unpaired-pen 0–255

If you disable automatic detection of insert-length statistics via the --enable-sampling option, you mustoverride all the following options to specify the statistics. For more information, see Mean Insert SizeDetection on page 21. These options are part of the [Aligner] section of the configuration file.

Option Description Command Line Equivalent Value

pe-stat-mean-insert Average template length. 0–65535

pe-stat-mean-read-len Average read length. 0–65535

pe-stat-quartiles-insert A comma-delimited trio of numbersspecifying 25th, 50th, and 75th percentiletemplate lengths.

0–65535

pe-stat-stddev-insert Standard deviation of template lengthdistribution.

0–65535

Variant Caller OptionsThe following options are in the Variant Caller section of the configuration file. For more information onthese options, see Variant Caller Options on page 27.

Name DescriptionCommand LineEquivalent

Value

cosmic A COSMIC VCF input file. --cosmic

cnv-cbs-alpha Significance level for the test toaccept change points. Default is 0.01.

cnv-cbs-alpha

Document # 1000000101034 v00




Value

cnv-cbs-eta Type I error rate of the sequentialboundary for early stopping whenusing the permutation method.Default is 0.05.

--cnv-cbs-eta

cnv-cbs-kmax Maximum width of smaller segmentfor permutation. Default is 25.

--cnv-cbs-kmax

cnv-cbs-min-width Minimum number of markers for achanged segment. Default is 2.

--cnv-cbs-min-width

cnv-cbs-nmin Minimum length of data for maximumstatistic approximation. Default is200.

--cnv-cbs-nmin

cnv-cbs-nperm Number of permutations used for p-value computation. Default is 10000.

--cnv-cbs-nperm

cnv-cbs-trim Proportion of data to be trimmed forvariance calculations. Default is0.025.

--cnv-cbs-trim

cnv-counts-method Counting method for an alignment tobe counted in a target bin. Default isoverlap for panel of normalapproach. For self normalization, thedefault is start.

--cnv-counts-method midpoint / start /overlap

cnv-enable-gcbias-correction

Enable/disable GC bias correction. --cnv-enable-gcbias-correction

true/false

cnv-enable-gcbias-smoothing

Enable/disable smoothing of the GCbias correction across adjacent GCbins with an exponential kernel.Default is true.

--cnv-enable-gcbias-smoothing

true/false

cnv-enable-ref-calls When true, copy neutral (REF) callsare included in the output VCF.

--cnv-enable-ref-calls true/false

cnv-enable-self-normalization

Enable/disable self-normalization. --cnv-enable-self-normalization

true/false

cnv-enable-split-intervals Splits up all target BED intervals intotwo equally spaced intervals. If thisoption is enabled, then all samples(case and panel of normals) must beexecuted with this option enabled.Default is false.

--cnv-enable-split-intervals

cnv-enable-tracks Enable/disable generation of trackfiles that can be imported into IGV forviewing. Default is true.

--cnv-enable-tracks true/false

cnv-extreme-percentile Extreme median percentile value atwhich to filter out samples. Default is2.5.

--cnv-extreme-percentile

cnv-filter-bin-support-ratio

Filters out a candidate event if thespan of supporting bins is less thanthe specified ratio with respect to theoverall event length. The default ratiois 0.2 (20% support).

--cnv-filter-bin-support-ratio

Document # 1000000101034 v00




Value

cnv-filter-copy-ratio Minimum copy ratio threshold valuecentered about 1.0 at which areported event is marked as PASS inthe output VCF file. Default is 0.2

--cnv-filter-copy-ratio

cnv-filter-de-novo-quality Phred-scale threshold for calling anevent as de novo in the proband.

--cnv-filter-de-novo-quality

cnv-filter-length Minimum event length in bases atwhich a reported event is marked asPASS in the output VCF file. Defaultis 10000.

--cnv-filter-length

cnv-filter-qual The QUAL value at which a reportedevent is marked as PASS in theoutput VCF file. Default is 10.

--cnv-filter-qual

cnv-fpop-penalty Penalty parameter for change pointdetection. Default is 0.03.

--cnv-fpop-penalty

cnv-input Sample input. --cnv-input

cnv-matched-normal Target counts file of the matchednormal sample.

--cnv-matched-normal

cnv-max-percent-zero-samples

Threshold for filtering out targets withtoo many zero coverage samples.Default is 5%.

--cnv-max-percent-zero-samples

cnv-max-percent-zero-targets

Threshold for filtering out sampleswith too many zero coverage targets.Default is 2.5%.

--cnv-max-percent-zero-targets

cnv-merge-distance Specifies the minimum number ofbase pairs between two segmentsthat would allow them to be merged.The default value is 0, which meansthe segments must be directlyadjacent.

--cnv-merge-distance

cnv-merge-threshold The maximum segment meandifference at which two adjacentsegments should be merged. Thesegment mean is represented as alinear copy ratio value. Default is 0.2.Set to 0 to disable merging.

--cnv-merge-threshold

cnv-min-mapq Enable or disable splitting all targetBED intervals into two equally spacedintervals. When enabled, all samples(case and panel of normals) must beexecuted with this option enabled.Default is false.

--cnv-min-mapq true/false

cnv-normals-file A single file to be used in the panel ofnormals. Can be specified multipletimes, once for each file.

--cnv-normals-file

cnv-normals-list A panel of normals file. --cnv-normals-list

cnv-num-gc-bins Number of bins for GC biascorrection. Each bin represents theGC content percentage. Default is25.

--cnv-num-gc-bins 10 / 20 / 25/ 50 / 100

Document # 1000000101034 v00




Value

cnv-ploidy The normal ploidy value. Used forestimation of the copy number valueemitted in the output VCF file. Defaultis 2.

--cnv-ploidy

cnv-qual-length-scale Bias weighting factor to adjust QUALestimates for segments with longerlengths. Advanced option that shouldnot have to be modified. Default is0.9303 (2-0.1).

--cnv-qual-length-scale

cnv-qual-noise-scale Bias weighting factor to adjust QUALestimates based on sample variance.Advanced option that should not haveto be modified. Default is 1.0.

--cnv-qual-noise-scale

cnv-segmentation-mode Segmentation algorithm to perform.The default value is slm or cb"depending on whether the intervalsare whole genome intervals ortargeted sequencing intervals.

--cnv-segmentation-mode

cbs / slm / hslm /fpop

cnv-slm-eta Baseline probability that the meanprocess changes its value. Default is1e-5.

--cnv-slm-eta

cnv-slm-fw Minimum number of data points for aCNV to be emitted. Default is 0.

--cnv-slm-fw

cnv-slm-omega Scaling parameter modulatingrelative weight betweenexperimental/biological variance.Default is 0.3.

--cnv-slm-omega

cnv-slm-stepeta Distance normalization parameter.The default value is 10000. Only validfor “HSLM”.

--cnv-slm-stepeta

cnv-target-bed A properly formatted BED file thatspecifies the target intervals tosample coverage over. For use inWES analysis.

--cnv-target-bed

cnv-target-factor-threshold

Percentage of median target factorthreshold to filter out useable targets.The default value is 1% for wholegenome processing and 10% fortargeted sequencing processing.

--cnv-target-factor-threshold

cnv-truncate-threshold Extreme outliers are truncated basedon this percent threshold. Default is0.1%.

--cnv-truncate-threshold

cnv-wgs-interval-width Width of the sampling interval forCNV WGS processing. Default is1000.

--cnv-wgs-interval-width

cnv-wgs-skip-contig-list A comma-separated list of contigidentifiers to skip when generatingintervals for WGS analysis. Thedefault contigs that are skipped, if notspecified, are "chrM,MT,m,chrm".

--cnv-wgs-skip-contig-list

Document # 1000000101034 v00




Value

enable-cnv-tracks Enable/disable generation of bigwigand gff files.

--enable-cnv-tracks true/false

enable-combinegvcfs Enable/disable combine gVCF run. --enable-combinegvcfs true/false

enable-joint-genotyping To enable combine gVCF run, set totrue.

--enable-joint-genotyping true/false

enable-multi-sample-gvcf Enable/disable generation of amultisample gVCF file. If set to true,requires a combined gVCF file asinput.

--enable-multi-sample-gvcf

true/false

enable-vqsr Enable/disable the VQSR postprocessing module.

--enable-vqsr true/false

panel-of-normals The path to the panel of normalsVCF file.

--panel-of-normals

variant The path to a single gVCF file.Multiple --variant options can be usedon the command line, onefor each gVCF. Up to 500 gVCFs aresupported.

--variant

pedigree-file Specific to joint calling. The path to aPED pedigree file containing astructured description of the familialrelationships between samples. Onlypedigree files that contain trios aresupported.

--pedigree-file

variant-list The path to a file containing a list ofinput gVCF files, one file per line, thatneed to be combined.

--variant-list

vc-alt-allele-in-normal-filter

Set to false to disable the alt-allele-in-normal filter. Default is true.

--vc-alt-allele-in-normal-filter

true/false

vc-decoy-contigs The path to a comma-separated listof contigs to skip during variantcalling.

--vc-decoy-contigs

vc-emit-ref-confidence To enable base pair gVCFgeneration, set to BP_RESOLUTION.To enable banded gVCF generation,set to GVCF.

--vc-emit-ref-confidence BP_RESOLUTION /GVCF

vc-enable-clustered-events-filter

Enable/disable the clustered eventsfilter. The default is true.

--vc-enable-clustered-events-filter

true/false

vc-enable-decoy-contigs Enable/disable variant calls on decoycontigs. Default is false.

--vc-enable-decoy-contigs

true/false

vc-enable-gatk-acceleration

Enable/disable running variant callerin GATK mode.

--vc-enable-gatk-acceleration

true/false

vc-enable-homologous-mapping-filter

Enable/disable the homologousmapping event filter. Default is true.

--vc-enable-homologous-mapping-filter

true/false

vc-enable-phasing Enable variants to be phased whenpossible. Default is true.

--vc-enable-phasing true/false

vc-enable-triallelic-filter Enable the multi-allelic. Default istrue.

-vc-enable-triallelic-filter

Document # 1000000101034 v00




Value

vc-gvcf-gq-bands Define GQ bands for gVCF output.Default is 10, 20, 30, 40, 60, 80.

--vc-gvcf-gq-bands

vc-hard-filter Boolean expression for filteringvariant calls. Default expression is:DRAGENHardQUAL:all: QUAL <10.4139;LowDepth:all: DP < 1

Parameters in theexpression caninclude QD, MQ, FS,MQRankSum,ReadPosRankSum,QUAL, DP, and GQ.

vc-limit-genomecov-output

Limit the size of the .genomecov.bedfile that is generated. Default is false.

--vc-limit-genomecov-output

true/false

vc-max-alternate-alleles Maximum number of ALT alleles thatare output in a VCF or gVCF. Defaultis 6.

--vc-max-alternate-alleles

vc-max-reads-per-active-region

Maximum number of reads perregion for downsampling. Default is1000.

vc-max-reads-per-active-region

Maximum number of reads perregion for downsampling; default is1000.

--vc-max-reads-per-active-region

vc-min-base-qual Minimum base quality to beconsidered for variant calling. Defaultis 10.

--vc-min-base-qual

vc-min-call-qual Minimum variant call quality foremitting a call. Default is 30.

--vc-min-call-qual

vc-min-read-qual Minimum read quality (MAPQ) to beconsidered for variant calling. Defaultis 20.

--vc-min-read-qual

vc-min-tumor- read-qual Minimum tumor read quality (MAPQ)to be considered for variant calling.

vc-min-reads-per-start-pos

Minimum number of reads per startposition for downsampling. Default is5.

--vc-min-reads-per-start-pos

vc-remove-all-soft-clips If set to true, variant caller does notuse soft clips of reads to determinevariants.

--vc-remove-all-soft-clips true/false

vc-sample-name Variant caller sample name,analogous field to RGSM.

--vc-sample-name

vc-target-bed Target regions BED file. --vc-target-bed

vc-target-coverage Target coverage for downsampling.Default is 2000.

--vc-target-coverage

vc-tlod-filter-threshold Set the TLOD filter threshold. Defaultis 6.5.

--vc-tlod-filter-threshold

Document # 1000000101034 v00




Value

vqsr-annotation A comma-separated string thatspecifies the annotations to use forbuilding models. String format is<mode>,<annotation>,<annotation>...where mode is INDEL or SNP. SeeVariant Quality Score Recalibrationon page 75.

--vqsr-annotation

vqsr-config The path to the VQSR configurationfile.

--vqsr-config

vqsr-filter-level Specifies the truth sensitivity level asa percentage to filter out variant calls.

--vqsr-filter-level

vqsr-input Specifies the VQSR input VCF to beprocessed.

--vqsr-input

vqsr-lod-cutoff Specifies the LOD cutoff score forselecting which variant call sites touse in building the negative model.Default is -5.0.

--vqsr-lod-cutoff

vqsr-num-gaussians The number of Gaussians togenerate the positive and negativemodels, specified as a comma-separated string of the following fourinteger values: <SNP positive>,<SNPnegative>,<INDEL positive>,<INDELnegative>. If not specified, defaultvalues are 8, 2, 4, 2.

--vqsr-num-gaussians

vqsr-resource The training resource files to be usedfor determining which variant callsites are true, specified as a comma-separated string that specifies themode of operation, the prior value toweight this resource, then the paththe resource file. For example:<mode>,<prior>,<resource file>.

--vqsr-resource

vqsr-tranche Specifies the truth sensitivity levels tocalculate the LOD cutoff values aspercentages. This option can bespecified multiple times with adifferent truth sensitivity level. If noneare specified, the default values are100.0, 99.99, 99.90, 99.0, and 90.0.

--vqsr-tranche

Document # 1000000101034 v00



Repeat Expansion Detection OptionsThe following options can be set in the RepeatGenotyping section of the configuration file or on thecommand line. For more information, see Repeat Expansion Detection with Expansion Hunter on page60.

ParameterName

DescriptionCommand LineEquivalent

Range

enable Enable or disable repeat expansion detection. --repeat-genotype-enable

true/false

specs The full path to the JSON file that contains the repeatvariant catalog (specification) describing the loci tocall.

--repeat-genotype-specs

Document # 1000000101034 v00



Technical AssistanceFor technical assistance, contact Illumina Technical Support.Website: www.illumina.comEmail: [email protected]

Illumina Customer Support Telephone Numbers

Region Toll Free Regional

North America +1.800.809.4566

Australia +1.800.775.688

Austria +43 800006249 +43 19286540

Belgium +32 80077160 +32 34002973

China 400.066.5835

Denmark +45 80820183 +45 89871156

Finland +358 800918363 +358 974790110

France +33 805102193 +33 170770446

Germany +49 8001014940 +49 8938035677

Hong Kong 800960230

Ireland +353 1800936608 +353 016950506

Italy +39 800985513 +39 236003759

Japan 0800.111.5011

Netherlands +31 8000222493 +31 207132960

New Zealand 0800.451.650

Norway +47 800 16836 +47 21939693

Singapore +1.800.579.2745

South Korea +82 80 234 5300

Spain +34 911899417 +34 800300143

Sweden +46 850619671 +46 200883979

Switzerland +41 565800000 +41 800200442

Taiwan 00806651752

United Kingdom +44 8000126019 +44 2073057197

Other countries +44.1799.534000

Safety data sheets (SDSs)—Available on the Illumina website at support.illumina.com/sds.html.

Product documentation—Available for download in PDF from the Illumina website. Go tosupport.illumina.com, select a product, then select Documentation & Literature.

Document # 1000000101034 v00



http://www.illumina.com/

mailto:[email protected]

http://support.illumina.com/sds.html

http://www.illumina.com/support.ilmn

Illumina5200 Illumina WaySan Diego, California 92122 U.S.A.+1.800.809.ILMN (4566)+1.858.202.4566 (outside North America)[email protected]

For Research Use Only. Not for use in diagnostic procedures.© 2020 Illumina, Inc. All rights reserved.

Document # 1000000101034 v00

illumina dragen bio-it platorm v3.3 user guide ... · revisionhistory document date...

Documents