variant (snps/indels) calling in dna sequences, part 1
DESCRIPTION
Abstract: This session will focus on the first steps involved in identifying SNPs from whole genome, exome capture or targeted resequencing data: The different read mapping approaches to a DNA reference sequence will be introduced and quality metrics discussed.TRANSCRIPT
The Queensland Brain Institute |
Variant calling for disease association (1/2)Ordering the haystack
April 11, 2023
[www.absolutefab.com]
The Queensland Brain Institute | April 11, 2023
Quick recap: Production informatics
Sequencing Image Fastq
• Sequencing->Images->Conversion (Demultiplexing)
• Resulting file type: FASTQ• Several projects can be processed on one flowcell• One project can have several samples
Quality ControlProjects
The Queensland Brain Institute | April 11, 2023
Product Time
fastq 5 days
bam, vcf,… 3 weeks
paper >6 months
Per one-flowcell project
Production Informatics and Bioinformatics
Map to genome and generate raw genomic features (e.g. SNPs)
Analyze the data; Uncover the biological meaning
Produce raw sequence readsBasic ProductionInformatics
Advanced Production Inform.
BioinformaticsResearch
The Queensland Brain Institute | April 11, 2023
Where in the genome do the reads come from?
Reads Alignment
The Queensland Brain Institute | April 11, 2023
Short read mapping
• Brute-Force algorithm would take years to process one lane: Data structures matter !– Constant trade-off: speed vs. sensitivity– To date >50 read mapping tools
• Two categories– Hash tables: MAQ, ELAND, SOAP, BFAST, RazerS, Novoalign
– Suffix trees: BWA, SOAP2, BOWTIE
Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ. Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet. 2011 Apr 28. PubMed PMID: 21525877.
The Queensland Brain Institute | April 11, 2023
Hash table based aligners
• Modification– Speed-up: Spaced seeds 111010010100110111– Gapped seeds: Qgrams
• Hash of the reads: MAQ, ELAND, ZOOM and SHRiMP– Potentially much smaller memory requirements
• Hash the reference: SOAP, BFAST and MOSAIK– Constant memory cost, one time effort
Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
The Queensland Brain Institute | April 11, 2023
Suffix tree and Burrows‐Wheeler Transformation
• Suffix trees are much faster– E.g. BWA is ~20-times faster than hash-based MAQ
• BW transformation makes them applicable (memory)
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. PMID: 19451168
queensland$ueensland$qeensland$quensland$quensland$queesland$queenland$queensand$queenslnd$queenslad$queenslan$queensland
$queensland and$queensl d$queenslan eensland$qu ensland$que land$queens nd$queensla nsland$quee queensland$sland$queen ueensland$q
Reference: queensland BWT(Ref): dlnuesae$nq$queensland and$queensl d$queenslan eensland$qu ensland$que land$queens nd$queensla nsland$quee queensland$sland$queen ueensland$q
Rotated Sorted
The Queensland Brain Institute | April 11, 2023
Find exact matches in transformed sequence
P BWT C 0 $queensland 1 6 and$queensl 1 10 d$queenslan 1 3 eensland$qu 1 4 ensland$que 1 7 land$queens 1 9 nd$queensla 1 5 nsland$quee 2 1 queensland$ 1 6 sland$queen 2 2 ueensland$q 1
Read: ensl
Reference:queensland12345678910
1. Search backwards2. Find letter i in last column3. Jump to the countth i letter in first column4. Set i to be the letter in the last column 5. repeat 3+4 to the end
John Pearson Winter School in Mathematical and Computational Biology 5-9 July 2010Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25; PMID: 19261174.
The Queensland Brain Institute | April 11, 2023
Which aligner to use ?
• Hash based approaches are more suitable for divergent alignments – Rule of thumb:
• <2% divergence -> BWT E.g. human alignments
• >2% divergence -> hash based approach E.g. wild mouse strains alignments
However, the space develops fast: don’t be sentimental
Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
The Queensland Brain Institute | April 11, 2023
File format: Sam/Bam
The SAM Format Specification (v1.4-r962) The SAM Format April 17, 2011
ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGCCAT
+ unlimited add. fields: TAG:TYPE:VALUE, e.g. NM edit distance
The Queensland Brain Institute | April 11, 2023
Flag
Hex 0x80 0x40 0x20 0x10 0x8 0x4 0x2 0x1Bit 128 64 32 16 8 4 2 1 = 163 1 1 1 1
The Queensland Brain Institute | April 11, 2023
CIGAR String
The Queensland Brain Institute | April 11, 2023
Visualizing Bam files: IGV
Exome capture
http://www.broadinstitute.org/igv/
Whole genome sequencing
The Queensland Brain Institute | April 11, 2023
Bam file: Quality control
• Percentage mapped– Aim for 80%
• Coverage– Aim for coverage >10
• Duplicates– Aim for <1% (whole
genome)
//cluster-vm.qbi.uq.edu/<yourProject>
The Queensland Brain Institute | April 11, 2023
Three things to remember
1. Getting the mapping right is critical2. QC are the mapping stats and visualizing the
bam file 3. Knowing where the reads are does not
necessarily tell you about their function
The Queensland Brain Institute | April 11, 2023
Next week: Part 2
Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced and quality metrics discussed.
http://climbers.net/blog/Exhibiting-at-Cliffhanger-12-13th-July-Sheffield
The Queensland Brain Institute | April 11, 2023
Walk-in-clinic