key tasks in sequence analysis- alignment-variant call-assembly
TRANSCRIPT
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
1/27
Aylwyn Scally W ellcome Trust Sanger Institute
March 2012
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
2/27
Key tasks in sequence analysis
•
Data handling
•
Alignment to a reference sequence
•
Alignment file handling
•
Variant calling
•
SNPs, genotypes
•
structural variation
•
Sequence assembly
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
3/27
Data handling
•
Important to have a data hierarchy
corresponding to experimental factors
lane/run
library
sequencingtechnology
individual
strain/subspecies/population
species hsa
YRI
NA12878
SLX
NA12878-WG
297_1 297_2
454
CEU
NA19240
SLX
NA19240-WG
505_7 505_8
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
4/27
Raw sequence data
• FASTQ format•
original Sanger standard for capillary data
•
derived from FASTA format
•
sequence and an associated per base qualityscore
• PHRED quality scores encoded as ASCIIprintable characters (ASCII 33–126)
• standard offset 33 but older Solexa/Illumina variantsused 64
@title
sequence+optional_text
quality
@SRR010930.8436795/1
ACCCCAGGATCAACACTTCACATGCATTAGCAGAGAGAGATAAATCAA +
=>=??A?
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
5/27
PHRED quality scores
•
Encodes the probability of an erroneous
call
•
quality score Q = –10 log10 P
•
error probability P = 10 –Q/10
•
example: call with Q = 30 has error probability
P = 10-3 = 1 in 1000
•
ASCII encoding
! ” # $ % & ’ ( ) * + , - . / 0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
encoding
Q score
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
6/27
DATA PROCESSING
ALIGNMENT
SAM FILE
PROCESSING
Alignment pipeline
•
Get data
• Prepare and index reference•
sequence names; alternate haplotypes etc
• Align data•
by lane or smaller unit – optimise throughput
• Sort by position
• Merge alignments
• Improve alignments
•
Merge libraries• Index final alignment
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
7/27
Alignment pipeline
BAMBAM BAMLibrarymerge
Library
… Alignment
Fastq
BAM
BAM
Fastq
BAM
BAM
Fastq
BAM
BAM
Fastq
BAM
BAM
Fastq
BAM
BAMBAM
ImprovementLane/plex
BAM BAM Sample/PlatformSample
merge
…
…
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
8/27
bwa
•
bwa index [-a bwtsw|div|is] [-c] • Burroughs-Wheeler transform construction algorithm
• ‘bwtsw’ for vertebrate sized genomes, ‘is’ for smaller genomes
• bwa aln [options] • align each single-ended fastq file individually• is name of reference file
•
options control alignment parameters, scoring matrix, seed length• bwa sampe [options]
• generates pairwise alignment from sai files produced by bwa aln• produces SAM output
• bwa bwasw [options] • alignment of long reads in query.fa• produces SAM output
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
9/27
bwa usage notes
•
bwa finds matches up to a finite edit distance• by default for 100-bp reads allows 5 edits
• Important to quality-clip reads• -q in bwa aln, e.g. set to 20
•
Non-ACGT bases on reads are treated asmismatches
• Parallelise for speed• split data into 1 Gbp blocks• bwa takes ~8 hrs per block
•
Check for truncated BAM files• e.g. with samtools flagstats
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
10/27
Alignment improvement
•
Library duplicate removal
• samtools, Picard
•
Realignment around indels
•
GATK
•
Base quality recalibration
•
GATK
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
11/27
Library duplicate removal
•
PCR amplification step in library preparation can resultin duplicate DNA fragments• PCR-free protocols exist but require larger volumes of
input DNA
• Generally a low number of duplicates in good libraries;
increases with depth of sequencing• Duplicates can result in false SNP calls
• manifest as high read depth support
• Removal method• Identify read-pairs where outer ends map to the same
position on the genome and remove all but one copy
•
samtools rmdup
• Picard/GATK MarkDuplicates
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
12/27
Realignment
•
Short indels in the sample relative to reference posedifficulties for alignment• Indels occurring near the ends of reads often not aligned
correctly
• Aligners prefer to introduce SNPs rather than an indel
•
Realignment algorithm• Input set of known indel sites and a BAM file
•
Previously published indel sites, dbSNP, 1000 Genomes, orestimate from alignment
• At each site, model the indel and reference haplotypes andselect best fit with data
•
New BAM file produced, modified where indels have beenintroduced by realignment
• Implemented in GATK (IndelRealigner function)
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
13/27
Additional alignment issues
•
Separate chromosomal BAMs
• easier to process in parallel
•
Realign/assemble unmapped reads
•
recover sequence missed due to referenceincompatibility or incompleteness
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
14/27
SAM/BAM
•
Sequence Alignment/Map format•
unified format for storing read alignments to areference genome
• BAM (Binary Alignment/Map) format
•
binary equivalent of SAM• Features
•
stores alignments from most alignment programs
•
supports multiple sequencing technologies
•
supports indexing for quick retrieval
•
reads can be classed into logical groups• e.g. lanes, libraries, individuals
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
15/27
SAM file format
•
Header
• Alignment lines (one per read)• 11 mandatory fields
• several optional fields (format TAG:TYPE:VALUE)
Col Field Type Description 1 QNAME str query name of the read or the read pair
2 FLAG int bitwise flag (pairing, mapped, mate mapped, etc.)
3 RNAME str reference sequence name
4 POS int 1-based leftmost position of clipped alignment
5 MAPQ int mapping quality (Phred scaled)
6 CIGAR str extended CIGAR string (details of alignment)
7 RNEXT str mate reference name (‘=’ if same as RNAME)
8 PNEXT int position of mate/next segment9 TLEN int observed template length
10 SEQ str segment sequence
11 QUAL str ASCII of Phred-scaled base quality
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
16/27
SAM format
•
Example
• http://picard.sourceforge.net/explain-flags.html
IL4_315:7:105:408:43
177 X
1741 0
1S35M
X 56845228
0 ATTTGGCTCTCTGCTTGTTTATTATTGGTGTATNGG
+1,1+16;>;166>;>;;>>;>>>>>>,>>>>>+>>
QNAME
FLAGRNAME
POSMAPQ
CIGAR
RNEXTPNEXT
TLENSEQ
QUAL
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
17/27
SAM/BAM file processing tools
•
samtools• C program and library
•
http://samtools.sourceforge.net
• view: SAM-BAM conversion
• sort, index, merge multiple BAM files
• flagstat: summary counts of mapping flags
•
Picard• Java program suite
http://picard.sourceforge.net
• MarkDuplicates, CollectAlignmentSummaryMetrics,CreateSequenceDictionary, SamToFastq, MeanQualityByCycle
• Pysam•
Python interface to samtools API•
http://code.google.com/p/pysam/
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
18/27
Variant calling
•
Call SNPs with genotypes (heterozygous andhomozygous), indels and structural variants
• Tools• samtools, bcftools
• GATK, SOAPsnp, Dindel
•
SVMerge• File formats:
• VCF, pileup
• Filters and calling protocols• depth, quality, strand bias, multiple samples
•
Indels harder to call accurately than SNPs• structural variation harder still
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
19/27
Variant Call Format (VCF)
•
Stores polymorphism data with annotation
• SNPs, insertions, deletions and structural variants
•
Can be indexed for fast data retrieval
•
Variant calls across many samples
•
Metadata
• e.g. dbSNP accession, filter status, validation status
•
Arbitrary tags can be used to describe new typesof variant
•
Note: binary BCF produced by samtools• get vcf with samtools mpileup | bcftools view
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
20/27
VCF Format
•
Header• Arbitrary number of INFO definition lines starting with ‘##’
• Column definition line starts with single ‘#’
• Mandatory columns• Chromosome (CHROM)
•
Position of the start of the variant (POS)• Unique identifiers of the variant (ID)
• Reference allele (REF)
• Comma separated list of alternate non-reference alleles(ALT)
•
Phred-scaled quality score (QUAL)• Site filtering information (FILTER)
• User extensible annotation (INFO)
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
21/27
VCF format
•
Example3
74393 .
G T
999
.DP=31;AF1=0.7002;AC1=4;DP4=4,0,22,2
;… GT:PL:DP:GQ 1/1:181,57,0:19:57
1/1:90,15,0:5:16 0/0:0,12,85:4:7
CHROM
POSID
REF ALT
QUAL
FILTERINFO
FORMATsample1
sample2sample3
see H. Li, Bioinformatics 27(21): 2987–2993 (2011) for
details of likelihood and population genetic calculations
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
22/27
More information
•
SNP calling and genotyping• Samtools
•
http://bioinformatics.oxfordjournals.org/content/25/16/2078.long
• http://samtools.sourceforge.net
• GATK
•
http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
• VCF• VCFtools
•
http://vcftools.sourceforge.net
• Danacek et al. Bioinformatics 27(15): 2156-2158 (2011)
•
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
23/27
Structural variation
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
24/27
Structural variation
•
Read depth and pairing information used to detect events• deviations from the expected fragment size
• presence/absence of mate pairs
• excessive/reduced read depth (CNV)
• Several methods/tools released
•
SVMerge pipeline• makes SV predictions using a collection of callers• Input is one BAM file per sample
• callers run individually & outputs converted into standard BEDformat
• calls merged
•
computationally validated using local de novo assembly• http://svmerge.sourceforge.net/
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
25/27
Assembly
•
Tools• Abyss
• http://www.bcgsc.ca/platform/bioinfo/software/abyss
• SGA• https://github.com/jts/sga
•
SOAPdenovo• http://soap.genomics.org.cn/soapdenovo.html
• ALLPATHS-LG• http://www.broadinstitute.org/software/allpaths-lg/blog
• Cortex
•
http://cortexassembler.sourceforge.net/• Velvet
• http://www.ebi.ac.uk/~zerbino/velvet/
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
26/27
Assembly metrics
•
N50, N10, N90 etc
• x % of assembly is in fragments larger than
N x
•
Number of contigs, mean/max contiglength
•
Realignment
•
fraction of read pairs mapped correctly
•
correct homozygous SNPs
•
identify breakpoints
-
8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly
27/27
Thanks to Thomas Keane and the Vertebrate
Resequencing team at WTSI for several slides