recovered file 1 - cs.colostate.educs680/slides/lecture3.pdf · samformat’...
TRANSCRIPT
Sequence Alignment Con0nued
Lecture 3: August 28, 2012
Review from Last Lecture: Exis0ng Tools
Different Sequence Alignment • Database Search:
– BLAST, FASTA, HMMER
• Mul0ple Sequence Alignment: – ClustalW, FSA
• Genomic Analysis: – BLAT
• Short Read Sequence Alignment: – BWA, Bow)e, drFAST, GSNAP, SHRiMP, SOAP, MAQ
Short Read Alignment SW
Bow)e: memory-‐efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-‐bp reads per hours Burrows-‐Wheeler Aligner (BWA): an aligner that implements two algorithms: bwa-‐short and BWA-‐SW. The former works for query sequences shorter than 200 bp and the la\er for longer sequences up to around 100 kbp.
Sequence Alignment/Map Format Input: query and
reference sequences.
Alignment So9ware
SAM File
Resequencing RNA Seq SNPs
Understanding the Input and Output of BWA
Sequence Alignment/Map Format Sequence Reads + Reference Sequence
Alignment So9ware
SAM File
Resequencing RNA Seq SNPs
Reads: Illumina or 454 reads. Reference: whole genome, contig, chromosome.
BWA, Bowtie, mrsFAST, GSNAP.
Most of the analysis happens when considering the SAM files.
SAM format “A tab-‐delimited text format consis0ng of a header sec0on, which is op0onal, and an alignment sec0on”
@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta
M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta
M5:a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta
M5:fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743
Example Headers:
1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
19:20389:F:275+18M2D19M 147 1 17919 0 18M2D19M = 17644 -314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19
9:21597+10M2I25M:R:-209 83 1 21678 0 8M2I27M = 21469 -244 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35
Example Alignments:
The Alignment Column
The Alignment Column
Harves0ng Informa0on from SAM • Query name, QNAME (SAM)/read_name (BAM). • FLAG provides the following informa0on:
– are there mul0ple fragments? – are all fragments properly aligned? – is this fragment unmapped? – is the next fragment unmapped? – is this query the reverse strand? – is the next fragment the reverse strand? – is this the last fragment? – is this a secondary alignment? – did this read fail quality controls? – is this read a PCR or op0cal duplicate?
Bitwise Flags FLAG: bitwise FLAG. Each bit is explained in the following table:
Bitwise Representa0on
1 = 00000001 paired-‐end read 2 = 00000010 mapped as proper pair 4 = 00000100 unmapped read 8 = 00001000 read mate unmapped 16 = 00010000 read mapped on reverse strand Example:
The flag 11 1 + 2 + 8 = 00001011 (condi0ons 1, 2, 8) • Flags 0, 4, and 16 are the flags most commonly used.
The Alignment Column
Mapping Quality
• Phred score, iden0cal to the quality measure in the fastq file. quality Q, probability P:
P = 10 ^ (-‐Q / 10.0)
• If Q=30, P=1/1000on average, one of out 1000 alignments will be wrong
• As good as this sounds it is not easy to compute such a quality.
Mapping Quality • Repeat structure. Reads falling in repe00ve regions usually get very low mapping quality.
• Base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment.
• Sensi)vity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensi0vity, which also causes mapping errors.
• Paired end or not. Reads mapped in pairs are more likely to be correct.
BWA Specific High Scores
A read alignment with a mapping quality 30 or above usually implies:
– The overall base quality of the read is good. – The best alignment has few mismatches. – The read has few or just one “good” hit on the reference, which means the current alignment is s0ll the best even if one or two bases are actually muta0ons or sequencing errors.
BWA Specific Low Scores
Surprisingly difficult to track down the exact behavior • Q=0 if a read can be aligned equally well to mul0ple posi0ons, BWA will randomly pick one posi0on and give it a mapping quality zero.
• Q=25 the edit distance equals mismatches and is greater than zero
What to do with low quality scores?
• Find repeat structures in the genome/con0g. • Determine if there is a problem with your alignment or data (i.e. all the reads mapped with low quality scores).
• Filter them out. Very common to write a perl/python script to filter out poorly aligned reads.
• Many, many, many other possibili0es.
The Alignment Column
CIGAR String
• CIGAR string is a compact representa0on of how the read aligned to the reference genome at that exact posi0on.
• More specifically, the CIGAR string is a sequence of of base lengths and the associated opera0on. – match/mismatch with the reference. – deleted/inserted from the from the reference.
Example of CIGAR
RefPos: ! !1 !2 !3 !4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19!Reference: !C !C !A !T A C T G A A C T G A C T A A C!Read: ! ! ! ! ! A C T A G A A T G G C T!
!
In the SAM file you will have the following fields: • POS: 5 • CIGAR: 3M1I3M1D5M!
Final Comments
• BAM is a compressed version of the SAM file format. There are mul0ple programs that convert BAM files to SAM files and vice versa.
• Tablet (h\p://bioinf.scri.ac.uk/tablet/) is an easy to use, program that allows you to visualize an alignment. – You simply give it a sam file and a fasta file and it reads the sam file and shows you the alignment.
Finding Short Read Data
Where to obtain data?
• Answer: NCBI website: – NCBI contains mul0ple reference genomes (large fasta files) and short read data (fasta files that primarily Illumina and 454).
– Finding data is pre\y trivial by either going to the NCBI website directly or using google.
• For example: googling “e coli k12 reference genome fasta file” will take you directly to the broad ins0tutes website and a link to the NCBI reference genome.
NCBI Short Read Archive h\p://trace.ncbi.nlm.nih.gov/Traces/sra/
Using NCBI SRA
If you can download a movie or a tv show then you can download short read data from SRA (it’s even easier)…. you just have to know what to look for: 1. Go to the “search”. 2. Type in organism name and strain if you
know it. i.e. “Escherichia coli str. K-‐12 substr. MG1655”.
3. Look at query results then download.
Running BWA
Steps in using BWA
Download and install BWA on Linux/Mac. If you are using cs servers then you shouldn’t have to do this step. Export the path or use the exact path.
! ! !bunzip2 bwa-0.5.9.tar.bz2 !! ! !tar xvf bwa-0.5.9.tar!! ! !cd bwa-0.5.9 | make!! ! !make
Download the reference genome using wget.
Create the index for the reference genome (assuming the reference sequences are in wg.fa). Only needs to be performed once for each genome. Use –a for small genomes. !
!bwa index -p hg19bwaidx -a bwtsw wg.fa!
Mapping short reads to the reference genome.
1. Align sequences using mul0ple threads (eg 4 CPUs). Assume the short reads are in the s_3_sequence.txt.gz file.!
!!bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > !s_3_sequence.txt.bwa!
2. Create alignment in the SAM format (a generic format for storing large nucleo0de sequence alignments):
!!bwa samse hg19bwaidx s_3_sequence.txt.bwa ! !!s_3_sequence.txt.gz > s_3_sequence.txt.sam!
!
Mapping long reads (454) can be done using the bwasw command: !
!bwa bwasw hg19bwaidx 454seqs.txt > 454seqs.sam!
Recap and Looking Forward
De novo vs. Re-‐sequencing • De novo assembly (“from the beginning”) implies that you have no prior knowledge of the genome. No reference, no con0gs, only reads.
• Re-‐sequencing assembly assumes you have a copy of the reference genome (that has been verified to a certain degree).
• The programs that work for re-‐sequencing will not work for de novo and vice versa. However, both can create copies of the genome.
De novo vs. Re-‐sequencing
Sample PreparaAon
Fragments
Re-sequencing (LOCAS, Shrimp) requires 15x to 30x coverage. Anything less and re-sequencing programs will not produce results or produce questionable results.
Sample PreparaAon
Fragments
Re-sequencing (LOCAS, Shrimp) requires 15x to 30x coverage. Anything less and re-sequencing programs will not produce results or produce questionable results.
Sample PreparaAon
Fragments
De-novo assembly requires higher coverage. At least 30x but upwards to 100x’s coverage. Most de novo assemblers require paired-end data.