recovered file 1 - cs.colostate.educs680/slides/lecture3.pdf · samformat’...

38
Sequence Alignment Con0nued Lecture 3: August 28, 2012

Upload: others

Post on 05-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Sequence  Alignment  Con0nued  

Lecture  3:  August  28,  2012  

Page 2: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Review  from  Last  Lecture:  Exis0ng  Tools  

Page 3: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Different  Sequence  Alignment  •  Database  Search:    

– BLAST,  FASTA,  HMMER  

•  Mul0ple  Sequence  Alignment:    – ClustalW,  FSA  

•  Genomic  Analysis:    – BLAT  

•  Short  Read  Sequence  Alignment:    – BWA,  Bow)e,  drFAST,  GSNAP,  SHRiMP,  SOAP,  MAQ  

Page 4: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Short  Read  Alignment  SW  

Bow)e:  memory-­‐efficient  short  read  aligner.  It  aligns  short  DNA  sequences  (reads)  to  the  human  genome  at  a  rate  of  over  25  million  35-­‐bp  reads  per  hours    Burrows-­‐Wheeler  Aligner  (BWA):  an  aligner  that  implements  two  algorithms:  bwa-­‐short  and  BWA-­‐SW.  The  former  works  for  query  sequences  shorter  than  200  bp  and  the  la\er  for  longer  sequences  up  to  around  100  kbp.  

Page 5: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Sequence  Alignment/Map  Format  Input:  query  and  

reference  sequences.    

Alignment  So9ware  

SAM  File  

Resequencing   RNA  Seq   SNPs  

Page 6: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Understanding  the  Input  and  Output  of  BWA  

Page 7: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Sequence  Alignment/Map  Format  Sequence  Reads  +  Reference  Sequence  

Alignment  So9ware  

SAM  File  

Resequencing   RNA  Seq   SNPs  

Reads: Illumina or 454 reads. Reference: whole genome, contig, chromosome.

BWA, Bowtie, mrsFAST, GSNAP.

Most of the analysis happens when considering the SAM files.

Page 8: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

SAM  format  “A  tab-­‐delimited  text  format  consis0ng  of  a  header  sec0on,  which  is  op0onal,  and  an  alignment  sec0on”  

@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta

M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta

M5:a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta

M5:fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743

Example Headers:

1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

19:20389:F:275+18M2D19M 147 1 17919 0 18M2D19M = 17644 -314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19

9:21597+10M2I25M:R:-209 83 1 21678 0 8M2I27M = 21469 -244 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35

Example Alignments:

Page 9: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

The  Alignment  Column  

Page 10: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

The  Alignment  Column  

Page 11: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Harves0ng  Informa0on  from  SAM  •  Query  name,  QNAME  (SAM)/read_name  (BAM).    •  FLAG  provides  the  following  informa0on:  

–  are  there  mul0ple  fragments?  –  are  all  fragments  properly  aligned?  –  is  this  fragment  unmapped?  –  is  the  next  fragment  unmapped?  –  is  this  query  the  reverse  strand?  –  is  the  next  fragment  the  reverse  strand?  –  is  this  the  last  fragment?  –  is  this  a  secondary  alignment?  –  did  this  read  fail  quality  controls?  –  is  this  read  a  PCR  or  op0cal  duplicate?  

Page 12: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Bitwise  Flags  FLAG:  bitwise  FLAG.    Each  bit  is  explained  in  the  following  table:  

Page 13: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Bitwise  Representa0on  

1  =  00000001    paired-­‐end  read  2  =  00000010    mapped  as  proper  pair  4  =  00000100    unmapped  read  8  =  00001000    read  mate  unmapped  16  =  00010000    read  mapped  on  reverse  strand  Example:      

 The  flag  11    1  +  2  +  8  =  00001011  (condi0ons  1,  2,  8)    •  Flags  0,  4,  and  16  are  the  flags  most  commonly  used.  

Page 14: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

The  Alignment  Column  

Page 15: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Mapping  Quality  

•  Phred  score,  iden0cal  to  the  quality  measure  in  the  fastq  file.  quality  Q,  probability  P:    

P  =  10  ^  (-­‐Q  /  10.0)    

•  If  Q=30,  P=1/1000on  average,  one  of  out  1000  alignments  will  be  wrong  

•  As  good  as  this  sounds  it  is  not  easy  to  compute  such  a  quality.  

Page 16: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Mapping  Quality  •  Repeat  structure.  Reads  falling  in  repe00ve  regions  usually  get  very  low  mapping  quality.  

•  Base  quality  of  the  read.  Low  quality  means  the  observed  read  sequence  is  possibly  wrong,  and  wrong  sequence  may  lead  to  a  wrong  alignment.  

•  Sensi)vity  of  the  alignment  algorithm.  The  true  hit  is  more  likely  to  be  missed  by  an  algorithm  with  low  sensi0vity,  which  also  causes  mapping  errors.  

•  Paired  end  or  not.  Reads  mapped  in  pairs  are  more  likely  to  be  correct.  

 

Page 17: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

BWA  Specific  High  Scores  

A  read  alignment  with  a  mapping  quality  30  or  above  usually  implies:    

– The  overall  base  quality  of  the  read  is  good.    – The  best  alignment  has  few  mismatches.    – The  read  has  few  or  just  one  “good”  hit  on  the  reference,  which  means  the  current  alignment  is  s0ll  the  best  even  if  one  or  two  bases  are  actually  muta0ons  or  sequencing  errors.    

Page 18: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

BWA  Specific  Low  Scores  

Surprisingly  difficult  to  track  down  the  exact  behavior  •  Q=0  if  a  read  can  be  aligned  equally  well  to  mul0ple  posi0ons,  BWA  will  randomly  pick  one  posi0on  and  give  it  a  mapping  quality  zero.  

•  Q=25  the  edit  distance  equals  mismatches  and  is  greater  than  zero  

 

Page 19: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

What  to  do  with  low  quality  scores?  

•  Find  repeat  structures  in  the  genome/con0g.  •  Determine  if  there  is  a  problem  with  your  alignment  or  data  (i.e.  all  the  reads  mapped  with  low  quality  scores).  

•  Filter  them  out.    Very  common  to  write  a  perl/python  script  to  filter  out  poorly  aligned  reads.  

•  Many,  many,  many  other  possibili0es.  

Page 20: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

The  Alignment  Column  

Page 21: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

CIGAR  String  

•  CIGAR  string  is  a  compact  representa0on  of  how  the  read  aligned  to  the  reference  genome  at  that  exact  posi0on.  

•  More  specifically,  the  CIGAR  string  is  a  sequence  of  of  base  lengths  and  the  associated  opera0on.    – match/mismatch  with  the  reference.  – deleted/inserted  from  the  from  the  reference.  

Page 22: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Example  of  CIGAR  

RefPos: ! !1 !2 !3 !4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19!Reference: !C !C !A !T A C T G A A C T G A C T A A C!Read: ! ! ! ! ! A C T A G A A T G G C T!

!

In  the  SAM  file  you  will  have  the  following  fields:    •  POS:  5  •  CIGAR:  3M1I3M1D5M!

Page 23: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Final  Comments  

•  BAM  is  a  compressed  version  of  the  SAM  file  format.    There  are  mul0ple  programs  that  convert  BAM  files  to  SAM  files  and  vice  versa.  

•   Tablet  (h\p://bioinf.scri.ac.uk/tablet/)  is  an  easy  to  use,  program  that  allows  you  to  visualize  an  alignment.      – You  simply  give  it  a  sam  file  and  a  fasta  file  and  it  reads  the  sam  file  and  shows  you  the  alignment.  

Page 24: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’
Page 25: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Finding  Short  Read  Data  

Page 26: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Where  to  obtain  data?  

•  Answer:  NCBI  website:  – NCBI  contains  mul0ple  reference  genomes  (large  fasta  files)  and  short  read  data  (fasta  files  that  primarily  Illumina  and  454).  

– Finding  data  is  pre\y  trivial  by  either  going  to  the  NCBI  website  directly  or  using  google.    

•  For  example:    googling  “e  coli  k12  reference  genome  fasta  file”  will  take  you  directly  to  the  broad  ins0tutes  website  and  a  link  to  the  NCBI  reference  genome.  

Page 27: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

NCBI  Short  Read  Archive  h\p://trace.ncbi.nlm.nih.gov/Traces/sra/  

Page 28: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Using  NCBI  SRA  

If  you  can  download  a  movie  or  a  tv  show  then  you  can  download  short  read  data  from  SRA  (it’s  even  easier)….  you  just  have  to  know  what  to  look  for:  1.  Go  to  the  “search”.  2.  Type  in  organism  name  and  strain  if  you  

know  it.    i.e.  “Escherichia  coli  str.  K-­‐12  substr.  MG1655”.  

3.  Look  at  query  results  then  download.  

Page 29: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Running  BWA  

Page 30: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Steps  in  using  BWA  

Download  and  install  BWA  on  Linux/Mac.  If  you  are  using  cs  servers  then  you  shouldn’t  have  to  do  this  step.  Export  the  path  or  use  the  exact  path.  

! ! !bunzip2 bwa-0.5.9.tar.bz2 !! ! !tar xvf bwa-0.5.9.tar!! ! !cd bwa-0.5.9 | make!! ! !make  

 Download  the  reference  genome  using  wget.  

Page 31: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Create  the  index  for  the  reference  genome  (assuming  the  reference  sequences  are  in  wg.fa).  Only  needs  to  be  performed  once  for  each  genome.  Use  –a  for  small  genomes.  !

!bwa index -p hg19bwaidx -a bwtsw wg.fa!

 Mapping  short  reads  to  the  reference  genome.  

     1.    Align  sequences  using  mul0ple  threads  (eg  4  CPUs).        Assume  the  short  reads  are  in  the  s_3_sequence.txt.gz  file.!

!!bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > !s_3_sequence.txt.bwa!

Page 32: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

 2.    Create  alignment  in  the  SAM  format  (a  generic  format  for      storing  large  nucleo0de  sequence  alignments):  

!!bwa samse hg19bwaidx s_3_sequence.txt.bwa ! !!s_3_sequence.txt.gz > s_3_sequence.txt.sam!

!

Mapping  long  reads  (454)  can  be  done  using  the  bwasw  command:  !

!bwa bwasw hg19bwaidx 454seqs.txt > 454seqs.sam!

Page 33: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Recap  and  Looking  Forward  

Page 34: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

De  novo  vs.  Re-­‐sequencing  •  De  novo  assembly  (“from  the  beginning”)  implies  that  you  have  no  prior  knowledge  of  the  genome.    No  reference,  no  con0gs,  only  reads.  

•  Re-­‐sequencing  assembly  assumes  you  have  a  copy  of  the  reference  genome  (that  has  been  verified  to  a  certain  degree).  

•  The  programs  that  work  for  re-­‐sequencing  will  not  work  for  de  novo  and  vice  versa.  However,  both  can  create  copies  of  the  genome.  

Page 35: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

De  novo  vs.  Re-­‐sequencing  

Page 36: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Sample  PreparaAon  

Fragments

Re-sequencing (LOCAS, Shrimp) requires 15x to 30x coverage. Anything less and re-sequencing programs will not produce results or produce questionable results.

Page 37: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Sample  PreparaAon  

Fragments

Re-sequencing (LOCAS, Shrimp) requires 15x to 30x coverage. Anything less and re-sequencing programs will not produce results or produce questionable results.

Page 38: Recovered File 1 - cs.colostate.educs680/Slides/lecture3.pdf · SAMformat’ “A’tabTdelimited’textformatconsis0ng’of’aheader’sec0on,’ which’is’op0onal,’and’an’alignmentsec0on”’

Sample  PreparaAon  

Fragments

De-novo assembly requires higher coverage. At least 30x but upwards to 100x’s coverage. Most de novo assemblers require paired-end data.