mes7594-01 genome informatics i - lecture v. short read alignment sangwoo kim, ph.d. assistant...

68
MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Genome Informatics I (2015 Spring)

Upload: aubrey-alexander

Post on 29-Dec-2015

220 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

MES7594-01 Genome Infor-matics I

- Lecture V. Short Read Alignment

Sangwoo Kim, Ph.D.Assistant Professor,

Severance Biomedical Research Institute, Yonsei University College of Medicine

Page 2: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Overview• Goal of this lecture

– You will learn the principle of mapping NGS short read to reference genome and practice alignment tools

• Short Read Alignment Theory– Why do we need special algorithm?– The Burrows-Wheeler Transformation (BWT)

• BWT indexing• LF search• Examples

• Practice with BWA• with NA18507 sequences

• Understanding alignment information– Viewing/Converting SAM/BAM format– Interpreting alignment information

Page 3: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

SHORT READ ALIGNMENT THEORY

Page 4: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

RAW NGS DATA (FASTQ)@SRR764745.4352210/1TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAA-GAGCTGTGAGA+5FIFEFHFGHHEFFEEIFFIFHFGGGGKGFJHFEKJJIFKKJGHGGGJFKHGGGLLFGGHLKHJJMGGGJNJKIJJLLIIIKJIHIKJEGFACGEEEDC>[email protected]/1ATATATGAAGGAAAGATACAGTCATTTTCAGACAAACAAATGCTGACAGAATTTGCCATTACCAAGCCAGGACTCTAAGAACTGCTAAAAG-GAGCTCTAAA+6FFDBDGDEGFEEEGEDBEEFDFEEDEEFFGEEFGFFFGFEHGGHEFFGFGEFFHGGFFFDGGGGHGGGHHGFHGGEGHGHFGIIGCFFFED?ADC>B<>>@SRR764746.2695391/1TAAAAGAGACAAAGAGAGACAGTATATCATCTGTCATCTGACAGTCTCATCCAACAGAAAAATATGACAATCCTAAACATATGTGAACCTAA-CACTGGAGC+6FIEEFDFEEEFEFEFEFEEEFDBECEFFEFFGEFFEFGHEFFGDGGFFEEGFGFFHFGGGGEDFHFFGHFGFHFGGGFFEFIGJFGGIHBDECCCD?;>[email protected]/1TTAAATAACCTGCTCCTGAATGAGCATTGGGTGAAAAACGAAATCAAGATGGAAATGTAAAAAATTTCTTCGAACTGGATGACACAACCTAT-CAAGACCTC+5FBCC@A*CHDFDDDDEFBDDGADFCBDFFEEGEGADEEAE4DEFFEGBEHE8;ADHD@DGGFCGDEDGFB==B?GNG@FMC@JFF>:FG=DDED=&>@A#@SRR764746.5506495/1CACAACCTATCAAGACCTCTGGGATACAGCAAAGGCAGTGCTAAGAGGAAAGTTTATAGCACTAAACACCTACGTCGAAAAGTCTGAAAGAG-CACAGACAA+5HIDDDEEBDEEEFEEEFEFGFFEECFFGFFFFGFFFGDHGGCFGFGGFGGHDEFDFDHGGFGDGGFGFGFDFAEFBCFFFFJDIKCEEFACFBCA?;A@[email protected]/1CCATAGAAAGGAATGAATTAACAGCATTTCCTGTGACCTGGACGAGATTGGAGACTATTGTTCTAAGTGATGTAACCCAGGAATGGAAAACT-CAACATTGT+5IHCBE@EEFFDEDGDEDDCFEEGFEEEDFDFGEHEFFFHEBHABHDEDHGDGFFGDFFHEEGGDGHFIFFIEDGFGHGHHCJCIGCEEEHFAB?B@<[email protected]/1TGTCCTTTCCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCAAACTAACACAAGAAAAGAAAACCAGGCCAGGAGCAGTGGCTCAT-GCCTGTAGT+5JIAIHEDHHDHGGFFFEIJFFHDCIHHHKFGHIIGGFGGGGHIGDGGIIIIGGJGFGGIIFHHKHIJIJKHLKILGCIIHMHKDKMLKFJBHHHBGFABB@SRR764745.944258/1GAGAACACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCAAAGGGTGGGAGGCTGGGGGAGGAACAGCATTAGGAGAAAT-ACCTAATGTAGA+5FFDEFEFEDIH?CECEHEHCHIJI>BCCCIDFFFFIHIBHBHFAAFEGGFHMM8FDCDGIEHGAGG@BGAAFKH?6>DKDDNIK?9<FHGBICDBG@<<[email protected]/1TGGGGAAAAAAAACATTCTCTGAAATTTGCTTTTATACCATTAAAGACTTATTTTTTATTACCAGCAATACAGGGCAACT-CATTCAGGTTGAATCTTGAAG+6NMHHFBGGFFEGHEEEIHIDIFGFDFFHFFEFEEGFIJGGGEHHLHIJEFHGHGHFFGGFJKHJJHHFFMHKNBEIFMMGLEIGJHMJCM@CA?FCD;GB

Page 5: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Mapping back to genome

TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA

Where is this sequence in human genome?

Page 6: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Mapping back to genome

TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA

Where is this sequence in human genome?

Do this as fast as possible!

Page 7: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

brute force way

T G A C G T G T G A T T C A A A A A A G C

The reference genome (chr1, start)

G A T T C A A A Your query

G A T T C A A A

G A T T C A A A

G A T T C A A A

Find “GATTCAAA” in human genome

This is very long (3 bil-lion)

Page 8: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

How fast should it be?

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algo-rithm

3 3.6x108 10 yrs

minimum re-quired

0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 daysbased on 200bp read length, 80x single-end wgs

Page 9: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Searching with index• Assume you’re searching

“genome” in a English dictio-nary– You don’t search every line in ev-

ery page– You first find the page range of “g”

in the dictionary– in the above range (of ‘g’), you

find the page range of “ge” in the dictionary

– in the above range (of ‘ge’), you find the page range of “gen” in the dictionary

– ...– until you find “genome”

Page 10: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Indexing genome

• We are going to make an index for genome– to make it possible to search a read-sequence

as we do it in an English dictionary

Page 11: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

BANANA

Page 12: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

BANANA$Lexicographically smallest

Page 13: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

BANANA$ANANA$B

Page 14: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

BANANA$ANANA$BNANA$BA

Page 15: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA

Page 16: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

Page 17: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

Page 18: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

ANNB$AA

last col-umn

Page 19: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”

Page 20: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”1. BWT just changes the order of the string2. BWT tends to collect similar characters together3. With only the transformed string, we can easily get the original

string

Page 21: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

Page 22: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

ANNB$AA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Page 23: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

ANNB$AA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

$AAABNN

sort

Page 24: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

ANNB$AA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

$AAABNN

sort

Page 25: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

ANNB$AA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

$AAABNN

Attach the last column

Page 26: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

A$NANABA$BANAN

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

Page 27: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

A$NANABA$BANAN

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

$BA$ANANBANANA

sort

Page 28: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

A$NANABA$BANAN

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

$BA$ANANBANANA

sort

ANNB$AA

Attach the last column

Page 29: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Inverse BWT

We are given “ANNB$AA”

A$NANABA$BANAN

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

$BA$ANANBANANA

sort

ANNB$AA

Attach the last column

Page 30: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

Page 31: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NANN

ANNAN

Page 32: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point

• the number of ‘N’• to determine the end point

start

end

Page 33: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point

• =5 • the number of ‘N’

• to determine the end point• =2

start

end

Page 34: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘N’• to determine the start point

• =5 • the number of ‘N’

• to determine the end point• =2

start

end

Page 35: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

Page 36: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

This is a range for ‘A’ not ‘AN’!!

Page 37: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

Page 38: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

count of ‘A’ before start point = 1

Page 39: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘A’ + number of ‘A’ before start point• to determine the start point

• =1 + 1 = 2• the number of ‘A’ before end point

• to determine the end point• =3

start

end

count of ‘A’ before start point = 1

“Ax” is not “AN” and less than “AN”

Page 40: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

start

end

The range of strings that start with “NAN” can be calculated from:

• the number of symbols that are lexi-cographically less than ‘N’ + number of ‘N’ before start point• to determine the start point

• =5 + 1 = 6• the number of ‘N’ before end point

• to determine the end point• =2

Page 41: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

startend

2nd row at the original permutation=number of rotations of original string=“NAN” exists at the 3rd position of “BANANA”

BANANA

Page 42: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 43: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 44: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 45: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 46: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 47: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 48: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 49: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Inexact matching

T G A C G T G T G A T T C A A A A A A G C

G A T T G A A A

When exact match does not exist:• continue other possible candidates (G -> A, C, T) and increase the

mismatch count• If another mismatch occurs, again branch it out. • So edit distance is critical to alignment speed

Page 50: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Goal achieved

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algo-rithm

3 3.6x108 10 yrs

minimum re-quired

0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 days

Page 51: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

PRACTICE WITH BWA

Page 52: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

BWA

Page 53: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

bwa practice

• In the cluster– >bwa

Page 54: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

bwa process• bwa index

– to index the reference genome (one time process)• = to create bwt for reference genomoe

• bwa aln– will calculate suffix array (SA) coordinate

• bwa samse (or bwa sampe for paired end se-quencing)– will convert the SA coordinate to chromosomal locations

• Input for bwa– reference genome– fastq file (the raw NGS data)

Page 55: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

reference data

Page 56: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

reference data

“bwa index” will index the reference genome (so reference is ready) it is already done here, do not try do it again

Page 57: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

sequence data

- Pick one chromosome for you- copy the fastq file to your

directory

- use “cp” command to do it

- example (copying chr8 NGS data to rachmani di-rectory)

>cp NA18507_chr8.* /scratch/2015_GenomeInformatics/rachmani/

Page 58: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

run bwa aln

>bwa aln reference yourdata.fastq > yourdata.sai

example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.fastq > NA18507_chr8.01.sai

runbwaaln.sh

>qsub runbwaaln.sh

write a job script

submit to clus-ter

Page 59: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

run bwa samse

>bwa samse reference yourdata.sai yourdata.fastq > yourdata.sam

example>bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.sai NA18507_chr8.01.fastq > NA18507_chr8.01.sam

runbwasamse.sh

>qsub runbwasamse.sh

write a job script

submit to clus-ter

Page 60: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

the output

>less NA18507_chr8.01.sam

This is your first alignment with real NGS data

Page 61: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

break

• Please ask any questions to us if you have problems (do not give up)

• If possible, try mapping in a paired-end mode– bwa sampe reference data01.sai data02.sai

data01.fastq data02.fastq > output.sam

Page 62: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

The SAM Format

For more details about SAM format please refer to:https://samtools.github.io/hts-specs/SAMv1.pdf

Page 63: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

SAM/BAM

• SAM and BAM are convertible (exactly same information)

• SAM file– human readable text file

• BAM file (binary)– human unreadable binary file– compressed (much smaller size)– able to index (for random access)

Page 64: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Converting SAM to BAM

• >samtools view yourdata.sam –Sb > your-data.bam– -S option means input is SAM format– -b option means output is BAM format–

Page 65: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Sorting and Indexing BAM

• samtools sort yourdata.sam yourdata.-sorted– will create yourdata.sorted.bam

• samtools index yourdata.bam– will create yourdata.bam.bai

• Now everything’s ready

Page 66: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Visualizing alignment

• IGV (Integrative Genomics Viewer)

Page 67: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)

Visualizing alignment

• samtools tview yourdata.bam reference– example:

• >samtools tview NA18507_chr8.01.sorted.bam /data/resource/reference/human/UCSC/hg19/BWAIn-dex/genome.fa

Page 68: MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei

Genome Informatics I (2015 Spring)