16 assembly scs v2 - department of computer science

62
Assembly & Shortest Common Superstring Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briey how you are using the slides. For original Keynote les, email me ([email protected]).

Upload: others

Post on 03-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 16 assembly scs v2 - Department of Computer Science

Assembly & Shortest Common SuperstringBen Langmead

Department of Computer Science

Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]).

Page 2: 16 assembly scs v2 - Department of Computer Science

Input DNA

Reads Reference genome

+

Assembly

XHow do we assemble puzzle without the benefit of knowing what the finished product should look like?

(That's what the Human Genome Project had to do!)

Page 3: 16 assembly scs v2 - Department of Computer Science

De novo shotgun assembly

Page 4: 16 assembly scs v2 - Department of Computer Science

Assembly

Whole-genome “shotgun” sequencing first copies the input DNA:

Input: GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT

Copy: GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT

Fragment: GGCGTCTA TATCTCGG CTCTAGGCCCTC ATTTTTT GGC GTCTATAT CTCGGCTCTAGGCCCTCA TTTTTT GGCGTC TATATCT CGGCTCTAGGCCCT CATTTTTT GGCGTCTAT ATCTCGGCTCTAG GCCCTCA TTTTTT

Then fragments it:

“Shotgun” refers to the random fragmentation of the whole genome; like it was fired from a shotgun

Page 5: 16 assembly scs v2 - Department of Computer Science

Assembly: Human Genome Project debate

Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.

Page 6: 16 assembly scs v2 - Department of Computer Science

Assembly: Human Genome Project debate

Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.

Page 7: 16 assembly scs v2 - Department of Computer Science

Assembly: Human Genome Project debate

Green, Philip. "Against a whole-genome shotgun." Genome Research 7.5 (1997): 410-417.

Page 8: 16 assembly scs v2 - Department of Computer Science

Assembly: Human Genome Project debate

Green, Philip. "Against a whole-genome shotgun." Genome Research 7.5 (1997): 410-417.

Page 9: 16 assembly scs v2 - Department of Computer Science

Assembly

Reconstruct thisFrom these

GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT

CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT

Page 10: 16 assembly scs v2 - Department of Computer Science

Assembly

Reconstruct thisFrom these

???????????????????????????????????

CTAGGCCCTCAATTTTT GGCGTCTATATCT CTCTAGGCCCTCAATTTTT TCTATATCTCGGCTCTAGG GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA GGCGTCGATATCT TATCTCGACTCTAGGCC GGCGTCTATATCTCG

Page 11: 16 assembly scs v2 - Department of Computer Science

GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT

CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT

Coverage = 5

Coverage

Page 12: 16 assembly scs v2 - Department of Computer Science

GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT

CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT

Coverage = 5

Coverage

Page 13: 16 assembly scs v2 - Department of Computer Science

GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT

CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT 35 bases

177 bases

Average coverage = 177 / 35 ≈ 5-fold

Page 14: 16 assembly scs v2 - Department of Computer Science

TCTATATCTCGGCTCTAGG

TATCTCGACTCTAGGCC

Page 15: 16 assembly scs v2 - Department of Computer Science

TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC

Page 16: 16 assembly scs v2 - Department of Computer Science

First law of assembly

If a suffix of read A is similar to a prefix of read B...

...then A and B might overlap in the genome

TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC

TCTATATCTCGGCTCTAGG GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT TATCTCGACTCTAGGCC

Page 17: 16 assembly scs v2 - Department of Computer Science

TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC

Why the differences?

1. Sequencing errors 2. Ploidy: e.g. humans have 2 copies of each

chromosome, and copies can differ

Page 18: 16 assembly scs v2 - Department of Computer Science

Second law of assembly

More coverage leads to more and longer overlaps

GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT

CTAGGCCCTCAATTTTT CTCGGCTCTAGCCCCTCATTTT TCTATATCTCGGCTCTAGG GGCGTCGATATCT

CTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCTATATCT

less coverage

more coverage

Page 19: 16 assembly scs v2 - Department of Computer Science

TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC

Page 20: 16 assembly scs v2 - Department of Computer Science

TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC

TATCTCGACTCTAGGCC |||| |||||| || CTCGGCTCTAGCCCCTCAT

Page 21: 16 assembly scs v2 - Department of Computer Science

Directed graph

PoloniusHamlet

NodeEdge

Page 22: 16 assembly scs v2 - Department of Computer Science

Directed graph

Polonius

Ophelia

Laertes

Hamlet

GertrudeKing

Hamlet

Claudius

Page 23: 16 assembly scs v2 - Department of Computer Science

Overlap graph

Each node is a read

Draw edge A -> B when suffix of A overlaps prefix of B

CTCGGCTCTAGCCCCTCATTTT

CTCGGCTCTAGCCCCTCATTTT

GGCTCTAGGCCCTCATTTTTT

Page 24: 16 assembly scs v2 - Department of Computer Science

Overlap graph

Nodes: all 6-mers from GTACGTACGAT

Edges: overlaps of length ≥4

TACGAT

ACGTAC

GTACGT

4

CGTACG5

GTACGA

4

TACGTA54 4

5

4

4

5

5

5

Page 25: 16 assembly scs v2 - Department of Computer Science

Overlap graph

Nodes: all 6-mers from GTACGTACGAT

Edges: overlaps of length ≥4

TACGAT

ACGTAC

GTACGT

4

CGTACG5

GTACGA

4

TACGTA54 4

5

4

4

5

5

5

Page 26: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring

Given set of strings S, find SCS(S): shortest string containing the strings in S as substrings

BAA AAB BBA ABA ABB BBB AAA BABS:

BAAAABBBAABAABBBBBAAABABConcat(S):24

SCS(S): AAABBBABAA10

Page 27: 16 assembly scs v2 - Department of Computer Science

TACGAT

ACGTAC

GTACGT

4

CGTACG5

GTACGA

4

TACGTA54 4

5

4

4

5

5

5

>>> scs(['GTACGT', 'TACGTA', 'ACGTAC', 'CGTACG', 'GTACGA', 'TACGAT']) 'GTACGTACGAT'

Reads: all 6-mers from GTACGTACGAT

Page 28: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring

NP-complete: no efficient algorithms for large inputs

Page 29: 16 assembly scs v2 - Department of Computer Science

Idea: pick order for strings in S and construct superstring

AAA AAB ABA ABB BAA BAB BBA BBBorder 1:

AAA

Page 30: 16 assembly scs v2 - Department of Computer Science

AAA AAB ABA ABB BAA BAB BBA BBBorder 1:

AAAB

Idea: pick order for strings in S and construct superstring

Page 31: 16 assembly scs v2 - Department of Computer Science

AAA AAB ABA ABB BAA BAB BBA BBBorder 1:

AAABA

Idea: pick order for strings in S and construct superstring

Page 32: 16 assembly scs v2 - Department of Computer Science

AAA AAB ABA ABB BAA BAB BBA BBBorder 1:

AAABABB

Idea: pick order for strings in S and construct superstring

Page 33: 16 assembly scs v2 - Department of Computer Science

AAA AAB ABA ABB BAA BAB BBA BBBorder 1:

AAABABBAABABBABBB

Idea: pick order for strings in S and construct superstring

superstring 1

Page 34: 16 assembly scs v2 - Department of Computer Science

AAA AAB ABA ABB BAA BAB BBA BBBorder 1:

AAABABBAABABBABBB

Idea: pick order for strings in S and construct superstring

AAA AAB ABA BAB ABB BBB BAA BBAorder 2:

AAABABBBAABBA

superstring 1

superstring 2

Try all possible orderings and pick shortest superstring

If S contains n strings, n ! (n factorial) orderings possible

Page 35: 16 assembly scs v2 - Department of Computer Science

AAA AAB ABA ABB BAA BAB BBA BBBorder 1:

AAABABBAABABBABBB

AAA AAB ABA BAB ABB BBB BAA BBAorder 2:

AAABABBBAABBA

superstring 1

superstring 2

If S contains n strings, n ! (n factorial) orderings possible

Page 36: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAB

ABB

BBABBB

AAA

2

1112

1

2

2 1

12

Page 37: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAB

ABB

BBABBB

AAA

2

1112

1

2

2 1

12

Page 38: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

ABB

BBABBB

AAAB 2

11 1

2

2 1

2

Page 39: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

ABB

BBABBB

AAAB 2

11 1

2

2 1

2

Page 40: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

BBAABBB

AAAB

21

1

2

1

Page 41: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

BBAABBB

AAAB

21

1

2

1

Page 42: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

ABBBA

AAAB

21

Page 43: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

ABBBA

AAAB

21

Page 44: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAABBBA superstring, length=7

Page 45: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

AAA AAB ABB BBB BBA Input strings

Algorithm in action (l = 1):

AAB

ABB

BBABBB

AAA

2

1112

1

2

2 1

12

Page 46: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

Input strings

Algorithm in action (l = 1):

AAB

ABB

BBABBB

AAA

2

1112

1

2

2 1

12

AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA

Page 47: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

Input strings

Algorithm in action (l = 1):

ABB

BBABBB

AAAB

111

2

2 1

22

AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA

Page 48: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

Input strings

Algorithm in action (l = 1):

ABB

BBABBB

AAAB

111

2

2 1

22

AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA

Page 49: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

Input strings

Algorithm in action (l = 1):

ABB

BBBA

AAAB

1

2

2

AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB

11

Page 50: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

Input strings

Algorithm in action (l = 1):

ABB

BBBA

AAAB

1

2

2

AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB

11

Page 51: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

Input strings

Algorithm in action (l = 1):

BBBA

AAABB

2

AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB AAABB BBBA

1

Page 52: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.

Input strings

Algorithm in action (l = 1):

AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB AAABB BBBA AAABBBA

That’s the SCS

AAABBBA

Page 53: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAA AAB ABB BBA BBB

AAAB ABB BBA BBB

Page 54: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAA AAB ABB BBA BBB

AAAB ABB BBA BBB

AAAB ABBA BBB

Page 55: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAA AAB ABB BBA BBB

AAAB ABB BBA BBB

AAAB ABBA BBB

AAABBA BBB

Page 56: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAA AAB ABB BBA BBB

AAAB ABB BBA BBB

AAAB ABBA BBB

AAABBA BBB

AAABBABBB superstring, length=9

Page 57: 16 assembly scs v2 - Department of Computer Science

Greedy shortest common superstring

AAA AAB ABB BBA BBB

AAAB ABB BBA BBB

AAAB ABBA BBB

AAABBA BBB

AAABBABBB superstring, length=9

AAABBBA superstring, length=7

Greedy answer isn't necessarily optimal

Page 58: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Greedy-SCS assembling all substrings of length 6 from: a_long_long_long_time. l = 3.

ng_lon _long_ a_long long_l ong_ti ong_lo long_t g_long g_time ng_tim ng_time ng_lon _long_ a_long long_l ong_ti ong_lo long_t g_long ng_time g_long_ ng_lon a_long long_l ong_ti ong_lo long_t ng_time long_ti g_long_ ng_lon a_long long_l ong_lo ng_time ong_lon long_ti g_long_ a_long long_l ong_lon long_time g_long_ a_long long_l long_lon long_time g_long_ a_long long_lon g_long_time a_long long_long_time a_long a_long_long_time

Foiled by repeat!

Page 59: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Same example, but increased the substring length from 6 to 8

long_lon ng_long_ _long_lo g_long_t ong_long g_long_l ong_time a_long_l _long_ti long_tim long_time long_lon ng_long_ _long_lo g_long_t ong_long g_long_l a_long_l _long_ti _long_time long_lon ng_long_ _long_lo g_long_t ong_long g_long_l a_long_l _long_time a_long_lo long_lon ng_long_ g_long_t ong_long g_long_l _long_time ong_long_ a_long_lo long_lon g_long_t g_long_l g_long_time ong_long_ a_long_lo long_lon g_long_l g_long_time ong_long_ a_long_lon g_long_l g_long_time ong_long_l a_long_lon g_long_time a_long_long_l a_long_long_long_time a_long_long_long_time

Got the whole thing: a_long_long_long_time

Page 60: 16 assembly scs v2 - Department of Computer Science

Shortest common superstring: greedy

Why are substrings of length 8 long enough for Greedy-SCS to figure out there are 3 copies of long?

a_long_long_long_time

One length-8 substring spans all three longs

g_long_l

Page 61: 16 assembly scs v2 - Department of Computer Science

Third law of assembly

Repeats make assembly difficult; whether we can assemble without mistakes depends on length of reads and repetitive patterns in genome

a_long_long_long_time

a_long_long_time

Collapsing a tandem repeat:

Spurious rearrangement:

Page 62: 16 assembly scs v2 - Department of Computer Science

Repeats foil assembly

Portion of overlap graph involving repeat family A

As are longer than read length

A

Lots of overlaps among A reads

Even if we avoid collapsing copies of A, we can’t know which paths in correspond to which paths out

L1

L2

L3

L4

R1

R2

R3

R4

L1

L2

L3

L4

R1

R2

R3

R4

Stre

tche

s of

ge

nom

eRe

ads

RepetitiveUnique Unique