lecture: graph-based genome scaffolding - igmigm.univ-mlv.fr/~mweller/scaf_lec.pdf · sanger...

309
Lecture: Graph-Based Genome Scaffolding Mathias Weller [email protected] Montpellier, 2017

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Lecture: Graph-Based

Genome Scaffolding

Mathias [email protected]

Montpellier, 2017

Page 2: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

DNA

- double strand

- inside nucleus (safe)

RNA

- single strand

- outside nucleus

- transfers genetic code

- Thymine (T) → Uracil (U)

Polymerase

2/33

Page 3: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

DNA

- double strand

- inside nucleus (safe)

RNA

- single strand

- outside nucleus

- transfers genetic code

- Thymine (T) → Uracil (U)

Polymerase

2/33

Page 4: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 5: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 6: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 7: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 8: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 9: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*

GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 10: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*

GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 11: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 12: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 13: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix & create thousands of copies

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

unreliable after a couple hundred bp

chop up DNA into pieces and read those

3/33

Page 14: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 15: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 16: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 17: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 18: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 19: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 20: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 21: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 22: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 23: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 24: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand

6. double strand is denaturized into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

8. read all strands from their anchor points outwards

Paired-End reads (distance between reads = “insert size”)

4/33

Page 25: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTCACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

5/33

Page 26: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

5/33

Page 27: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 1: parts of the sequence might not be covered by reads

sequence with “high coverage”

5/33

Page 28: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 1: parts of the sequence might not be covered by reads

sequence with “high coverage”

5/33

Page 29: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 1: parts of the sequence might not be covered by reads

sequence with “high coverage”

5/33

Page 30: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 1: parts of the sequence might not be covered by reads

sequence with “high coverage”

5/33

Page 31: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCT

CCTT

CTTG

TTGG

5/33

Page 32: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

1. produce best pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCT

CCTT

CTTG

TTGG

5/33

Page 33: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

1. produce best pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCT

CCTT

CTTG

TTGG

5/33

Page 34: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCT

CCTT

CTTG

TTGG

5/33

Page 35: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find path using all overlaps

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCT

CCTT

CTTG

TTGG

5/33

Page 36: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find Eulerian path

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCT

CCTT

CTTG

TTGG

5/33

Page 37: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find Eulerian path

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

5/33

Page 38: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 2: Shortest Common Superstring is NP-hard

“Overlap-Layout-Consensus” assemblers

Problem: Θ(n2) too slow in practice

DeBruijn-graph based assembly

1. chop all reads into “k-mers”

2. builds overlap graph

(“DeBruijn graph”)

3. find Eulerian path

k = 4

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

5/33

Page 39: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 3: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: we have paired-end information!

5/33

Page 40: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 3: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: we have paired-end information!

5/33

Page 41: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 3: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: we have paired-end information!

5/33

Page 42: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 3: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: we have paired-end information!

5/33

Page 43: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 3: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: we have paired-end information!

5/33

Page 44: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 3: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: we have paired-end information!

5/33

Page 45: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGNNNNCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Idea: overlap reads

Problem 3: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: we have paired-end information!

5/33

Page 46: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Genome Scaffolding: Previous Work

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

- SOPRA [Dayarian, Michael, Sengupta, BMC Bioinf. 11, ’10]

I removes reads in high-coverage area (likely repeats)I orientation step (heuristic) + ordering step (heuristic)I coded in Pearl (!!!)I (observed sparse contig graph)

- SSPACE [Boetzer & al., Bioinf. 27(4), ’11]

- OPERA [Gao, Sung, Ngaraja, JCB. 18(11), ’11]

- GRASS [Gritsenko & al., Bioinf. 28(11), ’12]

- SCARPA [Donmez, Brudno, Bioinf. 29(4), ’13]

- . . . [Huson & al., JACM, ’02][Nieuwerburgh & al., NAR, ’12]

6/33

Page 47: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Genome Scaffolding: Previous Work

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

- SOPRA [Dayarian, Michael, Sengupta, BMC Bioinf. 11, ’10]

- SSPACE [Boetzer & al., Bioinf. 27(4), ’11]

I heuristic contig extensionI “reasonable time”

- OPERA [Gao, Sung, Ngaraja, JCB. 18(11), ’11]

- GRASS [Gritsenko & al., Bioinf. 28(11), ’12]

- SCARPA [Donmez, Brudno, Bioinf. 29(4), ’13]

- . . . [Huson & al., JACM, ’02][Nieuwerburgh & al., NAR, ’12]

6/33

Page 48: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Genome Scaffolding: Previous Work

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

- SOPRA [Dayarian, Michael, Sengupta, BMC Bioinf. 11, ’10]

- SSPACE [Boetzer & al., Bioinf. 27(4), ’11]

- OPERA [Gao, Sung, Ngaraja, JCB. 18(11), ’11]

I np+O(1) time (p =#edge-deletions (≥ feedback edge set))I most work done by a heuristic “graph contraction”

- GRASS [Gritsenko & al., Bioinf. 28(11), ’12]

- SCARPA [Donmez, Brudno, Bioinf. 29(4), ’13]

- . . . [Huson & al., JACM, ’02][Nieuwerburgh & al., NAR, ’12]

6/33

Page 49: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Genome Scaffolding: Previous Work

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

- SOPRA [Dayarian, Michael, Sengupta, BMC Bioinf. 11, ’10]

- SSPACE [Boetzer & al., Bioinf. 27(4), ’11]

- OPERA [Gao, Sung, Ngaraja, JCB. 18(11), ’11]

- GRASS [Gritsenko & al., Bioinf. 28(11), ’12]

I Mixed-Integer Quadratic ProgrammingI deals with uncertain data (slack variables)

“intractable even for small # of contigs”

I heuristic workaround:I solve relaxed formulation & use slack values ILP

- SCARPA [Donmez, Brudno, Bioinf. 29(4), ’13]

- . . . [Huson & al., JACM, ’02][Nieuwerburgh & al., NAR, ’12]

6/33

Page 50: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Genome Scaffolding: Previous Work

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

- SOPRA [Dayarian, Michael, Sengupta, BMC Bioinf. 11, ’10]

- SSPACE [Boetzer & al., Bioinf. 27(4), ’11]

- OPERA [Gao, Sung, Ngaraja, JCB. 18(11), ’11]

- GRASS [Gritsenko & al., Bioinf. 28(11), ’12]

- SCARPA [Donmez, Brudno, Bioinf. 29(4), ’13]

I orientation step: use FPT algo for Odd Cycle TransersalI ordering step: heuristic

- . . . [Huson & al., JACM, ’02][Nieuwerburgh & al., NAR, ’12]

6/33

Page 51: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Genome Scaffolding: Previous Work

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

- SOPRA [Dayarian, Michael, Sengupta, BMC Bioinf. 11, ’10]

- SSPACE [Boetzer & al., Bioinf. 27(4), ’11]

- OPERA [Gao, Sung, Ngaraja, JCB. 18(11), ’11]

- GRASS [Gritsenko & al., Bioinf. 28(11), ’12]

- SCARPA [Donmez, Brudno, Bioinf. 29(4), ’13]

- . . . [Huson & al., JACM, ’02][Nieuwerburgh & al., NAR, ’12]

6/33

Page 52: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

7 /33

Page 53: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 54: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGG

GTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 55: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 56: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 57: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 58: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 59: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 60: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

7 /33

Page 61: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp ∈ N

Question: Can G be covered by

≤σp alternating paths

σc alternating cycles

of total weight ≥ k?

7 /33

Page 62: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

Question: Can G be covered by

≤σp alternating paths &

≤σc alternating cycles

of total weight ≥ k?

7 /33

Page 63: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Exact ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

Question: Can G be covered by

σp alternating paths &

σc alternating cycles

of total weight ≥ k?

7 /33

Page 64: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

19

10

71

26

22

32

59

71

78

2 60

85

1

38

1

17

19

7

25

8/33

Page 65: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

8/33

Page 66: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

1

63

94

25

9

28

29

55

2

17

1

67

62

1

5

71

26

24

47

83

71

34

71

68

72

1

51

19

79

54

151

67

6

66

34

68

59

2

23

11

12

1

5

1

2

6

1

49

13

71

28

71

5

6

11

33

67

24

88

82

2

42

25

1

39

14

65

5

2

70

24

89

46

1

1

1

80

1

73

64

36

72

33

73

67

59

1

14

1

1

70

3

1

33

53

2

2

1

135

4

2980

74

80

84

11

31

1

1

62

73

38

31

33

77

108

28

13

1 4

1

70

8

5

1

47

3

31

2

1

18

56

49

45

2

2

8/33

Page 67: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D .

1. make a copy of D

2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

9/33

Page 68: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D .

1. make a copy of D2. duplicate all vertices M

3. “slide” down all arrow tips & ignore directions

9/33

Page 69: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D .

1. make a copy of D2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

9/33

Page 70: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G .

9/33

Page 71: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G .

“⇒”: replace each v in the Hamiltonian path by vup → vlow .

alternating X covers M X

9/33

Page 72: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G .

“⇒”: replace each v in the Hamiltonian path by vup → vlow .

alternating X covers M X

9/33

Page 73: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G .

“⇐”: contract each matching edge in the covering alternating path.

hits all vertices exactly once X is valid directed path X

9/33

Page 74: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G .

“⇐”: contract each matching edge in the covering alternating path.

hits all vertices exactly once X is valid directed path X

9/33

Page 75: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0}.

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

9/33

Page 76: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}.

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

9/33

Page 77: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}.

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.9/33

Page 78: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can G be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Exact Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}.

Corollary

Exact Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.9/33

Page 79: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Co-Bipartites

???

?

Wait, what?

Recap: Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

Unweighted!

10 /33

Page 80: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Co-Bipartites

???

?

Wait, what?

Recap: Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

Unweighted!

10 /33

Page 81: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Co-Bipartites

???

?

Wait, what?

Recap: Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

Unweighted!

10 /33

Page 82: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Co-Bipartites

???

?

Wait, what?

Recap: Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

Unweighted!

10 /33

Page 83: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

no edges between X & Y need 2 objects (paths/cycles)

otherwise can always cover G with 1 path

TODO

decide if we can cover with 1 cycle

11 / 33

Page 84: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X

#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 85: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X

#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 86: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X

#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 87: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X

#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 88: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 89: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y

#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 90: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y

#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 91: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y

#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 92: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 93: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 94: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 95: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 96: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Observation

∃ alternating cycle with non-matching edge X extend to cover all M in G [X ]

Observation

#matching edges between X & Y even (and > 0) X#matching edges between X & Y odd

find any non-matching edge between X & Y#matching edges between X & Y is 0

all other cases are X (tedious case analysis)

11 / 33

Page 97: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Co-Bipartites

X Y

forbidden!

Theorem

Scaffolding can be solved in O(n + m) time on co-bipartite graphs

11 / 33

Page 98: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `

M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 99: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `

M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 100: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 101: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 102: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 103: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 104: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 105: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 106: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 107: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 108: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 109: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g

delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 110: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce k

or g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 111: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u

u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 112: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below”

∃ “clone” of g−p−` take p−`

12 /33

Page 113: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−`

take p−`

12 /33

Page 114: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Case 2parent g of p is matched “above”

either p is the only child of g delete ` & g and reduce kor g has another child u u matched “below” ∃ “clone” of g−p−` take p−`

12 /33

Page 115: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Unweighted Trees

Observation

no alternating cycles in a tree

Observation

consider a lowest leaf `M is perfect ` matched

parent p of ` has only 1 child

Case 1parent g of p is matched “below”

g is matched to a leaf `′

always take `−p−g−`′

g

p

`

Theorem

Scaffolding can be solved in O(n) time on unweighted trees

12 /33

Page 116: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[p, x ]v = max. weight collected

below v with p finished paths

“under x ”

up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

13 /33

Page 117: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[p, x ]v = max. weight collected

below v with p finished paths

“under x ”

up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

13 /33

Page 118: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[p, x ]v = max. weight collected

below v with p finished paths

“under x ”

up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .[p,X]v := max

p1, p2, . . . , pc∑pi = p

∑1≤i≤c

maxx∈{X,X}

[pi , x ]vi

BAD IDEA!

13 /33

Page 119: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[p, x ]v = max. weight collected

below v with p finished paths

“under x ”

up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .[p,X]v := max

p1, p2, . . . , pc∑pi = p

∑1≤i≤c

maxx∈{X,X}

[pi , x ]vi

BAD IDEA!

13 /33

Page 120: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

13 /33

Page 121: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 122: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈M

ω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 123: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M

{[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 124: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

v

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 125: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 126: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 127: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 128: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 0

1 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 129: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 130: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 131: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 132: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 133: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 134: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Scaffolding in Weighted Trees

Dynamic Programming Idea

bottom-up traversal; in each

vertex v , need to remember:

• #paths used below v• v incident with non-matching?

Semantics

[j , p, x ]v = max. weight collected

below v with p finished paths

“under x ” up to vj(abbrev: last child [p, x ]v)

v

v0

v1 v2 v3v4

v5

vj

125

9 6

3

1 1 X 0

1 X 0

1 X 01 X 0

1 X 01 0 X 9

1 1 X 0

2 1 X 9

2 2 X 0

1 X 9

2 X 0

1 1 X 0

2 1 X 3

2 2 X 0

1 X 3

2 X 0

1 2 X 9

1 3 X 0

1 2 X 9

1 3 X 0

2 3 X 12

2 3 X 21

2 4 X 9

2 4 X 12

2 5 X 0

. . . .

Recurrence

Let v1, v2, . . . , vc be the children of v .

[0, 0,X]v :=0

[j , p, x ]v :=maxpj≤p

max{[pj ,X]vj , [pj ,X]vj }+ [j − 1, p − pj , x ]v if vvj /∈Mω(vvj ) + [pj + 1,X]vj + [j − 1, p − pj ,X]v if x = X& vvj /∈M{

[pj − 1,X]vj[pj − 1,X]vj

}+ [j − 1, p − pj , x ]v if vvj ∈M

13 /33

Page 135: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding

1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 136: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 137: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 138: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 139: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 140: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 141: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 142: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 143: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 144: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 145: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 146: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 147: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

14 /33

Page 148: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution X

Note: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

14 /33

Page 149: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

14 /33

Page 150: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

14 /33

Page 151: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

14 /33

Page 152: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

14 /33

Page 153: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete graphs can be

3-approximated in O(|V | log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

14 /33

Page 154: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

3-approximated in O(|V | log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

14 /33

Page 155: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

3-approximated in O(|V | log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

14 /33

Page 156: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding

1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 157: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 158: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 159: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 160: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 161: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 162: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 163: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

15 /33

Page 164: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution X

ω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

15 /33

Page 165: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution Xω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

15 /33

Page 166: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete graphs can be

2-approximated in O(|V |2) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

15 /33

Page 167: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete (bipartite) graphs can be

2-approximated in O(|V |2) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

15 /33

Page 168: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete (bipartite) graphs can be

2-approximated in O(|V |2) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

15 /33

Page 169: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2j ii − 2j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{

[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(n!!)

16 /33

Page 170: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2j ii − 2j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{

[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(n!!)

16 /33

Page 171: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2

j ii − 2j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{

[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(n!!)

16 /33

Page 172: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2

j ii − 2

j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(n!!)

16 /33

Page 173: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2j ii − 2

j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(n!!)

16 /33

Page 174: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2j ii − 2

j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(n!!)

16 /33

Page 175: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2j ii − 2

j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(n!!)

16 /33

Page 176: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms I: Brute Force

j ii − 2j ii − 2

j ii − 2

[p, c , j ]i :=max. weight collectible before vi with p & cpaths/cycles plus one path starting at vj

[p, c , j ]i =[p, c , j ]i−2 + ω(vi−2vi−1) if j < i − 2 & vi−2vi−1 ∈ E

[p, c , i − 1]i = maxj<i−2j even

{[p − 1, c , j ]i−2

[p, c − 1, j ]i−2 + ω(vjvi−2) if vjvi−2 ∈ E

Observation

An ordering of V (G ) certifies YES-instances of Scaffolding.

try all O(n!) certificates

contigs force every other vertex O(√

2n · n/2!)

16 /33

Page 177: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms II: Dynamic Programming

S − xy

x yw

u

Semantics

[S , p, c , u, v ] = max. weight collectible in G [S ] by p alt. paths, calt. cycles and an alt. path starting at u & ending at v

Computation

Let xy ∈M. Then, [{xy}, 0, 0, x , y ] := 0 and

[S , p, c , u, y ] := maxw∈G [S−xy ]

u 6=w

[S − xy , p, c , u,w ] + ω(wx)

[S , p, c , x , y ] :=

maxu,w∈G [S−xy ]

{

[S − xy , p − 1, c , u,w ]

[S − xy , p, c − 1, u,w ] + ω(wu) if wu ∈ E (G ) \M

17 /33

Page 178: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms II: Dynamic Programming

S − xy

x ywu

Semantics

[S , p, c , u, v ] = max. weight collectible in G [S ] by p alt. paths, calt. cycles and an alt. path starting at u & ending at v

Computation

Let xy ∈M. Then, [{xy}, 0, 0, x , y ] := 0 and

[S , p, c , u, y ] := maxw∈G [S−xy ]

u 6=w

[S − xy , p, c , u,w ] + ω(wx)

[S , p, c , x , y ] :=

maxu,w∈G [S−xy ]

{

[S − xy , p − 1, c , u,w ]

[S − xy , p, c − 1, u,w ] + ω(wu) if wu ∈ E (G ) \M

17 /33

Page 179: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms II: Dynamic Programming

S − xy

x ywu

Semantics

[S , p, c , u, v ] = max. weight collectible in G [S ] by p alt. paths, calt. cycles and an alt. path starting at u & ending at v

Computation

Let xy ∈M. Then, [{xy}, 0, 0, x , y ] := 0 and

[S , p, c , u, y ] := maxw∈G [S−xy ]

u 6=w

[S − xy , p, c , u,w ] + ω(wx)

[S , p, c , x , y ] := maxu,w∈G [S−xy ]

{[S − xy , p − 1, c , u,w ]

[S − xy , p, c − 1, u,w ] + ω(wu) if wu ∈ E (G ) \M

17 /33

Page 180: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms II: Dynamic Programming

S − xy

x ywu

Semantics

[S , p, c , u, v ] = max. weight collectible in G [S ] by p alt. paths, calt. cycles and an alt. path starting at u & ending at v

Computation

Let xy ∈M. Then, [{xy}, 0, 0, x , y ] := 0 and

[S , p, c , u, y ] := maxw∈G [S−xy ]

u 6=w

[S − xy , p, c , u,w ] + ω(wx)

[S , p, c , x , y ] := maxu,w∈G [S−xy ]

{[S − xy , p − 1, c , u,w ]

[S − xy , p, c − 1, u,w ] + ω(wu) if wu ∈ E (G ) \M

17 /33

Page 181: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Exact Algorithms II: Dynamic Programming

S − xy

x ywu

Semantics

[S , p, c , u, v ] = max. weight collectible in G [S ] by p alt. paths, calt. cycles and an alt. path starting at u & ending at v

Theorem

Scaffolding can be solved in O(√

2nn3σpσc) time.

17 /33

Page 182: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

But is it even NP-hard?

18 /33

Page 183: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

But is it even NP-hard?

18 /33

Page 184: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

But is it even NP-hard?

18 /33

Page 185: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

But is it even NP-hard?

18 /33

Page 186: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

u

v

Observation

• v in path & u in cycle 1 path X• v in path & u in path 2 paths X

unless it’s the same path!

But is it even NP-hard?

18 /33

Page 187: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

u

v

Observation

• v in path & u in cycle 1 path X• v in path & u in path 2 paths X

unless it’s the same path!

But is it even NP-hard?

18 /33

Page 188: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

u

v

Observation

• v in path & u in cycle 1 path X• v in path & u in path 2 paths X

unless it’s the same path!But is it even NP-hard?

18 /33

Page 189: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both?

remove all non-matching edges from parent u, except uv

u

v

Observation

• v in path & u in cycle 1 path X• v in path & u in path 2 paths X

unless it’s the same path!

But is it even NP-hard?

18 /33

Page 190: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if unweighted, can we take both? X

remove all non-matching edges from parent u, except uv

u

v

Observation

• v in path & u in cycle 1 path X• v in path & u in path 2 paths X

unless it’s the same path!But is it even NP-hard?

18 /33

Page 191: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if σp = 0, we have to take both!

remove all non-matching edges from parent u, except uv

u

v

Observation

• v in path & u in cycle 1 path X• v in path & u in path 2 paths X

unless it’s the same path!

But is it even NP-hard?

18 /33

Page 192: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if σp = 0, we have to take both!

remove all non-matching edges from parent u, except uv

u

v

Observation

• v in path & u in cycle 1 path X• v in path & u in path 2 paths X

unless it’s the same path!

But is it even NP-hard?

18 /33

Page 193: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if σp = 0, we have to take both!

remove all non-matching edges from parent u, except uv

Corollary

Scaffolding can be solved in O(n) on quasi-forests if σp = 0.

Scaffolding can be solved in O(n2σp+1) in quasi-forests.

But is it even NP-hard?

18 /33

Page 194: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if σp = 0, we have to take both!

remove all non-matching edges from parent u, except uv

Corollary

Scaffolding can be solved in O(n) on quasi-forests if σp = 0.Scaffolding can be solved in O(n2σp+1) in quasi-forests.

But is it even NP-hard?

18 /33

Page 195: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Recall

• Scaffolding is hard in any sufficiently

dense graph class

• Scaffolding is easy in trees

A Shot at Sparsity

G is Quasi-forest ⇔ G −M is forest

Observation

Each leaf v of G −M has degree 2 in G if σp = 0, we have to take both!

remove all non-matching edges from parent u, except uv

Corollary

Scaffolding can be solved in O(n) on quasi-forests if σp = 0.Scaffolding can be solved in O(n2σp+1) in quasi-forests.

But is it even NP-hard?18 /33

Page 196: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Weighted 2-SATInput: ϕ on X in 2-CNF, weights ω : X × {0, 1} → N, k ∈ N

Question: is there a satisfying assignment for ϕ of weight ≤ k?

Remark

Independent Set is special case of Weighted 2-SAT

19 /33

Page 197: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Weighted 2-SATInput: ϕ on X in 2-CNF, weights ω : X × {0, 1} → N, k ∈ N

Question: is there a satisfying assignment for ϕ of weight ≤ k?

Remark

Independent Set is special case of Weighted 2-SAT

19 /33

Page 198: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

19 /33

Page 199: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

19 /33

Page 200: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

19 /33

Page 201: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

19 /33

Page 202: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

19 /33

Page 203: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

19 /33

Page 204: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Observation

∃ weight-k satisfying assignment

∃ weight-k cover with ≤ nalternating paths

Theorem

Scaffolding is NP-hard even if G −Mis a collection of paths with weights

0/1

Corollary

no 2o(n+m)-time algorithm (ETH)

no no(k)-time algorithm (FPT 6=W[t])

19 /33

Page 205: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Observation

∃ weight-k satisfying assignment

∃ weight-k cover with ≤ nalternating paths

Theorem

Scaffolding is NP-hard even if G −Mis a collection of paths with weights

0/1

Corollary

no 2o(n+m)-time algorithm (ETH)

no no(k)-time algorithm (FPT 6=W[t])

19 /33

Page 206: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Observation

∃ weight-k satisfying assignment

∃ weight-k cover with ≤ nalternating paths

Theorem

Scaffolding is NP-hard even if G −Mis a collection of paths with weights

0/1

Corollary

no 2o(n+m)-time algorithm (ETH)

no no(k)-time algorithm (FPT 6=W[t])

19 /33

Page 207: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Observation

∃ weight-k satisfying assignment

∃ weight-k cover with ≤ nalternating paths

Theorem

Scaffolding is NP-hard even if G −Mis a collection of paths with weights

0/1

Corollary

no 2o(n+m)-time algorithm (ETH)

no no(k)-time algorithm (FPT 6=W[t])

19 /33

Page 208: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Observation

∃ weight-k satisfying assignment

∃ weight-k cover with ≤ nalternating paths

Theorem

Scaffolding is NP-hard even if G −Mis a collection of paths with weights

0/1

Corollary

no 2o(n+m)-time algorithm (ETH)

no no(k)-time algorithm (FPT 6=W[t])

19 /33

Page 209: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Observation

∃ weight-k satisfying assignment

∃ weight-k cover with ≤ nalternating paths

Theorem

Scaffolding is NP-hard even if G −Mis a collection of paths with weights

0/1

Corollary

no 2o(n+m)-time algorithm (ETH)

no no(k)-time algorithm (FPT 6=W[t])

19 /33

Page 210: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Sparse Graphs: Quasi-Forest

Observation

∃ weight-k satisfying assignment

∃ weight-k cover with ≤ nalternating paths

Theorem

Scaffolding is NP-hard even if G −Mis a collection of paths with weights

0/1

Corollary

no 2o(n+m)-time algorithm (ETH)

no no(k)-time algorithm (FPT 6=W[t])

19 /33

Page 211: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Other Forms of Tree-Likeness

Tree Decompositions

tree T , each vertex i associated to some Xi ⊆ V (G ) s.t.

1. ∀ e ∈ E (G ), there is some i ∈ V (T ) with e ∈ Xi2. ∀ v ∈ V (G ), bags containing v induce a connected subtree

treewidth tw = size of largest bag - 1

Hope

Practical instances of Scaffolding have low treewidth (they

originate from linear structure)

Nice Decompositions

Leaf: X = ∅Introduce v : i has single child j and Xi \ Xj = {v}Forget v : i has single child j and Xj \ Xi = {v}Introduce uv : i has single child j and uv ⊆ Xi = Xj

(each edge introduced exactly once)

Join: i has 2 children j and ` and Xi = Xj = X`

20/33

Page 212: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Other Forms of Tree-Likeness

Tree Decompositions

tree T , each vertex i associated to some Xi ⊆ V (G ) s.t.

1. ∀ e ∈ E (G ), there is some i ∈ V (T ) with e ∈ Xi2. ∀ v ∈ V (G ), bags containing v induce a connected subtree

treewidth tw = size of largest bag - 1

Hope

Practical instances of Scaffolding have low treewidth (they

originate from linear structure)

Nice Decompositions

Leaf: X = ∅Introduce v : i has single child j and Xi \ Xj = {v}Forget v : i has single child j and Xj \ Xi = {v}Introduce uv : i has single child j and uv ⊆ Xi = Xj

(each edge introduced exactly once)

Join: i has 2 children j and ` and Xi = Xj = X`

20/33

Page 213: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Other Forms of Tree-Likeness

Tree Decompositions

tree T , each vertex i associated to some Xi ⊆ V (G ) s.t.

1. ∀ e ∈ E (G ), there is some i ∈ V (T ) with e ∈ Xi2. ∀ v ∈ V (G ), bags containing v induce a connected subtree

treewidth tw = size of largest bag - 1

Hope

Practical instances of Scaffolding have low treewidth (they

originate from linear structure)

Nice Decompositions

Leaf: X = ∅Introduce v : i has single child j and Xi \ Xj = {v}Forget v : i has single child j and Xj \ Xi = {v}Introduce uv : i has single child j and uv ⊆ Xi = Xj

(each edge introduced exactly once)

Join: i has 2 children j and ` and Xi = Xj = X`

20/33

Page 214: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X

#matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

21 /33

Page 215: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X

#matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

21 /33

Page 216: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X

#matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

21 /33

Page 217: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X

#matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

21 /33

Page 218: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X

#matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

21 /33

Page 219: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

21 /33

Page 220: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Semantics

[d ,P, p, c]i = max. weight of any S with M∩ E (Gi ) ⊆ S ⊆ E (Gi ) and

1. each vertex v ∈ Xi has degree d(v) in Gi [S ],2. for each uv ∈ P , Gi [S ] contains an alternating path. . .

u = v : . . . from u avoiding d−1(1)u 6= v : . . . from u to v

3. Gi [S ] contains p alt. paths & c alt. cycles avoiding d−1(1)

21 /33

Page 221: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Leaf Bag

[∅,∅, 0, 0]i = 0

21 /33

Page 222: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Introduce v (single child j)[d ,P, p, c]i = [d |v→⊥,P, p, c]j

21 /33

Page 223: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Introduce uv (single child j)Case 1: d(u) = d(v) = 2

u v

21 /33

Page 224: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Introduce uv (single child j)Case 1: d(u) = d(v) = 2

u v [d ,P, p, c]i = [d |u→1,v→1,P + uv , p, c − 1]j

21 /33

Page 225: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Introduce uv (single child j)Case 1: d(u) = d(v) = 2

u v u v u v u v

21 /33

Page 226: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Forget v (single child j)

u

[d ,P, p, c]i = max

[d |v→1,P + vv , p − 1, c]j

maxuu∈P

[d |v→1, (P − uu) + uv , p, c]j

maxx∈{0,2}

[d |v→x ,P, p, c]j

21 /33

Page 227: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Forget v (single child j)

u

v

[d ,P, p, c]i = max

[d |v→1,P + vv , p − 1, c]j

maxuu∈P

[d |v→1, (P − uu) + uv , p, c]j

maxx∈{0,2}

[d |v→x ,P, p, c]j

21 /33

Page 228: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Forget v (single child j)u v

[d ,P, p, c]i = max

[d |v→1,P + vv , p − 1, c]j

maxuu∈P

[d |v→1, (P − uu) + uv , p, c]j

maxx∈{0,2}

[d |v→x ,P, p, c]j

21 /33

Page 229: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Forget v (single child j)

u

v

[d ,P, p, c]i = max

[d |v→1,P + vv , p − 1, c]j

maxuu∈P

[d |v→1, (P − uu) + uv , p, c]j

maxx∈{0,2}

[d |v→x ,P, p, c]j

21 /33

Page 230: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Join Bag (children j & `)

[d ,P, p, c]i = maxdj ,Pj ,pj ,cj

maxP`

Pj t P` = P

[dj ,Pj , pj , cj ]j + [d − dj ,P`, p − pj , c − cj ]`

O(3tw · twtw /2 ·σp · σc) table entries

and O((tw +2)tw · σp · σc · n) time

21 /33

Page 231: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Join Bag (children j & `)

[d ,P, p, c]i = maxdj ,Pj ,pj ,cj

maxP`

Pj t P` = P

[dj ,Pj , pj , cj ]j + [d − dj ,P`, p − pj , c − cj ]`

O(3tw · twtw /2 ·σp · σc) table entries

and O((tw +2)tw · σp · σc · n) time

21 /33

Page 232: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Join Bag (children j & `)

[d ,P, p, c]i = maxdj ,Pj ,pj ,cj

maxP`

Pj t P` = P

[dj ,Pj , pj , cj ]j + [d − dj ,P`, p − pj , c − cj ]`

O(2tw · twtw /2 ·σp · σc) table entries

and O((tw +2)tw · σp · σc · n) time

21 /33

Page 233: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

How do Solutions Interact with Bags?

Xi

Gi [S ]

Ingredients

• degree-function d : X → {0, 1, 2}• “pairing” ⊆

(X2

)∪ X #matchings possibilities O(|X ||X |/2)

• #paths and #cycles completed “below the bag”

Join Bag (children j & `)

[d ,P, p, c]i = maxdj ,Pj ,pj ,cj

maxP`

Pj t P` = P

[dj ,Pj , pj , cj ]j + [d − dj ,P`, p − pj , c − cj ]`

O(2tw · twtw /2 ·σp · σc) table entries and O((tw +2)tw · σp · σc · n) time

21 /33

Page 234: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Too slow

in prac-

tice!

22/33

Page 235: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s

tc

t

p- chromosomes = disjoint s-t-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,t

∑v yvu =

∑v yuv

- path bounds:∑

v yvt ≤ σ- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈Cyuv < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

23/33

Page 236: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s

tc

t

p

- chromosomes = disjoint s-t-paths

- bin. variables yuv = 1⇔ u → v usedx{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,t

∑v yvu =

∑v yuv

- path bounds:∑

v yvt ≤ σ- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈Cyuv < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

23/33

Page 237: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s

tc

t

p

- chromosomes = disjoint s-t-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,t

∑v yvu =

∑v yuv

- path bounds:∑

v yvt ≤ σ

- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈Cyuv < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

23/33

Page 238: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s

tc

t

p

- chromosomes = disjoint s-t-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,t

∑v yvu =

∑v yuv

- path bounds:∑

v yvt ≤ σ- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈Cyuv < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

23/33

Page 239: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s

tc

t

p

- chromosomes = disjoint s-t-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,t

∑v yvu =

∑v yuv

- path bounds:∑

v yvt ≤ σ- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈Cyuv < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

23/33

Page 240: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-t-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,t

∑v yvu =

∑v yuv

- path bounds:∑

v yvt ≤ σ- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈Cyuv < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

23/33

Page 241: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

23/33

Page 242: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Extension: Contig Jumps

Mean read length: 70bp

Mean insert size: 472bp

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

24/33

Page 243: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Extension: Contig Jumps

Mean read length: 70bp

Mean insert size: 472bp

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

24/33

Page 244: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Extension: Contig Jumps

Mean read length: 70bp

Mean insert size: 472bp

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

24/33

Page 245: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Extension: Contig Jumps

Mean read length: 70bp

Mean insert size: 472bp

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

24/33

Page 246: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

25/33

Page 247: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

25/33

Page 248: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

25/33

Page 249: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

25/33

Page 250: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

25/33

Page 251: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

Jump Mechanicsfor each non-contig uv ,

1. introduce a variable zuv

2. construct “jump network” between u and v that fits in the gap

3. add zuv to x{u,v}extra: preprocess instance to finish incomplete jumps

25/33

Page 252: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu+zuv + zvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

Jump Mechanicsfor each non-contig uv ,

1. introduce a variable zuv

2. construct “jump network” between u and v that fits in the gap

3. add zuv to x{u,v}extra: preprocess instance to finish incomplete jumps

25/33

Page 253: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Extension: Contig Jumps

Mean read length: 70bp

Mean insert size: 472bp

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

26/33

Page 254: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Extension: Contig Jumps

Mean read length: 70bp

Mean insert size: 472bp

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

26/33

Page 255: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Extension: Contig Jumps

Mean read length: 70bp

Mean insert size: 472bp

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

26/33

Page 256: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

ILP Extension: Multiplicities

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

27/33

Page 257: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

ILP Extension: Multiplicities

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

27/33

Page 258: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

ILP Extension: Multiplicities

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

27/33

Page 259: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

ILP Extension: Multiplicities

70

5

47

67

70

135

49

3931

5

49

56

80

80

34

6

11

2489

19

25

83

73

82

71

9

14

63

72

64

28

29

33

66

33

68

94

62

62

11

47

18

67

8

6

74

59

12

34

46

88

42

79

77

13

17

26

24

70

65

45

33

73

67

23

71

108

67

36

13

55

5

38

72

51

71

5

28

73

31

71

59

80

54

84

28

29

71

151

24

5

6

33

68

53

25

×2

28/33

Page 260: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

ILP Extension: Multiplicities

70

5

47

67

70

135

49

3931

5

49

56

80

80

34

6

11

2489

19

25

83

73

82

71

9

14

63

72

64

28

29

33

66

33

68

94

62

62

11

47

18

67

8

6

74

59

12

34

46

88

42

79

77

13

17

26

24

70

65

45

33

73

67

23

71

108

67

36

13

55

5

38

72

51

71

5

28

73

31

71

59

80

54

84

28

29

71

151

24

5

6

33

68

53

25

×2

28/33

Page 261: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- bin. variables yuv = 1⇔ u → v used

x{u,v} = yuv + yvu+zuv + zvu

- force contigs: ∀uv∈Mxuv = 1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈C(yuv−yutc ) < |C |

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics

!!!

Multiplicities1. make yuv , x{u,v} integers in domain [0,m({u, v})]2. change callback

29/33

Page 262: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Integer Linear Program Formulation

s tc

tp- chromosomes = disjoint s-{tp, tc}-paths- int. variables yuv = `⇔ u → v used ` times

x{u,v} = yuv + yvu+zuv + zvu

- force contigs: ∀uv∈Mxuv≥1- path preservation: ∀u 6=s,tp ,tc

∑v yvu =

∑v yuv

- path & cycle bounds:∑

v yvt{p,c} ≤ σ{p,c}- forbid cycles (row generation via callback):

∀ cycle C :∑

uv∈Cyuv ≤ |C | ·mmax ·

∑u∈C ,v /∈C

yuv

- objective: max∑e∈E

x{u,v} · ω(e)

- cycle consistency: ∀uyutc ≤ ysu

- jump mechanics !!!Multiplicities1. make yuv , x{u,v} integers in domain [0,m({u, v})]2. change callback

29/33

Page 263: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

30/33

Page 264: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

30/33

Page 265: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

30/33

Page 266: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

30/33

Page 267: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

3 2 2 1 3 3 5 5 71

2 1 2 1

Proof

30/33

Page 268: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

30/33

Page 269: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

30/33

Page 270: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

30/33

Page 271: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

30/33

Page 272: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

5

2

22

result is collection of alternating paths & cycles

30/33

Page 273: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3 3 3

5

2

22

result is collection of alternating paths & cycles

30/33

Page 274: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

5

2

22

result is collection of alternating paths & cycles

30/33

Page 275: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

5

2

22

result is collection of alternating paths & cycles

30/33

Page 276: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

3

22

result is collection of alternating paths & cycles

30/33

Page 277: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

3

22

result is collection of alternating paths & cycles

30/33

Page 278: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguity

information loss

3. cut as few ends as possible

computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 279: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguity

information loss

3. cut as few ends as possible

computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 280: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity

information loss

3. cut as few ends as possible

computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 281: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity

information loss

3. cut as few ends as possible

computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 282: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible

computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 283: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible

computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 284: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 285: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible

computationally hard

30/33

Page 286: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

30/33

Page 287: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

Multiplicitiesone &

#non-matching adj. to contig

30/33

Page 288: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

Multiplicitiesone &

#non-matching adj. to contig

30/33

Page 289: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

Multiplicitiesone &

#non-matching adj. to contig

30/33

Page 290: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

Multiplicitiesone &

#non-matching adj. to contig

30/33

Page 291: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

30/33

Page 292: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

30/33

Page 293: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

30/33

Page 294: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

30/33

Page 295: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Linearization of Solutions

Theorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguity information loss

3. cut as few ends as possible computationally hard

4. cut as few multiplicities as possible computationally hard

30/33

Page 296: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 297: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 298: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 299: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 300: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 301: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 302: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 303: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

What we saw

- 3-step sequencing technique:

1. produce paired-end reads

2. assemble reads to contigs

3. scaffold contigs to chromosomes using read-pairings

- computationally hard problem for dense graphs with weights 0/1

- no constant-factor approx or subexponential-time algorithm

for linear quasi trees with weights 0/1

- O(n2) time on unweighted cliques/co-bipartite/split

- O(n · σp · σc) time for constant treewidth

- 2-approximable in cliques/complete bipartite in O(n3) time

- O(√

2n

poly(n)) time exact algorithm

- ILP formulation with contig jumps & multiplicities

- Linearization problem raised by multiplicities in solution

31 /33

Page 304: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

Outlook

- 3rd generation sequencing: PacBio, Oxford Nanoporeproduces long reads (10-15kbp), but error-prone

correction using small reads?

- generally: multi-library scaffolding

- other sources for contig-connections (phylogenetic

information?)

- better parameters for Scaffolding and Scaffold Linearization

analyze practical instances

- approximation/heuristics for Scaffold Linearization

32/33

Page 305: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

Outlook

- 3rd generation sequencing: PacBio, Oxford Nanoporeproduces long reads (10-15kbp), but error-prone

correction using small reads?

- generally: multi-library scaffolding

- other sources for contig-connections (phylogenetic

information?)

- better parameters for Scaffolding and Scaffold Linearization

analyze practical instances

- approximation/heuristics for Scaffold Linearization

32/33

Page 306: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

Outlook

- 3rd generation sequencing: PacBio, Oxford Nanoporeproduces long reads (10-15kbp), but error-prone

correction using small reads?

- generally: multi-library scaffolding

- other sources for contig-connections (phylogenetic

information?)

- better parameters for Scaffolding and Scaffold Linearization

analyze practical instances

- approximation/heuristics for Scaffold Linearization

32/33

Page 307: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

Outlook

- 3rd generation sequencing: PacBio, Oxford Nanoporeproduces long reads (10-15kbp), but error-prone

correction using small reads?

- generally: multi-library scaffolding

- other sources for contig-connections (phylogenetic

information?)

- better parameters for Scaffolding and Scaffold Linearization

analyze practical instances

- approximation/heuristics for Scaffold Linearization

32/33

Page 308: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Conclusion

Outlook

- 3rd generation sequencing: PacBio, Oxford Nanoporeproduces long reads (10-15kbp), but error-prone

correction using small reads?

- generally: multi-library scaffolding

- other sources for contig-connections (phylogenetic

information?)

- better parameters for Scaffolding and Scaffold Linearization

analyze practical instances

- approximation/heuristics for Scaffold Linearization

32/33

Page 309: Lecture: Graph-Based Genome Scaffolding - IGMigm.univ-mlv.fr/~mweller/scaf_lec.pdf · Sanger Sequencing 1.split helix & create thousands of copies 2.add polymerase & floating bases:

Fin!

33/33