david evans cs.virginia/evans

42
Class Class 26: 26: Computin Computin g g Genomes, Genomes, Genomes Genomes Computin Computin g g David Evans David Evans http://www.cs.virginia.edu/evans http://www.cs.virginia.edu/evans cs302: Theory of Computation cs302: Theory of Computation University of Virginia Computer University of Virginia Computer Science Science

Upload: scarlett-mcmillan

Post on 01-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Class 26: Computing Genomes, Genomes Computing. David Evans http://www.cs.virginia.edu/evans. cs302: Theory of Computation University of Virginia Computer Science. Final Exam Sneak Preview. Handout available now - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: David Evans cs.virginia/evans

Class 26: Class 26: Computing Computing Genomes, Genomes, Genomes Genomes

ComputingComputing

David EvansDavid Evanshttp://www.cs.virginia.edu/evanshttp://www.cs.virginia.edu/evans

cs302: Theory of Computationcs302: Theory of ComputationUniversity of Virginia Computer ScienceUniversity of Virginia Computer Science

Page 2: David Evans cs.virginia/evans

2Lecture 26: Computing Genomes and Vice Versa

Final Exam Sneak Preview• Handout available now• Honor policy: you may discuss these problems with

others and use any resources you want until the Final• No notes or other resources may be used during the

final• Intent is to give you an idea what to expect on the final

and a chance to start thinking about some problems– Don’t attempt to memorize answers: need to understand

things since the actual questions may be different

Page 3: David Evans cs.virginia/evans

3Lecture 26: Computing Genomes and Vice Versa

Menu

• Computing Genomes (PS6, Problem 6)– Crash course in biology

• Busy Beaver result! • Computing with Genomes• Conclusion

Page 4: David Evans cs.virginia/evans

4Lecture 26: Computing Genomes and Vice Versa

Genome Assembly Problem

In order to assemble a genome, it is necessary to combine snippets from many reads into a single sequence. The input is a set of n genome snippets, each of which is a string of up to k symbols. The output is the smallest single string that contains all of the input snippets as substrings.

Page 5: David Evans cs.virginia/evans

5Lecture 26: Computing Genomes and Vice Versa

DNA

• Sequence of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T)

• Two strands, A must attach to T and G must attach to C

A

G

T

CC

Page 6: David Evans cs.virginia/evans

6Lecture 26: Computing Genomes and Vice Versa

Central Dogma of Biology

• RNA makes copies of DNA segments• RNA describes sequences of amino acids• Chains of amino acids make proteins• Proteins make us

DNA

Transcription

RNA

Translation

ProteinImage from http://www.umich.edu/~protein/

Page 7: David Evans cs.virginia/evans

7Lecture 26: Computing Genomes and Vice Versa

Human Genome• 3 Billion Base Pairs

– Each nucleotide is 2 bits (4 possibilities)– 3 B pairs * 1 byte/4 pairs = 750 MB

• Every sequence of 3 base pairs one of 20 amino acids (or stop codon)– 21 possible codons, but 43 = 64 possible – So, really only 750MB * (21/64) ~ 250 MB

• Much of it (> 95%) is may be junk (doesn’t encode proteins, but some might be important)

1 CD ~ 650 MB

Page 8: David Evans cs.virginia/evans

8Lecture 26: Computing Genomes and Vice Versa

Human Genome Race

• UVa CLAS 1970• Yale PhD• Tenured Professor at U.

Michigan

Francis Collins(Director of public National Center for Human Genome Research)(Picture from UVa Graduation 2001)

Craig Venter(President of Celera Genomics)

vs.

• San Mateo College• Court-martialed• Denied tenure at SUNY

Buffalo

Page 9: David Evans cs.virginia/evans

9Lecture 26: Computing Genomes and Vice Versa

Reading the Genome

Whitehead Institute, MIT

Page 10: David Evans cs.virginia/evans

10Lecture 26: Computing Genomes and Vice Versa

Gene Reading Machines

• One read: about 700 base pairs• But…don’t know where they are on the

chromosome

AGGCATACCAGAATACCCGTGATCCAGAATAAGCActualGenome

ACCAGAATACCRead 1

TCCAGAATAARead 2

TACCCGTGATCCARead 3

Page 11: David Evans cs.virginia/evans

11Lecture 26: Computing Genomes and Vice Versa

Genome Assembly Problem

Input: Genome fragments (but without knowing where they are from)Ouput: The full genome

ACCAGAATACCRead 1

TCCAGAATAARead 2

TACCCGTGATCCARead 3

Page 12: David Evans cs.virginia/evans

12Lecture 26: Computing Genomes and Vice Versa

Decision Problem

GA = { < { x1, x2, …, xn }, m > | where each xi is a string and there is a string X of length m that includes all of the xi stringsas substrings }

If we had a decider for GA, can we find the length of the shortest common superstring?

Yes. Try all possible m values from 1, 2, …, Σ|xi|.

Page 13: David Evans cs.virginia/evans

13Lecture 26: Computing Genomes and Vice Versa

Is GA In Class NP?

GA = { < { x1, x2, …, xn }, m > | where each xi is a string and there is a string X of length m that includes all of the xi stringsas substrings }

Yes. The string X is a polynomial-verifiable certificate.- Check it includes each substring- Check its length is m

Page 14: David Evans cs.virginia/evans

14Lecture 26: Computing Genomes and Vice Versa

Is GA NP-Complete?

GA = { < { x1, x2, …, xn }, m > | where each xi is a string and there is a string X of length m that includes all of the xi stringsas substrings }

Page 15: David Evans cs.virginia/evans

15Lecture 26: Computing Genomes and Vice Versa

NP-CompleteA language B is in NP-complete if:

2. There is a polynomial-time reduction from every problem A NP to B.

1. B NP

B

NPB

NP

?

Page 16: David Evans cs.virginia/evans

16Lecture 26: Computing Genomes and Vice Versa

To Prove NP-Hardness

• Pick some known NP-Complete problem X.• Show that a polynomial-time solver for Y

could be used to build a polynomial-time solver for X.

• This proves that there is no polynomial-time solver for Y (unless P = NP).

Page 17: David Evans cs.virginia/evans

17Lecture 26: Computing Genomes and Vice Versa

Possible Choices…

HAMPATH

A

B

C

D

E

3SAT

(a b c) (a b c) …

SUBSET-SUM

{<{3, 5, 12, 13, 17}, 30}

By definition, all must work. Every NP-Complete problem can be reduced to every NP-Complete problem.

In practice, some will work much more easily than others. Try to pick a problem “close” to the target problem.

Page 18: David Evans cs.virginia/evans

18Lecture 26: Computing Genomes and Vice Versa

Busy Beaver ChallengeRuixin Yang

Page 19: David Evans cs.virginia/evans

19Lecture 26: Computing Genomes and Vice Versa

Doubly-Infinite Tape (Regular BB)

Page 20: David Evans cs.virginia/evans

20Lecture 26: Computing Genomes and Vice Versa

One-Way Infinite Tape (Challenge)

Page 21: David Evans cs.virginia/evans

21Lecture 26: Computing Genomes and Vice Versa

Winning (3, 2) Machine

Page 22: David Evans cs.virginia/evans

22Lecture 26: Computing Genomes and Vice Versa

Code

BusyBeaver32.cpp

BusyBeaver.cpp

Page 23: David Evans cs.virginia/evans

23Lecture 26: Computing Genomes and Vice Versa

Is GA NP-Complete?

GA = { < { x1, x2, …, xn }, m > | where each xi is a string and there is a string X of length m that includes all of the xi stringsas substrings }

Page 24: David Evans cs.virginia/evans

24Lecture 26: Computing Genomes and Vice Versa

Reducing HAMPATH to GA

• Take an arbitrary input to HAMPATH:HAMPATH = { G, s, t | G represents a graph, and

there is a path between s and t in G that includes every node in G exactly once }

• Construct a corresponding input to GA such that the input is in GA if and only if the original input is in HAMPATH.

So, we need to map the nodes and edges of G into thesubstrings input to GA.

Page 25: David Evans cs.virginia/evans

25Lecture 26: Computing Genomes and Vice Versa

Consider Each Node

Incoming Edges(x, a)

aOutgoing Edges(a, y)

In a Hamiltonian path, for each node (except start and end), exactly one incoming edge, and one outgoing edge must be used.

Page 26: David Evans cs.virginia/evans

26Lecture 26: Computing Genomes and Vice Versa

Consider Each Node

Incoming Edges(x, a)

aOutgoing Edges(a, y)

We need GA substrings to represent each edge, butsuch that only 1 can be actually followed. But, the GA superstring needs to include all substrings.

Idea: make it so the untaken edges can loop back!

Page 27: David Evans cs.virginia/evans

27Lecture 26: Computing Genomes and Vice Versa

Simple Nodes

If there is only one edge (a, b) out of a given node, thatedge must be used in the path. Add the substring: ab

a

b

Page 28: David Evans cs.virginia/evans

28Lecture 26: Computing Genomes and Vice Versa

Generating the Substrings

• For each edge (a, yi) add two substrings:ayia This is the “back” edge.

yiayi+1 This connects the possible destinations.

a

b

c

aba acabac cab

Possible (length 4) alignments:aba aca bac cab

If there is a Hamiltonian path, one of these must be used.

Page 29: David Evans cs.virginia/evans

29Lecture 26: Computing Genomes and Vice Versa

Start and End

A

C

D

E<G, A, E>

The start node must come first:Add string: @A (where @ is unused elsewhere)Add string: E$ (where $ is unused elsewhere)

B

Page 30: David Evans cs.virginia/evans

30Lecture 26: Computing Genomes and Vice Versa

What is m?

A

C

D

E

<G, A, D>

{ @A, ABA, BAC, ACA, CAB, BE, ED, DA, CBC, CDC, BCD, DBC, D$ }

B@A ACA CAB ABA BAC CBC BCD CDC DCB BE ED D$

m = Start and end = 2 + one-edge nodes * 1 + multi-edge nodes * 2 for each edge + 2 for align

Page 31: David Evans cs.virginia/evans

31Lecture 26: Computing Genomes and Vice Versa

Human Genome• 3 Billion base pairs• 600-700 bases per read• ~8X coverage required• So, n 37 Million sequence fragments• Celera used 27.2 Million reads (but could get

more than 700 bases per read)

How can we solve an NP-Complete problem for such large n?

Page 32: David Evans cs.virginia/evans

32Lecture 26: Computing Genomes and Vice Versa

Approaches

• Human Genome Project (Collins)– Start by producing a genome map (using biology,

chemistry, etc) to have a framework for knowing where the fragments should go

• Celera Solution (Venter)– Approximate: we can’t guarantee finding the

shortest possible, but we can develop clever algorithms that get close most of the time

Page 33: David Evans cs.virginia/evans

33Lecture 26: Computing Genomes and Vice Versa

Result: Draw

President Clinton announces Human Genome Sequenceessentially complete (with Venter and Collins), June 26, 2000

But, Human Genome Project mostly adopted Venter’s approach.

Page 34: David Evans cs.virginia/evans

34Lecture 26: Computing Genomes and Vice Versa

Genomes Computing

Page 35: David Evans cs.virginia/evans

35Lecture 26: Computing Genomes and Vice Versa

Solving HAMPATH with DNA• Make up a two random 4-nucleotide sequences for

each node:A: A1 = ACTT A2 = gcagB: B1 = TCGG B2 = actg

C: C1 = GGCT C2 = atgt

D: D1 = GATC D2 = tcca

• If there is a link between two cities (XY), create a nucleotide sequence: X2Y1 AB gcagTCGGBA actgACTT

Based on Fred Hapgood’s notes on Adelman’s talk

Page 36: David Evans cs.virginia/evans

36Lecture 26: Computing Genomes and Vice Versa

Encoding The Problem• Each city nucleotide sequence binds with its

complement (A T, G C) :A: A1 = ACTT A2 = gcag

A’: TGAA cgtcB: TCGGactg

B’: AGCCtgacC: GGCTatgt C’ = CCGAtacaD: GATCtcca D’ = CTAGaggt

• Mix up all the link and complement DNA strands – they will bind to show a path!

Page 37: David Evans cs.virginia/evans

37Lecture 26: Computing Genomes and Vice Versa

Path Binding

A

B

C

D

ACTTgcag

TCGGactg

GATCtcca

GGCTatgt

A’TGAAcgtc

gcagGGCTAB

B’CCGAtaca

atgtTCGG BC

C’AGCCtgac

D’CTAGaggt

actgGATC CD

Page 38: David Evans cs.virginia/evans

38Lecture 26: Computing Genomes and Vice Versa

Getting the Solution• Shake up all the DNA to get it to bind• Extract DNA strands starting with A and ending

with D – Can do this with chemical binding on start/end tags:

remove all strands that do not start with A, and then remove all strands that do not end with D

• Weigh remaining strands to find ones with the right weight (7 * 8 nucleotides)

• Select one of these and read its sequence

Page 39: David Evans cs.virginia/evans

39Lecture 26: Computing Genomes and Vice Versa

Is Church-Turning Wrong?• Time to solve problem with DNA computer

doesn’t scale with input size– Can shake up any amount of DNA in the same

amount of time!• Can DNA computers solve undecidable

problems?• Is TM model robust enough for P to be the same

for DNA computer?

No (at least not like this). Can simulate everything with TM.

No: DNA computer can solve NP-Hard problems in constant time! Volume of DNA needed grows exponentially with input size.

Page 40: David Evans cs.virginia/evans

40Lecture 26: Computing Genomes and Vice Versa

DNA-Enhanced PC

To solve HAMPATH for 45 vertices, you need ~20M gallons

Page 41: David Evans cs.virginia/evans

41Lecture 26: Computing Genomes and Vice Versa

Where to Go From Here• Talks today and tomorrow (both in new Library)

– 4:00pm today: Curtis Wong– 3:00 tomorrow: Bill Wulf

• Security and Theory Lunch Groups– See handout for links

• Courses:– CS660: Graduate Theory of Computation– CS432, MATH450, PHIL233, Cryptography

Page 42: David Evans cs.virginia/evans

42Lecture 26: Computing Genomes and Vice Versa

Thanks!