1 the rna detective game: finding rna chains from fragments fred roberts, rutgers university rna...

65
1 THE RNA DETECTIVE GAME: FINDING RNA CHAINS FROM FRAGMENTS Fred Roberts, Rutgers University RNA Detective

Post on 19-Dec-2015

226 views

Category:

Documents


6 download

TRANSCRIPT

1

THE RNA DETECTIVE GAME:FINDING RNA CHAINS FROM

FRAGMENTS

Fred Roberts, Rutgers University

RNA Detective

2

Deoxyribonucleic acid, DNA, is the basic building block of inheritance.

DNA can be thought of as a chain consisting of bases.Each base is one of four possible chemicals:

Thymine (T), Cytosine (C), Adenine (A), Guanine (G)

DNA and RNA

3

Some DNA chains:GGATCCTGG, TTCGCAAAAAGAATC

Real DNA chains are long: Algae (P. salina): 6.6x105 bases long

Slime mold (D. discoideum): 5.4x107 bases long

DNA and RNA

4

Insect (D. melanogaster – fruit fly): 1.4x108 bases long

Bird (G. domesticus): 1.2x109 bases long

DNA and RNA

5

Human (H. sapiens): 3.3x109 bases long

The sequence of bases in DNA encodes certain genetic information.

In particular, it determines long chains of amino acids known as proteins.

DNA and RNA

6

How many possible DNA chains are there in humans?

DNA and RNA

7

Aside: Counting

Fundamental methods of combinatorics are important in mathematical biology.

8

The Product RuleHow many sequences of 0’s and 1’s are there of length 2?

There are 2 ways to choose the first digit and no matter how we choose the first digit, there are two ways to choose the second digit.

Thus, there are 2x2 = 22 = 4 ways to choose the sequence.

00, 01, 10, 11

How many sequences are there of length 3?

By similar reasoning: 2x2x2 = 23.

9

The Product RuleIs this interesting?

10

The Product Rule

Boring!

11

The Product Rule

Really boring!

12

The Product Rule

Counting may be boring at times, but we will see that it can be really powerful.

13

The Product Rule

Product Rule: If something can happen in n1 ways and no matter how the first thing happens, a second thing can happen in n2 ways, then the two things together can happen in n1 x n2 ways. More generally, if something can happen in n1 ways and no matter how the first thing happens, a second thing can happen in n2 ways, and no matter how the first two things happen a third thing can happen in n3 ways, … then all the things together can happen in n1 x n2 x n3 ways.

14

How many possible DNA chains are there in humans?

How many DNA chains are there with two bases?

Answer (Product Rule): 4x4 = 42 = 16.

There are 4 choices for the first base and, for each such choice, 3 choices for the second base.

How many with 3 bases?

How many with n bases?

DNA and RNA

15

How many with 3 bases? 43 = 64

How many with n bases? 4n

How many human DNA chains are possible?

4^(3.3x109)

This is greater than 10^(1.98x109)(1 followed by 198 million zeroes!)

DNA and RNA

16

RNA is a “messenger molecule” whose links are defined from DNA.

An RNA chain has at each link one of four bases. The possible bases are the same as those in DNA except

that the base Uracil (U) replaces the base Thymine (T).

DNA and RNA

17

Sample RNA chains:GGCAUUGGA, UAUAUGCGGCUUC

RNA chains are very long.Can we discover what they look

like without actually observing them?

Trick: Use enzymes.

The RNA Detective Game

18

Some enzymes break up an RNA chain into fragments after each G link.

Some enzymes break up the chain after each C or U link.

Consider the chain

CCGGUCCGAAAG

Applying the G enzyme breaks the chain into the following fragments:

G fragments: CCG, G, UCCG, AAAG

We know that these are the fragments, but we do not know the order in which they appear.

How many possible chains have these four fragments?

The RNA Detective Game

19

Chain: CCGGUCCGAAAG

G fragments: CCG, G, UCCG, AAAG

Product rule again: 4 choices for the first fragment, for each such choice 3 choices for second fragment, …

There are 4x3x2x1 = 4! = 24 possible chains. One chain corresponding to each permutation of these four

fragments.

One such chain different from the original:

UCCGGCCGAAAG

The RNA Detective Game

20

Chain: CCGGUCCGAAAG

Suppose we instead apply the U,C enzyme.We get the following fragments:

U,C fragments: C, C, GGU, C, C, GAAAG

How many chains are there with these fragments?Is 6! = 720 the correct answer???

Two of the permutations are the one that takes the fragments in the order given and the one that takes the second fragment first and the first second and all others in this order.

They give rise to the same chain.

The RNA Detective Game

21

So 6! is wrong.What is the answer??

The RNA Detective Game

What if the fragments were

C, C, C, C, C

There are 5! permutations of these fragments, but only one RNA chain with these fragments:

CCCCC

22

Aside: More Counting

23

Putting n distinguishable balls into k distinguishable boxes:

Multinomial Coefficients

The number of ways to put n1 balls into the first box, n2 balls into the second box, …, nk balls into the kthbox is denoted by C(n;n1,n2,…,nk), where n = n1 + n2 + … nk.

24

Theorem: C(n;n1,n2,…,nk) = n!/n1!n2!...nk!

Example: How many RNA chains of length 6 have 3 C’s and 3 A’s?

Think of 2 boxes, a C box and an A box. How many ways are there to put 3 positions (balls) into the C box and 3 into the A box?

Answer: C(6;3,3) = 6!/3!3! = 20.

Some of these are: CACACA, ACACAC, AAACCC.

Multinomial Coefficients

25

If a 6-link RNA chain is chosen at random, what is the probability of obtaining one with 3 C’s and 3 A’s?

Answer: There are 46 possible RNA chains of length 6.The probability is therefore

C(6;3,3)/46 = 20/4096 .005.

Multinomial Coefficients

26

The number of 10-link RNA chains consisting of 3 A’s, 2 C’s, 2 U’s, and 3 G’s is

C(10;3,2,2,3) = 25,200

What if we know they end in AAG?

Then, only the first 7 positions need to be filled, and 2 A’s and one G are already used up. Hence, the answer is

C(7;1,2,2,2) = 630

Notice how knowing the end of a chain can dramatically reduce the number of possible chains.

Multinomial Coefficients

27

Returning to the RNA Detective Game

28

Recall that we have the following U,C fragments:

C, C, GGU, C, C, GAAAG

The number of RNA chains with these fragments is not 6! = 720.

Think of having 6 positions (there are 6 fragments) and assigning 4 positions to the C box, 1 to the GGU box, and one to the GAAAG box.

Then the number of ways of doing this is

C(6;4,1,1) = 6!/4!1!1! = 30

The RNA Detective Game

29

U,C fragments: C, C, GGU, C, C, GAAAG

Actually, this computation is still a bit off, though not because the combinatorial argument is wrong.

Notice that the fragment GAAAG does not end in U or C.Thus, we know it comes last.

There are 5 remaining U,C fragments.The number of chains beginning with these 5 fragments is

given by

C(5;4,1) = 5

Beginning of the chains: CCCCGGU, CCCGGUC, CCGGUCC, CGGUCCC, GGUCCCC

The RNA Detective Game

30

We get all chains with the given U,C fragments by adding GAAAG to the end of each of these:

CCCCGGUGAAAGCCCGGUCGAAAGCCGGUCCGAAAGCGGUCCCGAAAGGGUCCCCGAAAG

The RNA Detective Game

31

Thus, there are 24 possible chains with the given G fragments and 5 with the possible U,C fragments.

But: We have not yet combined our knowledge of both G and U,C fragments.

G fragments: CCG, G, UCCG, AAAGU,C fragments: C, C, GGU, C, C, GAAAG

Which of the 5 chains with these U,C fragments has the right G fragments?

The RNA Detective Game

32

G fragments: CCG, G, UCCG, AAAGU,C fragments: C, C, GGU, C, C, GAAAG

Which of the 5 chains with these U,C fragments has the right G fragments?

CCCCGGUGAAAGCCCGGUCGAAAGCCGGUCCGAAAGCGGUCCCGAAAGGGUCCCCGAAAG

CCCCGGUGAAAG does not: It has CCCCG as a G fragment.

What about the others?

The RNA Detective Game

33

Checking the remaining 4 possible RNA chains with the given U,C fragments shows that only the third one,

CCGGUCCGAAAG

has the given G fragments.Hence, we have recovered the initial chain.

This is an example of recovery of an RNA chain given a complete digest by enzymes.

How remarkable is it that we could recover the initial RNA chain this way?

The RNA Detective Game

34

CCGGUCCGAAAG

How many RNA chains are there with the same bases as this chain?

There are 12 bases: 4 C’s, 4 G’s, 3 A’s, and 1 U.

The number of chains with these bases is given by C(12;4,4,3,1) = 138,600

Thus, knowing the number of bases is not nearly as useful as knowing the fragments.

The RNA Detective Game

35

Another example.

G fragments: UG, ACG, ACU,C fragments: U, GAC, GAC

Step 1: Does any fragment have to come last?

The RNA Detective Game

36

G fragments: UG, ACG, ACU,C fragments: U, GAC, GAC

Step 1: Does any fragment have to come last?

None of the U,C fragments has to come last.However, the G fragment AC has to come last.

Thus, the other two G fragments come first in some order and there are only two possible RNA chains with these G fragments: UGACGAC, ACGUGAC

The RNA Detective Game

37

G fragments: UG, ACG, ACU,C fragments: U, GAC, GAC

There are only two possible RNA chains with these G fragments: UGACGAC, ACGUGAC

The latter has AC as a U,C fragment. So, the former is the correct chain.

The RNA Detective Game

38

Is it always possible to completely recover the original RNA chain given its G fragments and U,C fragments?

The RNA Detective Game

RNA

39

Is it always possible to completely recover the original RNA chain given its G fragments and U,C fragments?

No: sometimes the solution is ambiguous.

Exercise: Find two RNA chains with the same G and U,C fragments.

The RNA Detective Game

40

Surprisingly, eulerian paths in multidigraphs can be used to help with the RNA detective game.

When a digraph is allowed to have more than one arc from vertex x to vertex y, we call it a multidigraph.

A path in a multidigraph is called eulerian if it uses every arc once and only once. (Recall the Konigsberg Bridge Problem.)

A closed path (one that ends where it starts) is eulerian if it is eulerian as a path.

Eulerian Paths

41

Eulerian Paths

b

d

c

e

a

eulerian closed path: a, b, c, d, b, e, a

42

Eulerian Paths

b

d

c

e

a

eulerian path: a, b, c, d, b, e

43

When does a multidigraph have an eulerian path or closed path?

Theorem (I.J. Good, 1946): A connected multidigraph has an eulerian closed path iff for every vertex, the indegree (number of incoming arcs) equals the outdegree (number of outgoing arcs).

Theorem (I.J. Good, 1946): A connected multidigraph has an eulerian path iff for all vertices with the possible exception of two, indegree equals outdegree, and for at most two vertices, indegree and outdegree differ by one.

Eulerian Paths

44

Eulerian Paths

a

bad

c

b

45

Note that these theorems hold if there are loops from a vertex to itself.

A loop adds 1 to indegree and 1 to outdegree.Thus, loops do not affect the existence of eulerian paths or

closed paths.

Eulerian Paths

46

Assume that there are at least two G fragments and at least two U,C fragments. Otherwise, we can recover the original chain.

Example:

G fragments: CCG, G, UCACG, AAAG, AAU,C fragments: C, C, GGU, C, AC, GAAAGAA

Eulerian Paths and the RNA Detective Game

47

G fragments: CCG, G, UCACG, AAAG, AAU,C fragments: C, C, GGU, C, AC, GAAAGAA

Step 1: Break down each fragment after each G, U, or C.

E.g.: GAAAGAA becomes GxAAAGxAA GGU becomes GxGxU

UCACG becomes UxCxACxG

Each piece is called an extended base.All extended bases in a fragment except first and last are

called interior extended bases.

Eulerian Paths and the RNA Detective Game

48

G fragments: CCG, G, UCACG, AAAG, AAU,C fragments: C, C, GGU, C, AC, GAAAGAA

Step 2: Use the extended base breakup of fragments to find the beginning and end of the RNA chain.Start by making two lists

All interior extended bases of all fragments:C, C, AC, G, AAAG

Fragments with one extended base:G, AAAG, AA, C, C, C, AC

Eulerian Paths and the RNA Detective Game

49

All interior extended bases of all fragments:C, C, AC, G, AAAG

Fragments with one extended base:G, AAAG, AA, C, C, C, AC

Theorem: Every entry on the first list is on the second list. There are always exactly two entries on the second list not on the first. One of these is the first extended base of the entire RNA chain and the other is the last.

Thus: chain begins in AA or C and ends in AA or C.How do you tell how it ends?

Eulerian Paths and the RNA Detective Game

50

Thus: chain begins in AA or C and ends in AA or C.How do you tell how it ends?

One of these must be from an abnormal fragment: a G fragment that doesn’t end in G or a U,C fragment that doesn’t end in U or C.

G fragments: CCG, G, UCACG, AAAG, AAU,C fragments: C, C, GGU, C, AC, GAAAGAA

AA is such an abnormal fragment.

An abnormal fragment marks the end of the chain.

So: chain ends in AA and begins in C.

Eulerian Paths and the RNA Detective Game

51

Step 3: Build a multidigraph.First, identify all normal fragments with more than one extended base. From each such fragment, use the first and last extended bases as vertices and draw an arc from the first to the last. Label the arc with the corresponding fragment.

G fragments: CCG, G, UCACG, AAAG, AAU,C fragments: C, C, GGU, C, AC, GAAAGAA

Fragment UCACG gives rise to vertices U and G and we include an arc from U to G labeled UCACG.

Eulerian Paths and the RNA Detective Game

52

Eulerian Paths and the RNA Detective Game

UCACG

UG

53

G fragments: CCG, G, UCACG, AAAG, AAU,C fragments: C, C, GGU, C, AC, GAAAGAA

Fragment CCG means that we include an arc from C to G labeled CCG.

Fragment GGU means that we include an arc from G to U labeled GGU.

Eulerian Paths and the RNA Detective Game

54

Eulerian Paths and the RNA Detective Game

UCACG

UG

GGU

C

CCG

55

There might be several arcs from a given extended base to another if there are several normal fragments from the first to the second. That is why we get a multidigraph.

Step 4: We add one additional arc.

Identify the longest abnormal fragment. Include an arc from the first (and perhaps only) extended

base in this fragment to the first extended base in the chain.

Label this as X*Y where X is the longest abnormal fragment in the chain and Y is first extended base in the chain.

Eulerian Paths and the RNA Detective Game

56

G fragments: CCG, G, UCACG, AAAG, AAU,C fragments: C, C, GGU, C, AC, GAAAGAA

GAAAGAA is the longest abnormal fragment. Put in an arc from G (first extended base in this fragment)

to C, first extended base in the chain.

Label the arc as GAAAGAA*C

Eulerian Paths and the RNA Detective Game

57

Eulerian Paths and the RNA Detective Game

UCACG

UG

GGU

C

CCG

GAAAGAA*C

58

Theorem: This multidigraph has an eulerian closed path. The RNA chains with the given G and U,C fragments correspond to eulerian closed paths that end with the special arc X*Y.

In our example, it is easy to check it has an eulerian closed path. (Use I.J. Good’s Theorem.)

Eulerian Paths and the RNA Detective Game

59

Eulerian Paths and the RNA Detective Game

UCACG

UG

GGU

C

CCG

GAAAGAA*C

The only eulerian closed path that ends in GAAAGAA*C goes from C to G to U to G to C.

60

Eulerian Paths and the RNA Detective Game

UCACG

UG

GGU

C

CCG

GAAAGAA*C

Step 5: Use the corresponding labeling of arcs to obtain the chain:

CCGGUCACGAAAGAA

It is easy to check this has the right G and U,C fragments.

61

The “fragmentation stratagem” we have described was used by R.W. Holley and his colleagues at Cornell in 1965 to determine the first nucleic acid sequence.

The method is not used anymore and was only used for a short time before other, more efficient methods were adopted.

However, it has great historical significance and illustrates an important role for mathematical methods in biology.

The RNA Detective Game: Concluding Comments

62

Nowadays, by use of radioactive marking and high-speed computer analysis, it is possible to sequence long RNA and DNA chains rather quickly.

The RNA Detective Game: Concluding Comments

63

The mathematical power of the fragmentation stratagem, nevertheless, is a good illustration of the use of methods of discrete mathematics in modern molecular biology.

The RNA Detective Game: Concluding Comments

64

And of the power of counting!

The RNA Detective Game: Concluding Comments

65

The RNA Detective Game: Enjoy it with Your Students