restriction mapping an introduction to bioinformatics algorithms (jones and pevzner)

33
Restriction Mapping Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Upload: martina-johnson

Post on 02-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Restriction MappingRestriction Mapping

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 2: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Molecular ScissorsMolecular Scissors(restriction enzymes)(restriction enzymes)

Molecular Cell Biology, 4th edition

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

HindII (first restriction enzyme): discovered accidentally in 1970 while studying how bacterium Haemophilus influenzae takes up DNA from the virus. Recognizes and cuts DNA at sequence GAATTC

Page 3: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Discovering Restriction EnzymesDiscovering Restriction Enzymes

Werner Arber Daniel Nathans Hamilton Smith

Werner Arber – discovered restriction enzymesDaniel Nathans - pioneered the application of restriction for the construction of genetic mapsHamilton Smith - showed that restriction enzyme cuts DNA in the middle of a specific sequence

My father has discovered a servant who serves as a pair of scissors. If a foreign king invades a bacterium, this servant can cut him in small fragments, but he does not do any harm to his own king. Clever people use the servant with the scissors to find out the secrets of the kings. For this reason my father received the Nobel Prize for the discovery of the servant with the scissors".

Daniel Nathans’ daughter (from Nobel lecture)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 4: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Recognition Sites of Restriction EnzymesRecognition Sites of Restriction Enzymes

Molecular Cell Biology, 4th edition

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 5: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Uses of Restriction EnzymesUses of Restriction Enzymes

Recombinant DNA technologyRecombinant DNA technologyCloningCloningcDNA/genomic library constructioncDNA/genomic library constructionDNA mappingDNA mapping

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 6: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Restriction MapsRestriction Maps• A map showing positions of restriction sites in a DNA sequence• If DNA sequence is known then construction of restriction map is a trivial exercise• In early days of molecular biology DNA sequences were often unknown• Biologists had to solve the problem of constructing restriction maps without knowing DNA sequences

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 7: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Full Restriction DigestFull Restriction Digest

• Cutting DNA at each restriction site creates

multiple restriction fragments:

• Is it possible to reconstruct the order of the fragments from the sizes of the fragments {3,5,5,9} ?

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 8: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Full Restriction Digest: Multiple SolutionsFull Restriction Digest: Multiple Solutions

• Alternative ordering of restriction fragments:

vs

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 9: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Measuring Length of Restriction FragmentsMeasuring Length of Restriction Fragments

Restriction enzymes break DNA into restriction Restriction enzymes break DNA into restriction fragments. fragments.

Gel electrophoresisGel electrophoresis is a process for separating DNA is a process for separating DNA by size and measuring sizes of restriction fragments by size and measuring sizes of restriction fragments

Can separate DNA fragments that differ in length in Can separate DNA fragments that differ in length in only 1 nucleotide for fragments up to 500 only 1 nucleotide for fragments up to 500 nucleotides longnucleotides long

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 10: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Gel ElectrophoresisGel Electrophoresis

DNA fragments are injected into a gel DNA fragments are injected into a gel positioned in an electric fieldpositioned in an electric field

DNA are negatively charged near neutral DNA are negatively charged near neutral pHpH The ribose phosphate backbone of each The ribose phosphate backbone of each

nucleotide is acidic; DNA has an overall nucleotide is acidic; DNA has an overall negative chargenegative charge

DNA molecules move towards the positive DNA molecules move towards the positive electrodeelectrode

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 11: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Gel ElectrophoresisGel Electrophoresis (cont’d) (cont’d)

DNA fragments of different lengths are DNA fragments of different lengths are separated according to sizeseparated according to size Smaller molecules move through the gel Smaller molecules move through the gel

matrix more readily than larger moleculesmatrix more readily than larger molecules

The gel matrix restricts random diffusion The gel matrix restricts random diffusion so molecules of different lengths separate so molecules of different lengths separate into different bandsinto different bands

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 12: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Gel Electrophoresis: ExampleGel Electrophoresis: Example

Direction

of DNA

movement

Smaller fragments

travel farther

Molecular Cell Biology, 4th edition

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Distance traveled is (roughly) inversely proportional to the logarithm of molecule size

Different sized molecules form distinct bands

Page 13: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Detecting DNA: Detecting DNA: AutoradiographyAutoradiography

One way to visualize separated DNA One way to visualize separated DNA bands on a gel is bands on a gel is autoradiographyautoradiography::

The DNA is radioactively labeledThe DNA is radioactively labeled

The gel is laid against a sheet of The gel is laid against a sheet of photographic film in the dark, exposing photographic film in the dark, exposing the film at the positions where the DNA is the film at the positions where the DNA is present.present.

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 14: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Detecting DNA: FluorescenceDetecting DNA: Fluorescence

Another way to visualize DNA bands in Another way to visualize DNA bands in gel is gel is fluorescencefluorescence::

The gel is incubated with a solution The gel is incubated with a solution containing the fluorescent dye ethidiumcontaining the fluorescent dye ethidium

Ethidium binds to the DNAEthidium binds to the DNA

The DNA lights up when the gel is The DNA lights up when the gel is exposed to ultraviolet light.exposed to ultraviolet light.

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 15: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Partial Restriction DigestPartial Restriction Digest

The sample of DNA is exposed to the restriction The sample of DNA is exposed to the restriction enzyme for only a limited amount of time to enzyme for only a limited amount of time to prevent it from being cut at all restriction sitesprevent it from being cut at all restriction sites

This experiment generates the set of all This experiment generates the set of all possible restriction fragments between every possible restriction fragments between every two (not necessarily consecutive) cutstwo (not necessarily consecutive) cuts

This set of fragment sizes is used to determine This set of fragment sizes is used to determine the positions of the restriction sites in the DNA the positions of the restriction sites in the DNA sequencesequence

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 16: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Partial Digest ExamplePartial Digest Example Partial Digest results in the following 10 Partial Digest results in the following 10

restriction fragments:restriction fragments:

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22}

X = {0, 5, 14, 19, 22}

Page 17: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Partial Digest Problem:Partial Digest Problem:

GoalGoal:: Given all pairwise distances Given all pairwise distances between points on a line, reconstruct between points on a line, reconstruct the positions of those pointsthe positions of those points

InputInput: The multiset of pairwise : The multiset of pairwise distances distances LL, containing C(n,2) , containing C(n,2) integersintegers

OutputOutput: A set : A set XX, of , of nn integers, such integers, such that that ΔΔXX = = LL

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 18: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Note:

It is not always possible to uniquely reconstruct a set X based

only on ΔX.

For example, the sets X = {0, 2, 5} and (X + 10) = {10, 12, 15}

both produce ΔX={2, 3, 5} as their partial digest set.

The sets {0,1,2,5,7,9,12} and {0,1,5,7,8,10,12} present a less

trivial example of non-uniqueness. They both digest into:

{1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 9, 10, 11, 12}

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 19: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

0 1 2 5 7 9 12

0

1 2 5 7 9 12

1

1 4 6 8 11

2

3 5 7 10

5

2 4 7

7

2 5

9

3

12

0 1 5 7 8 10 12

0

1 5 7 8 10 12

1

4 6 7 9 11

5

2 3 5 7

7

1 3 5

8

2 4

10

2

12

Homometric Sets

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Two sets A and B are homometric if A = B

A = {0,1,2,5,7,9,12} B = {0,1,5,7,8,10,12}

Page 20: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Partial Digest: Brute ForcePartial Digest: Brute Force(exhaustive search)(exhaustive search)

1.1. Find the restriction fragment of maximum length Find the restriction fragment of maximum length MM. . MM is the length of the DNA sequence. is the length of the DNA sequence.

2.2. For every possible set For every possible set

XX={={0, 0, xx22, … ,, … ,xxnn-1-1, , M}M}

compute corresponding compute corresponding ΔΔXX (i.e., pairwise distances) (i.e., pairwise distances)

3.3. If If ΔΔXX is equal to the experimental partial digest is equal to the experimental partial digest LL, , then then X X is the correct restriction mapis the correct restriction map

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 21: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Partial Digest: Brute ForcePartial Digest: Brute ForceTo do this, we will need to know To do this, we will need to know nn. Note that . Note that C(n,2) is n!/[(n-2)!2!] = n(n-1)/2C(n,2) is n!/[(n-2)!2!] = n(n-1)/2

But |L| = C(n,2) = n(n-1)/2, so nBut |L| = C(n,2) = n(n-1)/2, so n22 – n – 2|L| = 0 – n – 2|L| = 0

For For L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} (i.e., our L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} (i.e., our previous example), |L| = 10 and n = 5. (Recall that previous example), |L| = 10 and n = 5. (Recall that X = {0, 5, 14, 19, 22} in that example.)X = {0, 5, 14, 19, 22} in that example.)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 22: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

BruteForcePDP(BruteForcePDP(L, nL, n):):

MM ← ← maximum element in maximum element in LLfor every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < MM

XX ← ← {0, {0, xx22, …, , …, xxnn-1-1, , MM}}

form form ΔΔX X from from XXif if ΔΔX X == L L

return return XXoutput “no solution”output “no solution”

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 23: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

AnotherBruteForcePDP(AnotherBruteForcePDP(L, nL, n))

MM ←← maximum element in maximum element in LL

for every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < M M fromfrom L L

XX ← ← { 0, { 0, xx22, …, , …, xxnn-1-1, , M M }}

form form ΔΔX X from from XX

if if ΔΔX X == L L

return return XX

output “no solution”output “no solution”

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 24: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

AnotherBruteForcePDP(AnotherBruteForcePDP(L, nL, n))

MM ←← maximum element in maximum element in LL

for every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < M M fromfrom L L

XX ← ← { 0, { 0, xx22, …, , …, xxnn-1-1, , M M }}

form form ΔΔX X from from XX

if if ΔΔX X == L L

return return XX

output “no solution”output “no solution”

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Example:

L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22}

n=5

Form all possible variations of

X = {0, a, b, c, M} until finding one for

which ΔX= L (where a, b, and c are

values < M from L)

Answer: X = {0, 5, 14, 19, 22}

Page 25: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

BruteForcePDP(BruteForcePDP(L, nL, n):):

MM ← ← maximum element in maximum element in LLfor every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < MM

XX ← ← {0, {0, xx22, …, , …, xxnn-1-1, , MM}}

form form ΔΔX X from from XXif if ΔΔX X == L L

return return XXoutput “no solution”output “no solution”

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Efficiency:

1.There are C(M-1,n-2) sets of integers

having values in the range (0,M)

2.Creating X, forming ΔX from X, and

comparing ΔX to L each requires a constant

number of operations

3.So, efficiency is O(C(M-1,n-2)) O(Mn-2)

Page 26: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

AnotherBruteForcePDP(AnotherBruteForcePDP(L, nL, n))

MM ←← maximum element in maximum element in LL

for every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < M M fromfrom L L

XX ← ← { 0, { 0, xx22, …, , …, xxnn-1-1, , M M }}

form form ΔΔX X from from XX

if if ΔΔX X == L L

return return XX

output “no solution”output “no solution”

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Efficiency:

1.There are C(|L|,n-2) sets of integers in L

having values in the range [0,M]. Note that |

L| = n(n-1)/2.

2.As before, the other processes each take

a constant number of operations

3.So, efficiency is O(C(|L|,n-2)) O(n2n-4)

Page 27: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Compare AnotherBruteForcePDP with BruteForcePDP

More efficient, but still slow

Consider L = {2, 998, 1000} (n = 3, M = 1000), BruteForcePDP will be extremely slow, but AnotherBruteForcePDP will be quite fast

Fewer sets are examined, but runtime is still exponential: O(n2n-4)

Page 28: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

PartialDigest(L)width ← Maximum element in LDELETE(width, L)X ← {0, width}PLACE(L, X)

PLACE(L, X)if L is empty

output Xreturn

y ← maximum element in Lif Δ(y, X ) L

Add y to X and remove lengths Δ(y, X) from LPLACE(L,X )Remove y from X and add lengths Δ(y, X) to L

if Δ(width-y, X ) LAdd width-y to X and remove lengths Δ(width-y, X) from LPLACE(L,X )Remove width-y from X and add lengths Δ(width-y, X) to L

return

A Better Algorithm…Notes:1.DELETE(y, L) removes the value y from L.2.Δ(y, X) denotes the multiset of distances between a point y and all points in a set X.3.After each recursive call in PLACE, X and L are restored to their condition before the call in case another branch in the search tree must be explored4.The algorithm lists all sets X with ΔX = L.

Consider an example whereL = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10}…

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 29: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Example

Page 30: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

PartialDigest(L)width ← Maximum element in LDELETE(width, L)X ← {0, width}PLACE(L, X)

PLACE(L, X)if L is empty

output Xreturn

y ← maximum element in Lif Δ(y, X ) L

Add y to X and remove lengths Δ(y, X) from LPLACE(L,X )Remove y from X and add lengths Δ(y, X) to L

if Δ(width-y, X ) LAdd width-y to X and remove lengths Δ(width-y, X) from LPLACE(L,X )Remove width-y from X and add lengths Δ(width-y, X) to L

return

A Better Algorithm…Efficiency:For the ideal case, only one recursive call is made in PLACE each time PLACE is called. The amount of work done for the call is O(n) the first time, O(n-1) the second time, etc., and this continues for n times, so the total work is n+(n-1)+(n-2)+…+1 = n(n+1)/2 or O(n2).

For pathological cases where both recursive calls are made in PLACE (i.e., if both alternatives are viable) each time PLACE is called, the complexity is O(2n) where n is |X|.

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Page 31: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Notes for Brute Force ApproachesBruteForcePDPM ← maximum element in Lfor every set of n – 2 integers 0 < x2 < … xn-1 < M X ← {0, x2, …, xn-1, M} form ΔX from X if ΔX = L return Xoutput “no solution”

AnotherBruteForcePDPM ← maximum element in Lfor every set of n – 2 integers 0 < x2 < … xn-1 < M from L X ← {0, x2, …, xn-1, M} form ΔX from X if ΔX = L return Xoutput “no solution”

compare

We would like to use the same algorithm to solve both. In order to do this, let’s put the values we will be using to create X into an array called workingArray. For BruteForcePDP we will be dealing with #s 1, 2, 3, 4, …, M-1. Let’s call this set of values allValues. For AnotherBruteForcePDP we will be dealing with #s from L except M and any duplicates. Let’s call this set of values reducedL. So, in order to use the same code for both BruteForcePDP and AnotherBruteForcePDP, all we need to do is put allValues or reducedL into workingArray, respectively, then use workingArray:

GenericBruteForcePDPif algorithm = BruteForcePDP workingArray ← allValueselse workingArray ← reducedLM ← maximum element in Lfor every set of n – 2 integers in workingArray X ← { 0, x2, …, xn-1, M } form ΔX from X if ΔX = L return Xoutput “no solution”

Page 32: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Notes for Brute Force Approaches

Our next problem is to generate every possible set of n-2 integers from values in workingArray. One way is to envision this as a tree search problem where the leaf nodes represent the possible arrangements of the values in workingArray. For example, consider L = {2, 2, 5, 7, 9, 10} and n = 4. In this case, workingArray contains the values {2, 5, 7, 9}.

A simple (but naïve) tree would look like this: But this tree eliminates redundancy :

Of course, we really just want the leaf nodes. To produce them, we can simply perform a depth first search, adding to the set of values at each of the n-2 positions (from left to right) as we go deeper into the tree until all positions have been filled. Our choice of value at any time will be made from the unused values in workingArray. When we use a value, we must remove it from workingArray so that it cannot be used at the next level. However, in order for this to work, we must restore workingArray to its previous state when we backtrack to a node. The easiest way to do this is via recursion, in which case we only need to make a copy of workingArray before the next recursive call, remove the appropriate value from the copy, and then pass the copy. That way, when returning from the recursive call, workingArray will already be as it was before the recursion.

(continued)

Combinations:C(x,y) = x!/[(x-y)!y!] C(4,2) = 4!/[(4-2)!2!] = 6

Permutations:P(x,y) = x!/(x-y)! P(4,2) = 4!/(4-2)! = 12

Page 33: Restriction Mapping An Introduction to Bioinformatics Algorithms (Jones and Pevzner)

Notes for Brute Force Approaches(continued)

Let setOfIntegers be the collection of n-2 integers that we must generate (i.e., a candidate map). Recall that this will begin with no values. Here is pseudocode for a depth-first traversal of the search tree:

depthFirst (setOfIntegers, workingArray)if setOfIntegers is complete (i.e., has no unfilled positions) if ΔX = L show setOfIntegers returnfor each position in workingArray V ← value at current position in workingArray (i.e., next unused value) workingArrayCopy ← workingArray remove V from workingArrayCopy setOfIntegersCopy ← setOfIntegers next available position in setOfIntegersCopy ← V depthFirst (setOfIntegersCopy, workingArrayCopy)return

Note: Before each recursive call we are reducing the contents of workingArray and increasing the number of values in setOfIntegers (i.e., the candidate map). We make copies of these arrays and pass them so that upon return both workingArray and setOfIntegers are as they were before being modified for the recursive call.