comparative genome maps csci 7000-005: computational genomics debra goldberg [email protected]

Comparative Genome Maps

CSCI 7000-005: Computational Genomics

Debra Goldberg

[email protected]

What is a comparative map?

Why construct comparative maps?

Identify & isolate genes• Crops: drought resistance, yield, nutrition...• Human: disease genes, drug response,…

Infer ancestral relationships Discover principles of evolution

• Chromosome• Gene family

“key to understanding the human genome”

Why automate?

Time consuming, laborious• Needs to be redone frequently

Codify a common set of principles

Nadeau and Sankoff: warn of “arbitrary nature of comparative map construction”

Definitions

Marker: identifiable chromosomal locus

Homology: genes with common ancester

Homeology: chromosomal regions derived from a common ancestral linkage group

Synteny: loci on the same chromosome

Colinearity: syntenic regions with conserved gene order

Input/Output

Input: • genetic maps of 2 species• marker/gene correspondences (homologs)

Output:• a comparative map

• homeologies identified

Map construction

3S

8L

10L

3L

Maize 1 (target), Rice (base)

Wilson et al. Genetics 1999

pds1 (3S)

rz742a (2S)

rz103b (2L)

cdo1387b (3S)

isu040 (3)

rz574 (3S)

cdo38a (7L)

cdo938a (3S)

rz585a (3S)

rz672a (3S)

isu081b (3S 10L)

rz323a (8L)

cdo344c (12L)

rz296a (5L)

bcd734b (3S)

rz500 (10L)

rz421 (10L)

isu74 (3S)

cdo464a (8L)

isu73 (3S)

cdo475b (6S)

cdo595 (8L)

cdo116 (8L)

rz28a (8L)

cdo99 (8L)

rz698a (9L)

bcd207a (10L)

cdo94b (10L)

bcd386a (10L)

isu78 (5L)

csu77 (10L)

cdo98b (10L)

rz630e (3L)

rz403 (3L)

cdo795a (3L)

bcd1072c (5C)

isu92b (3L)

cdo122a (3L)

rz912a (3L)

bcd808a (11S)

cdo246 (3L)

adh1 (11S)

cdo353b (3L)

isu106a (3L)

phi1 (3L)

Go from this to this

Chromosome labeling

Maize 1 (target),

Rice (base)


Maize 1

pds1 (3S)

rz742a (2S)

rz103b (2L)

cdo1387b (3S)

isu040 (3)

rz574 (3S)

cdo38a (7L)

cdo938a (3S)

rz585a (3S)

rz672a (3S)

isu081b (3S 10L)

rz323a (8L)

cdo344c (12L)

rz296a (5L)

bcd734b (3S)

rz500 (10L)

rz421 (10L)

isu74 (3S)

cdo464a (8L)

isu73 (3S)

cdo475b (6S)

cdo595 (8L)

cdo116 (8L)

rz28a (8L)

cdo99 (8L)

rz698a (9L)

bcd207a (10L)

cdo94b (10L)

bcd386a (10L)

isu78 (5L)

csu77 (10L)

cdo98b (10L)

rz630e (3L)

rz403 (3L)

cdo795a (3L)

bcd1072c (5C)

isu92b (3L)

cdo122a (3L)

rz912a (3L)

bcd808a (11S)

cdo246 (3L)

adh1 (11S)

cdo353b (3L)

isu106a (3L)

phi1 (3L)

Rice

3S

8L

10L

3L

A natural model?

Maize 1 (target),

Rice (base)


Maize 1

pds1 (3S)

rz742a (2S)

rz103b (2L)

cdo1387b (3S)

isu040 (3)

rz574 (3S)

cdo38a (7L)

cdo938a (3S)

rz585a (3S)

rz672a (3S)

isu081b (3S 10L)

rz323a (8L)

cdo344c (12L)

rz296a (5L)

bcd734b (3S)

rz500 (10L)

rz421 (10L)

isu74 (3S)

cdo464a (8L)

isu73 (3S)

cdo475b (6S)

cdo595 (8L)

cdo116 (8L)

rz28a (8L)

cdo99 (8L)

rz698a (9L)

bcd207a (10L)

cdo94b (10L)

bcd386a (10L)

isu78 (5L)

csu77 (10L)

cdo98b (10L)

rz630e (3L)

rz403 (3L)

cdo795a (3L)

bcd1072c (5C)

isu92b (3L)

cdo122a (3L)

rz912a (3L)

bcd808a (11S)

cdo246 (3L)

adh1 (11S)

cdo353b (3L)

isu106a (3L)

phi1 (3L)

Rice

3S

8L

10L

3L

Scoring

10L

3L

s

m

bcd207a (10L)cdo94b (10L)bcd386a (10L)isu78 (5L)csu77 (10L)cdo98b (10L)rz630e (3L)rz403 (3L)cdo795a (3L)isu92b (3L)

Assumptions

Accept published marker order

All linkage groups of base are unique

Simplistic homeology criteria

At least one homeologous region

A natural model?

Dynamic programming

li = location of homolog to marker i

S[i,a] = penalty (score) for an optimal labeling of the submap from marker i to the end, when labeling begins with label a

a

1 ... i ... n

Recurrence relation

S[n,a] = m (a, ln)

S[i,a] = m (a, li) + min (S[i+1,b] + s (a,b) )bL

a b

... i i+1 ... n

li li+1 ln

a ... n... ln

Problem with linear model

s = 2

a-b-c motif:

a b c score: 2s = 4

a a a b b b c c c

a-b-a motif:

a score: 3m = 3

a a a b b b a a a

The stack model

Segment at top of the stack can be:• pushed (remembered), later popped• replaced

Push and replace cost s -- pop is free.

b b bfe

dc

ac

Scoring

s

9L

7L

7L

“free” pop

m

m

m

uaz265a (7L) isu136 (2L) isu151 (7L) rz509b (7L) cdo59c (7L) rz698c (9L) bcd1087a (9L) rz206b (9L) bcd1088c (9L) csu40 (3S) cdo786a (9L) csu154 (7L) isu113a (7L) csu17 (7L) cdo337 (3L) rz530a (7L)

Dynamic programming

S[i,j,a] = score for an optimal labeling of:• submap from marker i to marker j• when labeling begins with label a --

i.e., marker i is labeled a

a

1 ... i ... j ... n

Recurrence relation

S[i,i,a] = m (a, li)

S[i,j,a] = min: m (a, li) + min (S[i+1,j,b] + s (a,b) )

min S[i,k,a] + S[k+1,j,a] i<k<j

bL

a a

1 ... i ... k+1 ... j ... n

a1 ... i i+1 ... n

a b1 ... i i+1 ... n

Results: infers evolutionary events

Maize 1 (target)

Rice (base)

Wilson et al.

Stack

Problem: Incomplete input

Gene order not always fully resolved. Co-located genes can be ordered to give

most parsimonious labeling.8p

19p

33.0 Atp6b1 (8p)33.0 Comp (19)33.0 Jak3 (19p)33.0 Jund1 (19p)33.0 Lpl (8p)33.0 Mel (19p)33.0 Npy1r (4q)33.0 Pde4c (19)33.033.0 Srebf1 (17p)

Slc18a1 (8p)

Atp6b1 (8p)Lpl (8p)

Npy1r (4q)Srebf1 (17p)Comp (19)Jak3 (19p)Jund1 (19p)Mel (19p)Pde4c (19)

Slc18a1 (8p)

=

8p

19p

The reordering algorithm

Uses a compression scheme• Within a megalocus, group genes by location

of related gene.• Order these groups• First, last groups interact with nearby genes• Any ordering of internal groups is equally

parsimonious

The reordering algorithm

Definitions

extended to distance to a set A of labels

0 if a A,

1 otherwise

S = the set of indices of supernode start elements

For simplicity, call supernode i S

(a, A) =

Definitions

For i S:

ni = # markers in i

ni(a) = # markers in i with a homolog on a

li = set of labels matching markers in i

• li = {a L | ni(a) 1},

Definitions

pi(c) gives mismatched marker and segment boundary penalties for label c

pi(c) = s : m ni(c) s

m ni(c) : m ni(c) s

Definitions

p(i,a,b) gives the total mismatched marker and segment boundary penalties attributed to “hidden markers”

(pi(c)) + m i (a,b) : for iS, ab

p(i,a,b) = (m ni(c)) + m i (a,b) : for iS, a=b

0 : otherwise.

c a,b

c a

Definitions

For i S:

i(a,b) = # labels in {a,b} without matching marker in i i(a,b) = (a, li) + (b, li)

i(a,b) {0,1,2}

Definitions

i (a,b) corrects if mismatch marker penalties assigned twice for same marker; in the recurrence and in p(i,a,b)

For example:i (a,b) = 0 if i(a,b) = 0

(if a, b are both represented in supernode)i (a,a) = -2 if i(a,a) > 0

(if a is not represented in supernode)

Recurrence relation

S[i,i,a] = m (a, li)

S[i,j,a] = min:

m (a, li) + min (S[i+1,j,b] + s (a,b) + p(i,a,b))

min S[i,k,a] + S[k+1,j,a] i<k<jk S

bL

Results: Fewer mismatches

stack reordering

Mouse 5 (target)

Human (base)

Results: Mismatches placed between segments

stack reordering

Mouse 8 (target)

Human (base)

Results: Detects new segments

stack reordering

Mouse 13 (target)

Human (base)

Summary

Finds optimal comparative map• Arranges markers in most parsimonious way

First algorithm to use megalocus data

Fast, objective, simple to use

Biologically meaningful results

Summary

Global view

Biologically meaningful results• Provides testable hypotheses

Robust• not species-specific• high/low resolution, genetic/physical maps• stable to errors in marker order

Future Directions

Algorithmic extensions• 3rd species• polyploidy• search for ancient duplications

Deduce history of evolutionary events• makes genome rearrangement measures

tractable and robust• infer common ancestor

Future Directions

Block-segmental sequence comparisons• non-local sequence alignment• protein domains

2D block-segmental comparisons• comparison of regulatory networks• image processing

Acknowledgments

Jon Kleinberg

Susan McCouch

Chris Pelkie

Sandra Harrington

Sam Cartinhour

Dave Schneider

NSF AAUW David and Lucile Packard

Foundation USDA Cooperative State

Research Education and Extension Service

ONR

comparative genome maps csci 7000-005: computational genomics debra goldberg [email protected]

Documents

marker i

c slide

supernode slide

aaabbbaaa slide

human genome slide

marker j

b pi

b blbl