comparative genome maps csci 7000-005: computational genomics debra goldberg [email protected]
Post on 21-Dec-2015
218 views
TRANSCRIPT
Why construct comparative maps?
Identify & isolate genes• Crops: drought resistance, yield, nutrition...• Human: disease genes, drug response,…
Infer ancestral relationships Discover principles of evolution
• Chromosome• Gene family
“key to understanding the human genome”
Why automate?
Time consuming, laborious• Needs to be redone frequently
Codify a common set of principles
Nadeau and Sankoff: warn of “arbitrary nature of comparative map construction”
Definitions
Marker: identifiable chromosomal locus
Homology: genes with common ancester
Homeology: chromosomal regions derived from a common ancestral linkage group
Synteny: loci on the same chromosome
Colinearity: syntenic regions with conserved gene order
Input/Output
Input: • genetic maps of 2 species• marker/gene correspondences (homologs)
Output:• a comparative map
• homeologies identified
Map construction
3S
8L
10L
3L
Maize 1 (target), Rice (base)
Wilson et al. Genetics 1999
pds1 (3S)
rz742a (2S)
rz103b (2L)
cdo1387b (3S)
isu040 (3)
rz574 (3S)
cdo38a (7L)
cdo938a (3S)
rz585a (3S)
rz672a (3S)
isu081b (3S 10L)
rz323a (8L)
cdo344c (12L)
rz296a (5L)
bcd734b (3S)
rz500 (10L)
rz421 (10L)
isu74 (3S)
cdo464a (8L)
isu73 (3S)
cdo475b (6S)
cdo595 (8L)
cdo116 (8L)
rz28a (8L)
cdo99 (8L)
rz698a (9L)
bcd207a (10L)
cdo94b (10L)
bcd386a (10L)
isu78 (5L)
csu77 (10L)
cdo98b (10L)
rz630e (3L)
rz403 (3L)
cdo795a (3L)
bcd1072c (5C)
isu92b (3L)
cdo122a (3L)
rz912a (3L)
bcd808a (11S)
cdo246 (3L)
adh1 (11S)
cdo353b (3L)
isu106a (3L)
phi1 (3L)
Go from this to this
Chromosome labeling
Maize 1 (target),
Rice (base)
Wilson et al. Genetics 1999
Maize 1
pds1 (3S)
rz742a (2S)
rz103b (2L)
cdo1387b (3S)
isu040 (3)
rz574 (3S)
cdo38a (7L)
cdo938a (3S)
rz585a (3S)
rz672a (3S)
isu081b (3S 10L)
rz323a (8L)
cdo344c (12L)
rz296a (5L)
bcd734b (3S)
rz500 (10L)
rz421 (10L)
isu74 (3S)
cdo464a (8L)
isu73 (3S)
cdo475b (6S)
cdo595 (8L)
cdo116 (8L)
rz28a (8L)
cdo99 (8L)
rz698a (9L)
bcd207a (10L)
cdo94b (10L)
bcd386a (10L)
isu78 (5L)
csu77 (10L)
cdo98b (10L)
rz630e (3L)
rz403 (3L)
cdo795a (3L)
bcd1072c (5C)
isu92b (3L)
cdo122a (3L)
rz912a (3L)
bcd808a (11S)
cdo246 (3L)
adh1 (11S)
cdo353b (3L)
isu106a (3L)
phi1 (3L)
Rice
3S
8L
10L
3L
A natural model?
Maize 1 (target),
Rice (base)
Wilson et al. Genetics 1999
Maize 1
pds1 (3S)
rz742a (2S)
rz103b (2L)
cdo1387b (3S)
isu040 (3)
rz574 (3S)
cdo38a (7L)
cdo938a (3S)
rz585a (3S)
rz672a (3S)
isu081b (3S 10L)
rz323a (8L)
cdo344c (12L)
rz296a (5L)
bcd734b (3S)
rz500 (10L)
rz421 (10L)
isu74 (3S)
cdo464a (8L)
isu73 (3S)
cdo475b (6S)
cdo595 (8L)
cdo116 (8L)
rz28a (8L)
cdo99 (8L)
rz698a (9L)
bcd207a (10L)
cdo94b (10L)
bcd386a (10L)
isu78 (5L)
csu77 (10L)
cdo98b (10L)
rz630e (3L)
rz403 (3L)
cdo795a (3L)
bcd1072c (5C)
isu92b (3L)
cdo122a (3L)
rz912a (3L)
bcd808a (11S)
cdo246 (3L)
adh1 (11S)
cdo353b (3L)
isu106a (3L)
phi1 (3L)
Rice
3S
8L
10L
3L
Scoring
10L
3L
s
m
bcd207a (10L)cdo94b (10L)bcd386a (10L)isu78 (5L)csu77 (10L)cdo98b (10L)rz630e (3L)rz403 (3L)cdo795a (3L)isu92b (3L)
Assumptions
Accept published marker order
All linkage groups of base are unique
Simplistic homeology criteria
At least one homeologous region
Dynamic programming
li = location of homolog to marker i
S[i,a] = penalty (score) for an optimal labeling of the submap from marker i to the end, when labeling begins with label a
a
1 ... i ... n
Recurrence relation
S[n,a] = m (a, ln)
S[i,a] = m (a, li) + min (S[i+1,b] + s (a,b) )bL
a b
... i i+1 ... n
li li+1 ln
a ... n... ln
Problem with linear model
s = 2
a-b-c motif:
a b c score: 2s = 4
a a a b b b c c c
a-b-a motif:
a score: 3m = 3
a a a b b b a a a
The stack model
Segment at top of the stack can be:• pushed (remembered), later popped• replaced
Push and replace cost s -- pop is free.
b b bfe
dc
ac
Scoring
s
9L
7L
7L
“free” pop
m
m
m
uaz265a (7L) isu136 (2L) isu151 (7L) rz509b (7L) cdo59c (7L) rz698c (9L) bcd1087a (9L) rz206b (9L) bcd1088c (9L) csu40 (3S) cdo786a (9L) csu154 (7L) isu113a (7L) csu17 (7L) cdo337 (3L) rz530a (7L)
Dynamic programming
S[i,j,a] = score for an optimal labeling of:• submap from marker i to marker j• when labeling begins with label a --
i.e., marker i is labeled a
a
1 ... i ... j ... n
Recurrence relation
S[i,i,a] = m (a, li)
S[i,j,a] = min: m (a, li) + min (S[i+1,j,b] + s (a,b) )
min S[i,k,a] + S[k+1,j,a] i<k<j
bL
a a
1 ... i ... k+1 ... j ... n
a1 ... i i+1 ... n
a b1 ... i i+1 ... n
Problem: Incomplete input
Gene order not always fully resolved. Co-located genes can be ordered to give
most parsimonious labeling.8p
19p
33.0 Atp6b1 (8p)33.0 Comp (19)33.0 Jak3 (19p)33.0 Jund1 (19p)33.0 Lpl (8p)33.0 Mel (19p)33.0 Npy1r (4q)33.0 Pde4c (19)33.033.0 Srebf1 (17p)
Slc18a1 (8p)
Atp6b1 (8p)Lpl (8p)
Npy1r (4q)Srebf1 (17p)Comp (19)Jak3 (19p)Jund1 (19p)Mel (19p)Pde4c (19)
Slc18a1 (8p)
=
8p
19p
The reordering algorithm
Uses a compression scheme• Within a megalocus, group genes by location
of related gene.• Order these groups• First, last groups interact with nearby genes• Any ordering of internal groups is equally
parsimonious
Definitions
extended to distance to a set A of labels
0 if a A,
1 otherwise
S = the set of indices of supernode start elements
For simplicity, call supernode i S
(a, A) =
Definitions
For i S:
ni = # markers in i
ni(a) = # markers in i with a homolog on a
li = set of labels matching markers in i
• li = {a L | ni(a) 1},
Definitions
pi(c) gives mismatched marker and segment boundary penalties for label c
pi(c) = s : m ni(c) s
m ni(c) : m ni(c) s
Definitions
p(i,a,b) gives the total mismatched marker and segment boundary penalties attributed to “hidden markers”
(pi(c)) + m i (a,b) : for iS, ab
p(i,a,b) = (m ni(c)) + m i (a,b) : for iS, a=b
0 : otherwise.
c a,b
c a
Definitions
For i S:
i(a,b) = # labels in {a,b} without matching marker in i i(a,b) = (a, li) + (b, li)
i(a,b) {0,1,2}
Definitions
i (a,b) corrects if mismatch marker penalties assigned twice for same marker; in the recurrence and in p(i,a,b)
For example:i (a,b) = 0 if i(a,b) = 0
(if a, b are both represented in supernode)i (a,a) = -2 if i(a,a) > 0
(if a is not represented in supernode)
Recurrence relation
S[i,i,a] = m (a, li)
S[i,j,a] = min:
m (a, li) + min (S[i+1,j,b] + s (a,b) + p(i,a,b))
min S[i,k,a] + S[k+1,j,a] i<k<jk S
bL
Summary
Finds optimal comparative map• Arranges markers in most parsimonious way
First algorithm to use megalocus data
Fast, objective, simple to use
Biologically meaningful results
Summary
Global view
Biologically meaningful results• Provides testable hypotheses
Robust• not species-specific• high/low resolution, genetic/physical maps• stable to errors in marker order
Future Directions
Algorithmic extensions• 3rd species• polyploidy• search for ancient duplications
Deduce history of evolutionary events• makes genome rearrangement measures
tractable and robust• infer common ancestor
Future Directions
Block-segmental sequence comparisons• non-local sequence alignment• protein domains
2D block-segmental comparisons• comparison of regulatory networks• image processing