snail montreal
Post on 16-Apr-2017
933 Views
Preview:
TRANSCRIPT
Unweaving regulatory networks: Automated
extraction from literature and statistical analysis
Overview of the talk
• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
•Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
--------------------------------------------
• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks
Our project is set up as a collaboration of three departments of Columbia University
Interdisciplinary Collaboration:
Department of Medical Informatics, Columbia University (Carol Friedman, Pauline Kra, Michael Krauthammer, Yu Hong, Andrey Rzhetsky)
Department of Computer Science, Columbia University (Vasileios Hatzivassiloglou, Pablo Ariel Duboue, Wubin Weng)
Columbia Genome Center, Columbia University (Pavel Morozov, Tomohiro Koike, Shawn Gomez, Sabina Kaplan, Sergey Kalachikov, Jim Russo, Andrey Rzhetsky)
Studying living organisms
is not unlike
playing with a jigsaw puzzle…
Starting point: before sequence data were available
“Stamp collecting”: some regularities start to emerge...
Defining families of sequences
Beginning assembly of pieces: where we are now
Future
Overview of the talk• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
•Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
--------------------------------------------
• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks
Our long-term objective:
develop computational tools for automated compilation and analysis of complex cell
regulation cascades in vertebrates
Problem/Motivation:
Currently a search through the PubMed system with the keywords “cell cycle” and “apoptosis” produced lists of 169,293 and 29,961 articles, respectively.
Clearly it is not feasible to scan all these papers “manually” ...
We decided
(i) to develop tools for automatic retrieval of binary regulatory relationships between molecules from research literature using techniques of natural language processing,
and
(ii) to use extracted knowledge for editing, visualization, and superimposing/comparing homologous networks.
We call the system
GENIES (GENomics Information Extraction System)
R e levan tkeyw ords
P a thw ays
V isu la ize
E d it
S im u la te
C om pare
R e trieve co llec tion o fjou rna l a rtic les
S ave co llec tion o fs ta tem ent/source
pa irs
N a tuara l LanguageP rocess ing
F ilte rs ta tem ents ,
reso lvecon trove rs ies ,
e lim ina teredundanc ies
In s ilicoknock ou t
o r knock ingenes
An overview of our system.
C ollectabstractsor artic les
R egularize &tag com ponents
Iden tify &tag term s
G enB ank
Identifybinary
re la tion
Supplem entallexicon
G ram m ar
K eywordsearch
C ollection o f"fla t" files
V isua lize, ed it,com pare,
s im ulate regu la torynetworks
WWW
I. C ollection
II. P reprocessing
III. T erm extraction
IV . R ela tionshipextraction
V . P ostprocessing
F iltersta tem ents,
reso lvecontrovers ies,
e lim inateredundancies,
inc lude o rexclude
sta tem ents
Application of techniques of Artificial Intelligence:
Natural Language Processing
Goal: to identify binary relationships of theform
“protein A activates protein B”
“protein B inactivates gene C”
Overview of the talk• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
•Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
---------------------------------------------
• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks
The language of regulatory pathways have significant
differences with the language of metabolic pathways
We represent a pathway a series of overlapping “links” –
substance/action/substance triplets
Substance A Substance B Substance C Substance D
Representation
Duality of actions in signal transduction literature
Logic Representation BiochemicalRepresentation Example
A = P I 3KB = A K T /P K B
A = pro teinphospatase 2A
B = FAS -activated
serine /th reoninekinase
A = IC EB = C PP 32
A = FAS-LB = FA S
A = (C -M yc:M ax)pro te in com plex
B = cdc25A gene
A = e IF2BB = virtua lly any
gene
A = Ca pum pA TP aseB = Ca
2+
2+
AB
A TP
AD P +
(pum p/channe l)
inside
B outside
-P O4
AB
(cata lys t - phosphatase )
active
B inactivephosphoryla ted
-PO4
AB
AT P
A DP
(ca ta lyst - kinase )
active
B inactive
phosphoryla ted
Bactive
C inactive
D +A (cata lyst - protease )
A is a ligandB inactiveA +
active[AB ]B is a receptor
A in itia tes transcrip tion o f B
phosphoryla tion
dephosphoryla tion
transport
c leavage
b inding
transcrip tion
translation A in itia tes translation of B
A activates B through a processprocess A = FA S-LB = AK T /PK B
A activates B through an action
other
sing le ac tion
A
B
"A activates B "
We realized that the
current research literature in molecular biology
Describes pathways on two different levels:
Logical and
Biochemical
A activates BA inactivates B
A phoshorylates BA methylates B
...
logical
biochemical
Dualism: in the biochemical representation substance A is not a participant of the
action, while it is in the logical representation
Logical Biochemical
Both logical and biochemical descriptions can be combined in the same sentence:
Activated raf-1 phosphorylates and activates mek-1.
logicalbiochemical
The paper descibing a “knowledge model” (=ontology) will appear in
Bioinformatics
Ontology paper
We represent a pathway a series of overlapping “links” –
substance/action/substance triplets
Substance A Substance B Substance C Substance D
“Actions” are relatively scanty:one can provide an exhaustive
list of them
Each action comes with a mechanism (biochemical
representation) and result (logical representation)
Gene and protein names are numerous (currently >80,000) and
the number is growing
MedLEE (by Carol Friedman and colleagues) contains implementation of various grammatical patterns associated
with the same verb:
A activates B…A is an activator of B…
A appeared to activate B…A is activating B…
MedLEE=Medical Language Extraction and Encoding System
It is an integral part of Clinical Information Service at Columbia-Presbyterian Medical Center,
It routinely processes thousands of patient records a day.
MedLEE does semantic analysis of the complete sentence.
If it a complete sentence cannot be parsed successfully, MedLEE does re-analysis, trying to extract parts.
For details see, e.g.,
Friedman, C., G. Hripcsak, W. DuMouchel, S.B. Johnson, and P.D. Clayton. 1995. Natural language processing in an operational clinical system. Natural Language Engineering. 1 (1): 83-108.
Overview of the talk• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
•Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
---------------------------------------------
• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks
Term identification:
To give you a feeling of the work of the complete conveyer line…
Consider sentence from an actualScience article
“rap1 functions as a negative regulatorof Tcr-mediated il-2 gene transcription”
NLP module (term markup + MedLEE) produces
[action, inactivate, [protein, rap1], [action, activate, [complex, T-cell receptor][action, transcribe, [gene, gene encoding interleukin-2]]],[parsemode, mode1]]
Which is then converted into“shorthand” notation
transcribe gene encoding interleukin-2Tcr activates transcribe&il-2Rap1 inactivates Tcr&transcribe&il-2
Substance (gene)Action
Action on action
Which is then further converted to a format readable by our pathway
visualization program
Protein{ Name{ "IL-3", } }
LogicalAction{ { UpstreamActionAgent { Protein{ Name{ "IL-3", } } }, DownstreamActionAgent { Complex{ Name{ "IL-3R" } } }, Result{ activation } } }
Complex{ Name{ "IL-3R" } Composition{ Protein{ Name{ “IL-3R alpha” } } Protein{ Name{ “IL-3R beta” } } } }
Which is then visualized...
IL -3
IL -3R
IG F1
IG F1R
IR S 1
R A S
P I 3-K
A K T /PK B
B A DB cl-XL
FA S -L
FA S
FA DD/MO R T
FL IC E
IC E
C PP 32
apoptos is
m itogen
C yclin D 1
pR b
E 2F
C yc lin E
P 53
P 21
P16
P27
C dk4
P 107
C -Myc
C -Myc
?
B in-1
Max
Max
C dc25A
Max
Mad
Mad
C dk2 p
P27 C yc lin E
C dk2p
C yc lin E
C dk2 p
C yc lin E
C dk2
cell pro liferation
Example of an actual human regulatory networkvisualized
Corresponding article
Overview of the talk• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
•Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
---------------------------------------------
• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks
Drawing a complex graphis a separate problem ofComputer Science.
We are usinga Simulated Annealing Techniqueto find an optimum graph layout
What is a good pathway graph?•Every gene/protein name is easy to read every
•Easy to trace connections between pairs of molecules
•Easy to read mechanism and result for each action
•Compact
•Shows tissue/stage/species/cell line specificity
•Beautiful
Human Cell Cycle /
Apoptosis Machinery
~400 nodes
Layered graph layout
Incompleteness of the graph:almost complete lack of feedback
loops
Overview of the talk
• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
•Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
---------------------------------------------
• Scale-free networks in biology and outside; stochastic birth of scale-free networks
Scale-free networks in biology and outside; mechanism of stochastic birth of
scale-free networks
(collaboration with Shawn M. Gomez)
Motivation:
To understand and describe real networks we need to come up
with biologically sensible model that is capable of generating
networks with properties close to those of real networks
What is a scale-free network?
k=0 k=1 k=2 k=3 k=4 …
kin=0 kin=1 kin=2 kin=3 kin=4 …
kout=0 kout=1 kout=2 kout=3 kout=4 …
Geometry of a network:
frequency, pk, of vertices
having exactly k edges
Yeast edge distributions
100
101
10-4
10-3
10-2
10-1
100
Edges (k)
P(k
)Distribution of Connectivity for Yeast Protein Network (DIP)
P(k) ~ k-g
g = 1.97
g = 2.80Outgoing Outgoing Regr.Incoming Incoming Regr.
Network topologyAlbert, et al. Nature 506, 378-382.
(Erdös–Rényi model:most nodes have approximatelythe same number of connections)
There are other scale-free networks
References
References
Actor collaboration
Power grid
World Wide Web
Actor collaboration
(see www.nd.edu/~alb)
Scale-free connectivity
kp ck g
• Scale-free networks are often characterized by a power law distribution
Actor collaboration
WWW
Power grid
WWW By K. C. Claffy
Summary
Related concepts: Zipf’s laws
Network of English words
HAMLET
The body is with the king, but the king is not with the body. The king is a thing--
body
the is
with
kingbut
not
a thing
Word frequencies in Shakespeare’s plays:
http://www.mta.link75.org/curriculum/english/shake/
Hamlet: “incoming connections”The, 1101
And, 898
To, 726
Of, 657
I, 561
You, 544
My, 508
A, 498
In, 414
It, 414
That, 389
Is, 334
Not, 315
This, 296
His, 292
…
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7Dataset: "Hamlet.txt"; g=0.706489; c=4.939893
log(number of connections per vertex)
log(
num
ber o
f ver
tices
)
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7Dataset: "AnthonyAndCleopatra.txt"; g=0.708199; c=4.829506
log(number of connections per vertex)
log(
num
ber o
f ver
tices
)
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6Dataset: "Macbeth.txt"; g=0.741575; c=4.814642
log(number of connections per vertex)
log(
num
ber o
f ver
tices
)
Metabolic networks in bacteria
There are networks that are not scale-free
Caenorhabditis elegans
“Some 300 of the 959 cells of the adult worm, for example, constitute a nervous system that can detect odor, taste, and respond to temperature and touch.”
0 50 100 150 200 250
0
10
20
30
40
50
60Nematode neurons
number of connections per vertex
num
ber o
f ver
tices
Not all biological networks are scale-free
Overview of the talk
• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
• Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
---------------------------------------------
• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks
A stochastic model of evolution of a molecular network
Molecular networksare composed of DNA, RNA, proteins, lipids, small molecules, …
(Over)Simplification:we assume that each node of the network has two domains (not
necessary distinct)responsible for the upstream and
downstream connections
A node of a molecular network
u p strea m
d o w n strea m
Each substance has a pair of “domains,”upstream and downstream,which determine specific upstream and downstream network connections.
The model would not be appropriate if domain-rich proteins tend to have more network connections than
domain-poor proteins
Another way to put the same question: Is the number of
domains per protein correlated with the number of network
connections per protein?
Connectivity is NOT affected by domain number
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
5
10
15
20
25
Number of Domains
Num
ber
of E
dges
OutgoingIncoming
Therefore,
u p strea m
d o w n strea m
it is probably OK to use “two-domain” model inthe current setup
Evolutionary events affecting network growth
+
d o w n strea m
u p strea m
u+ +
+
(A ) D up lica tion (B ) Inn ova tion
(C ) R ep lacem en t
We assume
that network rearrangements arrive independently according to a Poisson process…
…which leads to a system of ordinary differential equations corresponding to a continuous-time Markov model…
Armed with resulting equations and programs, we can compute the topological properties of the
growing network for any (reasonable) combination of
parameter values
Data: signal transduction network in yeast
100
101
10-4
10-3
10-2
10-1
100
Edges (k)
P(k
)
Distribution of Connectivity for Yeast Protein Network (DIP)
P(k) ~ k-g
g = 1.97
g = 2.80Outgoing Outgoing Regr.Incoming Incoming Regr.
We therefore can fit the model to the experimental data and estimate parameter values
Estimating parameters:
2 2, , , ,
1 1
( ) ( ) .N N
k in k in k out k outk k
O E O E min
Fitting data for upstream and downstream connections simultaneously
100
101
102
10-9
10-8
10-7
10-6
10-5
10-4
10-3
10-2
u=0.000000; =18.638257; =43.313340; cu=-2.840558;
d=298.902373; =18.638257; =43.313340; cd=-2.835264.
log(number of connections per vertex)
log(
prop
ortio
n of
ver
tices
)
gin = 2.80
fitted model gout = 1.97
fitted model
Fitting data for upstream and downstream connections separately
100
101
102
10-7
10-6
10-5
10-4
10-3
10-2
10-1
u=0.000000; u=8.581353; u=20.042505; cu=-1.156492;
d=0.000000; d=13.167339; d=16.230905; cd=-1.285549.
log(number of connections per vertex)
log(
prop
ortio
n of
ver
tices
)
gin = 2.80
fitted model gout = 1.97
fitted model
Estimated parameters
ReplacementDuplication
Innovation
The analysis of the model suggests that the number of domain copies per genome
should also follow a power law
Frequency of domain appearances in E. coli (closed circles), S. cerevisiae (crosses), and H.
sapiens (open rectangles).
100
101
102
103
10-4
10-3
10-2
10-1
100
Number of Domain Occurences per Genome
Fre
quen
cy
E. coli S. cerevisiaeH. sapiens
How many distinct domains exist?
1
1
1
3
in
in
V
iV
j
iD G
j
g
g
Number ofdistinct domains
Number of genes
Power lawconstant
For E. coli our estimate suggests existence of
D > 4,600domains…
For yeast D > 12,900domains…
(The last!) overview of the talk
• Introduction: project participants & jigsaw puzzle analogy
• Project motivation.
•Duality of signal transduction language.
• How the whole system works.
• Good and ugly graphs.
---------------------------------------------
• Scale-free networks in biology and outside; stochastic birth of scale-free networks
Thank you!
top related