snail montreal

Post on 16-Apr-2017

933 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Unweaving regulatory networks: Automated

extraction from literature and statistical analysis

Overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

--------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Our project is set up as a collaboration of three departments of Columbia University

Interdisciplinary Collaboration:

Department of Medical Informatics, Columbia University (Carol Friedman, Pauline Kra, Michael Krauthammer, Yu Hong, Andrey Rzhetsky)

Department of Computer Science, Columbia University (Vasileios Hatzivassiloglou, Pablo Ariel Duboue, Wubin Weng)

Columbia Genome Center, Columbia University (Pavel Morozov, Tomohiro Koike, Shawn Gomez, Sabina Kaplan, Sergey Kalachikov, Jim Russo, Andrey Rzhetsky)

Studying living organisms

is not unlike

playing with a jigsaw puzzle…

Starting point: before sequence data were available

“Stamp collecting”: some regularities start to emerge...

Defining families of sequences

Beginning assembly of pieces: where we are now

Future

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

--------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Our long-term objective:

develop computational tools for automated compilation and analysis of complex cell

regulation cascades in vertebrates

Problem/Motivation:

Currently a search through the PubMed system with the keywords “cell cycle” and “apoptosis” produced lists of 169,293 and 29,961 articles, respectively.

Clearly it is not feasible to scan all these papers “manually” ...

We decided

(i) to develop tools for automatic retrieval of binary regulatory relationships between molecules from research literature using techniques of natural language processing,

and

(ii) to use extracted knowledge for editing, visualization, and superimposing/comparing homologous networks.

We call the system

GENIES (GENomics Information Extraction System)

R e levan tkeyw ords

P a thw ays

V isu la ize

E d it

S im u la te

C om pare

R e trieve co llec tion o fjou rna l a rtic les

S ave co llec tion o fs ta tem ent/source

pa irs

N a tuara l LanguageP rocess ing

F ilte rs ta tem ents ,

reso lvecon trove rs ies ,

e lim ina teredundanc ies

In s ilicoknock ou t

o r knock ingenes

An overview of our system.

C ollectabstractsor artic les

R egularize &tag com ponents

Iden tify &tag term s

G enB ank

Identifybinary

re la tion

Supplem entallexicon

G ram m ar

K eywordsearch

C ollection o f"fla t" files

V isua lize, ed it,com pare,

s im ulate regu la torynetworks

WWW

I. C ollection

II. P reprocessing

III. T erm extraction

IV . R ela tionshipextraction

V . P ostprocessing

F iltersta tem ents,

reso lvecontrovers ies,

e lim inateredundancies,

inc lude o rexclude

sta tem ents

Application of techniques of Artificial Intelligence:

Natural Language Processing

Goal: to identify binary relationships of theform

“protein A activates protein B”

“protein B inactivates gene C”

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

The language of regulatory pathways have significant

differences with the language of metabolic pathways

We represent a pathway a series of overlapping “links” –

substance/action/substance triplets

Substance A Substance B Substance C Substance D

Representation

Duality of actions in signal transduction literature

Logic Representation BiochemicalRepresentation Example

A = P I 3KB = A K T /P K B

A = pro teinphospatase 2A

B = FAS -activated

serine /th reoninekinase

A = IC EB = C PP 32

A = FAS-LB = FA S

A = (C -M yc:M ax)pro te in com plex

B = cdc25A gene

A = e IF2BB = virtua lly any

gene

A = Ca pum pA TP aseB = Ca

2+

2+

AB

A TP

AD P +

(pum p/channe l)

inside

B outside

-P O4

AB

(cata lys t - phosphatase )

active

B inactivephosphoryla ted

-PO4

AB

AT P

A DP

(ca ta lyst - kinase )

active

B inactive

phosphoryla ted

Bactive

C inactive

D +A (cata lyst - protease )

A is a ligandB inactiveA +

active[AB ]B is a receptor

A in itia tes transcrip tion o f B

phosphoryla tion

dephosphoryla tion

transport

c leavage

b inding

transcrip tion

translation A in itia tes translation of B

A activates B through a processprocess A = FA S-LB = AK T /PK B

A activates B through an action

other

sing le ac tion

A

B

"A activates B "

We realized that the

current research literature in molecular biology

Describes pathways on two different levels:

Logical and

Biochemical

A activates BA inactivates B

A phoshorylates BA methylates B

...

logical

biochemical

Dualism: in the biochemical representation substance A is not a participant of the

action, while it is in the logical representation

Logical Biochemical

Both logical and biochemical descriptions can be combined in the same sentence:

Activated raf-1 phosphorylates and activates mek-1.

logicalbiochemical

The paper descibing a “knowledge model” (=ontology) will appear in

Bioinformatics

Ontology paper

We represent a pathway a series of overlapping “links” –

substance/action/substance triplets

Substance A Substance B Substance C Substance D

“Actions” are relatively scanty:one can provide an exhaustive

list of them

Each action comes with a mechanism (biochemical

representation) and result (logical representation)

Gene and protein names are numerous (currently >80,000) and

the number is growing

MedLEE (by Carol Friedman and colleagues) contains implementation of various grammatical patterns associated

with the same verb:

A activates B…A is an activator of B…

A appeared to activate B…A is activating B…

MedLEE=Medical Language Extraction and Encoding System

It is an integral part of Clinical Information Service at Columbia-Presbyterian Medical Center,

It routinely processes thousands of patient records a day.

MedLEE does semantic analysis of the complete sentence.

If it a complete sentence cannot be parsed successfully, MedLEE does re-analysis, trying to extract parts.

For details see, e.g.,

Friedman, C., G. Hripcsak, W. DuMouchel, S.B. Johnson, and P.D. Clayton. 1995. Natural language processing in an operational clinical system. Natural Language Engineering. 1 (1): 83-108.

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Term identification:

To give you a feeling of the work of the complete conveyer line…

Consider sentence from an actualScience article

“rap1 functions as a negative regulatorof Tcr-mediated il-2 gene transcription”

NLP module (term markup + MedLEE) produces

[action, inactivate, [protein, rap1], [action, activate, [complex, T-cell receptor][action, transcribe, [gene, gene encoding interleukin-2]]],[parsemode, mode1]]

Which is then converted into“shorthand” notation

transcribe gene encoding interleukin-2Tcr activates transcribe&il-2Rap1 inactivates Tcr&transcribe&il-2

Substance (gene)Action

Action on action

Which is then further converted to a format readable by our pathway

visualization program

Protein{ Name{ "IL-3", } }  

LogicalAction{ { UpstreamActionAgent { Protein{ Name{ "IL-3", } } }, DownstreamActionAgent { Complex{ Name{ "IL-3R" } } }, Result{ activation } } }

Complex{ Name{ "IL-3R" } Composition{ Protein{ Name{ “IL-3R alpha” } } Protein{ Name{ “IL-3R beta” } } }  }

Which is then visualized...

IL -3

IL -3R

IG F1

IG F1R

IR S 1

R A S

P I 3-K

A K T /PK B

B A DB cl-XL

FA S -L

FA S

FA DD/MO R T

FL IC E

IC E

C PP 32

apoptos is

m itogen

C yclin D 1

pR b

E 2F

C yc lin E

P 53

P 21

P16

P27

C dk4

P 107

C -Myc

C -Myc

?

B in-1

Max

Max

C dc25A

Max

Mad

Mad

C dk2 p

P27 C yc lin E

C dk2p

C yc lin E

C dk2 p

C yc lin E

C dk2

cell pro liferation

Example of an actual human regulatory networkvisualized

Corresponding article

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Drawing a complex graphis a separate problem ofComputer Science.

We are usinga Simulated Annealing Techniqueto find an optimum graph layout

What is a good pathway graph?•Every gene/protein name is easy to read every

•Easy to trace connections between pairs of molecules

•Easy to read mechanism and result for each action

•Compact

•Shows tissue/stage/species/cell line specificity

•Beautiful

Human Cell Cycle /

Apoptosis Machinery

~400 nodes

Layered graph layout

Incompleteness of the graph:almost complete lack of feedback

loops

Overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; stochastic birth of scale-free networks

Scale-free networks in biology and outside; mechanism of stochastic birth of

scale-free networks

(collaboration with Shawn M. Gomez)

Motivation:

To understand and describe real networks we need to come up

with biologically sensible model that is capable of generating

networks with properties close to those of real networks

What is a scale-free network?

k=0 k=1 k=2 k=3 k=4 …

kin=0 kin=1 kin=2 kin=3 kin=4 …

kout=0 kout=1 kout=2 kout=3 kout=4 …

Geometry of a network:

frequency, pk, of vertices

having exactly k edges

Yeast edge distributions

100

101

10-4

10-3

10-2

10-1

100

Edges (k)

P(k

)Distribution of Connectivity for Yeast Protein Network (DIP)

P(k) ~ k-g

g = 1.97

g = 2.80Outgoing Outgoing Regr.Incoming Incoming Regr.

Network topologyAlbert, et al. Nature 506, 378-382.

(Erdös–Rényi model:most nodes have approximatelythe same number of connections)

There are other scale-free networks

References

References

Actor collaboration

Power grid

World Wide Web

Actor collaboration

(see www.nd.edu/~alb)

Scale-free connectivity

kp ck g

• Scale-free networks are often characterized by a power law distribution

Actor collaboration

WWW

Power grid

WWW By K. C. Claffy

Summary

Related concepts: Zipf’s laws

Network of English words

HAMLET

The body is with the king, but the king is not with the body. The king is a thing--

body

the is

with

kingbut

not

a thing

Word frequencies in Shakespeare’s plays:

http://www.mta.link75.org/curriculum/english/shake/

Hamlet: “incoming connections”The, 1101

And, 898

To, 726

Of, 657

I, 561

You, 544

My, 508

A, 498

In, 414

It, 414

That, 389

Is, 334

Not, 315

This, 296

His, 292

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7Dataset: "Hamlet.txt"; g=0.706489; c=4.939893

log(number of connections per vertex)

log(

num

ber o

f ver

tices

)

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7Dataset: "AnthonyAndCleopatra.txt"; g=0.708199; c=4.829506

log(number of connections per vertex)

log(

num

ber o

f ver

tices

)

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6Dataset: "Macbeth.txt"; g=0.741575; c=4.814642

log(number of connections per vertex)

log(

num

ber o

f ver

tices

)

Metabolic networks in bacteria

There are networks that are not scale-free

Caenorhabditis elegans

“Some 300 of the 959 cells of the adult worm, for example, constitute a nervous system that can detect odor, taste, and respond to temperature and touch.”

0 50 100 150 200 250

0

10

20

30

40

50

60Nematode neurons

number of connections per vertex

num

ber o

f ver

tices

Not all biological networks are scale-free

Overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

• Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

A stochastic model of evolution of a molecular network

Molecular networksare composed of DNA, RNA, proteins, lipids, small molecules, …

(Over)Simplification:we assume that each node of the network has two domains (not

necessary distinct)responsible for the upstream and

downstream connections

A node of a molecular network

u p strea m

d o w n strea m

Each substance has a pair of “domains,”upstream and downstream,which determine specific upstream and downstream network connections.

The model would not be appropriate if domain-rich proteins tend to have more network connections than

domain-poor proteins

Another way to put the same question: Is the number of

domains per protein correlated with the number of network

connections per protein?

Connectivity is NOT affected by domain number

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

5

10

15

20

25

Number of Domains

Num

ber

of E

dges

OutgoingIncoming

Therefore,

u p strea m

d o w n strea m

it is probably OK to use “two-domain” model inthe current setup

Evolutionary events affecting network growth

+

d o w n strea m

u p strea m

u+ +

+

(A ) D up lica tion (B ) Inn ova tion

(C ) R ep lacem en t

We assume

that network rearrangements arrive independently according to a Poisson process…

…which leads to a system of ordinary differential equations corresponding to a continuous-time Markov model…

Armed with resulting equations and programs, we can compute the topological properties of the

growing network for any (reasonable) combination of

parameter values

Data: signal transduction network in yeast

100

101

10-4

10-3

10-2

10-1

100

Edges (k)

P(k

)

Distribution of Connectivity for Yeast Protein Network (DIP)

P(k) ~ k-g

g = 1.97

g = 2.80Outgoing Outgoing Regr.Incoming Incoming Regr.

We therefore can fit the model to the experimental data and estimate parameter values

Estimating parameters:

2 2, , , ,

1 1

( ) ( ) .N N

k in k in k out k outk k

O E O E min

Fitting data for upstream and downstream connections simultaneously

100

101

102

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

u=0.000000; =18.638257; =43.313340; cu=-2.840558;

d=298.902373; =18.638257; =43.313340; cd=-2.835264.

log(number of connections per vertex)

log(

prop

ortio

n of

ver

tices

)

gin = 2.80

fitted model gout = 1.97

fitted model

Fitting data for upstream and downstream connections separately

100

101

102

10-7

10-6

10-5

10-4

10-3

10-2

10-1

u=0.000000; u=8.581353; u=20.042505; cu=-1.156492;

d=0.000000; d=13.167339; d=16.230905; cd=-1.285549.

log(number of connections per vertex)

log(

prop

ortio

n of

ver

tices

)

gin = 2.80

fitted model gout = 1.97

fitted model

Estimated parameters

ReplacementDuplication

Innovation

The analysis of the model suggests that the number of domain copies per genome

should also follow a power law

Frequency of domain appearances in E. coli (closed circles), S. cerevisiae (crosses), and H.

sapiens (open rectangles).

100

101

102

103

10-4

10-3

10-2

10-1

100

Number of Domain Occurences per Genome

Fre

quen

cy

E. coli S. cerevisiaeH. sapiens

How many distinct domains exist?

1

1

1

3

in

in

V

iV

j

iD G

j

g

g

Number ofdistinct domains

Number of genes

Power lawconstant

For E. coli our estimate suggests existence of

D > 4,600domains…

For yeast D > 12,900domains…

(The last!) overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; stochastic birth of scale-free networks

Thank you!

top related