snail montreal

104
Unweaving regulatory networks: Automated extraction from literature and statistical analysis

Upload: nirmala-last

Post on 16-Apr-2017

933 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Snail Montreal

Unweaving regulatory networks: Automated

extraction from literature and statistical analysis

Page 2: Snail Montreal

Overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

--------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Page 3: Snail Montreal

Our project is set up as a collaboration of three departments of Columbia University

Page 4: Snail Montreal

Interdisciplinary Collaboration:

Department of Medical Informatics, Columbia University (Carol Friedman, Pauline Kra, Michael Krauthammer, Yu Hong, Andrey Rzhetsky)

Department of Computer Science, Columbia University (Vasileios Hatzivassiloglou, Pablo Ariel Duboue, Wubin Weng)

Columbia Genome Center, Columbia University (Pavel Morozov, Tomohiro Koike, Shawn Gomez, Sabina Kaplan, Sergey Kalachikov, Jim Russo, Andrey Rzhetsky)

Page 5: Snail Montreal

Studying living organisms

is not unlike

playing with a jigsaw puzzle…

Page 6: Snail Montreal

Starting point: before sequence data were available

Page 7: Snail Montreal

“Stamp collecting”: some regularities start to emerge...

Page 8: Snail Montreal

Defining families of sequences

Page 9: Snail Montreal

Beginning assembly of pieces: where we are now

Page 10: Snail Montreal

Future

Page 11: Snail Montreal

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

--------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Page 12: Snail Montreal

Our long-term objective:

develop computational tools for automated compilation and analysis of complex cell

regulation cascades in vertebrates

Page 13: Snail Montreal

Problem/Motivation:

Currently a search through the PubMed system with the keywords “cell cycle” and “apoptosis” produced lists of 169,293 and 29,961 articles, respectively.

Clearly it is not feasible to scan all these papers “manually” ...

Page 14: Snail Montreal

We decided

(i) to develop tools for automatic retrieval of binary regulatory relationships between molecules from research literature using techniques of natural language processing,

and

(ii) to use extracted knowledge for editing, visualization, and superimposing/comparing homologous networks.

Page 15: Snail Montreal

We call the system

GENIES (GENomics Information Extraction System)

Page 16: Snail Montreal

R e levan tkeyw ords

P a thw ays

V isu la ize

E d it

S im u la te

C om pare

R e trieve co llec tion o fjou rna l a rtic les

S ave co llec tion o fs ta tem ent/source

pa irs

N a tuara l LanguageP rocess ing

F ilte rs ta tem ents ,

reso lvecon trove rs ies ,

e lim ina teredundanc ies

In s ilicoknock ou t

o r knock ingenes

An overview of our system.

Page 17: Snail Montreal

C ollectabstractsor artic les

R egularize &tag com ponents

Iden tify &tag term s

G enB ank

Identifybinary

re la tion

Supplem entallexicon

G ram m ar

K eywordsearch

C ollection o f"fla t" files

V isua lize, ed it,com pare,

s im ulate regu la torynetworks

WWW

I. C ollection

II. P reprocessing

III. T erm extraction

IV . R ela tionshipextraction

V . P ostprocessing

F iltersta tem ents,

reso lvecontrovers ies,

e lim inateredundancies,

inc lude o rexclude

sta tem ents

Application of techniques of Artificial Intelligence:

Natural Language Processing

Goal: to identify binary relationships of theform

“protein A activates protein B”

“protein B inactivates gene C”

Page 18: Snail Montreal

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Page 19: Snail Montreal

The language of regulatory pathways have significant

differences with the language of metabolic pathways

Page 20: Snail Montreal

We represent a pathway a series of overlapping “links” –

substance/action/substance triplets

Substance A Substance B Substance C Substance D

Representation

Page 21: Snail Montreal

Duality of actions in signal transduction literature

Page 22: Snail Montreal

Logic Representation BiochemicalRepresentation Example

A = P I 3KB = A K T /P K B

A = pro teinphospatase 2A

B = FAS -activated

serine /th reoninekinase

A = IC EB = C PP 32

A = FAS-LB = FA S

A = (C -M yc:M ax)pro te in com plex

B = cdc25A gene

A = e IF2BB = virtua lly any

gene

A = Ca pum pA TP aseB = Ca

2+

2+

AB

A TP

AD P +

(pum p/channe l)

inside

B outside

-P O4

AB

(cata lys t - phosphatase )

active

B inactivephosphoryla ted

-PO4

AB

AT P

A DP

(ca ta lyst - kinase )

active

B inactive

phosphoryla ted

Bactive

C inactive

D +A (cata lyst - protease )

A is a ligandB inactiveA +

active[AB ]B is a receptor

A in itia tes transcrip tion o f B

phosphoryla tion

dephosphoryla tion

transport

c leavage

b inding

transcrip tion

translation A in itia tes translation of B

A activates B through a processprocess A = FA S-LB = AK T /PK B

A activates B through an action

other

sing le ac tion

A

B

"A activates B "

We realized that the

current research literature in molecular biology

Describes pathways on two different levels:

Logical and

Biochemical

Page 23: Snail Montreal

A activates BA inactivates B

A phoshorylates BA methylates B

...

logical

biochemical

Page 24: Snail Montreal

Dualism: in the biochemical representation substance A is not a participant of the

action, while it is in the logical representation

Logical Biochemical

Page 25: Snail Montreal

Both logical and biochemical descriptions can be combined in the same sentence:

Activated raf-1 phosphorylates and activates mek-1.

logicalbiochemical

Page 26: Snail Montreal

The paper descibing a “knowledge model” (=ontology) will appear in

Bioinformatics

Page 27: Snail Montreal

Ontology paper

Page 28: Snail Montreal

We represent a pathway a series of overlapping “links” –

substance/action/substance triplets

Substance A Substance B Substance C Substance D

Page 29: Snail Montreal

“Actions” are relatively scanty:one can provide an exhaustive

list of them

Page 30: Snail Montreal

Each action comes with a mechanism (biochemical

representation) and result (logical representation)

Page 31: Snail Montreal

Gene and protein names are numerous (currently >80,000) and

the number is growing

Page 32: Snail Montreal

MedLEE (by Carol Friedman and colleagues) contains implementation of various grammatical patterns associated

with the same verb:

A activates B…A is an activator of B…

A appeared to activate B…A is activating B…

Page 33: Snail Montreal

MedLEE=Medical Language Extraction and Encoding System

It is an integral part of Clinical Information Service at Columbia-Presbyterian Medical Center,

It routinely processes thousands of patient records a day.

MedLEE does semantic analysis of the complete sentence.

If it a complete sentence cannot be parsed successfully, MedLEE does re-analysis, trying to extract parts.

Page 34: Snail Montreal

For details see, e.g.,

Friedman, C., G. Hripcsak, W. DuMouchel, S.B. Johnson, and P.D. Clayton. 1995. Natural language processing in an operational clinical system. Natural Language Engineering. 1 (1): 83-108.

Page 35: Snail Montreal

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Page 36: Snail Montreal

Term identification:

Page 37: Snail Montreal

To give you a feeling of the work of the complete conveyer line…

Page 38: Snail Montreal

Consider sentence from an actualScience article

“rap1 functions as a negative regulatorof Tcr-mediated il-2 gene transcription”

Page 39: Snail Montreal

NLP module (term markup + MedLEE) produces

[action, inactivate, [protein, rap1], [action, activate, [complex, T-cell receptor][action, transcribe, [gene, gene encoding interleukin-2]]],[parsemode, mode1]]

Page 40: Snail Montreal

Which is then converted into“shorthand” notation

transcribe gene encoding interleukin-2Tcr activates transcribe&il-2Rap1 inactivates Tcr&transcribe&il-2

Substance (gene)Action

Action on action

Page 41: Snail Montreal

Which is then further converted to a format readable by our pathway

visualization program

Protein{ Name{ "IL-3", } }  

LogicalAction{ { UpstreamActionAgent { Protein{ Name{ "IL-3", } } }, DownstreamActionAgent { Complex{ Name{ "IL-3R" } } }, Result{ activation } } }

Complex{ Name{ "IL-3R" } Composition{ Protein{ Name{ “IL-3R alpha” } } Protein{ Name{ “IL-3R beta” } } }  }

Page 42: Snail Montreal

Which is then visualized...

Page 43: Snail Montreal

IL -3

IL -3R

IG F1

IG F1R

IR S 1

R A S

P I 3-K

A K T /PK B

B A DB cl-XL

FA S -L

FA S

FA DD/MO R T

FL IC E

IC E

C PP 32

apoptos is

m itogen

C yclin D 1

pR b

E 2F

C yc lin E

P 53

P 21

P16

P27

C dk4

P 107

C -Myc

C -Myc

?

B in-1

Max

Max

C dc25A

Max

Mad

Mad

C dk2 p

P27 C yc lin E

C dk2p

C yc lin E

C dk2 p

C yc lin E

C dk2

cell pro liferation

Example of an actual human regulatory networkvisualized

Page 44: Snail Montreal

Corresponding article

Page 45: Snail Montreal

Overview of the talk• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Page 46: Snail Montreal

Drawing a complex graphis a separate problem ofComputer Science.

We are usinga Simulated Annealing Techniqueto find an optimum graph layout

Page 47: Snail Montreal

What is a good pathway graph?•Every gene/protein name is easy to read every

•Easy to trace connections between pairs of molecules

•Easy to read mechanism and result for each action

•Compact

•Shows tissue/stage/species/cell line specificity

•Beautiful

Page 48: Snail Montreal

Human Cell Cycle /

Apoptosis Machinery

Page 49: Snail Montreal

~400 nodes

Page 50: Snail Montreal

Layered graph layout

Page 51: Snail Montreal

Incompleteness of the graph:almost complete lack of feedback

loops

Page 52: Snail Montreal

Overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; stochastic birth of scale-free networks

Page 53: Snail Montreal

Scale-free networks in biology and outside; mechanism of stochastic birth of

scale-free networks

(collaboration with Shawn M. Gomez)

Page 54: Snail Montreal

Motivation:

To understand and describe real networks we need to come up

with biologically sensible model that is capable of generating

networks with properties close to those of real networks

Page 55: Snail Montreal

What is a scale-free network?

Page 56: Snail Montreal

k=0 k=1 k=2 k=3 k=4 …

kin=0 kin=1 kin=2 kin=3 kin=4 …

kout=0 kout=1 kout=2 kout=3 kout=4 …

Geometry of a network:

frequency, pk, of vertices

having exactly k edges

Page 57: Snail Montreal

Yeast edge distributions

100

101

10-4

10-3

10-2

10-1

100

Edges (k)

P(k

)Distribution of Connectivity for Yeast Protein Network (DIP)

P(k) ~ k-g

g = 1.97

g = 2.80Outgoing Outgoing Regr.Incoming Incoming Regr.

Page 58: Snail Montreal

Network topologyAlbert, et al. Nature 506, 378-382.

(Erdös–Rényi model:most nodes have approximatelythe same number of connections)

Page 59: Snail Montreal

There are other scale-free networks

Page 60: Snail Montreal

References

Page 61: Snail Montreal

References

Page 62: Snail Montreal

Actor collaboration

Power grid

World Wide Web

Page 63: Snail Montreal

Actor collaboration

(see www.nd.edu/~alb)

Page 64: Snail Montreal
Page 65: Snail Montreal

Scale-free connectivity

kp ck g

• Scale-free networks are often characterized by a power law distribution

Actor collaboration

WWW

Power grid

Page 66: Snail Montreal

WWW By K. C. Claffy

Page 67: Snail Montreal

Summary

Page 68: Snail Montreal

Related concepts: Zipf’s laws

Page 69: Snail Montreal

Network of English words

HAMLET

The body is with the king, but the king is not with the body. The king is a thing--

body

the is

with

kingbut

not

a thing

Page 70: Snail Montreal

Word frequencies in Shakespeare’s plays:

http://www.mta.link75.org/curriculum/english/shake/

Page 71: Snail Montreal

Hamlet: “incoming connections”The, 1101

And, 898

To, 726

Of, 657

I, 561

You, 544

My, 508

A, 498

In, 414

It, 414

That, 389

Is, 334

Not, 315

This, 296

His, 292

Page 72: Snail Montreal

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7Dataset: "Hamlet.txt"; g=0.706489; c=4.939893

log(number of connections per vertex)

log(

num

ber o

f ver

tices

)

Page 73: Snail Montreal

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7Dataset: "AnthonyAndCleopatra.txt"; g=0.708199; c=4.829506

log(number of connections per vertex)

log(

num

ber o

f ver

tices

)

Page 74: Snail Montreal

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6Dataset: "Macbeth.txt"; g=0.741575; c=4.814642

log(number of connections per vertex)

log(

num

ber o

f ver

tices

)

Page 75: Snail Montreal

Metabolic networks in bacteria

Page 76: Snail Montreal
Page 77: Snail Montreal

There are networks that are not scale-free

Page 78: Snail Montreal

Caenorhabditis elegans

“Some 300 of the 959 cells of the adult worm, for example, constitute a nervous system that can detect odor, taste, and respond to temperature and touch.”

Page 79: Snail Montreal

0 50 100 150 200 250

0

10

20

30

40

50

60Nematode neurons

number of connections per vertex

num

ber o

f ver

tices

Not all biological networks are scale-free

Page 80: Snail Montreal

Overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

• Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; mechanism of stochastic birth of scale-free networks

Page 81: Snail Montreal

A stochastic model of evolution of a molecular network

Page 82: Snail Montreal

Molecular networksare composed of DNA, RNA, proteins, lipids, small molecules, …

Page 83: Snail Montreal

(Over)Simplification:we assume that each node of the network has two domains (not

necessary distinct)responsible for the upstream and

downstream connections

Page 84: Snail Montreal

A node of a molecular network

u p strea m

d o w n strea m

Each substance has a pair of “domains,”upstream and downstream,which determine specific upstream and downstream network connections.

Page 85: Snail Montreal

The model would not be appropriate if domain-rich proteins tend to have more network connections than

domain-poor proteins

Page 86: Snail Montreal

Another way to put the same question: Is the number of

domains per protein correlated with the number of network

connections per protein?

Page 87: Snail Montreal

Connectivity is NOT affected by domain number

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

5

10

15

20

25

Number of Domains

Num

ber

of E

dges

OutgoingIncoming

Page 88: Snail Montreal

Therefore,

u p strea m

d o w n strea m

it is probably OK to use “two-domain” model inthe current setup

Page 89: Snail Montreal

Evolutionary events affecting network growth

+

d o w n strea m

u p strea m

u+ +

+

(A ) D up lica tion (B ) Inn ova tion

(C ) R ep lacem en t

Page 90: Snail Montreal

We assume

that network rearrangements arrive independently according to a Poisson process…

…which leads to a system of ordinary differential equations corresponding to a continuous-time Markov model…

Page 91: Snail Montreal

Armed with resulting equations and programs, we can compute the topological properties of the

growing network for any (reasonable) combination of

parameter values

Page 92: Snail Montreal

Data: signal transduction network in yeast

100

101

10-4

10-3

10-2

10-1

100

Edges (k)

P(k

)

Distribution of Connectivity for Yeast Protein Network (DIP)

P(k) ~ k-g

g = 1.97

g = 2.80Outgoing Outgoing Regr.Incoming Incoming Regr.

Page 93: Snail Montreal

We therefore can fit the model to the experimental data and estimate parameter values

Page 94: Snail Montreal

Estimating parameters:

2 2, , , ,

1 1

( ) ( ) .N N

k in k in k out k outk k

O E O E min

Page 95: Snail Montreal

Fitting data for upstream and downstream connections simultaneously

100

101

102

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

u=0.000000; =18.638257; =43.313340; cu=-2.840558;

d=298.902373; =18.638257; =43.313340; cd=-2.835264.

log(number of connections per vertex)

log(

prop

ortio

n of

ver

tices

)

gin = 2.80

fitted model gout = 1.97

fitted model

Page 96: Snail Montreal

Fitting data for upstream and downstream connections separately

100

101

102

10-7

10-6

10-5

10-4

10-3

10-2

10-1

u=0.000000; u=8.581353; u=20.042505; cu=-1.156492;

d=0.000000; d=13.167339; d=16.230905; cd=-1.285549.

log(number of connections per vertex)

log(

prop

ortio

n of

ver

tices

)

gin = 2.80

fitted model gout = 1.97

fitted model

Page 97: Snail Montreal

Estimated parameters

ReplacementDuplication

Innovation

Page 98: Snail Montreal

The analysis of the model suggests that the number of domain copies per genome

should also follow a power law

Page 99: Snail Montreal

Frequency of domain appearances in E. coli (closed circles), S. cerevisiae (crosses), and H.

sapiens (open rectangles).

100

101

102

103

10-4

10-3

10-2

10-1

100

Number of Domain Occurences per Genome

Fre

quen

cy

E. coli S. cerevisiaeH. sapiens

Page 100: Snail Montreal

How many distinct domains exist?

1

1

1

3

in

in

V

iV

j

iD G

j

g

g

Number ofdistinct domains

Number of genes

Power lawconstant

Page 101: Snail Montreal

For E. coli our estimate suggests existence of

D > 4,600domains…

Page 102: Snail Montreal

For yeast D > 12,900domains…

Page 103: Snail Montreal

(The last!) overview of the talk

• Introduction: project participants & jigsaw puzzle analogy

• Project motivation.

•Duality of signal transduction language.

• How the whole system works.

• Good and ugly graphs.

---------------------------------------------

• Scale-free networks in biology and outside; stochastic birth of scale-free networks

Page 104: Snail Montreal

Thank you!