storytelling and clustering for cellular signaling pathways m. shahriar hossain, monika akbar,...

Post on 12-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Storytelling and Clustering for Cellular Signaling Pathways

M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys

Department of Computer Science,Virginia Tech, Blacksburg, VA 24061.

2

Objective

STKE Dataset Cell interactions through chemical

signals Discover relationships between the

pathways Graph structure Subgraph discovery problem

Pathways relationships Clustering Storytelling

Myocyte Adrenergic Pathway (CMP_9043)

4

Dataset properties

Total Pathways = 50

Size Range

1-1

0

11

-20

21

-30

31

-40

41

-50

51

-60

61

-70

71

-80

81

-90

91

-10

0

10

0-1

10N

um

ber

of

Pat

hw

ays

in S

ize

Ran

ge

0

2

4

6

8

10

12

5

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Pathway Graphs

Frequent Subgraph

s

Clustering

STKE Dataset

NN Storytelling

6

Subsequent Candidate Generation Apriori – incremental approach [17] FSG [2]

Generate a (k+1)-edge candidate subgraph by combining two k-edge subgraphs where these two k-edge subgraphs have a common core subgraph of (k-1)-edges.

Cost of comparison between subgraphs (and core subgraphs) is reduced using hash-code of each subgraph object.

m

n

o

lp

m

n

o

pq l

m

n

o

pq

7

Subsequent Candidate Generation

Instance: Number of 5-edge

subgraphs: 21 Core subgraph

comparisons for s1: 20

mn

o

l p q

mn

o

p l q

mn

o

p

mn

o

l p

m op

r

nm o

lp

r

n

mn

o

l pm

n

o

l ps

mn

o

ps

mn

o

l p m

n

o

t zNot generated

………………………………………….………………………………................………………………………………….

Total Unique Nodes:1205Total Relations:1376

Master Pathway Graph (MPG)

9

SEG - Subgraph Extension Generation

Neighborhood Extension Neighborhood list : {q, r, s}

Comparison is not required. Subgraph is extended from

physical evidence

m

n

o

lp

n

m o

lps

m

n

o

lp

q

m

n

o

lp

r

l

m n

o

q

p

r

s

10

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Pathway Graphs

Frequent Subgraph

s

Clustering

STKE Dataset

NN Storytelling

11

Subgraph Discovery

k # of Subgraphs generated

Time (sec.)

1 1,376 Existing

2 5,380 41

3 29,565 149

4 187,508 971

5 1274,852 7518

--- ---- -----

min_sup=2%

• What so novel about pruning edges?

12

‘Importance Factor’ of a subgraph: sfipf

jj n

sf1

jij

ipsp

Dipf

:

Subgraph frequency,

Inverse pathway frequency,

ijji ipfsfsfipf ,

For i-th subgraph j-th pathway:

13

Dataset Properties (sfipf)

min_sfipf

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

20

.14

0.1

60

.18

0.2

0

# o

f e

dg

es

le

ft

0200400600800

100012001400

min_sfipf

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

20

.14

0.1

60

.18

0.2

0

# o

f p

ath

wa

ys l

eft

0

10

20

30

40

50

Number of edges in MPG=1376Total pathways=50

14

Subgraph Discovery

min_sup= 4.0%min_sfipf= 0.01

k

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Tim

e (m

s)

0

50x103

100x103

150x103

200x103

250x103

300x103

350x103

400x103

FSGSEG

15

Subgraph Discovery

min_sup= 4.0%min_sfipf= 0.01

k

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Tim

e (m

s)

0

500

1000

1500

2000

2500

3000

FSGSEG

16

Subgraph Discovery

min_sup= 4.0%min_sfipf= 0.01

k

3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

# o

f A

tem

pts

0

250000

500000

750000

1000000

1250000

FSGSEG

k Number of

Subgraphs

Time Saved

(%)

Attempts

Saved(%)

2 186 99.83 98.983 246 98.33 86.154 305 98.57 86.385 323 98.95 86.916 313 98.96 85.647 279 98.88 83.258 263 98.67 78.919 292 98.38 74.76

10 364 98.58 74.7511 470 98.76 78.0812 608 99.04 81.8413 785 99.22 85.0214 980 99.38 87.6315 1117 99.48 89.4816 1075 99.53 90.2617 804 99.51 89.4018 430 99.34 85.2219 141 98.76 71.2220 20 96.15 9.1921 1 75.74 -574.47Overall attempts saved = 89.52%

Overall time saved = 99.39%

18

Clustering

Hierarchical Agglomerative Clustering (HAC)

k-means Unsupervised measure of clusters’

validity Average Silhouette Coefficient (ASC)

[19]

19

Clustering

min_sup=4%, min_sfipf=0.01

k-means

# of Clusters2 4 6 8 10 12 14 16 18 20

AS

C0.0

0.1

0.2

0.3

0.4Cosine sfipf Dice Jaccard Overlap

min_sup=4%, min_sfipf=0.01

HAC

# of Clusters

2 4 6 8 10 12 14 16 18 20

AS

C

0.0

0.1

0.2

0.3

0.4

20

Clustering

ASC Contour map for 10 clusters using HAC

0.08

0.08

0.10

0.10

0.12

0.12

0.16

0.16

0.14

0.140.200.18

min_sup4 6 8 10 12

min

_s

fip

f

0.01

0.02

0.03

0.04

0.05

0.08 0.10 0.12 0.14 0.16 0.18 0.20

ASC Contour map for 10 clusters using k-means

0.04

0.04

0.06

0.06

0.060.08

0.08

0.08

0.10

0.14

0.12

0.10

0.10

min_sup4 6 8 10 12

min

_sfi

pf

0.01

0.02

0.03

0.04

0.05

0.04 0.06 0.08 0.10 0.12 0.14

21

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Pathway Graphs

Frequent Subgraph

s

Clustering

STKE Dataset

NN Storytelling

22

Pathway Relations (StoryTelling)

Bidirectional Search Cover tree for NN

S

p1

p2

p3

T

p7

p8

p9

Day-to-day life example

Roman Holiday

SabrinaBreakfast

at Tiffany’sSome

Like it HotRear

Window

2001: A Space Odyssey

Golden Eye

Die Another Day

Terminator 3

Terminator 3Collateral damage

Lethal Weapon 4

Die Hard 2

SpeedAir Force

One

U.S. Marshals

S.W.A.T.The day after

Tomorrowvan

HelsingBlade: Trinity

Roman Holiday

SabrinaFunny Face

Deep in my Heart

Singing in the rain

An American in Paris

Kismet

Kiss me Kate

High Society

Anchors Aweigh

On the Town

Take me out to the Ball Game

From Roman Holiday

From Terminator 3

From: Roman HolidayTo: Terminator 3

24

Examples in STKE

http://people.cs.vt.edu/msh/infoviz/3/

25

Pathway Relations (StoryTelling)

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Nu

mb

er

of

t-le

ng

th s

tori

es

0

50

100

150

200

250

300

350

b=2b=4b=6b=8

26

Pathway Relations (StoryTelling)

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Nu

mb

er

of

t-le

ng

th s

tori

es

0

50

100

150

200

250

300

350

b=2b=3b=4b=5b=6b=7b=8b=9b=10

27

Pathway Relations (StoryTelling)

Branching factor, b

2 3 4 5 6 7 8 9 10

To

tal

sto

ries

fro

m a

ll p

airs

0

200

400

600

800

1000

Branching factor, b2 3 4 5 6 7 8 9 10

Tim

e to

gen

erat

eal

l st

ori

es (

ms)

0.0

200.0x103

400.0x103

600.0x103

800.0x103

1.0x106

1.2x106

1.4x106

Branching factor, b

2 3 4 5 6 7 8 9 10

Len

gth

of

the

lon

ges

t s

tory

4

6

8

10

12

14

16

28

Future Directions

Compare our SEG graph methods with text based clustering and storytelling

Examine costs and benefits for combining text and graph mining techniques

29

References

[1] Science Signaling, The signal Transduction Knowledge Environment (STKE), "The Database of Cell Signaling", http://stke.sciencemag.org/cm/

[2] Kuramochi, M. and Karypis, G., "An efficient algorithm for discovering frequent subgraphs", IEEE Transactions on KDE, Vol. 16(9), September 2004, pp. 1038-1051.

[3] Breslin, T., Krogh, M., Peterson, C., and Troein, C., "Signal transduction pathway profiling of individual tumor samples", BMC Bioinformatics, June 29, 2005.

[4] Kumar, D., Ramakrishnan, N., Helm, R. F., and Potts, M., "Algorithms for Storytelling", IEEE Transactions on KDE, Vol. 20(6), June 2008, pp. 736-751.

[5] Ratprasartporn, N., Cakmak, A., and Ozsoyoglu, G., "On Data and Visualization Models for Signaling Pathways", 18th SSDBM, 2006, pp. 133-142.

[6] Xu, X., and Yu, Y., "Modeling and Verifying WNT Signaling Pathway", 3rd Intl. Conf. on ICNC. 2007, Vol. 2, pp. 319 - 323.

[7] Schreiber, F., "Comparison of metabolic pathways using constraint graph drawing", 1st Asia-Pacific bioinformatics Conf. on Bioinfo., Australia, Vol. 19, 2003, pp. 105 - 110.

[8] Abello, J., van Ham, F., and Krishnan, N., "ASKGraphView: A Large Scale Graph Visualization System", IEEE Transactions on Visualization and Computer Graphics, Vol. 12(5), 2006, pp. 669 - 676.

[9] Miyake, S., Tohsato, A., Takenaka, Y., and Matsuda, H. "A clustering method for comparative analysis between genomes and pathways", 8th Intl. Conf. on Database Systems for Advanced Applications, March 2003 pp. 327 - 334.

30

References[10] Yan, X., and Han, J. "gSpan: graph-based substructure pattern mining", IEEE ICDM, 2002, pp. 721-

724.

[11] Moti, C., and Ehud, G. "Diagonally Subgraphs Pattern Mining", 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2004, pp. 51-58.

[12] Ketkar, N., Holder, L., Cook, D., Shah, R., and Coble, J. "Subdue: Compression-based Frequent Pattern Discovery in Graph Data", ACM KDD Workshop on Open-Source Data Mining, August 2005, pp. 71-76.

[13] Zhang, T., Ramakrishnan, R., and Livny, M., "BIRCH: An Efficient Data Clustering Method for Very Large Databases", ACM SIGMOD Intl. Conf. on Management of Data, Canada, 1996, pp. 103-114.

[14] Wagsta, K., Cardie, C., Rogers, S., and Schroedl, S., "Constrained K-means Clustering with Background Knowledge", ICML 2001, pp. 577-584.

[15] Lin, F., and Hsueh, C. M., "Knowledge map creation and maintenance for virtual communities of practice", Intl. Journal of Information Processing and Management, ACM, Vol. 42(2), 2006, pp. 551-568.

[16] Beygelzimer, A., Kakade, S., Langford, J., "Cover trees for nearest neighbor", ICML 2006, pp. 97-104.

[17] Agrawal, R., and Srikant, R. "Fast Algorithms for Mining Association Rules", Intl. Conf. on Very Large Data Bases, Santiago, Chile, September 1994, pp. 487-499.

[18] Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning, A. and Bollinger, T. "The Quest Data Mining System", KDD'96, USA, 1996, pp. 244-249.

[19] Tan, P. N., Steinbachm, M., and Kumar, V., "Introduction to Data Mining", Addison-Wesley, ISBN: 0321321367, April 2005, pp. 539-547.

[20] http://people.cs.vt.edu/amonika/infoviz/

31

Thank You

top related