storytelling and clustering for cellular signaling pathways m. shahriar hossain, monika akbar,...

Storytelling and Clustering for Cellular Signaling Pathways

M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys

Department of Computer Science,Virginia Tech, Blacksburg, VA 24061.

Objective

STKE Dataset Cell interactions through chemical

signals Discover relationships between the

pathways Graph structure Subgraph discovery problem

Pathways relationships Clustering Storytelling

Myocyte Adrenergic Pathway (CMP_9043)

Dataset properties

Total Pathways = 50

Size Range

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Pathway Graphs

Frequent Subgraph

Clustering

STKE Dataset

NN Storytelling

Subsequent Candidate Generation Apriori – incremental approach [17] FSG [2]

Generate a (k+1)-edge candidate subgraph by combining two k-edge subgraphs where these two k-edge subgraphs have a common core subgraph of (k-1)-edges.

Cost of comparison between subgraphs (and core subgraphs) is reduced using hash-code of each subgraph object.

Subsequent Candidate Generation

Instance: Number of 5-edge

subgraphs: 21 Core subgraph

comparisons for s1: 20

t zNot generated

………………………………………….………………………………................………………………………………….

Total Unique Nodes:1205Total Relations:1376

Master Pathway Graph (MPG)

SEG - Subgraph Extension Generation

Neighborhood Extension Neighborhood list : {q, r, s}

Comparison is not required. Subgraph is extended from

physical evidence

Design Pipeline

Preprocessor

Pathway Graphs

Frequent Subgraph

Clustering

STKE Dataset

NN Storytelling

Subgraph Discovery

k # of Subgraphs generated

Time (sec.)

1 1,376 Existing

2 5,380 41

3 29,565 149

4 187,508 971

5 1274,852 7518

--- ---- -----

min_sup=2%

• What so novel about pruning edges?

‘Importance Factor’ of a subgraph: sfipf

Subgraph frequency,

Inverse pathway frequency,

ijji ipfsfsfipf ,

For i-th subgraph j-th pathway:

Dataset Properties (sfipf)

min_sfipf

0200400600800

100012001400

min_sfipf

Number of edges in MPG=1376Total pathways=50

Subgraph Discovery

min_sup= 4.0%min_sfipf= 0.01

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

50x103

100x103

150x103

200x103

250x103

300x103

350x103

400x103

FSGSEG

Subgraph Discovery

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

FSGSEG

Subgraph Discovery

3 4 5 6 7 8 9 10

250000

500000

750000

1000000

1250000

FSGSEG

k Number of

Subgraphs

Time Saved

Attempts

Saved(%)

2 186 99.83 98.983 246 98.33 86.154 305 98.57 86.385 323 98.95 86.916 313 98.96 85.647 279 98.88 83.258 263 98.67 78.919 292 98.38 74.76

10 364 98.58 74.7511 470 98.76 78.0812 608 99.04 81.8413 785 99.22 85.0214 980 99.38 87.6315 1117 99.48 89.4816 1075 99.53 90.2617 804 99.51 89.4018 430 99.34 85.2219 141 98.76 71.2220 20 96.15 9.1921 1 75.74 -574.47Overall attempts saved = 89.52%

Overall time saved = 99.39%

Clustering

Hierarchical Agglomerative Clustering (HAC)

k-means Unsupervised measure of clusters’

validity Average Silhouette Coefficient (ASC)

Clustering

min_sup=4%, min_sfipf=0.01

k-means

# of Clusters2 4 6 8 10 12 14 16 18 20

0.4Cosine sfipf Dice Jaccard Overlap

min_sup=4%, min_sfipf=0.01

# of Clusters

2 4 6 8 10 12 14 16 18 20

Clustering

ASC Contour map for 10 clusters using HAC

0.140.200.18

min_sup4 6 8 10 12

0.08 0.10 0.12 0.14 0.16 0.18 0.20

ASC Contour map for 10 clusters using k-means

0.060.08

min_sup4 6 8 10 12

0.04 0.06 0.08 0.10 0.12 0.14

Design Pipeline

Preprocessor

Pathway Graphs

Frequent Subgraph

Clustering

STKE Dataset

NN Storytelling

Pathway Relations (StoryTelling)

Bidirectional Search Cover tree for NN

Day-to-day life example

Roman Holiday

SabrinaBreakfast

at Tiffany’sSome

Like it HotRear

Window

2001: A Space Odyssey

Golden Eye

Die Another Day

Terminator 3

Terminator 3Collateral damage

Lethal Weapon 4

Die Hard 2

SpeedAir Force

U.S. Marshals

S.W.A.T.The day after

Tomorrowvan

HelsingBlade: Trinity

Roman Holiday

SabrinaFunny Face

Deep in my Heart

Singing in the rain

An American in Paris

Kismet

Kiss me Kate

High Society

Anchors Aweigh

On the Town

Take me out to the Ball Game

From Roman Holiday

From Terminator 3

From: Roman HolidayTo: Terminator 3

Examples in STKE

http://people.cs.vt.edu/msh/infoviz/3/

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

b=2b=4b=6b=8

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

b=2b=3b=4b=5b=6b=7b=8b=9b=10

Branching factor, b

2 3 4 5 6 7 8 9 10

Branching factor, b2 3 4 5 6 7 8 9 10

200.0x103

400.0x103

600.0x103

800.0x103

1.0x106

1.2x106

1.4x106

Branching factor, b

2 3 4 5 6 7 8 9 10

Future Directions

Compare our SEG graph methods with text based clustering and storytelling

Examine costs and benefits for combining text and graph mining techniques

References

[1] Science Signaling, The signal Transduction Knowledge Environment (STKE), "The Database of Cell Signaling", http://stke.sciencemag.org/cm/

[2] Kuramochi, M. and Karypis, G., "An efficient algorithm for discovering frequent subgraphs", IEEE Transactions on KDE, Vol. 16(9), September 2004, pp. 1038-1051.

[3] Breslin, T., Krogh, M., Peterson, C., and Troein, C., "Signal transduction pathway profiling of individual tumor samples", BMC Bioinformatics, June 29, 2005.

[4] Kumar, D., Ramakrishnan, N., Helm, R. F., and Potts, M., "Algorithms for Storytelling", IEEE Transactions on KDE, Vol. 20(6), June 2008, pp. 736-751.

[5] Ratprasartporn, N., Cakmak, A., and Ozsoyoglu, G., "On Data and Visualization Models for Signaling Pathways", 18th SSDBM, 2006, pp. 133-142.

[6] Xu, X., and Yu, Y., "Modeling and Verifying WNT Signaling Pathway", 3rd Intl. Conf. on ICNC. 2007, Vol. 2, pp. 319 - 323.

[7] Schreiber, F., "Comparison of metabolic pathways using constraint graph drawing", 1st Asia-Pacific bioinformatics Conf. on Bioinfo., Australia, Vol. 19, 2003, pp. 105 - 110.

[8] Abello, J., van Ham, F., and Krishnan, N., "ASKGraphView: A Large Scale Graph Visualization System", IEEE Transactions on Visualization and Computer Graphics, Vol. 12(5), 2006, pp. 669 - 676.

[9] Miyake, S., Tohsato, A., Takenaka, Y., and Matsuda, H. "A clustering method for comparative analysis between genomes and pathways", 8th Intl. Conf. on Database Systems for Advanced Applications, March 2003 pp. 327 - 334.

References[10] Yan, X., and Han, J. "gSpan: graph-based substructure pattern mining", IEEE ICDM, 2002, pp. 721-

[11] Moti, C., and Ehud, G. "Diagonally Subgraphs Pattern Mining", 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2004, pp. 51-58.

[12] Ketkar, N., Holder, L., Cook, D., Shah, R., and Coble, J. "Subdue: Compression-based Frequent Pattern Discovery in Graph Data", ACM KDD Workshop on Open-Source Data Mining, August 2005, pp. 71-76.

[13] Zhang, T., Ramakrishnan, R., and Livny, M., "BIRCH: An Efficient Data Clustering Method for Very Large Databases", ACM SIGMOD Intl. Conf. on Management of Data, Canada, 1996, pp. 103-114.

[14] Wagsta, K., Cardie, C., Rogers, S., and Schroedl, S., "Constrained K-means Clustering with Background Knowledge", ICML 2001, pp. 577-584.

[15] Lin, F., and Hsueh, C. M., "Knowledge map creation and maintenance for virtual communities of practice", Intl. Journal of Information Processing and Management, ACM, Vol. 42(2), 2006, pp. 551-568.

[16] Beygelzimer, A., Kakade, S., Langford, J., "Cover trees for nearest neighbor", ICML 2006, pp. 97-104.

[17] Agrawal, R., and Srikant, R. "Fast Algorithms for Mining Association Rules", Intl. Conf. on Very Large Data Bases, Santiago, Chile, September 1994, pp. 487-499.

[18] Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning, A. and Bollinger, T. "The Quest Data Mining System", KDD'96, USA, 1996, pp. 244-249.

[19] Tan, P. N., Steinbachm, M., and Kumar, V., "Introduction to Data Mining", Addison-Wesley, ISBN: 0321321367, April 2005, pp. 539-547.

[20] http://people.cs.vt.edu/amonika/infoviz/

Thank You

storytelling and clustering for cellular signaling pathways m. shahriar hossain, monika akbar,...

Documents

selim shahriar, northwestern university

welcome to our presentation. we are group 07 sharmin sultana...

modeling a head with polys in 3dsmax

factoring polys

photography_bup_m shahriar sonet_dhsm-2016

comilla · md. me-he-di hasan tuhin md. sami al jabed md....

· al-mukit hasan md tanvir ehsan ha ppy md tajmimsalam...

cg filling polys

pi: selim m. shahriar / northwestern university

table of contents rutherfoord, rebecca - #2887 - 429 1...

1 chapter 2: elementary programming shahriar hossain

a110-4 solving factored polys

hossain shahriar mohammad zulkernine. one of the worst...

bauxite process polys i us

abstracts - jahangirnagar university · basin, dinajpur,...

banning plastic bag & bound shahriar july-2016

secure and reliable mobile application: challenges and...

storytelling and clustering for cellular signaling...

jobstestbd.com · md.nurnobi hossain md. al amin shah...

sarami shahriar