text mining and its application in bioinformatics xiaohua tony hu college of information science...

69
Text Mining and Its Text Mining and Its Application in Application in Bioinformatics Bioinformatics Xiaohua Tony Hu Xiaohua Tony Hu College of Information Science & College of Information Science & Technology Technology Drexel University, USA Drexel University, USA

Upload: laura-eaton

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

Text Mining and Its Text Mining and Its Application in Application in

Bioinformatics Bioinformatics

Xiaohua Tony HuXiaohua Tony HuCollege of Information Science & TechnologyCollege of Information Science & Technology

Drexel University, USADrexel University, USA

Page 2: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

22

AgendaAgenda

• IntroductionIntroduction• Problems of Biomedical Literature Problems of Biomedical Literature

Mining ApproachesMining Approaches• Related WorksRelated Works• Our System: Bio-Set-DMOur System: Bio-Set-DM• Sub Network Modeling, Simulation Sub Network Modeling, Simulation

and Evaluationand Evaluation• Conclusion and Future StudiesConclusion and Future Studies

Page 3: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

33

Biomedical Literature Biomedical Literature MiningMining

• Many biomedical and bioinformatics Many biomedical and bioinformatics knowledge and experimental results knowledge and experimental results only published in text documents and only published in text documents and these documents are collected in these documents are collected in online digital libraries/databases online digital libraries/databases (Medline, PubMedCentral, (Medline, PubMedCentral, BioMedCentral).BioMedCentral).

• How big is Medline?How big is Medline?– Abstracts from more than 4800 journals, Abstracts from more than 4800 journals,

with over 16 million abstractswith over 16 million abstracts– Over 10,000 papers per week are addedOver 10,000 papers per week are added

Page 4: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

44

IntroductionIntroduction

The Exploding number of PubMed articles over the The Exploding number of PubMed articles over the yearsyears

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,00019

5019

5219

5419

5619

5819

6019

6219

6419

6619

6819

7019

7219

7419

7619

7819

8019

8219

8419

8619

8819

9019

9219

9419

9619

9820

0020

0220

0420

06 (A

pr.)

Year

ME

DL

INE

Siz

e (#

of

arti

cles

)

Page 5: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

55

Introduction Introduction

•How to solve the information How to solve the information overloading of biomedical overloading of biomedical Literature?Literature?– developing scalable searching & mining developing scalable searching & mining

methods methods – integrating information extraction and integrating information extraction and

data mining methods to automaticallydata mining methods to automaticallyosearch & retrieve biomedical literature search & retrieve biomedical literature

efficiently and effectively efficiently and effectively

oextract the results into a structured format extract the results into a structured format

o mine important biological relationshipsmine important biological relationships

Page 6: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

66

Major Issues in Biomedical Major Issues in Biomedical Literature MiningLiterature Mining

• Huge numbers of documents Huge numbers of documents

• Lack of structuresLack of structures

• Many subdomains Many subdomains

• Many aliases and typographical Many aliases and typographical variants for most biomedical objectsvariants for most biomedical objects

• Abbreviations, synonyms, polysemy, Abbreviations, synonyms, polysemy, etcetc

Page 7: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

77

The General Text Mining The General Text Mining ViewView

1.1. Selects what they will read (Information Selects what they will read (Information Retrieval), Retrieval),

2.2. Identifies important entities and relations Identifies important entities and relations between those entities (Information between those entities (Information Extraction), Extraction),

3.3. Combines this new information with other Combines this new information with other documents and other knowledge into a documents and other knowledge into a databasedatabase

4.4. Mine the extracted results (Data Mining)Mine the extracted results (Data Mining)

Page 8: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

88

Issues in Current Information Issues in Current Information Retrieval (IR)?Retrieval (IR)?

• Key-word based: get a lot of Key-word based: get a lot of irrelevant and miss a lot of relevant irrelevant and miss a lot of relevant documents documents

• Query ExpansionQuery Expansion

• Probability Language ModelingProbability Language Modeling

Ex: mouse, bank, chip, apple etcEx: mouse, bank, chip, apple etc

Page 9: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

99

Issues in Current Information Issues in Current Information Extraction (IE)?Extraction (IE)?

•Examining every document– Doing so against Medline is extremely

time-consuming

•Using filters to select promising abstracts for extraction– Requiring human involvement to

maintain and to adapt to new topics or sub disciplines.

Page 10: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1010

Our Approaches: Our Approaches: Bio-SET-Bio-SET-DMDM• Information Retrieval: semantic-query expansion Information Retrieval: semantic-query expansion

(Xiaohua Zhou’s Ph.D. Thesis) (Xiaohua Zhou’s Ph.D. Thesis) • Information Extraction Methods: mutual reinforcement Information Extraction Methods: mutual reinforcement

learning for automatic pattern learning and tuple learning for automatic pattern learning and tuple extraction (Illhoi Yoo’s Ph.D. thesis)extraction (Illhoi Yoo’s Ph.D. thesis)

• Text Mining: graphical-based representation text Text Mining: graphical-based representation text clustering and summarization (Xiaodan Zhang’s Ph.D. clustering and summarization (Xiaodan Zhang’s Ph.D. thesis)thesis)

• Bio-SET-DM (Biomedical Literature Searching, Bio-SET-DM (Biomedical Literature Searching, Extracting and Text Data Mining)Extracting and Text Data Mining)

• Biomedical Ontologies: UMLS and Go are the glues Biomedical Ontologies: UMLS and Go are the glues

Page 11: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1111

NSF Career: NSF Career: A Unified Architecture for Data A Unified Architecture for Data Mining Biomedical Literature Databases (415K Mining Biomedical Literature Databases (415K

US$, March 2005-Feb 2010US$, March 2005-Feb 2010))

Ontology Base

BioMedicalLiterature D B(e.g. PubMed)

U MLS

Initial Q uery

Semantic -based Query Expansion

Data Mining

G ene O ntology

Information Retrieval

Q uery List

Categorized D ocuments

Information Extraction

Promising D ocument Set

A utomatic Pattern & R elation Generation

Summary Report

K now ledge Base

Text C lustering

Text Summarization

K eyphrase Extraction

Rule Induction

.....

A ssociation A lgorithm

.....

Text Clusters

… ..

Page 12: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1212

Problem Descriptions of IRProblem Descriptions of IR

• DescriptionsDescriptions– Many biomedical literature searches are about Many biomedical literature searches are about

relationships between biological entities.relationships between biological entities.– The co-occurrence of two keywords often does mean The co-occurrence of two keywords often does mean

these two keywords are really related.these two keywords are really related.

– Explicitly index and search documents with Explicitly index and search documents with relationshipsrelationships

obesity [TIAB] AND hypertension [TIAB] AND hasabstract [text]AND ("1900"[PDAT] : "2005/03/08"[PDAT])

The query used to retrieve documents addressing the interaction of obesity and hypertension from PubMed. A ranked hit list of 6687 documents is returned. We then took the top 100 abstracts for human relevance judgment. Unfortunately, as expected, only 33 of them were relevant.

Page 13: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1313

Statistical Language ModelStatistical Language Model

• Statistical language modelStatistical language model– It is a probabilistic mechanism for It is a probabilistic mechanism for

generating text.generating text.

• Text generationText generation– Suppose word is the unit of a text (e.g. Suppose word is the unit of a text (e.g.

document). The text generation process document). The text generation process looks like as follows:looks like as follows:•Choose a language model in each step.Choose a language model in each step.

•Generate a word according to the chosen Generate a word according to the chosen model.model.

Page 14: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1414

Language Modeling and IRLanguage Modeling and IR

• Example:Example:– Document 1={(A,3), (B, 5), (C,2)}Document 1={(A,3), (B, 5), (C,2)}– Document 2={(A,4), (B, 1), (C,5)}Document 2={(A,4), (B, 1), (C,5)}– Query={A, B}Query={A, B}– Which document is more relevant to the query?Which document is more relevant to the query?

Doc 1: 0.3*0.5=0.15

Doc 2: 0.4*0.1=0.04

Doc 1 is more relevant to the query than Doc 2

Page 15: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1515

Why Smoothing?Why Smoothing?

• Avoid Zero Probability Avoid Zero Probability – Document 1={(A,3), (B, 5), (C,2)}Document 1={(A,3), (B, 5), (C,2)}– Document 2={(A,4), (B, 1), (C,5)}Document 2={(A,4), (B, 1), (C,5)}– Query={A, D}Query={A, D}– Which document is more relevant to the query?Which document is more relevant to the query?

Doc 1: 0.3*0=0

Doc 2: 0.4*0=0

Obviously, this result is not reasonable.

Page 16: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1616

Why Smoothing?Why Smoothing?

• Discount High-frequency Terms: Stop words (e.g. the, a, an, Discount High-frequency Terms: Stop words (e.g. the, a, an, you…) frequently occur in documents. According to Maximum you…) frequently occur in documents. According to Maximum Likelihood Estimate (MLE), their generative probability will be Likelihood Estimate (MLE), their generative probability will be very high. However, stop words are obviously trivial to those very high. However, stop words are obviously trivial to those documents.documents.

• Assign reasonable probability to unseen word (Data Sparsity)Assign reasonable probability to unseen word (Data Sparsity)– Testing words do not appear in training corpus.Testing words do not appear in training corpus.– Need effective smoothing method, especially incorporating Need effective smoothing method, especially incorporating

the semantic relationship between the testing words and the semantic relationship between the testing words and training words into the model.training words into the model.

– Examples: a document containing Examples: a document containing ““autoauto”” for query for query ““carcar”” in in text retrieval task.text retrieval task.• If using Laplacian smoothing or background smoothing, If using Laplacian smoothing or background smoothing,

the document will not return for the query.the document will not return for the query.• If using semantic smoothing, the document will return for If using semantic smoothing, the document will return for

the query.the query.

Page 17: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1717

LM and IRLM and IR

• Steps:Steps:– Estimate the word distribution for each Estimate the word distribution for each

document, i.e., p(w|ddocument, i.e., p(w|dii), which is also ), which is also referred to as document language model or referred to as document language model or document model.document model.

– Computing the probability of generating the Computing the probability of generating the query according to each document model.query according to each document model.

– Rank the query-generating probabilities of Rank the query-generating probabilities of all documents in the collection.all documents in the collection.

Page 18: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1818

Language Modeling IR Language Modeling IR FormalismFormalism• LM: view IR as a process of word sampling from LM: view IR as a process of word sampling from

the document. The higher probability to the document. The higher probability to generate the query, the more relevant the generate the query, the more relevant the document is to the query (Ponte and Croft 1998)document is to the query (Ponte and Croft 1998)

),|(log

)(

)(log),|(log

)(

)(log

),|(

),|(log

),(

),(log

rDQp

Drp

rDprDQp

Drp

rDp

rDQp

rDQp

DQrp

DQrp

rank

rank

The formula is from (Lafferty and Zhai 2002)

Page 19: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

1919

Context-Sensitive Semantic Context-Sensitive Semantic Smoothing (Our Approach)Smoothing (Our Approach)

• DefinitionDefinition– Like the statistical translation model, term Like the statistical translation model, term

semantic relationships are used for model semantic relationships are used for model smoothing.smoothing.

– Unlike the statistical translation model, contextual Unlike the statistical translation model, contextual and sense information is consideredand sense information is considered

• MethodMethod– Decompose a document into a set of context-Decompose a document into a set of context-

sensitive topic signatures and then statistically sensitive topic signatures and then statistically translate topic signatures into individual words.translate topic signatures into individual words.

Page 20: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2020

Topic SignaturesTopic Signatures

• Concept PairsConcept Pairs– A pair of two concepts which are semantically A pair of two concepts which are semantically

and syntactically related to each otherand syntactically related to each other– Example: computer and mouse, hypertension Example: computer and mouse, hypertension

and obesityand obesity– Extraction: Ontology-based approach (Zhou et Extraction: Ontology-based approach (Zhou et

al. 2006, SIGIR)al. 2006, SIGIR)

• Multiword PhrasesMultiword Phrases– Example: Space Program, Star War, White HouseExample: Space Program, Star War, White House– Extraction: Xtract (Smadja 1993)Extraction: Xtract (Smadja 1993)

Page 21: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2121

Translation Probability Translation Probability EstimateEstimate• MethodMethod

– Use cooccurrence Use cooccurrence counts (topic signature counts (topic signature and individual words)and individual words)

– Use a mixture model to Use a mixture model to remove noise from remove noise from topic-free general wordstopic-free general words

)|()|()1()|( CwpwpDwpktk

Denotes Dk the set of documents containing the topic signature tk. The parameter α is the coefficient controlling the influence of the corpus model in the mixture model.

w1D1

D2

D3

D4

VdVt

t1

t2

t3

t4

t5 w4

w3

w2

Vw

Figure 2. Illustration of document indexing. Vt, Vd and Vw are topic signature set, document set and word set, respectively.

Page 22: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2222

Translation Probability Translation Probability EstimateEstimate• Log likelihood of generating DLog likelihood of generating Dkk

Where is the document frequency of term w in Dk, i.e., the cooccurrence count of w and tk in the whole collection.

))|()|()1)((,(

)|(log),(),|(log

CwpwpDwc

DwpDwcCDp

wtk

wkktk

k

k

)|()|()1(

)|()1()(ˆ

)(

)()(

Cwpwp

wpwp

k

k

tn

tn

n

)(ˆ),(

)(ˆ),()|(

)(

)()1(

ii

nki

nk

tn

wpDwc

wpDwcwp

k

• EM for estimationEM for estimation

Page 23: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2323

Contrasting Translation Contrasting Translation ExampleExampleSpace:Space:space 0.245; shuttle 0.057; launch 0.053; flight 0.042; air 0.035; program space 0.245; shuttle 0.057; launch 0.053; flight 0.042; air 0.035; program

0.031; 0.031; center 0.030; administration 0.026; develop 0.025; like 0.023; look 0.022; center 0.030; administration 0.026; develop 0.025; like 0.023; look 0.022;

world 0.020;world 0.020;director 0.020; plan 0.018; release 0.017; problem 0.017; work 0.016; place director 0.020; plan 0.018; release 0.017; problem 0.017; work 0.016; place

0.016; 0.016; mile 0.015; base 0.014;mile 0.015; base 0.014;

Program:Program:program 0.193; washington 0.026; congress 0.026; administration 0.024; program 0.193; washington 0.026; congress 0.026; administration 0.024;

need 0.024; need 0.024; billion 0.023; develop 0.023; bush 0.020; plan 0.020;money 0.020; problem billion 0.023; develop 0.023; bush 0.020; plan 0.020;money 0.020; problem

0.020; 0.020; provide 0.020; writer 0.018; d 0.018; help 0.018; work 0.017; president 0.017; provide 0.020; writer 0.018; d 0.018; help 0.018; work 0.017; president 0.017; house .017; million 0.016; increase 0.016;house .017; million 0.016; increase 0.016;

Space Program:Space Program:space 0.101; program 0.071;space 0.101; program 0.071; NASA 0.048 NASA 0.048; shuttle 0.043; ; shuttle 0.043; astronaut 0.041astronaut 0.041;;

launch 0.040; mission 0.038; flight 0.037; launch 0.040; mission 0.038; flight 0.037; earth 0.037; moon 0.035; orbit earth 0.037; moon 0.035; orbit 0.032; 0.032;

satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; ; technology 0.026;technology 0.026;

project 0.025; science 0.023; budget 0.023;project 0.025; science 0.023; budget 0.023;

Page 24: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2424

Topic Signature LMTopic Signature LM

• Basic IdeaBasic Idea– Linearly interpolate the topic signature based Linearly interpolate the topic signature based

translation model with a simple language model.translation model with a simple language model.

– The document expansions based on context-The document expansions based on context-sensitive semantic smoothing will be very specific. sensitive semantic smoothing will be very specific.

– The simple language model can capture the points The simple language model can capture the points the topic signatures miss.the topic signatures miss.

)|()|()1()|( dwpdwpdwp tbbt

Where the translation coefficient (λ) controls the influence of the translation component in the mixture model.

Page 25: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2525

Topic Signature LMTopic Signature LM

• The Simple Language ModelThe Simple Language Model)|()|()1()|( Cwpdwpdwp mlb

k

kmltt dtpwpdwpk

)|()|()|(

• The Topic Signature Translation The Topic Signature Translation ModelModel

),(

),()|(

ii

kkml dtc

dtcdtp

c(ti, d) is the frequency of topic signature ti in document d.

Page 26: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2626

Text Retrieval ExperimentsText Retrieval Experiments

• CollectionsCollections– TREC Genomics Track 2004 and 2005TREC Genomics Track 2004 and 2005– Use sub-collectionUse sub-collection

– 2004: 48,753 documents2004: 48,753 documents– 2005: 41,018 documents2005: 41,018 documents

• Measures:Measures:– Mean Average Precision (AP), RecallMean Average Precision (AP), Recall

• SettingsSettings– Simple language model as the baselineSimple language model as the baseline– Use concept pairs as topic signaturesUse concept pairs as topic signatures– Background coefficient: 0.05Background coefficient: 0.05– Pseudo-relevance feedback: top 50 documents, expand10 Pseudo-relevance feedback: top 50 documents, expand10

termsterms

Page 27: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2727

ExperimentsExperiments

• CollectionsCollections– TREC Genomics Track 2004 and 2005TREC Genomics Track 2004 and 2005– Use sub-collectionUse sub-collection

– 2004: 48,753 documents2004: 48,753 documents– 2005: 41,018 documents2005: 41,018 documents

• Measures:Measures:– Mean Average Precision (AP), RecallMean Average Precision (AP), Recall

• SettingsSettings– Background coefficient: 0.05Background coefficient: 0.05– Pseudo-relevance feedback: top 50 documents, Pseudo-relevance feedback: top 50 documents,

expand10 termsexpand10 terms

Page 28: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2828

Baseline ModelsBaseline Models

Table 1. Comparison of the baseline language model to the Okapi model. The Okapi formula is the same as the one in [10]. The number of relevant documents for TREC04 and TREC05 are 8266 and 4585, respectively. The asterisk indicates the initial query is weighted.

CollectioCollectionn

RecallRecall MAPMAP

SLMSLMOkapiOkapi ChangChang

ee

SLMSLMOkapOkapii

ChangChangee

TREC04TREC04 64116411 66626662 +3.9%+3.9% 0.3450.345 0.360.3633

+5.2%+5.2%

TREC04*TREC04* 65276527 67046704 +2.7%+2.7% 0.3640.364 0.360.3644

+0.0%+0.0%

TREC05TREC05 40844084 41244124 +1.0%+1.0% 0.2550.255 0.250.2500

-2.0%-2.0%

TREC05*TREC05* 41354135 41344134 -0.0%-0.0% 0.2600.260 0.250.2544

-2.3%-2.3%

Page 29: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

2929

Experiment ResultsExperiment ResultsTable 2. The comparison of the baseline language model (DM0) to document smoothing model (DM2) and query smoothing model (FM1).

CollectionCollection DM0DM0λ=0.3λ=0.3 γ =0.6γ =0.6

DM2DM2 ChangChangee

FM1FM1 ChangeChange

TREC04TREC04MAPMAP

0.340.3455

0.3950.395 +14.5+14.5%%

0.4510.451 +30.9+30.9%%

RecalRecalll

64116411 67496749 +5.3%+5.3% 69296929 +8.0%+8.0%

TREC04*TREC04*MAPMAP

0.360.3644

0.4140.414 +13.7+13.7%%

0.4600.460 +26.9+26.9%%

RecalRecalll

65276527 69056905 +5.8%+5.8% 70397039 +7.8%+7.8%

TREC05TREC05MAPMAP 0.250.25

550.2770.277 +8.6%+8.6% 0.2790.279 +9.4%+9.4%

RecalRecalll

40844084 41674167 +2.0%+2.0% 42274227 +3.5%+3.5%

TREC05*TREC05*MAPMAP

0.260.2600

0.2880.288 +10.8+10.8%%

0.2870.287 +10.4+10.4%%

RecalRecalll

41354135 42144214 +1.9%+1.9% 42354235 +2.4%+2.4%

Page 30: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3030

Context-sensitive vs. Context-Context-sensitive vs. Context-insensitiveinsensitive

Table 3. Comparison of the context-sensitive semantic smoothing (DM2) to the context-insensitive semantic smoothing (DM2’) on MAP. The rightmost column is the change of DM2 over DM2’.

CollectioCollectionn

DM0DM0 DM2’DM2’ DM2DM2

ChangChangeeMAPMAP MAPMAP ChangChang

eeMapMap ChangChang

ee

TREC04TREC04 0.340.3466

0.360.3677

+6.1+6.1%%

0.390.3955

+14.5+14.5%%

+7.6%+7.6%

TREC04TREC04**

0.360.3644

0.380.3844

+5.5+5.5%%

0.410.4144

+13.7+13.7%%

+7.8%+7.8%

TREC05TREC05 0.250.2555

0.260.2600

+2.0+2.0%%

0.270.2777

+8.6%+8.6% +6.5%+6.5%

TREC05TREC05**

0.260.2600

0.260.2699

+3.5+3.5%%

0.280.2888

+10.8+10.8%%

+7.1%+7.1%

• The context-sensitive semantic smoothing approach The context-sensitive semantic smoothing approach performs significantly better than context-insensitive performs significantly better than context-insensitive semantic smoothing approaches.semantic smoothing approaches.

Page 31: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3131

Relevant PublicationsRelevant Publications1.1. Hu X., Xu., X., Hu X., Xu., X., Mining Novel Connections from Online Mining Novel Connections from Online

Biomedical Databases Using Semantic Query Expansion Biomedical Databases Using Semantic Query Expansion and Semantic-Relationship Pruningand Semantic-Relationship Pruning, International Journal of , International Journal of Web and Grid Service, 1(2), 2005, pp 222-239Web and Grid Service, 1(2), 2005, pp 222-239

2.2. Zhou X., Zhou X., Hu XHu X., Zhang X., ., Zhang X., Topic Signature Language Models Topic Signature Language Models for Ad-hoc Retrievalfor Ad-hoc Retrieval,, in the IEEE Transactions on Knowledge in the IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), September, 2007and Data Engineering (IEEE TKDE), September, 2007

3.3. Song M., Song I-Y, Hu X., Allen B., Song M., Song I-Y, Hu X., Allen B., Integration of Association Integration of Association Rules and Ontology for Semantic-based Query Expansion Rules and Ontology for Semantic-based Query Expansion in the Journal of Data & Knowledge Engineeringin the Journal of Data & Knowledge Engineering

4.4. Zhou X., Hu X., Zhang X., Lin X., Song I-Y., Zhou X., Hu X., Zhang X., Lin X., Song I-Y., Context-Sensitive Context-Sensitive Semantic Smoothing for the Language Modeling Approach Semantic Smoothing for the Language Modeling Approach to Genomic IRto Genomic IR, in the Proc. Of the 29th Annual International , in the Proc. Of the 29th Annual International ACM SIGIR Conference on Research & Development on ACM SIGIR Conference on Research & Development on Information Retrieval (Information Retrieval (SIGIR 2006SIGIR 2006), ),

5.5. Zhou X., Zhang X., Zhou X., Zhang X., Hu XHu X., ., Semantic Smoothing of Document Semantic Smoothing of Document Models for Agglomerative Clustering, Models for Agglomerative Clustering, accepted in the Twentieth accepted in the Twentieth International Joint Conference on Artificial Intelligence(IJCAI 07), International Joint Conference on Artificial Intelligence(IJCAI 07), Hyderabad, India, Jan 6-12, 2007 Hyderabad, India, Jan 6-12, 2007

6.6. Zhang X., Zhang X., Hu XHu X., Zhou X., ., Zhou X., A Comparative Evaluation of Different A Comparative Evaluation of Different Link Types on Enhancing Document ClusteringLink Types on Enhancing Document Clustering, accepted in 31, accepted in 31thth Annual International ACM SIGIR Conference on Research & Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (Development on Information Retrieval (SIGIR 2008SIGIR 2008) )

Page 32: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3232

SPIE:SPIE: SScalable and calable and PPortable ortable IInformation nformation

EExtractionxtraction

• Scalable and portable information Scalable and portable information extraction system (SPIE) is extraction system (SPIE) is influenced by the idea of DIPRE influenced by the idea of DIPRE introduced by Brin [Brin, 1998].introduced by Brin [Brin, 1998].

• The goal is to develop efficient and The goal is to develop efficient and portable information extraction portable information extraction system to automatically extract system to automatically extract various biological relationships from various biological relationships from online biomedical literature with no online biomedical literature with no or little human intervention. or little human intervention.

Page 33: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3333

SPIE:SPIE: SScalable and calable and PPortable ortable IInformation nformation

EExtractionxtraction

• The main ideas of SPIE:The main ideas of SPIE:– Automatic query generation and query Automatic query generation and query

expansion for effective search and expansion for effective search and retrieval from text databasesretrieval from text databases

– Dual reinforcement information Dual reinforcement information extraction for pattern generation and extraction for pattern generation and tuple extractiontuple extraction

– Scalable well in huge collections of text Scalable well in huge collections of text files because it does not need to scan files because it does not need to scan every text fileevery text file

Page 34: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3434

SPIE:SPIE: SScalable and calable and PPortable ortable IInformation nformation

EExtractionxtraction

Set of Documents

Queries

Pattern Base

Biomedical Literature DB

Mutual Reinforcement of Pattern Generation

- Instance Extraction

Automatic Query Generation & Document CategorizationInitial seed tuples

Instance Relation

Initial seed tuples

tuples generated from IE

Automatic Categorization of

Documents

Extract text segment of interest

Find occurrence of seed tuples

Generate extraction pattern and store it in

pattern base

New instance extraction based on

pattern matching

Query List

Data Mining to generate rules from

categorized documents

Search Engine

Page 35: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3535

SPIESPIE((SScalable & calable & PPortable ortable IEIE))

SPIE takes the following steps:SPIE takes the following steps:

1.Starting with a set of user-provided seed tuples, SPIE retrieves a sample of documents from the biomedical literature library.– the seed tuples can be quite small, normally 5 to

10 is enough– constructing some simple queries by using the

attribute values of the initial seed tuples to extract the document samples of a pre-defined size using from the search engine

Page 36: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3636

SPIE SPIE ((SScalable & calable & PPortable ortable IEIE))

2.The tuple set induces a binary partition (a split) on the documents:– those that contain tuples or those that do

not contain any tuple from the relation – The documents are thus labeled

automatically as either positive or negative examples, respectively.

– The positive examples represent the documents that contain at least one tuple.

– The negative examples represent documents that contain no tuples.

Page 37: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3737

Query Generation/Expansion Query Generation/Expansion for Document Retrievalfor Document Retrieval

STEP 3 consists of two stagesSTEP 3 consists of two stages– converting the positive and negative converting the positive and negative

examples into an appropriate examples into an appropriate representation for trainingrepresentation for training

– running the data mining algorithms on running the data mining algorithms on the training examples to generate a set the training examples to generate a set of rules and then convert the rules into of rules and then convert the rules into an ordered list of queries expected to an ordered list of queries expected to retrieve new useful documentsretrieve new useful documents

Page 38: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3838

Query Generation/Expansion Query Generation/Expansion for Document Retrievalfor Document Retrieval

Page 39: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

3939

Query Generation/Expansion Query Generation/Expansion for Document Retrievalfor Document Retrieval

• In STEP 3 three data mining algorithms are used In STEP 3 three data mining algorithms are used for rule generation; Ripple, CBA & DB-Decifor rule generation; Ripple, CBA & DB-Deci

• Those rules are ranked based on Laplace Those rules are ranked based on Laplace measuresmeasures

• Top 10% of rules are converted into a query listTop 10% of rules are converted into a query list

Positive IF WORDS ~ protein AND bindingPositive IF WORDS ~ cell and function

Query 1: protein AND binding Query 2: cell AND function

Page 40: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4040

Pattern GenerationPattern Generation

• A pattern is A pattern is – a 5–tuples <a 5–tuples <prefixprefix, , entity_tag1entity_tag1, , infixinfix, , entity_tag2entity_tag2, ,

suffixsuffix>>– prefixprefix, , infixinfix, and , and suffixsuffix are vectors associating weights are vectors associating weights

with terms. with terms. – pprefixrefix is the part of sentence before is the part of sentence before entity1entity1, , – infixinfix is the part of sentence between is the part of sentence between entity1entity1 and and

entity2entity2 – suffixsuffix is the part of sentence after is the part of sentence after entity2.entity2.

““HP1 interacts with HDAC4 in the two–hybrid system…”HP1 interacts with HDAC4 in the two–hybrid system…”

{ “”, { “”, <Protein><Protein>, “, “interacts withinteracts with”, ”, <Protein><Protein>, , “”}.“”}.

Page 41: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4141

Pattern MatchingPattern Matching

Page 42: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4242

Experiment Experiment

• Keyword base vs. SPIEKeyword base vs. SPIE• Keyword base experimentKeyword base experiment

– Input: Input: o around 7000 protein names (expanded from 1600 around 7000 protein names (expanded from 1600

protein names using protein synonyms)protein names using protein synonyms)o 23 keywords23 keywordso 1.5 million abstracts (obtained using those keyword 1.5 million abstracts (obtained using those keyword

searching in PubMed)searching in PubMed)

• SPIE experimentSPIE experiment– Input:Input:

o Only 10 pairs of protein-protein interaction (PPI) Only 10 pairs of protein-protein interaction (PPI) pairspairs

– Maximum number of documents used in each Maximum number of documents used in each iteration is 10kiteration is 10k

– Starting with 50k documents and stopping at Starting with 50k documents and stopping at 500k documents500k documents

Page 43: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4343

ExperimentExperimentSPIEKeyword based

Page 44: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4444

Experiment Experiment

ExperimentExperiment Abstracts Abstracts usedused

# of distinct # of distinct PPIPPI

Keyword baseKeyword base 1,444,0021,444,002 9,9809,980

SPIESPIE 500k500k 9,4839,483

• It is very obvious that SPIE has a It is very obvious that SPIE has a significant performance significant performance advantage over key-word based advantage over key-word based approach.approach.

Page 45: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4545

Chromatin Protein Chromatin Protein NetworkNetwork

Page 46: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4646

Biomolecular Network Biomolecular Network Analysis Analysis

• Biomolecular networks dynamically Biomolecular networks dynamically respond to stimuli and implement respond to stimuli and implement cellular function cellular function

• Understanding these dynamic Understanding these dynamic changes is the key challenge for cell changes is the key challenge for cell biologists biologists

• Biomolecular networks grow in size Biomolecular networks grow in size and complexity, and thus the and complexity, and thus the computer simulation is an essential computer simulation is an essential tool to understand biomolecular tool to understand biomolecular network models network models

• A sub-network executes a specific A sub-network executes a specific cellular function and deserve to be cellular function and deserve to be studiedstudied

Page 47: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4747

Biomolecular Network Biomolecular Network AnalysisAnalysis

• Our method consists of two steps.Our method consists of two steps.• First, a novel scale-free network First, a novel scale-free network

clustering approach is applied to the clustering approach is applied to the biomolecular network to obtain various biomolecular network to obtain various sub-networks.sub-networks.

• Second, computational models are Second, computational models are generated for the sub-network and generated for the sub-network and simulated to predict their behavior in simulated to predict their behavior in the cellular context. the cellular context.

• We discuss and evaluate three We discuss and evaluate three advanced computational models: state-advanced computational models: state-space model, probabilistic Boolean space model, probabilistic Boolean network model, and fuzzy logic model.network model, and fuzzy logic model.

Page 48: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4848

Mining the large-scale Mining the large-scale biomolecular network biomolecular network

(1)(1)Main Algorithm Main Algorithm SNBuilder SNBuilder (G, s, f, d(G, s, f, d)) 1: 1: GG((V, EV, E)) is the input graph with vertex is the input graph with vertex

set set VV and edge set and edge set EE. . 2: 2: s s is the seed vertex; is the seed vertex; ff is the affinity is the affinity

threshold; threshold; dd is the distance threshold. is the distance threshold.3: 3: N N ← {Adjacency list of ← {Adjacency list of ss } U{ } U{ss}}4: 4: C C ← FindCore(← FindCore(NN)) 5: 5: C’ C’ ← ExpandCore(← ExpandCore(C, f, dC, f, d))6: 6: return return C’C’

Page 49: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

4949

Mining a large-scale Mining a large-scale biomolecular network biomolecular network

(2)(2)Sub-Algorithm Sub-Algorithm FindCore FindCore ((NN)) 8: 8: for eachfor each v N v N9: calculate k9: calculate kvv

inin(N)(N)10:10:end forend for11: 11: KKminmin ← min { k ← min { kvv

inin (N), v N} (N), v N}12: K12: Kmaxmax ← max { k ← max { kvv

inin(N), (N), v Nv N}}13: 13: if if KKminmin = K = Kmaxmax or (k or (kii

inin (N) = k (N) = kjjinin (N), (N),

( ) ( ) then returnthen return NN

14:14:else return else return FindCore(N – {v}, kFindCore(N – {v}, kvvinin(N) (N)

= K= Kminmin))

jisjiNji ,,,,

Page 50: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5050

Mining a large-scale Mining a large-scale biomolecular network biomolecular network

(3)(3)Sub-Algorithm Sub-Algorithm ExpandCoreExpandCore((C, f, dC, f, d)) 1616: D ← {v, w}: D ← {v, w}17: C’ ← C17: C’ ← C18: 18: for each for each t Dt D, , t C, and distance(t, s) t C, and distance(t, s)

<= d<= d19: calculate k19: calculate ktt

in in (D)(D)

20: calculate k20: calculate kttoutout (D) (D)

21:21: if if kkttinin (D) > k (D) > ktt

outout (D) (D) oror k kttinin (D)/|D| > f (D)/|D| > f

then then C’ ← C’ U {t}C’ ← C’ U {t}22: 22: end forend for23: 23: if if C’ = C C’ = C then return then return CC24: 24: else returnelse return ExpandCore(C’, f, d) ExpandCore(C’, f, d)

CwCvEwv

,,),(

Page 51: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5151

Experiment ResultsExperiment ResultsPromising Protein-Protein Interaction clusters Promising Protein-Protein Interaction clusters

Page 52: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5252

Experiment ResultsExperiment Results

Fig 1 Fig 1 A sub-network obtained using the A sub-network obtained using the algorithmalgorithm

Page 53: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5353

State-space model for State-space model for simulationsimulation (1)(1)

Observation equations

Dynamic equations

A gene regulatory network

x1

x2

.

.

.

xn

Externalinputs

Z1

Zp

Page 54: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5454

• x: gene expression datax: gene expression data• z: internal variables z: internal variables ((promoterspromoters))• A: state transition matrixA: state transition matrix• B: control B: control ((inputinput)) matrix matrix• C is transformation matrixC is transformation matrix• n1(t) and n2(t) stand for noisesn1(t) and n2(t) stand for noises

)()()(

)()()()1(

2

1

ttt

tttt

nCzx

nBuAzz

State-space model for State-space model for simulationsimulation (2)(2)

Page 55: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5555

• Applying the state-space modeling method Applying the state-space modeling method to gene expression data of 16 genes in to gene expression data of 16 genes in Figure 1, we obtained an inferred gene Figure 1, we obtained an inferred gene regulatory network with nine internal regulatory network with nine internal variables variables

• The analysis shows that the inferred The analysis shows that the inferred network is stable, robust, and periodic network is stable, robust, and periodic

• Use the constructed model from the Use the constructed model from the training dataset Thy-Thy 3training dataset Thy-Thy 3 to predict the to predict the expression profiles in the testing dataset expression profiles in the testing dataset Thy-NocThy-Noc, the result is shown in Figure 2, the result is shown in Figure 2

State-space model for State-space model for simulationsimulation (3)(3)

Page 56: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5656

State-space model for State-space model for simulationsimulation (4)(4)

Fig 2 Comparison of experimental (solid lines) and predicted Fig 2 Comparison of experimental (solid lines) and predicted (dotted lines) gene expression profiles for DMFT(A), F2 (B), (dotted lines) gene expression profiles for DMFT(A), F2 (B), RRM2 (C) and TYR (D)RRM2 (C) and TYR (D)

0 5 10 15 20-0.5

0

0.5

1

Time

Exp

ressio

n l

evel

0 5 10 15 20-0.5

0

0.5

1

Time

Exp

ressio

n l

evel

0 5 10 15 20-0.5

0

0.5

1

Time

Exp

ressio

n l

evel

0 5 10 15 20-0.5

0

0.5

1

Time

Exp

ressio

n l

evel

A B

C D

Page 57: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5757

• The fuzzy biomolecular network model is The fuzzy biomolecular network model is a set of rule sets for each node (in this a set of rule sets for each node (in this case gene) in the network governing the case gene) in the network governing the response to each fuzzy state of the input response to each fuzzy state of the input genes to that node (the output gene). genes to that node (the output gene).

• Fuzzy rule sets are generated for genes Fuzzy rule sets are generated for genes in the sub-network in Figure 1. in the sub-network in Figure 1.

• Use the constructed model from the Use the constructed model from the training dataset Thy-Thy 3training dataset Thy-Thy 3 to predict the to predict the expression profiles in the testing dataset expression profiles in the testing dataset Thy-NocThy-Noc, the results shown in Figure 3, the results shown in Figure 3

Fuzzy logic modelFuzzy logic model for for simulationsimulation (1)(1)

Page 58: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5858

Fuzzy logic model for Fuzzy logic model for simulationsimulation (2)(2)

Fig 3 Fig 3 Best fit rule on training set “Thy-Thy 3” predicting gene Best fit rule on training set “Thy-Thy 3” predicting gene expression on the test data set (solid line) compared to actual expression on the test data set (solid line) compared to actual data from the test set “Thy-Noc” (dashed line) for CDK2 (A), data from the test set “Thy-Noc” (dashed line) for CDK2 (A), BRCA1 (B), EP300 (C), and CDK4 (D)BRCA1 (B), EP300 (C), and CDK4 (D)

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0 10 20 30 40

Time

Lo

g(E

xpre

ssio

n R

atio

)

-1.5

-1

-0.5

0

0.5

1

0 10 20 30 40

Time

Lo

g(E

xpre

ssio

n R

atio

)

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0 10 20 30 40

Time

Lo

g(E

xpre

ssio

n R

atio

)

-0.8-0.6-0.4-0.2

00.20.40.60.8

0 10 20 30 40

Time

Lo

g(E

xpre

ssio

n R

atio

)A B

C D

Page 59: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

5959

• A probabilistic Boolean network (PBN) is A probabilistic Boolean network (PBN) is a Markov chain capturing transition a Markov chain capturing transition probabilities among different genes probabilities among different genes expression states.expression states.

• We construct PBNs for the given We construct PBNs for the given microarray data set "Thy-Thy 3" and use microarray data set "Thy-Thy 3" and use the data set "Thy-Noc" to test the the data set "Thy-Noc" to test the constructed PBNsconstructed PBNs

• The results are shown in Tables 1 The results are shown in Tables 1 through 3through 3

Probabilistic Boolean Probabilistic Boolean NetworksNetworks for simulationfor simulation (1)(1)

Page 60: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6060

Probabilistic Boolean Probabilistic Boolean NetworksNetworks for simulationfor simulation (2)(2)

GeneGene DMTFDMTF BRCA1BRCA1 HIFXHIFX HEHE PPP2R4PPP2R4 MYCMYC NR4A2NR4A2 F2F2

2 states2 states 66.6766.67 55.5655.56 77.7877.78 22.2222.22 55.5655.56 44.4444.44 72.2272.22 61.1161.11

GeneGene PTENPTEN RRM2RRM2 PLATPLAT TYRTYR CADCAD CDK2CDK2 CDK4CDK4 EP300EP300

2 states2 states 72.2272.22 77.7877.78 50.0050.00 55.5655.56 66.6766.67 50.0050.00 66.6766.67 72.2272.22

Table 1: Prediction accuracy based on the given genetic network using 2 states microarray data.

GeneGene DMTFDMTF BRCA1BRCA1 HIFXHIFX HEHE PPP2R4PPP2R4 MYCMYC NR4A2NR4A2 F2F2

3 states3 states 50.0050.00 55.5655.56 66.6766.67 16.6716.67 61.1161.11 55.5655.56 55.5655.56 61.1161.11

GeneGene PTENPTEN RRM2RRM2 PLATPLAT TYRTYR CADCAD CDK2CDK2 CDK4CDK4 EP300EP300

3 states3 states 44.4444.44 72.2272.22 66.6766.67 50.0050.00 61.1161.11 38.8938.89 66.6766.67 61.1161.11

Table 2: Prediction accuracy based on the given genetic network using 3 states microarray data.

Page 61: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6161

Probabilistic Boolean Probabilistic Boolean NetworksNetworks for simulationfor simulation (3)(3)

• To improve the prediction accuracy of the To improve the prediction accuracy of the He MYC and CDK2, we use the developed He MYC and CDK2, we use the developed multivariate Markov chain to model the multivariate Markov chain to model the mircoarray data set. The results are shown mircoarray data set. The results are shown in Table 3in Table 3GeneGene HEHE MYCMYC CDK2CDK2

2 states2 states 55.56 (22.22)55.56 (22.22) 61.11 (44.44)61.11 (44.44) 66.67 (50.00)66.67 (50.00)

3 states3 states 27.78 (16.67)27.78 (16.67) 55.56 (55.56)55.56 (55.56) 38.89 (38.89)38.89 (38.89)

Table 3: Prediction accuracy based on the input genes estimated from the multivariate Markov chain model.

Page 62: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6262

ConclusionsConclusions• We present a new method for mining and We present a new method for mining and

dynamic simulation of sub-networks from dynamic simulation of sub-networks from large biomolecular network. large biomolecular network.

• The presented method applies a scale–free The presented method applies a scale–free network clustering approach to the network clustering approach to the biomelcular network to obtain biologically biomelcular network to obtain biologically functional sub-network. functional sub-network.

• Three computational models: state-space Three computational models: state-space model, probabilistic Boolean Network, and model, probabilistic Boolean Network, and fuzzy logical model are employed to fuzzy logical model are employed to simulate the sub-network, using time-series simulate the sub-network, using time-series gene expression data of the human cell gene expression data of the human cell cycle. cycle.

• The results indicate our presented method The results indicate our presented method is promising for mining and simulation of is promising for mining and simulation of sub-networks.sub-networks.

Page 63: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6363

Relevant PublicationsRelevant Publications

1.1. Hu XHu X., Wu D., ., Wu D., Data Mining and Predictive Modeling of Data Mining and Predictive Modeling of Biomolecular Network from Biomedical Literature Biomolecular Network from Biomedical Literature DatabasesDatabases, in IEEE/ACM Transactions on Computational , in IEEE/ACM Transactions on Computational Biology and Bioinformatics, (April-June 2007), p251-263 Biology and Bioinformatics, (April-June 2007), p251-263

2.2. Hu X.,, Sokhansanj, Wu, D., Tang Y., Hu X.,, Sokhansanj, Wu, D., Tang Y., A Novel Approach A Novel Approach for Mining and Dynamic Fuzzy Simulation of for Mining and Dynamic Fuzzy Simulation of Biomolecular Network,Biomolecular Network, in IEEE Transactions on Fuzzy in IEEE Transactions on Fuzzy SystemsSystems

3.3. Hu XHu X., Wu F.X. Ng M., Sokhansanj B., ., Wu F.X. Ng M., Sokhansanj B., Mining and Dynamic Mining and Dynamic Simulation of Sub-Networks from Large Biomolecular Simulation of Sub-Networks from Large Biomolecular NetworksNetworks, in 2007 International Conference on Artificial , in 2007 International Conference on Artificial Intelligence, June 25-28, Las Vegas, USA (Intelligence, June 25-28, Las Vegas, USA (Best Paper Best Paper AwardAward, , out of 500 submissionsout of 500 submissions) )

4.4. Hu XHu X., Yoo I., Song I-Y., Song M., Han J., Lechner M., ., Yoo I., Song I-Y., Song M., Han J., Lechner M., Extracting and Mining Protein-Protein Interaction Network Extracting and Mining Protein-Protein Interaction Network from Biomedical Literaturefrom Biomedical Literature, in the Proceedings of the 2004 , in the Proceedings of the 2004 IEEE Symposium on Computational Intelligence in IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (IEEE CIBCB Bioinformatics and Computational Biology (IEEE CIBCB 2004), Oct. 7-8, 2004, San Diego, USA, (2004), Oct. 7-8, 2004, San Diego, USA, (Best Paper Best Paper AwardAward), pp 244-251), pp 244-251

5.5. Tang Y.C., Zhang Y-Q, Huang Z., Tang Y.C., Zhang Y-Q, Huang Z., Hu XHu X.,, and Zhao Y. .,, and Zhao Y. Recursive Recursive Fuzzy Granulation for Gene Subsets Extraction and Cancer Fuzzy Granulation for Gene Subsets Extraction and Cancer ClassificationClassification accepted to be published in the accepted to be published in the IEEE Transactions IEEE Transactions on Information Technology in Biomedicineon Information Technology in Biomedicine

Page 64: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6464

Dragon ToolkitDragon Toolkit• Software package Designed for Language Software package Designed for Language

Modeling, Information Retrieval and Text Modeling, Information Retrieval and Text MiningMining

• Free download Free download http://www.ischool.drexel.edu/dmbio/dragontool/default.asp

• 500 Java Libaries in NLP, Search Engine, 500 Java Libaries in NLP, Search Engine, Entity Extraction, One of the most popular Entity Extraction, One of the most popular software packages for Information software packages for Information Retrieval, NLP etc. More than 1500 Retrieval, NLP etc. More than 1500 research groups in the world have research groups in the world have downloaded it since Jul 2007downloaded it since Jul 2007

Page 65: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6565

Call for PaperCall for PaperInternational Journal of Data Mining International Journal of Data Mining

and Bioinformatics (and Bioinformatics (IJDMBIJDMB))

Editor-in-Chief: Xiaohua HuEditor-in-Chief: Xiaohua Hu

Inaugural Issue: July 2006Inaugural Issue: July 2006

SCI Indexed: Oct, 2007SCI Indexed: Oct, 2007

Page 66: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6666

Call For ParticipationCall For Participation2008 IEEE International Conference on 2008 IEEE International Conference on

Bioinformatics and Biomedicine Bioinformatics and Biomedicine ((BIBM 08BIBM 08))

Philadelphia, USA, Nov 3-5, 2008Philadelphia, USA, Nov 3-5, 2008

IEEE BIBM Steering Committee Chair: IEEE BIBM Steering Committee Chair: Xiaohua HuXiaohua Hu

Page 67: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6767

My Ph.D. Students and Joint My Ph.D. Students and Joint Ph.D. Students with Chinese Ph.D. Students with Chinese UniversityUniversity1. Illhoi Yoo (graduated in 2006, tenure-track assistant professor in Univ. of (graduated in 2006, tenure-track assistant professor in Univ. of

Missouri-Columbia)Missouri-Columbia)2.2. Xiaodan Zhang (4Xiaodan Zhang (4thth year Ph.D. student, Text and Web Data Mining, Digital year Ph.D. student, Text and Web Data Mining, Digital

Library, Bioterrorism)Library, Bioterrorism)3.3. Daniel Wu (5Daniel Wu (5thth year Ph.D. student, Data Mining and Biomolecular Network year Ph.D. student, Data Mining and Biomolecular Network

Analysis)Analysis)4.4. Xuheng Xu (4Xuheng Xu (4thth year Ph.D. student, Semantic-based Query Optimization and year Ph.D. student, Semantic-based Query Optimization and

Intelligent Searching)Intelligent Searching)5.5. Davis Zhou (5Davis Zhou (5thth year Ph.D. student, Semantic-based Information Extraction and year Ph.D. student, Semantic-based Information Extraction and

Retrieval)Retrieval)6.6. Palakorn Achananuparp (4Palakorn Achananuparp (4rthrth year Ph.D. student, Text Mining) year Ph.D. student, Text Mining)7.7. Deima Elnatour (4Deima Elnatour (4thth year Ph.D. student, Semantic-based Text Mining) year Ph.D. student, Semantic-based Text Mining)8.8. Guisu Li (2Guisu Li (2ndnd year Ph.D. student, Healthcare Informatics) year Ph.D. student, Healthcare Informatics)9.9. Zhong Huang (2Zhong Huang (2ndnd year Ph.D. student, Bioinformatics, Computational Biology) year Ph.D. student, Bioinformatics, Computational Biology)10.10. Xin Chen (fresh Ph.D. student, USTC)Xin Chen (fresh Ph.D. student, USTC)11.11. Xiaoshi Yin (joint Ph.D. student with Prof. Zhoujun Li from BAUU)Xiaoshi Yin (joint Ph.D. student with Prof. Zhoujun Li from BAUU)12.12. Min Xu (joint Ph.D. student with Prof. Shuigeng Zhou from Fudan University)Min Xu (joint Ph.D. student with Prof. Shuigeng Zhou from Fudan University)13.13. Yaoyu Zuo (joint Ph.D. student with Prof. Ying Tong from Zhongshan UniversityYaoyu Zuo (joint Ph.D. student with Prof. Ying Tong from Zhongshan University

Page 68: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6868

AcknowledgementsAcknowledgements

• PI: PI: NSF CAREERNSF CAREER: A Unified Architecture for Data Mining : A Unified Architecture for Data Mining Large Biomedical Literature Databases (Large Biomedical Literature Databases (NSF CAREER IIS NSF CAREER IIS 04480230448023, , $415K$415K, 03/15/2005-02/28/2010) , 03/15/2005-02/28/2010)

• PI: High Performance Rough Sets Data Analysis in Data PI: High Performance Rough Sets Data Analysis in Data Mining (Mining (NSF CCF 0514679NSF CCF 0514679, , $102K$102K, 08/01/2005-, 08/01/2005-07/31/2008)07/31/2008)

• Co-PI: The Drexel University GAANN Fellowship Program: Co-PI: The Drexel University GAANN Fellowship Program: Educating Renaissance Engineers (US Dept. of Education, Educating Renaissance Engineers (US Dept. of Education, 9/1/2006 to 8/31/2009, around 9/1/2006 to 8/31/2009, around $700K$700K))

• Co-PI: Penn State Cancer Education Network Evaluation (PA Co-PI: Penn State Cancer Education Network Evaluation (PA Dept. of Health, 04/25/2006-07/31/2010, Dept. of Health, 04/25/2006-07/31/2010, $1.2M$1.2M))

• Co-PI: Center for Public Health Readiness and Co-PI: Center for Public Health Readiness and Communication (PA Dept. of Health, 08/01/2004-Communication (PA Dept. of Health, 08/01/2004-08/31/2007, 08/31/2007, $1.5M$1.5M))

• Co-PI: Origin and Evolution of Genomic Instability in Breast Co-PI: Origin and Evolution of Genomic Instability in Breast Cancer (PA Dept. of Health, Cancer (PA Dept. of Health, $100K$100K, 05/01/2004-, 05/01/2004-04/30/2005) 04/30/2005)

• Co-PI: Systems Biology Approach to Understanding Protein-Co-PI: Systems Biology Approach to Understanding Protein-Protein Interactions (PA Dept. of Health, Protein Interactions (PA Dept. of Health, $100K$100K, , 05/01/2004-04/30/200505/01/2004-04/30/2005

Page 69: Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA

6969

Thanks for your Thanks for your attentionattention

Any comments Any comments or questions or questions ??