text mining and its application in bioinformatics xiaohua tony hu college of information science...
TRANSCRIPT
Text Mining and Its Text Mining and Its Application in Application in
Bioinformatics Bioinformatics
Xiaohua Tony HuXiaohua Tony HuCollege of Information Science & TechnologyCollege of Information Science & Technology
Drexel University, USADrexel University, USA
22
AgendaAgenda
• IntroductionIntroduction• Problems of Biomedical Literature Problems of Biomedical Literature
Mining ApproachesMining Approaches• Related WorksRelated Works• Our System: Bio-Set-DMOur System: Bio-Set-DM• Sub Network Modeling, Simulation Sub Network Modeling, Simulation
and Evaluationand Evaluation• Conclusion and Future StudiesConclusion and Future Studies
33
Biomedical Literature Biomedical Literature MiningMining
• Many biomedical and bioinformatics Many biomedical and bioinformatics knowledge and experimental results knowledge and experimental results only published in text documents and only published in text documents and these documents are collected in these documents are collected in online digital libraries/databases online digital libraries/databases (Medline, PubMedCentral, (Medline, PubMedCentral, BioMedCentral).BioMedCentral).
• How big is Medline?How big is Medline?– Abstracts from more than 4800 journals, Abstracts from more than 4800 journals,
with over 16 million abstractswith over 16 million abstracts– Over 10,000 papers per week are addedOver 10,000 papers per week are added
44
IntroductionIntroduction
The Exploding number of PubMed articles over the The Exploding number of PubMed articles over the yearsyears
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,00019
5019
5219
5419
5619
5819
6019
6219
6419
6619
6819
7019
7219
7419
7619
7819
8019
8219
8419
8619
8819
9019
9219
9419
9619
9820
0020
0220
0420
06 (A
pr.)
Year
ME
DL
INE
Siz
e (#
of
arti
cles
)
55
Introduction Introduction
•How to solve the information How to solve the information overloading of biomedical overloading of biomedical Literature?Literature?– developing scalable searching & mining developing scalable searching & mining
methods methods – integrating information extraction and integrating information extraction and
data mining methods to automaticallydata mining methods to automaticallyosearch & retrieve biomedical literature search & retrieve biomedical literature
efficiently and effectively efficiently and effectively
oextract the results into a structured format extract the results into a structured format
o mine important biological relationshipsmine important biological relationships
66
Major Issues in Biomedical Major Issues in Biomedical Literature MiningLiterature Mining
• Huge numbers of documents Huge numbers of documents
• Lack of structuresLack of structures
• Many subdomains Many subdomains
• Many aliases and typographical Many aliases and typographical variants for most biomedical objectsvariants for most biomedical objects
• Abbreviations, synonyms, polysemy, Abbreviations, synonyms, polysemy, etcetc
77
The General Text Mining The General Text Mining ViewView
1.1. Selects what they will read (Information Selects what they will read (Information Retrieval), Retrieval),
2.2. Identifies important entities and relations Identifies important entities and relations between those entities (Information between those entities (Information Extraction), Extraction),
3.3. Combines this new information with other Combines this new information with other documents and other knowledge into a documents and other knowledge into a databasedatabase
4.4. Mine the extracted results (Data Mining)Mine the extracted results (Data Mining)
88
Issues in Current Information Issues in Current Information Retrieval (IR)?Retrieval (IR)?
• Key-word based: get a lot of Key-word based: get a lot of irrelevant and miss a lot of relevant irrelevant and miss a lot of relevant documents documents
• Query ExpansionQuery Expansion
• Probability Language ModelingProbability Language Modeling
Ex: mouse, bank, chip, apple etcEx: mouse, bank, chip, apple etc
99
Issues in Current Information Issues in Current Information Extraction (IE)?Extraction (IE)?
•Examining every document– Doing so against Medline is extremely
time-consuming
•Using filters to select promising abstracts for extraction– Requiring human involvement to
maintain and to adapt to new topics or sub disciplines.
1010
Our Approaches: Our Approaches: Bio-SET-Bio-SET-DMDM• Information Retrieval: semantic-query expansion Information Retrieval: semantic-query expansion
(Xiaohua Zhou’s Ph.D. Thesis) (Xiaohua Zhou’s Ph.D. Thesis) • Information Extraction Methods: mutual reinforcement Information Extraction Methods: mutual reinforcement
learning for automatic pattern learning and tuple learning for automatic pattern learning and tuple extraction (Illhoi Yoo’s Ph.D. thesis)extraction (Illhoi Yoo’s Ph.D. thesis)
• Text Mining: graphical-based representation text Text Mining: graphical-based representation text clustering and summarization (Xiaodan Zhang’s Ph.D. clustering and summarization (Xiaodan Zhang’s Ph.D. thesis)thesis)
• Bio-SET-DM (Biomedical Literature Searching, Bio-SET-DM (Biomedical Literature Searching, Extracting and Text Data Mining)Extracting and Text Data Mining)
• Biomedical Ontologies: UMLS and Go are the glues Biomedical Ontologies: UMLS and Go are the glues
1111
NSF Career: NSF Career: A Unified Architecture for Data A Unified Architecture for Data Mining Biomedical Literature Databases (415K Mining Biomedical Literature Databases (415K
US$, March 2005-Feb 2010US$, March 2005-Feb 2010))
Ontology Base
BioMedicalLiterature D B(e.g. PubMed)
U MLS
Initial Q uery
Semantic -based Query Expansion
Data Mining
G ene O ntology
Information Retrieval
Q uery List
Categorized D ocuments
Information Extraction
Promising D ocument Set
A utomatic Pattern & R elation Generation
Summary Report
K now ledge Base
Text C lustering
Text Summarization
K eyphrase Extraction
Rule Induction
.....
A ssociation A lgorithm
.....
Text Clusters
… ..
1212
Problem Descriptions of IRProblem Descriptions of IR
• DescriptionsDescriptions– Many biomedical literature searches are about Many biomedical literature searches are about
relationships between biological entities.relationships between biological entities.– The co-occurrence of two keywords often does mean The co-occurrence of two keywords often does mean
these two keywords are really related.these two keywords are really related.
– Explicitly index and search documents with Explicitly index and search documents with relationshipsrelationships
obesity [TIAB] AND hypertension [TIAB] AND hasabstract [text]AND ("1900"[PDAT] : "2005/03/08"[PDAT])
The query used to retrieve documents addressing the interaction of obesity and hypertension from PubMed. A ranked hit list of 6687 documents is returned. We then took the top 100 abstracts for human relevance judgment. Unfortunately, as expected, only 33 of them were relevant.
1313
Statistical Language ModelStatistical Language Model
• Statistical language modelStatistical language model– It is a probabilistic mechanism for It is a probabilistic mechanism for
generating text.generating text.
• Text generationText generation– Suppose word is the unit of a text (e.g. Suppose word is the unit of a text (e.g.
document). The text generation process document). The text generation process looks like as follows:looks like as follows:•Choose a language model in each step.Choose a language model in each step.
•Generate a word according to the chosen Generate a word according to the chosen model.model.
1414
Language Modeling and IRLanguage Modeling and IR
• Example:Example:– Document 1={(A,3), (B, 5), (C,2)}Document 1={(A,3), (B, 5), (C,2)}– Document 2={(A,4), (B, 1), (C,5)}Document 2={(A,4), (B, 1), (C,5)}– Query={A, B}Query={A, B}– Which document is more relevant to the query?Which document is more relevant to the query?
Doc 1: 0.3*0.5=0.15
Doc 2: 0.4*0.1=0.04
Doc 1 is more relevant to the query than Doc 2
1515
Why Smoothing?Why Smoothing?
• Avoid Zero Probability Avoid Zero Probability – Document 1={(A,3), (B, 5), (C,2)}Document 1={(A,3), (B, 5), (C,2)}– Document 2={(A,4), (B, 1), (C,5)}Document 2={(A,4), (B, 1), (C,5)}– Query={A, D}Query={A, D}– Which document is more relevant to the query?Which document is more relevant to the query?
Doc 1: 0.3*0=0
Doc 2: 0.4*0=0
Obviously, this result is not reasonable.
1616
Why Smoothing?Why Smoothing?
• Discount High-frequency Terms: Stop words (e.g. the, a, an, Discount High-frequency Terms: Stop words (e.g. the, a, an, you…) frequently occur in documents. According to Maximum you…) frequently occur in documents. According to Maximum Likelihood Estimate (MLE), their generative probability will be Likelihood Estimate (MLE), their generative probability will be very high. However, stop words are obviously trivial to those very high. However, stop words are obviously trivial to those documents.documents.
• Assign reasonable probability to unseen word (Data Sparsity)Assign reasonable probability to unseen word (Data Sparsity)– Testing words do not appear in training corpus.Testing words do not appear in training corpus.– Need effective smoothing method, especially incorporating Need effective smoothing method, especially incorporating
the semantic relationship between the testing words and the semantic relationship between the testing words and training words into the model.training words into the model.
– Examples: a document containing Examples: a document containing ““autoauto”” for query for query ““carcar”” in in text retrieval task.text retrieval task.• If using Laplacian smoothing or background smoothing, If using Laplacian smoothing or background smoothing,
the document will not return for the query.the document will not return for the query.• If using semantic smoothing, the document will return for If using semantic smoothing, the document will return for
the query.the query.
1717
LM and IRLM and IR
• Steps:Steps:– Estimate the word distribution for each Estimate the word distribution for each
document, i.e., p(w|ddocument, i.e., p(w|dii), which is also ), which is also referred to as document language model or referred to as document language model or document model.document model.
– Computing the probability of generating the Computing the probability of generating the query according to each document model.query according to each document model.
– Rank the query-generating probabilities of Rank the query-generating probabilities of all documents in the collection.all documents in the collection.
1818
Language Modeling IR Language Modeling IR FormalismFormalism• LM: view IR as a process of word sampling from LM: view IR as a process of word sampling from
the document. The higher probability to the document. The higher probability to generate the query, the more relevant the generate the query, the more relevant the document is to the query (Ponte and Croft 1998)document is to the query (Ponte and Croft 1998)
),|(log
)(
)(log),|(log
)(
)(log
),|(
),|(log
),(
),(log
rDQp
Drp
rDprDQp
Drp
rDp
rDQp
rDQp
DQrp
DQrp
rank
rank
The formula is from (Lafferty and Zhai 2002)
1919
Context-Sensitive Semantic Context-Sensitive Semantic Smoothing (Our Approach)Smoothing (Our Approach)
• DefinitionDefinition– Like the statistical translation model, term Like the statistical translation model, term
semantic relationships are used for model semantic relationships are used for model smoothing.smoothing.
– Unlike the statistical translation model, contextual Unlike the statistical translation model, contextual and sense information is consideredand sense information is considered
• MethodMethod– Decompose a document into a set of context-Decompose a document into a set of context-
sensitive topic signatures and then statistically sensitive topic signatures and then statistically translate topic signatures into individual words.translate topic signatures into individual words.
2020
Topic SignaturesTopic Signatures
• Concept PairsConcept Pairs– A pair of two concepts which are semantically A pair of two concepts which are semantically
and syntactically related to each otherand syntactically related to each other– Example: computer and mouse, hypertension Example: computer and mouse, hypertension
and obesityand obesity– Extraction: Ontology-based approach (Zhou et Extraction: Ontology-based approach (Zhou et
al. 2006, SIGIR)al. 2006, SIGIR)
• Multiword PhrasesMultiword Phrases– Example: Space Program, Star War, White HouseExample: Space Program, Star War, White House– Extraction: Xtract (Smadja 1993)Extraction: Xtract (Smadja 1993)
2121
Translation Probability Translation Probability EstimateEstimate• MethodMethod
– Use cooccurrence Use cooccurrence counts (topic signature counts (topic signature and individual words)and individual words)
– Use a mixture model to Use a mixture model to remove noise from remove noise from topic-free general wordstopic-free general words
)|()|()1()|( CwpwpDwpktk
Denotes Dk the set of documents containing the topic signature tk. The parameter α is the coefficient controlling the influence of the corpus model in the mixture model.
w1D1
D2
D3
D4
VdVt
t1
t2
t3
t4
t5 w4
w3
w2
Vw
Figure 2. Illustration of document indexing. Vt, Vd and Vw are topic signature set, document set and word set, respectively.
2222
Translation Probability Translation Probability EstimateEstimate• Log likelihood of generating DLog likelihood of generating Dkk
Where is the document frequency of term w in Dk, i.e., the cooccurrence count of w and tk in the whole collection.
))|()|()1)((,(
)|(log),(),|(log
CwpwpDwc
DwpDwcCDp
wtk
wkktk
k
k
)|()|()1(
)|()1()(ˆ
)(
)()(
Cwpwp
wpwp
k
k
tn
tn
n
)(ˆ),(
)(ˆ),()|(
)(
)()1(
ii
nki
nk
tn
wpDwc
wpDwcwp
k
• EM for estimationEM for estimation
2323
Contrasting Translation Contrasting Translation ExampleExampleSpace:Space:space 0.245; shuttle 0.057; launch 0.053; flight 0.042; air 0.035; program space 0.245; shuttle 0.057; launch 0.053; flight 0.042; air 0.035; program
0.031; 0.031; center 0.030; administration 0.026; develop 0.025; like 0.023; look 0.022; center 0.030; administration 0.026; develop 0.025; like 0.023; look 0.022;
world 0.020;world 0.020;director 0.020; plan 0.018; release 0.017; problem 0.017; work 0.016; place director 0.020; plan 0.018; release 0.017; problem 0.017; work 0.016; place
0.016; 0.016; mile 0.015; base 0.014;mile 0.015; base 0.014;
Program:Program:program 0.193; washington 0.026; congress 0.026; administration 0.024; program 0.193; washington 0.026; congress 0.026; administration 0.024;
need 0.024; need 0.024; billion 0.023; develop 0.023; bush 0.020; plan 0.020;money 0.020; problem billion 0.023; develop 0.023; bush 0.020; plan 0.020;money 0.020; problem
0.020; 0.020; provide 0.020; writer 0.018; d 0.018; help 0.018; work 0.017; president 0.017; provide 0.020; writer 0.018; d 0.018; help 0.018; work 0.017; president 0.017; house .017; million 0.016; increase 0.016;house .017; million 0.016; increase 0.016;
Space Program:Space Program:space 0.101; program 0.071;space 0.101; program 0.071; NASA 0.048 NASA 0.048; shuttle 0.043; ; shuttle 0.043; astronaut 0.041astronaut 0.041;;
launch 0.040; mission 0.038; flight 0.037; launch 0.040; mission 0.038; flight 0.037; earth 0.037; moon 0.035; orbit earth 0.037; moon 0.035; orbit 0.032; 0.032;
satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; ; technology 0.026;technology 0.026;
project 0.025; science 0.023; budget 0.023;project 0.025; science 0.023; budget 0.023;
2424
Topic Signature LMTopic Signature LM
• Basic IdeaBasic Idea– Linearly interpolate the topic signature based Linearly interpolate the topic signature based
translation model with a simple language model.translation model with a simple language model.
– The document expansions based on context-The document expansions based on context-sensitive semantic smoothing will be very specific. sensitive semantic smoothing will be very specific.
– The simple language model can capture the points The simple language model can capture the points the topic signatures miss.the topic signatures miss.
)|()|()1()|( dwpdwpdwp tbbt
Where the translation coefficient (λ) controls the influence of the translation component in the mixture model.
2525
Topic Signature LMTopic Signature LM
• The Simple Language ModelThe Simple Language Model)|()|()1()|( Cwpdwpdwp mlb
k
kmltt dtpwpdwpk
)|()|()|(
• The Topic Signature Translation The Topic Signature Translation ModelModel
),(
),()|(
ii
kkml dtc
dtcdtp
c(ti, d) is the frequency of topic signature ti in document d.
2626
Text Retrieval ExperimentsText Retrieval Experiments
• CollectionsCollections– TREC Genomics Track 2004 and 2005TREC Genomics Track 2004 and 2005– Use sub-collectionUse sub-collection
– 2004: 48,753 documents2004: 48,753 documents– 2005: 41,018 documents2005: 41,018 documents
• Measures:Measures:– Mean Average Precision (AP), RecallMean Average Precision (AP), Recall
• SettingsSettings– Simple language model as the baselineSimple language model as the baseline– Use concept pairs as topic signaturesUse concept pairs as topic signatures– Background coefficient: 0.05Background coefficient: 0.05– Pseudo-relevance feedback: top 50 documents, expand10 Pseudo-relevance feedback: top 50 documents, expand10
termsterms
2727
ExperimentsExperiments
• CollectionsCollections– TREC Genomics Track 2004 and 2005TREC Genomics Track 2004 and 2005– Use sub-collectionUse sub-collection
– 2004: 48,753 documents2004: 48,753 documents– 2005: 41,018 documents2005: 41,018 documents
• Measures:Measures:– Mean Average Precision (AP), RecallMean Average Precision (AP), Recall
• SettingsSettings– Background coefficient: 0.05Background coefficient: 0.05– Pseudo-relevance feedback: top 50 documents, Pseudo-relevance feedback: top 50 documents,
expand10 termsexpand10 terms
2828
Baseline ModelsBaseline Models
Table 1. Comparison of the baseline language model to the Okapi model. The Okapi formula is the same as the one in [10]. The number of relevant documents for TREC04 and TREC05 are 8266 and 4585, respectively. The asterisk indicates the initial query is weighted.
CollectioCollectionn
RecallRecall MAPMAP
SLMSLMOkapiOkapi ChangChang
ee
SLMSLMOkapOkapii
ChangChangee
TREC04TREC04 64116411 66626662 +3.9%+3.9% 0.3450.345 0.360.3633
+5.2%+5.2%
TREC04*TREC04* 65276527 67046704 +2.7%+2.7% 0.3640.364 0.360.3644
+0.0%+0.0%
TREC05TREC05 40844084 41244124 +1.0%+1.0% 0.2550.255 0.250.2500
-2.0%-2.0%
TREC05*TREC05* 41354135 41344134 -0.0%-0.0% 0.2600.260 0.250.2544
-2.3%-2.3%
2929
Experiment ResultsExperiment ResultsTable 2. The comparison of the baseline language model (DM0) to document smoothing model (DM2) and query smoothing model (FM1).
CollectionCollection DM0DM0λ=0.3λ=0.3 γ =0.6γ =0.6
DM2DM2 ChangChangee
FM1FM1 ChangeChange
TREC04TREC04MAPMAP
0.340.3455
0.3950.395 +14.5+14.5%%
0.4510.451 +30.9+30.9%%
RecalRecalll
64116411 67496749 +5.3%+5.3% 69296929 +8.0%+8.0%
TREC04*TREC04*MAPMAP
0.360.3644
0.4140.414 +13.7+13.7%%
0.4600.460 +26.9+26.9%%
RecalRecalll
65276527 69056905 +5.8%+5.8% 70397039 +7.8%+7.8%
TREC05TREC05MAPMAP 0.250.25
550.2770.277 +8.6%+8.6% 0.2790.279 +9.4%+9.4%
RecalRecalll
40844084 41674167 +2.0%+2.0% 42274227 +3.5%+3.5%
TREC05*TREC05*MAPMAP
0.260.2600
0.2880.288 +10.8+10.8%%
0.2870.287 +10.4+10.4%%
RecalRecalll
41354135 42144214 +1.9%+1.9% 42354235 +2.4%+2.4%
3030
Context-sensitive vs. Context-Context-sensitive vs. Context-insensitiveinsensitive
Table 3. Comparison of the context-sensitive semantic smoothing (DM2) to the context-insensitive semantic smoothing (DM2’) on MAP. The rightmost column is the change of DM2 over DM2’.
CollectioCollectionn
DM0DM0 DM2’DM2’ DM2DM2
ChangChangeeMAPMAP MAPMAP ChangChang
eeMapMap ChangChang
ee
TREC04TREC04 0.340.3466
0.360.3677
+6.1+6.1%%
0.390.3955
+14.5+14.5%%
+7.6%+7.6%
TREC04TREC04**
0.360.3644
0.380.3844
+5.5+5.5%%
0.410.4144
+13.7+13.7%%
+7.8%+7.8%
TREC05TREC05 0.250.2555
0.260.2600
+2.0+2.0%%
0.270.2777
+8.6%+8.6% +6.5%+6.5%
TREC05TREC05**
0.260.2600
0.260.2699
+3.5+3.5%%
0.280.2888
+10.8+10.8%%
+7.1%+7.1%
• The context-sensitive semantic smoothing approach The context-sensitive semantic smoothing approach performs significantly better than context-insensitive performs significantly better than context-insensitive semantic smoothing approaches.semantic smoothing approaches.
3131
Relevant PublicationsRelevant Publications1.1. Hu X., Xu., X., Hu X., Xu., X., Mining Novel Connections from Online Mining Novel Connections from Online
Biomedical Databases Using Semantic Query Expansion Biomedical Databases Using Semantic Query Expansion and Semantic-Relationship Pruningand Semantic-Relationship Pruning, International Journal of , International Journal of Web and Grid Service, 1(2), 2005, pp 222-239Web and Grid Service, 1(2), 2005, pp 222-239
2.2. Zhou X., Zhou X., Hu XHu X., Zhang X., ., Zhang X., Topic Signature Language Models Topic Signature Language Models for Ad-hoc Retrievalfor Ad-hoc Retrieval,, in the IEEE Transactions on Knowledge in the IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), September, 2007and Data Engineering (IEEE TKDE), September, 2007
3.3. Song M., Song I-Y, Hu X., Allen B., Song M., Song I-Y, Hu X., Allen B., Integration of Association Integration of Association Rules and Ontology for Semantic-based Query Expansion Rules and Ontology for Semantic-based Query Expansion in the Journal of Data & Knowledge Engineeringin the Journal of Data & Knowledge Engineering
4.4. Zhou X., Hu X., Zhang X., Lin X., Song I-Y., Zhou X., Hu X., Zhang X., Lin X., Song I-Y., Context-Sensitive Context-Sensitive Semantic Smoothing for the Language Modeling Approach Semantic Smoothing for the Language Modeling Approach to Genomic IRto Genomic IR, in the Proc. Of the 29th Annual International , in the Proc. Of the 29th Annual International ACM SIGIR Conference on Research & Development on ACM SIGIR Conference on Research & Development on Information Retrieval (Information Retrieval (SIGIR 2006SIGIR 2006), ),
5.5. Zhou X., Zhang X., Zhou X., Zhang X., Hu XHu X., ., Semantic Smoothing of Document Semantic Smoothing of Document Models for Agglomerative Clustering, Models for Agglomerative Clustering, accepted in the Twentieth accepted in the Twentieth International Joint Conference on Artificial Intelligence(IJCAI 07), International Joint Conference on Artificial Intelligence(IJCAI 07), Hyderabad, India, Jan 6-12, 2007 Hyderabad, India, Jan 6-12, 2007
6.6. Zhang X., Zhang X., Hu XHu X., Zhou X., ., Zhou X., A Comparative Evaluation of Different A Comparative Evaluation of Different Link Types on Enhancing Document ClusteringLink Types on Enhancing Document Clustering, accepted in 31, accepted in 31thth Annual International ACM SIGIR Conference on Research & Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (Development on Information Retrieval (SIGIR 2008SIGIR 2008) )
3232
SPIE:SPIE: SScalable and calable and PPortable ortable IInformation nformation
EExtractionxtraction
• Scalable and portable information Scalable and portable information extraction system (SPIE) is extraction system (SPIE) is influenced by the idea of DIPRE influenced by the idea of DIPRE introduced by Brin [Brin, 1998].introduced by Brin [Brin, 1998].
• The goal is to develop efficient and The goal is to develop efficient and portable information extraction portable information extraction system to automatically extract system to automatically extract various biological relationships from various biological relationships from online biomedical literature with no online biomedical literature with no or little human intervention. or little human intervention.
3333
SPIE:SPIE: SScalable and calable and PPortable ortable IInformation nformation
EExtractionxtraction
• The main ideas of SPIE:The main ideas of SPIE:– Automatic query generation and query Automatic query generation and query
expansion for effective search and expansion for effective search and retrieval from text databasesretrieval from text databases
– Dual reinforcement information Dual reinforcement information extraction for pattern generation and extraction for pattern generation and tuple extractiontuple extraction
– Scalable well in huge collections of text Scalable well in huge collections of text files because it does not need to scan files because it does not need to scan every text fileevery text file
3434
SPIE:SPIE: SScalable and calable and PPortable ortable IInformation nformation
EExtractionxtraction
Set of Documents
Queries
Pattern Base
Biomedical Literature DB
Mutual Reinforcement of Pattern Generation
- Instance Extraction
Automatic Query Generation & Document CategorizationInitial seed tuples
Instance Relation
Initial seed tuples
tuples generated from IE
Automatic Categorization of
Documents
Extract text segment of interest
Find occurrence of seed tuples
Generate extraction pattern and store it in
pattern base
New instance extraction based on
pattern matching
Query List
Data Mining to generate rules from
categorized documents
Search Engine
3535
SPIESPIE((SScalable & calable & PPortable ortable IEIE))
SPIE takes the following steps:SPIE takes the following steps:
1.Starting with a set of user-provided seed tuples, SPIE retrieves a sample of documents from the biomedical literature library.– the seed tuples can be quite small, normally 5 to
10 is enough– constructing some simple queries by using the
attribute values of the initial seed tuples to extract the document samples of a pre-defined size using from the search engine
3636
SPIE SPIE ((SScalable & calable & PPortable ortable IEIE))
2.The tuple set induces a binary partition (a split) on the documents:– those that contain tuples or those that do
not contain any tuple from the relation – The documents are thus labeled
automatically as either positive or negative examples, respectively.
– The positive examples represent the documents that contain at least one tuple.
– The negative examples represent documents that contain no tuples.
3737
Query Generation/Expansion Query Generation/Expansion for Document Retrievalfor Document Retrieval
STEP 3 consists of two stagesSTEP 3 consists of two stages– converting the positive and negative converting the positive and negative
examples into an appropriate examples into an appropriate representation for trainingrepresentation for training
– running the data mining algorithms on running the data mining algorithms on the training examples to generate a set the training examples to generate a set of rules and then convert the rules into of rules and then convert the rules into an ordered list of queries expected to an ordered list of queries expected to retrieve new useful documentsretrieve new useful documents
3838
Query Generation/Expansion Query Generation/Expansion for Document Retrievalfor Document Retrieval
3939
Query Generation/Expansion Query Generation/Expansion for Document Retrievalfor Document Retrieval
• In STEP 3 three data mining algorithms are used In STEP 3 three data mining algorithms are used for rule generation; Ripple, CBA & DB-Decifor rule generation; Ripple, CBA & DB-Deci
• Those rules are ranked based on Laplace Those rules are ranked based on Laplace measuresmeasures
• Top 10% of rules are converted into a query listTop 10% of rules are converted into a query list
Positive IF WORDS ~ protein AND bindingPositive IF WORDS ~ cell and function
Query 1: protein AND binding Query 2: cell AND function
4040
Pattern GenerationPattern Generation
• A pattern is A pattern is – a 5–tuples <a 5–tuples <prefixprefix, , entity_tag1entity_tag1, , infixinfix, , entity_tag2entity_tag2, ,
suffixsuffix>>– prefixprefix, , infixinfix, and , and suffixsuffix are vectors associating weights are vectors associating weights
with terms. with terms. – pprefixrefix is the part of sentence before is the part of sentence before entity1entity1, , – infixinfix is the part of sentence between is the part of sentence between entity1entity1 and and
entity2entity2 – suffixsuffix is the part of sentence after is the part of sentence after entity2.entity2.
““HP1 interacts with HDAC4 in the two–hybrid system…”HP1 interacts with HDAC4 in the two–hybrid system…”
{ “”, { “”, <Protein><Protein>, “, “interacts withinteracts with”, ”, <Protein><Protein>, , “”}.“”}.
4141
Pattern MatchingPattern Matching
4242
Experiment Experiment
• Keyword base vs. SPIEKeyword base vs. SPIE• Keyword base experimentKeyword base experiment
– Input: Input: o around 7000 protein names (expanded from 1600 around 7000 protein names (expanded from 1600
protein names using protein synonyms)protein names using protein synonyms)o 23 keywords23 keywordso 1.5 million abstracts (obtained using those keyword 1.5 million abstracts (obtained using those keyword
searching in PubMed)searching in PubMed)
• SPIE experimentSPIE experiment– Input:Input:
o Only 10 pairs of protein-protein interaction (PPI) Only 10 pairs of protein-protein interaction (PPI) pairspairs
– Maximum number of documents used in each Maximum number of documents used in each iteration is 10kiteration is 10k
– Starting with 50k documents and stopping at Starting with 50k documents and stopping at 500k documents500k documents
4343
ExperimentExperimentSPIEKeyword based
4444
Experiment Experiment
ExperimentExperiment Abstracts Abstracts usedused
# of distinct # of distinct PPIPPI
Keyword baseKeyword base 1,444,0021,444,002 9,9809,980
SPIESPIE 500k500k 9,4839,483
• It is very obvious that SPIE has a It is very obvious that SPIE has a significant performance significant performance advantage over key-word based advantage over key-word based approach.approach.
4545
Chromatin Protein Chromatin Protein NetworkNetwork
4646
Biomolecular Network Biomolecular Network Analysis Analysis
• Biomolecular networks dynamically Biomolecular networks dynamically respond to stimuli and implement respond to stimuli and implement cellular function cellular function
• Understanding these dynamic Understanding these dynamic changes is the key challenge for cell changes is the key challenge for cell biologists biologists
• Biomolecular networks grow in size Biomolecular networks grow in size and complexity, and thus the and complexity, and thus the computer simulation is an essential computer simulation is an essential tool to understand biomolecular tool to understand biomolecular network models network models
• A sub-network executes a specific A sub-network executes a specific cellular function and deserve to be cellular function and deserve to be studiedstudied
4747
Biomolecular Network Biomolecular Network AnalysisAnalysis
• Our method consists of two steps.Our method consists of two steps.• First, a novel scale-free network First, a novel scale-free network
clustering approach is applied to the clustering approach is applied to the biomolecular network to obtain various biomolecular network to obtain various sub-networks.sub-networks.
• Second, computational models are Second, computational models are generated for the sub-network and generated for the sub-network and simulated to predict their behavior in simulated to predict their behavior in the cellular context. the cellular context.
• We discuss and evaluate three We discuss and evaluate three advanced computational models: state-advanced computational models: state-space model, probabilistic Boolean space model, probabilistic Boolean network model, and fuzzy logic model.network model, and fuzzy logic model.
4848
Mining the large-scale Mining the large-scale biomolecular network biomolecular network
(1)(1)Main Algorithm Main Algorithm SNBuilder SNBuilder (G, s, f, d(G, s, f, d)) 1: 1: GG((V, EV, E)) is the input graph with vertex is the input graph with vertex
set set VV and edge set and edge set EE. . 2: 2: s s is the seed vertex; is the seed vertex; ff is the affinity is the affinity
threshold; threshold; dd is the distance threshold. is the distance threshold.3: 3: N N ← {Adjacency list of ← {Adjacency list of ss } U{ } U{ss}}4: 4: C C ← FindCore(← FindCore(NN)) 5: 5: C’ C’ ← ExpandCore(← ExpandCore(C, f, dC, f, d))6: 6: return return C’C’
4949
Mining a large-scale Mining a large-scale biomolecular network biomolecular network
(2)(2)Sub-Algorithm Sub-Algorithm FindCore FindCore ((NN)) 8: 8: for eachfor each v N v N9: calculate k9: calculate kvv
inin(N)(N)10:10:end forend for11: 11: KKminmin ← min { k ← min { kvv
inin (N), v N} (N), v N}12: K12: Kmaxmax ← max { k ← max { kvv
inin(N), (N), v Nv N}}13: 13: if if KKminmin = K = Kmaxmax or (k or (kii
inin (N) = k (N) = kjjinin (N), (N),
( ) ( ) then returnthen return NN
14:14:else return else return FindCore(N – {v}, kFindCore(N – {v}, kvvinin(N) (N)
= K= Kminmin))
jisjiNji ,,,,
5050
Mining a large-scale Mining a large-scale biomolecular network biomolecular network
(3)(3)Sub-Algorithm Sub-Algorithm ExpandCoreExpandCore((C, f, dC, f, d)) 1616: D ← {v, w}: D ← {v, w}17: C’ ← C17: C’ ← C18: 18: for each for each t Dt D, , t C, and distance(t, s) t C, and distance(t, s)
<= d<= d19: calculate k19: calculate ktt
in in (D)(D)
20: calculate k20: calculate kttoutout (D) (D)
21:21: if if kkttinin (D) > k (D) > ktt
outout (D) (D) oror k kttinin (D)/|D| > f (D)/|D| > f
then then C’ ← C’ U {t}C’ ← C’ U {t}22: 22: end forend for23: 23: if if C’ = C C’ = C then return then return CC24: 24: else returnelse return ExpandCore(C’, f, d) ExpandCore(C’, f, d)
CwCvEwv
,,),(
5151
Experiment ResultsExperiment ResultsPromising Protein-Protein Interaction clusters Promising Protein-Protein Interaction clusters
5252
Experiment ResultsExperiment Results
Fig 1 Fig 1 A sub-network obtained using the A sub-network obtained using the algorithmalgorithm
5353
State-space model for State-space model for simulationsimulation (1)(1)
Observation equations
Dynamic equations
A gene regulatory network
x1
x2
.
.
.
xn
Externalinputs
Z1
Zp
…
5454
• x: gene expression datax: gene expression data• z: internal variables z: internal variables ((promoterspromoters))• A: state transition matrixA: state transition matrix• B: control B: control ((inputinput)) matrix matrix• C is transformation matrixC is transformation matrix• n1(t) and n2(t) stand for noisesn1(t) and n2(t) stand for noises
)()()(
)()()()1(
2
1
ttt
tttt
nCzx
nBuAzz
State-space model for State-space model for simulationsimulation (2)(2)
5555
• Applying the state-space modeling method Applying the state-space modeling method to gene expression data of 16 genes in to gene expression data of 16 genes in Figure 1, we obtained an inferred gene Figure 1, we obtained an inferred gene regulatory network with nine internal regulatory network with nine internal variables variables
• The analysis shows that the inferred The analysis shows that the inferred network is stable, robust, and periodic network is stable, robust, and periodic
• Use the constructed model from the Use the constructed model from the training dataset Thy-Thy 3training dataset Thy-Thy 3 to predict the to predict the expression profiles in the testing dataset expression profiles in the testing dataset Thy-NocThy-Noc, the result is shown in Figure 2, the result is shown in Figure 2
State-space model for State-space model for simulationsimulation (3)(3)
5656
State-space model for State-space model for simulationsimulation (4)(4)
Fig 2 Comparison of experimental (solid lines) and predicted Fig 2 Comparison of experimental (solid lines) and predicted (dotted lines) gene expression profiles for DMFT(A), F2 (B), (dotted lines) gene expression profiles for DMFT(A), F2 (B), RRM2 (C) and TYR (D)RRM2 (C) and TYR (D)
0 5 10 15 20-0.5
0
0.5
1
Time
Exp
ressio
n l
evel
0 5 10 15 20-0.5
0
0.5
1
Time
Exp
ressio
n l
evel
0 5 10 15 20-0.5
0
0.5
1
Time
Exp
ressio
n l
evel
0 5 10 15 20-0.5
0
0.5
1
Time
Exp
ressio
n l
evel
A B
C D
5757
• The fuzzy biomolecular network model is The fuzzy biomolecular network model is a set of rule sets for each node (in this a set of rule sets for each node (in this case gene) in the network governing the case gene) in the network governing the response to each fuzzy state of the input response to each fuzzy state of the input genes to that node (the output gene). genes to that node (the output gene).
• Fuzzy rule sets are generated for genes Fuzzy rule sets are generated for genes in the sub-network in Figure 1. in the sub-network in Figure 1.
• Use the constructed model from the Use the constructed model from the training dataset Thy-Thy 3training dataset Thy-Thy 3 to predict the to predict the expression profiles in the testing dataset expression profiles in the testing dataset Thy-NocThy-Noc, the results shown in Figure 3, the results shown in Figure 3
Fuzzy logic modelFuzzy logic model for for simulationsimulation (1)(1)
5858
Fuzzy logic model for Fuzzy logic model for simulationsimulation (2)(2)
Fig 3 Fig 3 Best fit rule on training set “Thy-Thy 3” predicting gene Best fit rule on training set “Thy-Thy 3” predicting gene expression on the test data set (solid line) compared to actual expression on the test data set (solid line) compared to actual data from the test set “Thy-Noc” (dashed line) for CDK2 (A), data from the test set “Thy-Noc” (dashed line) for CDK2 (A), BRCA1 (B), EP300 (C), and CDK4 (D)BRCA1 (B), EP300 (C), and CDK4 (D)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
0 10 20 30 40
Time
Lo
g(E
xpre
ssio
n R
atio
)
-1.5
-1
-0.5
0
0.5
1
0 10 20 30 40
Time
Lo
g(E
xpre
ssio
n R
atio
)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0 10 20 30 40
Time
Lo
g(E
xpre
ssio
n R
atio
)
-0.8-0.6-0.4-0.2
00.20.40.60.8
0 10 20 30 40
Time
Lo
g(E
xpre
ssio
n R
atio
)A B
C D
5959
• A probabilistic Boolean network (PBN) is A probabilistic Boolean network (PBN) is a Markov chain capturing transition a Markov chain capturing transition probabilities among different genes probabilities among different genes expression states.expression states.
• We construct PBNs for the given We construct PBNs for the given microarray data set "Thy-Thy 3" and use microarray data set "Thy-Thy 3" and use the data set "Thy-Noc" to test the the data set "Thy-Noc" to test the constructed PBNsconstructed PBNs
• The results are shown in Tables 1 The results are shown in Tables 1 through 3through 3
Probabilistic Boolean Probabilistic Boolean NetworksNetworks for simulationfor simulation (1)(1)
6060
Probabilistic Boolean Probabilistic Boolean NetworksNetworks for simulationfor simulation (2)(2)
GeneGene DMTFDMTF BRCA1BRCA1 HIFXHIFX HEHE PPP2R4PPP2R4 MYCMYC NR4A2NR4A2 F2F2
2 states2 states 66.6766.67 55.5655.56 77.7877.78 22.2222.22 55.5655.56 44.4444.44 72.2272.22 61.1161.11
GeneGene PTENPTEN RRM2RRM2 PLATPLAT TYRTYR CADCAD CDK2CDK2 CDK4CDK4 EP300EP300
2 states2 states 72.2272.22 77.7877.78 50.0050.00 55.5655.56 66.6766.67 50.0050.00 66.6766.67 72.2272.22
Table 1: Prediction accuracy based on the given genetic network using 2 states microarray data.
GeneGene DMTFDMTF BRCA1BRCA1 HIFXHIFX HEHE PPP2R4PPP2R4 MYCMYC NR4A2NR4A2 F2F2
3 states3 states 50.0050.00 55.5655.56 66.6766.67 16.6716.67 61.1161.11 55.5655.56 55.5655.56 61.1161.11
GeneGene PTENPTEN RRM2RRM2 PLATPLAT TYRTYR CADCAD CDK2CDK2 CDK4CDK4 EP300EP300
3 states3 states 44.4444.44 72.2272.22 66.6766.67 50.0050.00 61.1161.11 38.8938.89 66.6766.67 61.1161.11
Table 2: Prediction accuracy based on the given genetic network using 3 states microarray data.
6161
Probabilistic Boolean Probabilistic Boolean NetworksNetworks for simulationfor simulation (3)(3)
• To improve the prediction accuracy of the To improve the prediction accuracy of the He MYC and CDK2, we use the developed He MYC and CDK2, we use the developed multivariate Markov chain to model the multivariate Markov chain to model the mircoarray data set. The results are shown mircoarray data set. The results are shown in Table 3in Table 3GeneGene HEHE MYCMYC CDK2CDK2
2 states2 states 55.56 (22.22)55.56 (22.22) 61.11 (44.44)61.11 (44.44) 66.67 (50.00)66.67 (50.00)
3 states3 states 27.78 (16.67)27.78 (16.67) 55.56 (55.56)55.56 (55.56) 38.89 (38.89)38.89 (38.89)
Table 3: Prediction accuracy based on the input genes estimated from the multivariate Markov chain model.
6262
ConclusionsConclusions• We present a new method for mining and We present a new method for mining and
dynamic simulation of sub-networks from dynamic simulation of sub-networks from large biomolecular network. large biomolecular network.
• The presented method applies a scale–free The presented method applies a scale–free network clustering approach to the network clustering approach to the biomelcular network to obtain biologically biomelcular network to obtain biologically functional sub-network. functional sub-network.
• Three computational models: state-space Three computational models: state-space model, probabilistic Boolean Network, and model, probabilistic Boolean Network, and fuzzy logical model are employed to fuzzy logical model are employed to simulate the sub-network, using time-series simulate the sub-network, using time-series gene expression data of the human cell gene expression data of the human cell cycle. cycle.
• The results indicate our presented method The results indicate our presented method is promising for mining and simulation of is promising for mining and simulation of sub-networks.sub-networks.
6363
Relevant PublicationsRelevant Publications
1.1. Hu XHu X., Wu D., ., Wu D., Data Mining and Predictive Modeling of Data Mining and Predictive Modeling of Biomolecular Network from Biomedical Literature Biomolecular Network from Biomedical Literature DatabasesDatabases, in IEEE/ACM Transactions on Computational , in IEEE/ACM Transactions on Computational Biology and Bioinformatics, (April-June 2007), p251-263 Biology and Bioinformatics, (April-June 2007), p251-263
2.2. Hu X.,, Sokhansanj, Wu, D., Tang Y., Hu X.,, Sokhansanj, Wu, D., Tang Y., A Novel Approach A Novel Approach for Mining and Dynamic Fuzzy Simulation of for Mining and Dynamic Fuzzy Simulation of Biomolecular Network,Biomolecular Network, in IEEE Transactions on Fuzzy in IEEE Transactions on Fuzzy SystemsSystems
3.3. Hu XHu X., Wu F.X. Ng M., Sokhansanj B., ., Wu F.X. Ng M., Sokhansanj B., Mining and Dynamic Mining and Dynamic Simulation of Sub-Networks from Large Biomolecular Simulation of Sub-Networks from Large Biomolecular NetworksNetworks, in 2007 International Conference on Artificial , in 2007 International Conference on Artificial Intelligence, June 25-28, Las Vegas, USA (Intelligence, June 25-28, Las Vegas, USA (Best Paper Best Paper AwardAward, , out of 500 submissionsout of 500 submissions) )
4.4. Hu XHu X., Yoo I., Song I-Y., Song M., Han J., Lechner M., ., Yoo I., Song I-Y., Song M., Han J., Lechner M., Extracting and Mining Protein-Protein Interaction Network Extracting and Mining Protein-Protein Interaction Network from Biomedical Literaturefrom Biomedical Literature, in the Proceedings of the 2004 , in the Proceedings of the 2004 IEEE Symposium on Computational Intelligence in IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (IEEE CIBCB Bioinformatics and Computational Biology (IEEE CIBCB 2004), Oct. 7-8, 2004, San Diego, USA, (2004), Oct. 7-8, 2004, San Diego, USA, (Best Paper Best Paper AwardAward), pp 244-251), pp 244-251
5.5. Tang Y.C., Zhang Y-Q, Huang Z., Tang Y.C., Zhang Y-Q, Huang Z., Hu XHu X.,, and Zhao Y. .,, and Zhao Y. Recursive Recursive Fuzzy Granulation for Gene Subsets Extraction and Cancer Fuzzy Granulation for Gene Subsets Extraction and Cancer ClassificationClassification accepted to be published in the accepted to be published in the IEEE Transactions IEEE Transactions on Information Technology in Biomedicineon Information Technology in Biomedicine
6464
Dragon ToolkitDragon Toolkit• Software package Designed for Language Software package Designed for Language
Modeling, Information Retrieval and Text Modeling, Information Retrieval and Text MiningMining
• Free download Free download http://www.ischool.drexel.edu/dmbio/dragontool/default.asp
• 500 Java Libaries in NLP, Search Engine, 500 Java Libaries in NLP, Search Engine, Entity Extraction, One of the most popular Entity Extraction, One of the most popular software packages for Information software packages for Information Retrieval, NLP etc. More than 1500 Retrieval, NLP etc. More than 1500 research groups in the world have research groups in the world have downloaded it since Jul 2007downloaded it since Jul 2007
6565
Call for PaperCall for PaperInternational Journal of Data Mining International Journal of Data Mining
and Bioinformatics (and Bioinformatics (IJDMBIJDMB))
Editor-in-Chief: Xiaohua HuEditor-in-Chief: Xiaohua Hu
Inaugural Issue: July 2006Inaugural Issue: July 2006
SCI Indexed: Oct, 2007SCI Indexed: Oct, 2007
6666
Call For ParticipationCall For Participation2008 IEEE International Conference on 2008 IEEE International Conference on
Bioinformatics and Biomedicine Bioinformatics and Biomedicine ((BIBM 08BIBM 08))
Philadelphia, USA, Nov 3-5, 2008Philadelphia, USA, Nov 3-5, 2008
IEEE BIBM Steering Committee Chair: IEEE BIBM Steering Committee Chair: Xiaohua HuXiaohua Hu
6767
My Ph.D. Students and Joint My Ph.D. Students and Joint Ph.D. Students with Chinese Ph.D. Students with Chinese UniversityUniversity1. Illhoi Yoo (graduated in 2006, tenure-track assistant professor in Univ. of (graduated in 2006, tenure-track assistant professor in Univ. of
Missouri-Columbia)Missouri-Columbia)2.2. Xiaodan Zhang (4Xiaodan Zhang (4thth year Ph.D. student, Text and Web Data Mining, Digital year Ph.D. student, Text and Web Data Mining, Digital
Library, Bioterrorism)Library, Bioterrorism)3.3. Daniel Wu (5Daniel Wu (5thth year Ph.D. student, Data Mining and Biomolecular Network year Ph.D. student, Data Mining and Biomolecular Network
Analysis)Analysis)4.4. Xuheng Xu (4Xuheng Xu (4thth year Ph.D. student, Semantic-based Query Optimization and year Ph.D. student, Semantic-based Query Optimization and
Intelligent Searching)Intelligent Searching)5.5. Davis Zhou (5Davis Zhou (5thth year Ph.D. student, Semantic-based Information Extraction and year Ph.D. student, Semantic-based Information Extraction and
Retrieval)Retrieval)6.6. Palakorn Achananuparp (4Palakorn Achananuparp (4rthrth year Ph.D. student, Text Mining) year Ph.D. student, Text Mining)7.7. Deima Elnatour (4Deima Elnatour (4thth year Ph.D. student, Semantic-based Text Mining) year Ph.D. student, Semantic-based Text Mining)8.8. Guisu Li (2Guisu Li (2ndnd year Ph.D. student, Healthcare Informatics) year Ph.D. student, Healthcare Informatics)9.9. Zhong Huang (2Zhong Huang (2ndnd year Ph.D. student, Bioinformatics, Computational Biology) year Ph.D. student, Bioinformatics, Computational Biology)10.10. Xin Chen (fresh Ph.D. student, USTC)Xin Chen (fresh Ph.D. student, USTC)11.11. Xiaoshi Yin (joint Ph.D. student with Prof. Zhoujun Li from BAUU)Xiaoshi Yin (joint Ph.D. student with Prof. Zhoujun Li from BAUU)12.12. Min Xu (joint Ph.D. student with Prof. Shuigeng Zhou from Fudan University)Min Xu (joint Ph.D. student with Prof. Shuigeng Zhou from Fudan University)13.13. Yaoyu Zuo (joint Ph.D. student with Prof. Ying Tong from Zhongshan UniversityYaoyu Zuo (joint Ph.D. student with Prof. Ying Tong from Zhongshan University
6868
AcknowledgementsAcknowledgements
• PI: PI: NSF CAREERNSF CAREER: A Unified Architecture for Data Mining : A Unified Architecture for Data Mining Large Biomedical Literature Databases (Large Biomedical Literature Databases (NSF CAREER IIS NSF CAREER IIS 04480230448023, , $415K$415K, 03/15/2005-02/28/2010) , 03/15/2005-02/28/2010)
• PI: High Performance Rough Sets Data Analysis in Data PI: High Performance Rough Sets Data Analysis in Data Mining (Mining (NSF CCF 0514679NSF CCF 0514679, , $102K$102K, 08/01/2005-, 08/01/2005-07/31/2008)07/31/2008)
• Co-PI: The Drexel University GAANN Fellowship Program: Co-PI: The Drexel University GAANN Fellowship Program: Educating Renaissance Engineers (US Dept. of Education, Educating Renaissance Engineers (US Dept. of Education, 9/1/2006 to 8/31/2009, around 9/1/2006 to 8/31/2009, around $700K$700K))
• Co-PI: Penn State Cancer Education Network Evaluation (PA Co-PI: Penn State Cancer Education Network Evaluation (PA Dept. of Health, 04/25/2006-07/31/2010, Dept. of Health, 04/25/2006-07/31/2010, $1.2M$1.2M))
• Co-PI: Center for Public Health Readiness and Co-PI: Center for Public Health Readiness and Communication (PA Dept. of Health, 08/01/2004-Communication (PA Dept. of Health, 08/01/2004-08/31/2007, 08/31/2007, $1.5M$1.5M))
• Co-PI: Origin and Evolution of Genomic Instability in Breast Co-PI: Origin and Evolution of Genomic Instability in Breast Cancer (PA Dept. of Health, Cancer (PA Dept. of Health, $100K$100K, 05/01/2004-, 05/01/2004-04/30/2005) 04/30/2005)
• Co-PI: Systems Biology Approach to Understanding Protein-Co-PI: Systems Biology Approach to Understanding Protein-Protein Interactions (PA Dept. of Health, Protein Interactions (PA Dept. of Health, $100K$100K, , 05/01/2004-04/30/200505/01/2004-04/30/2005
6969
Thanks for your Thanks for your attentionattention
Any comments Any comments or questions or questions ??