towards evidence-based discovery
DESCRIPTION
Towards Evidence-Based Discovery. Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill http://www.ils.unc.edu/~cablake [email protected]. Motivation. Relentless increase in electronically available text Life Sciences - PowerPoint PPT PresentationTRANSCRIPT
Towards Evidence-Based Discovery
Catherine Blake
School of Information and Library Science
University of North Carolina at Chapel Hill
http://www.ils.unc.edu/[email protected]
2
Motivation• Relentless increase in electronically available
text– Life Sciences
• 17 millionth entry added in April 2007• 5,200 journals indexed• 12,000 new articles each week !
– Chemistry – more than 110,000 articles in 1 year alone
• Consequences:– Hundreds of thousands of relevant articles– Implicit connections between literature go unnoticed
Shift from Retrieval to Synthesis
3
Information Overload
“One of the diseases of this age is the multiplicity of books; they doth so overcharge the world that it is not able to digest the abundance of idle matter that is every day hatched and brought forth into the world”
- Barnaby Rich, 1613
Evidence-Based Discovery
4
If I have seen further than
others, it is by standing upon the
shoulders of giants.
Sir Isaac Newton
We can't solve problems using the
same kindof thinking we used when we created them.
Albert Einstein
1 American Heritage Dictionary
Goal: Facilitate Discovery from Text
To make easy or easier1 A productive insight1
5
EducationDiscovery Science
Evidence-based Practice
Natural LanguageProcessing
Human Discovery and
Synthesis
Human-assisted
Discovery and
Synthesis
Heterogeneous Literature
Core
Chemistry
Breast Cancer
Genomics
Synthesis andDiscovery Work
Practices
News
DocSouth
Outline
• Motivation• Case Studies
– METIS• Human synthesis• Natural language processing
– Claim Jumping through Scientific Literature
• Next Steps• Summary
6
Systematic Review Process
– Formulate the problem– Locate and select studies– Assess quality of studies– Collect data – Analyze and present results– Interpret results– Improve and update review
28 months frominitial idea topublication
Increased demand due to evidence-
based medicine
I teration
Co llaboration
A n alysisE xtraction
Con textIn form ation
H ypothesisP ro jection
R etrieval Corpus
M E D L IN E
E m base V erifi cationFacts
Manual Synthesis
Select Extract AnalyzeVerify
Guesswork guided by scientifically trained intuition
Rescher (1978)
Context Information
• Study Information– e.g. date, location, ...
• Population Information– e.g. gender, age, ...
• Risk Factor or Intervention– e.g. duration of exposure, confounders
• Disease– e.g. stage, confounders
Loosely coupledto review focus
Tightly coupledto review focus
I teration
Co llaboration
ExternalD ata
A n alysisE xtraction
Con textIn form ation
H ypothesisP ro jection
R etrieval Corpus
M E D L IN E
E m base V erifi cationFacts
Collaborative Information Synthesis
Key: Estimate Missing Information
What are people with Breast Cancer exposed to?
What are people in a similar population exposed to?
Are these rates significantly different?
Studies with Breast Cancer patients
Database of risk factorsBRFSS
Facts for each study•number of patients•age of patients •geographic location•risk-factor exposure …
Codebook•question asked•age, gender•% responses
1 2
3
T. Tengs & N. D. Osgood (2001) “The link between smoking and Impotence: Two Decades of Evidence”, Preventive Medicine, 32:447-52
More than Automated Meta-Analysis
Systematic Review
External database
Entire study
Main topicSecondary Information
Key
Information SynthesisInformation Synthesis
• Traditional analysis– same study design– medicine = RCT– epidemiology =
cohort
• Information Synthesis– any study that
includes required information
– augment missing information
13
EducationDiscovery Science
Evidence-based Practice
Natural LanguageProcessing
Human Discovery and
Synthesis
Human-assisted
Discovery and
Synthesis
Heterogeneous Literature
Core
Chemistry
Breast Cancer
Genomics
Synthesis andDiscovery Work
Practices
News
DocSouth
Natural LanguageProcessing
14
METIS Information Extractor
• Semantic Grammar• Features: words, numbers, and semantic types in
the Unified Medical Language System (UMLS)
• Information extracted :• risk factor exposure (tobacco and alcohol ) gender• age (min, max, mean) start and end dates• number of subjects with medical condition geographical
location
{term;’age’} {term:’of’} {number;10<n2<110}{term;’to’}{number;10<n2<110}
The age of breast cancer subjects ranged between 20 to 64 years old.
{semantic type: neoplastic process, or disease}
METIS Info Extractor – Evaluation
• Diverse text corpus– epidemiology, surgery, biology, ...– cohort studies, case-control trials, ...
• Evaluation– Metrics (precision, recall)– Annotators (developer, domain expert,
expert annotator, novice) – Primary topic (breast cancer, impotence)– Secondary information (tobacco and
alcohol consumption)
METIS Info Extractor – Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5Rank
Rec
all
Development
Domain Expert
Expert Annotator
Novice Annotator
METIS Info Extractor – Precision
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5Rank
Pre
cis
ion
Development
Domain Expert
Expert Annotator
Novice Annotator
Verify information extracted
Electronic version of article
Converted Article
METIS Verifier
METIS Verifier
METIS Analyzer
• Meta-Analysis– Developed for agricultural application– Requires empirical studies with a
quantitative outcome– Unit of study is an article - not a person– Result – a unitless metric called an effect size
• Two common meta-analysis techniques– Fixed effects– Randomized-effects model
Evaluation: Compared generated effect size with examples in text books and published articles,
Result: Same effect size
Synthetic Estimate Evaluation
0
0.2
0.4
0.6
0.8
1
1 2 3 4 Average
Article Identifier
Co
ntr
ol R
ate
Actual
Estimated
TobaccoConsumption
0
0.2
0.4
0.6
0.8
1
1 2 3 4 AverageArticle Identifier
Co
ntr
ol R
ate
Actual
Estimated
AlcoholConsumption
Outline
• Motivation• Case Studies
– METIS– Claim Jumping
• Human discovery• Natural language processing• Human-assisted discovery and synthesis
• Next Steps• Summary
24
25
EducationDiscovery Science
Evidence-based Practice
Natural LanguageProcessing
Human Discovery and
Synthesis
Human-assisted
Discovery and
Synthesis
Heterogeneous Literature
Core
Chemistry
Breast Cancer
Genomics
Synthesis andDiscovery Work
Practices
News
DocSouth
Human Discovery and
Synthesis
Human Discovery
• Day-to-day activities of scientists reflect– the complex socio-technical
environments in which successful creativity tools will eventually be embedded
– the human cognitive processing surrounding creativity
• Unit of analysis: a paper or grant proposal
How do chemists transform an idea into a publication ?
How do chemists arrive at their research question ?
Approach• Recruitment
– experienced scientists (7-45 yrs)– local chemists and chemical engineers– response rate 84% (21/25)
• Semi-structured interviews
• Critical incident technique1.seminal paper in their field2.recent paper authored by the participant3.paper authored by the participant that
they were particularly proud of
Interview Questions• Discovery Questions
– What is your definition of discovery ?– What evidence convinced you that the paper addressed the initial research
questions ?– What factors limited the adoption and deployment of the discovery ?– How did you arrive at the research question ?– What if any existing evidence prompted the study/experiment ?– Were there any alternative explanations ?
• Information Usage questions– Other than the scientific literature, what information resources do you draw
from to aid in your research processes ?– How many articles did you read last month that related to each of those
projects ? – Is that typical of how many articles you read in a month for research projects ?– Do you read articles for another purpose ? If so what?– How many hours do you spend reading journal articles for research projects?– Which journals do you typically read and draw from ?– How would you characterize the journals that you read- are they only within
your domain, or do you read journals that would be considered non-traditional in your research ?
– If you only have a few minutes to read an article, what parts would you read? – What do you do with the article once you have read it ?
Chemists and Chemical Engineers
• Compared with other scientists chemists and chemical engineers– read more (Brown,1999)– have more personal subscriptions to journals
(Noble & Coughlin, 1997)– spend more time reading (Tenopir & King, 2003)– visit the library more often (Brown, 1999)
• Consequences– information disseminated quickly– information has a relative short lifespan
Human Discovery Findings
• Discovery definition– Novelty - Balance theory and
experimentation– Build on existing ideas - Practical application– Simplicity
• Hypothesis generation– Discussion - Previous experiments– Combine expertise - Read literature
• Hypothesis validation– Iterative - Tightly coupled
31
EducationDiscovery Science
Evidence-based Practice
Natural LanguageProcessing
Human Discovery and
Synthesis
Human-assisted
Discovery and
Synthesis
Heterogeneous Literature
Core
Chemistry
Breast Cancer
Genomics
Synthesis andDiscovery Work
Practices
News
DocSouth
Natural LanguageProcessing
Causal Relationships
• Newspaper genre– Causal relationships (Khoo, Chan, & Niu,
1998)
• Biomedical genre– Causes and treats (Price & Delcambre, 2005)– Causal knowledge (Khoo, Chan, Niu, 2000)
• Universal Grammar – Causatives (Comrie, 1974, 1981)– Action verbs (Thomson, 1987)
32
Claim Definition
• “To assert in the face of possible contradiction”
• Example sentence reporting a claim– “This study showed that Tamoxifen reduces the
breast cancer risk”
• Example Claim Framework– Tamoxifenagent
– reduceschange
– [breast cancer risk] object
33
The Claim Framework
• Goal– go beyond genes and proteins– differentiate between different levels of
confidence in the claim– consider claims made in the full text
• Working hypothesis– literature will report findings using
constructs within the Claim Framework– human annotators will agree on facets
34
Preliminary Results
• 29 articles from TREC Genomics – Total number of sentences: 5535 – Sentences with >=1 claim: 1250 (22.6%)– Total number of claims: 3228– Average claims per sentence: 2.51 – Claims that did not fit in the Framework: 31
• Per article– Average number of sentences: 191 – Average number of sentences with >=1
claim:4335
Distribution of Claim Categories
36
Category Total (%) Pilot(%) Main(%)
Explicit 2489 77.11 332 83.42 215776.6
3Implicit 87 2.70 3 0.75 84 2.98Observation 298 9.23 24 6.03 274 9.73Correlation 174 5.39 12 3.02 162 5.75Comparison 165 5.11 27 6.85 138 4.9
Total 3228 100 398 100 2830 100
37
All DocumentsAnnotation Total (%) Words (Avg)Agent 2894 89.65 5221 1.80Agent Direction 285 8.83 291 1.02Agent Modifier 1246 38.60 4448 3.57Object 3197 99.04 6849 2.14Object Direction 271 8.40 283 1.04Object Modifier 1561 48.36 5383 3.44Change 1897 58.77 1953 1.03Change Direction 1337 41.42 1358 1.02Change Modifier 1147 35.53 1618 1.41Claim Basis 165 5.11 394 2.39Claim Basis Dir. 42 1.30 43 1.02Claim Basis Mod. 86 2.66 266 3.09
Total 3228 28107 8.70
Inter Annotator Agreement
Information Facet KappaAgreement
Agent 0.71 substantial
Object 0.77 substantial
Change 0.57 moderateChange+ChangeDir 0.88
almost perfect38
Location of Claims
39
Total Sentences With % %
SectionClaim
Total
section
claim
Abstract 98 309 31.72 7.84
Introduction 357 979 36.4728.5
6Method 6 1121 0.54 0.48
Result 293 1829 16.0223.4
4
Discussion 539 1406 38.3443.1
2
Total 1250 5535 22.58100.
00
40
EducationDiscovery Science
Evidence-based Practice
Natural LanguageProcessing
Human Discovery and
Synthesis
Human-assisted
Discovery and
Synthesis
Heterogeneous Literature
Core
Chemistry
Breast Cancer
Genomics
Synthesis andDiscovery Work
Practices
News
DocSouth
Human-assisted
Discovery and
Synthesis
User StudyTimothy S. Carey, MD, MPHSarah Graham Kenan Professor of MedicineDirector, Cecil G Sheps Center for Health Services
Research Ila Cote, PhD, DABTActing Division DirectorUS Environmental Protection AgencyNational Center for Environmental Assessment Michael T Crimmins PhD.Mary Ann Smith Distinguished Professor of
Chemistry UNC and Department Chair, Department of Chemistry
Paul JonesClinical Associate ProfessorSchool of Information and Library ScienceDirector of ibiblio.org Rudy L Juliano PhD.Boshamer Distinguished Professor of PharmacologyPrincipal Investigator, Carolina Center of Cancer
Nanotechnology Excellence
41
Steven W. Matson Ph.D.Professor and ChairDepartment of Biology Robert C Millikan DVM PhDBarbara Sorenson Hulka Distinguished ProfessorDepartment of EpidemiologySchool of Public Health Dr. Rosa Perelmuter, PhDDirector, Moore Undergraduate ResearchApprentice ProgramProfessor of Spanish and Assistant Dean, Academic Advising Program Jan F. Prins PhD.Professor of Computer Science andChairman, Department of Computer Science Alexander Tropsha, Ph.D.Professor and ChairDirector, Laboratory for Molecular Modeling Suzanne West, PhDResearcherHealth, Social and Economics ResearchRTI International
42
EducationDiscovery Science
Evidence-based Practice
Natural LanguageProcessing
Human Discovery and
Synthesis
Human-assisted
Discovery and
Synthesis
Heterogeneous Literature
Core
Chemistry
Breast Cancer
Genomics
Synthesis andDiscovery Work
Practices
News
DocSouth
Closing Comments• Accelerate synthesis
• Breast cancer study without METIS would take >13 years
• Without synthetic estimate = systematic review
• Accelerate discovery– Connections between literature– Speculative and orthogonal views
• Human discovery and synthesis – As important if not more so than automation
43
“Tap the vast reservoir of human knowledge”Louis Round Wilson, 1929
AcknowledgementsMETIS
• Funded in part by– California Breast Cancer Research
program– University of California, Irvine
• Thanks to user groups – Particularly to Dr. Adams and Dr.
Tengs• Academic mentoring
– Primary Advisor: Dr. Wanda Pratt– Medical Mentor: Dr. Catherine
Carpenter – Co-Advisors: Dr Dennis Kibler and Dr
Michael Pazzani– Committee Member: Dr Paul Dourish
Claim Jumping
• Funded in part by– Faculty fellowship from the
Renaissance Computing Institute
– UNC Faculty Award• Thanks to collaborators
• Nassib Nassar and Mats Rynge (RENCI)
• Amol Bapat and Ryan Jones (SILS)
Chemists and Chemical Engineers Study
• Funded in part by– NSF Center for
Environmentally Responsible Solvents and Processes
Questions and Comments Welcome
Catherine [email protected]
School of Information and Library Science
University of North Carolina at Chapel Hill
http://www.ils.unc.edu/~cablake
Publication Bias
• Studies that find a correlation between a risk factor and disease are more likely to be published (Easterbrook et al, 1991, Ingelfinger et al, 1994)
• METIS provides a new way to explore this bias Bias introduced by authors, editors, funding, ...