ii-sdv 2012 making knowledge discoverable: the role of agile text mining
DESCRIPTION
TRANSCRIPT
![Page 1: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/1.jpg)
Making Knowledge Discoverable:
The Role of Agile Text Mining
David Milward
Linguamatics
II-SDV, Nice, April 2012
![Page 2: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/2.jpg)
Click to edit Master title styleClick to edit Master title style
• Search vs. Text Mining
• Agile Text Mining
– Linguamatics I2E
• Relationship Extraction: Text Mining + Search
• Finding the Most Relevant Documents: Search + Text Mining
• Accelerating a Search Strategy
• Example Results from
– multiple documents
– within complex documents
• Reproducible Workflows
2
Overview
![Page 3: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/3.jpg)
Click to edit Master title styleClick to edit Master title styleSearch vs. Text Mining
Text MiningSearch Engine
Filter to find
most relevant
documents,
then read
News Feeds
Manipulate
the text to
discover
what is there
company activity company
Sanofi bid Aventis
Roche partner Antisoma
Scientific Literature Patents Internal Reports Clinical Trials
3
Natural Language
Processing (NLP) to
understand meaning
Statistics to provide trends
![Page 4: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/4.jpg)
Click to edit Master title styleClick to edit Master title style
• Text mining provides ability to discover
– but typically queries have to be programmed in, and processing is slow
• Search provides ability to filter quickly to relevant documents
– but poor at answering open questions e.g. “what are biomarkers for breast cancer”
• Combine text mining with search to discover within specific contexts e.g.
What is a risk factor for diabetes
4
Agile Text Mining
Filter to the context of interestDiscover what is available
![Page 5: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/5.jpg)
Click to edit Master title styleClick to edit Master title style
• Linguamatics I2E first adopted in pharma/biotech, including 9 of top 10
• Used to answer a wide range of questions e.g.
– Dose durations of follow-on clinical trials
– Therapeutic usages of recombinant proteins
– Cofactors for Nuclear Receptors
• Wide range of application areas across the drug pipeline e.g.
– Target Prioritization, Safety/Toxicity, Clinical Trial Design
– Competitive Intelligence, Marketing
• Are the Life Sciences special?
– Particularly knowledge intensive, so high demand
– A lot of complex, ambiguous terminology, balanced by good resources
• I2E is a generic platform
– Now being used in chemicals, consumer products, health …
5
Technology Adoption
![Page 6: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/6.jpg)
Click to edit Master title styleClick to edit Master title style
Scientific LiteratureAbstracts e.g. MEDLINE
Full text journal articles
e.g. via Quosa
Patents
News
Conference Abstracts
Internal documents
Social MediaTwitter
6
Example Data Sources
Competitive Intelligence,
R&D, Marketing
Clinical Trial Design, Safety,
Relative Efficacy of Drugs
Clinical Trials
FDA Drug Label Inserts
Electronic Health Records
MEDLINE (20 million abstracts)
Patent Full Text
USPTO, WIPO, EPO …
Clinical Trials …
Cloud Based Service
![Page 7: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/7.jpg)
Click to edit Master title styleClick to edit Master title style
Actionable
InformationAd hoc queries Smart queries Multi queries Batch queries
NLP Class, conceptRegular
Expression
Chemical
Structure
Agile NLP Querying
Linguamatics I2E: An Agile Text Mining Platform
Internal/External
Sources
Decision Support
Documents
Domain knowledge – ontologies, thesauri, dictionariesOntologies
Flexible, Highly Scalable Rich IndexesIndexing
Structured Results
(Actionable
Information)
7
![Page 8: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/8.jpg)
Using I2E to Extract Relationships
8
![Page 9: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/9.jpg)
Click to edit Master title styleClick to edit Master title styleImprove Recall Using Terminologies
Which genes are known to affect breast cancer?
(ESR1 OR ERBB2 OR CHEK2 OR BRCA1...) Affect Breast Cancer
(ESR1 OR ERBB2 OR CHEK2 OR BRCA1...) Affect
(“Breast Cancer” OR “Breast Carcinoma” OR “Cancer of Breast” OR ...)
Could be 10,000s of terms
Could be 100s of terms
9
Any Gene/Protein Relationship Breast Cancer
![Page 10: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/10.jpg)
Click to edit Master title styleClick to edit Master title styleExtracting Relationships using NLP: Biomarkers
Gene(from Entrez)
Complex linguistic relationship
Disease(from MedDRA)
Relevant sentence extracted with terms highlighted
Link to source document
![Page 11: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/11.jpg)
Click to edit Master title styleClick to edit Master title styleExtracting Numerical Data in Context: Safety
CompoundPotential safety issues
In this organ At this dosage
11
![Page 12: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/12.jpg)
Click to edit Master title styleClick to edit Master title styleStandardizing/Clustering to Save Review
• Go directly to answers, e.g. find all the genes associated with a specific disease
• Highlighted evidence and link to the document
• Save time in review:
– Rather than reading all 470 documents for ERBB2, just read enough to check relationship exists
– Can then concentrate on the longer tail of less well-known genes/proteins
• Customer reports of order of magnitude speed-up
12
![Page 13: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/13.jpg)
Finding the Most Relevant Documents
13
![Page 14: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/14.jpg)
Click to edit Master title styleClick to edit Master title styleFinding Relevant Documents
• More comprehensive results
– Terminologies
– Regular Expressions
• Precise expressions e.g. for miRNA (simplified)
» let-?\d+.*
» mirn?a?-?\d+.*
– High throughput searches
• lists of 500 genes, chemicals etc.
– Chemical substructure and similarity searching
• Reduced noise in results
– Use of linguistics rather than distance e.g. n words
– Regions
• abstract, methods, claims, claim, table
– Local negation e.g. “dead” for death, but not “dead time”
14
![Page 15: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/15.jpg)
Click to edit Master title styleClick to edit Master title styleMore Comprehensive Results: Terminologies
• Re-use terminologies with 10s of thousands of concepts, and 100s of thousands of synonyms
• If we are interested in genes associated with cancer, we don’t just want synonyms of “cancer” e.g. “malignant neoplasm”, but also any specific cancer
Malignant neoplasm
Malignant tumor
Malignancy
Carcinoma
….
Cancer Leukaemia
Hamartoma
Paraneoplastic Syndromes
Adamintonoma
Peutz-Jeghers Syndromes
Hydatiform Mole
Plus 1000s more…
Ways of Expressing concept of “Cancer”
Different kinds of Cancer
15
![Page 16: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/16.jpg)
Click to edit Master title styleClick to edit Master title styleChemical Text Mining
• Ability to efficiently answer precise queries e.g.
– What chemicals with this substructure act as inhibitors
16
![Page 17: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/17.jpg)
Accelerating a Search Strategy
17
![Page 18: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/18.jpg)
Click to edit Master title styleClick to edit Master title style
Looking for words before or after the word of interest
18
Contextual Landscaping
Results from 100K
USPTO Patents
Linguistics to reduce noise, to find types of chocolate
![Page 19: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/19.jpg)
Click to edit Master title styleClick to edit Master title style
• In search we can often restrict to a particular region of a
document
• In text mining we can first check all the regions that the term
appears in
– E.g. look for the regions that IC50 appears in
• See what you want and what you might lose
– We then have better evidence to restrict to particular regions
19
Which Regions Does this Term Appear In
Results from 100K 2011 Patents
![Page 20: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/20.jpg)
Click to edit Master title styleClick to edit Master title style
• Find most frequent codes in patents where a company or set
of companies (e.g. pharma) is the Applicant or Assignee
20
What IPC Codes are Assigned to a Company’s Patents
Results from recent USPTO Patents
![Page 21: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/21.jpg)
Summarizing Results from Multiple
Documents
for Efficient Review and Integration
![Page 22: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/22.jpg)
Click to edit Master title styleClick to edit Master title style
• Find and extract patterns
to find a particular
concept
• For example: how do we
distinguish people:
– willing to get a vaccine
– not willing
• If we can partition the
two populations we can
then see how they are
influenced
Semantics: 1000s of ways of saying the same thing
22
Examples from Twitter
Concept of getting a vaccine
![Page 23: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/23.jpg)
Click to edit Master title styleClick to edit Master title style
• Terminologies can be used to link individual synonyms to a semantic concept
• Concept is uniquely identified by the ontology and the node identifier (and, typically, by the preferred term)
• This allows better clustering of results, better statistics, and allows us to connect results from text mining with other databases, or the semantic web
Semantics: Standardized Identifiers for Concepts
23
![Page 24: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/24.jpg)
Click to edit Master title styleClick to edit Master title style
• Extract the same relationship, even if it is expressed very differently e.g.
– SRC phosphorylates EGFR
– phosphorylation of EGFR by SRC
– EGFR is phosphorylated by SRC
• Establish the direction of the relationship
– SRC phosphorylates EGFR, not EGFR phosphorylates SRC
Semantics: Directed Relationships
24
![Page 25: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/25.jpg)
Finding Information Across Multiple
Documents
![Page 26: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/26.jpg)
Click to edit Master title styleClick to edit Master title style
• Categorizing Patents according to Disease Area
– Linguistics and ontologies provide better context, greater insight
26
Document Categorization
Freemind
400K Patents
![Page 27: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/27.jpg)
Click to edit Master title styleClick to edit Master title styleTell me everything about X
• Search provides most relevant documents mentioning X
• Text mining can summarize distinct properties of X by
clustering facts extracted from all documents
27
![Page 28: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/28.jpg)
Click to edit Master title styleClick to edit Master title style
Visualization via Cytoscape
Linking Knowledge: Indirect Relationships
28
Thalidomide to a Gene/Protein
A Gene/Protein to Angiogenesis
![Page 29: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/29.jpg)
Within a Document
![Page 30: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/30.jpg)
Click to edit Master title styleClick to edit Master title styleLinking Within Complex Patent Documents
• Linking information in one part of a patent to another e.g.
– Finding compounds with a particular substructure where a value is
reported
…
30
![Page 31: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/31.jpg)
Click to edit Master title styleClick to edit Master title styleFrom One Claim to the Next
• For information in claims, often want to work back along
the chain of claims
31
![Page 32: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/32.jpg)
Reproducible Workflows
32
![Page 33: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/33.jpg)
Click to edit Master title styleClick to edit Master title style
• Queries can be run regularly as
documents get updated
• Integrate text mining with workflow
tools for
– up-to-date dashboards
– alerts
– integration with other structured data
sources
– web portals
• Provide a range of analytics and
visualization
33
Automation
Pipeline Pilot
![Page 34: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/34.jpg)
Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results
Intervention
Using Pipeline Pilot Visualization
34
![Page 35: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/35.jpg)
Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results
Dates
Colours indicate Phase
Using Pipeline Pilot visualization
35
![Page 36: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining](https://reader033.vdocuments.site/reader033/viewer/2022051817/547b43d1b4af9fb9158b4e90/html5/thumbnails/36.jpg)
Click to edit Master title styleClick to edit Master title style
• Agile Text Mining provides query power and flexibility
– Can address the long tail of less predictable questions
– Allows “queries of arbitrary complexity” combining
• NLP, ontologies, regular expressions, regions, numerical expressions,
chemical substructure/similarity, disambiguation
– Uses the data itself to
• inform search strategies
• build terminologies
– Provides flexible, structured output to
• allow integration with existing structured data
• fit into existing workflows
• This enables very wide application of text mining: wherever
there is a need to search, extract, or categorize information
Conclusions
36