automating discovery from biomedical texts marti hearst & barbara rosario uc berkeley agyinc...

35
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Automating Discovery from Biomedical Texts

Marti Hearst & Barbara RosarioUC Berkeley

Agyinc VisitAugust 16, 2000

Page 2: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

The LINDI ProjectLinking Information for New

Discoveries

UIs for building and reusing hypothesis seeking strategies.

Statistical language analysis techniques for extracting propositions

Two Main Thrusts:

Page 3: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Scenario: Explore Functions of a Gene

Objective– Determine the functions of a newly

sequenced Gene X. Known facts

– Gene X co-expresses (activated in the same cell) with Gene A, B, C

– The relationship of Gene A, B, C with certain types of diseases (from medical literature)

Question– What types of diseases are Gene X related

to?

Page 4: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Gene Co-expression:Role in the genetic pathway

g?

PSA

Kall.

PAP

h?

PSA

Kall.

PAP

g?

Other possibilities as well

Page 5: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Make use of the literature

Look up what is known about the other genes.

Different articles in different collections Look for commonalities

– Similar topics indicated by Subject Descriptors

– Similar words in titles and abstractsadenocarcinoma, neoplasm, prostate, prostatic

neoplasms, tumor markers, antibodies ...

Page 6: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Developing Strategies

Different strategies seem needed for different situations– First: see what is known about

Kallikrein.– 7341 documents. Too many– AND the result with “disease” category

» If result is non-empty, this might be an interesting gene

– Now get 803 documents

Page 7: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Medical Literature

Explore Functions of New Gene X

Gene-A

Key

wo

rds

Slide adapted from K. Patel

Projection

Mapping

Query

Page 8: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Developing Strategies

Different strategies seem needed for different situations– First: see what is known about Kallikrein.– 7341 documents. Too many– AND the result with “disease” category

» If result is non-empty, this might be an interesting gene

– Now get 803 documents– AND the result with PSA

» Get 11 documents. Better!

Page 9: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Medical Literature

Explore Functions of New Gene X

Gene-A

Key

wo

rds

Key

wo

rds

Gene-B Gene-C

Key

wo

rds

Projection

Keywords

Intersection

Query

Page 10: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Developing Strategies

Look for commalities among these documents– Manual scan through ~100 category

labels– Would have been better if

»Automatically organized» Intersections of “important” categories

scanned for first

Page 11: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Medical Literature

Explore Functions of New Gene X

Gene-A

Key

wo

rds

Key

wo

rds

Gene-B

Keywords

Keywords

Slide adapted from K. Patel

Slicing

Gene-C

Key

wo

rds

Projection

Keywords

Intersection

Mapping

Query

Page 12: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Try a new tack

Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests

New tack: intersect search on all three known genes– Hope they all talk about diagnostics

and prostate cancer– Fortunately, 7 documents returned– Bingo! A relation to regulation of this

cancer

Page 13: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Medical Literature

Explore Functions of New Gene X

Possible FunctionFor Gene-X

Gene-A

Key

wo

rds

Key

wo

rds

Gene-B

Keywords

Keywords

Slide adapted from K. Patel

Slicing

Gene-C

Key

wo

rds

Projection

Keywords

Intersection

Mapping

Query

Query

Page 14: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Formulate a Hypothesis

Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer

New tack: do some lab tests– See if mystery gene is similar in

molecular structure to the others– If so, it might do some of the same

things they do

Page 15: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Strategies again

In hindsight, combining all three genes was a good strategy.– Store this for later

Might not have worked– Need a suite of strategies– Build them up via experience and a

good UI

Page 16: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

The System Doing the same query with slightly different

values each time is time-consuming and tedious

Same goes for cutting and pasting results– IR systems don’t support varying queries

like this very well.– Each situation is a bit different

Some automatic processing is needed in the background to eliminate/suggest hypotheses

Page 17: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

The User Interface A general search interface should

support– History– Context– Comparison– Operators: Intersection, Union, Slicing– Operator Reuse– Visualization (where appropriate)

We have an initial implementation It needs lots of work

Page 18: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Architecture of LINDI UI

Data Layer Annotation Layer User Interface Layer

Page 19: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Data Layer Purpose

– Hide different formats of text collections Components

– Data: Abstractions representing records of a text collection

– Operations: performed on the data Data

– A set of records– Each record is a set of tuples with types

Operations– union, intersection, projection, mapping

Page 20: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Annotation Layer

Purpose– Associate data set with operations

that produced them (history)– History is a first class object

Advantage– Streamline a sequence of operations– Reuse operations– Parameterize operations

Page 21: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

User Interface

Direct manipulation of information objects and access operations– Query– Intersection– Union– Mapping– Slicing

Record and reuse of past operations Parameterization of operations Streamlining of operations

Page 22: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Initial Palette

Page 23: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Query Structure Determined by Collection Type

Page 24: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Query Operation Results

Page 25: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Projection Operation and Subsequent Results

Page 26: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Parameterized Query: Repeat operations with different values

GC

GB

GA

Page 27: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Intersection over Projected Attribute

Page 28: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Intersection over Projected Attribute

Page 29: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Example Interaction with UI Prototype

1 Query on Gene names2 Project out only mesh headings3 Intersect the results4 Map to create a ranking5 Slice out the top-ranked.

Page 30: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Future Work on UI As currently designed

– Better labeling– Better layout

» Intuitive» Scalable

– Connection to real backend– User Testing

» Does direct manipulation work?» What operator sequences help?» How to improve parameterization?

More advanced– Support for strategies– Incorporation of NLP

Page 31: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Language Analysis Component

Goals:– Extract Propositions from Text– Make Inferences

Page 32: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Language Analysis Component

Why Extract Propositions from Text?– Text is how knowledge at the

propositional level is communicated– Text is continually being created and

updated by the outside world

Page 33: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Example:Statistical Semantic

GrammarTo detect causal relationships between medical concepts– Title:

Magnesium deficiency implicated in increased stress levels.

– Interpretation: <nutrient><reduction> related-to

<increase><symptom>

– Inference:» Increase(stress, decrease(mg))

Page 34: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

Statistical Semantic Grammars

Empirical NLP has made great strides– But mainly applied to syntactic structure

Semantic grammars are powerful, but– Brittle – Time-consuming to construct

Idea:– Use what we now know about statistical NLP

to build up a probabilistic grammar

Page 35: Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

LINDI: Target Components

1. Special UI for retrieving appropriate docs

2. Language analysis on docs to detect causal relationships between concepts

3. Probabilistic representation of concepts and relationships

4. UI + User: Hypothesis creation