text mining for bioscience applications: the state of the art marti hearst university of california,...

Text Mining for Bioscience Applications:

The State of the Art

Marti Hearst

University of California, Berkeley

Outline

• Search vs. Discovery

• Why is text analysis difficult?

• Some current approaches

• Future directions

My Background

• Computer Scientist by training– NOT a biologist

• Professor in an interdisciplinary program– School of Information Management & Systems (SIMS)– Affiliated with the UCSF Bioinformatics Grad Group

• Research fields are– Computational Linguistics– Search (Information Retrieval)– User Interfaces and Information Visualization

• Have focused for a while on bioscience text• Have received research support from Genentech

Monet, Haystack with Snow, Morning

Search vs. Discovery

Search:Finding hay in a haystack

Discovery:Creating a new

kind of hay

Search Goals

• More accurate results

• More comprehensive results– Thesaurus expansion

• Intelligent summaries of results

• Organize results along biologically relevant lines

• Better user interfaces

Knowledge Discovery from Text

• How to discover new information … • … As opposed to looking up what’s

already known.• Method:

– Create hypotheses– Use large text collections to gather

evidence to refute or support hypotheses

– Do lab tests to verify promising results

Discovery Goals

• Genomics– Automatically build gene networks– Discover gene functions

• Pharmacology– Help determine which drugs can help cure a

disease– Help determine which genetic traits will lead

to a reaction to a drug

• Etiology– Discover underlying causes of disease

Why is Automated Text Analysis Difficult?

USA Today, 2/26/04, Sbazo & Appleby10

Why is automated text analysis difficult?

“Avastin, developed by South San Francisco-based Genentech (DNA), was approved for advanced colorectal cancer and for patients who haven't received other chemotherapy, according to the Food and Drug Administration.”

– What is approved doing in this sentence?• John was approved for advancement -> gets a promotion.• Avastin was approved for cancer -> to fight cancer.• Avastin was approved for patients -> to consume to fight

cancer.

– What kind of patients approved for?• Ambiguous. Could be for anyone who hasn’t received

chemotherapy, or only those patients with advanced colorectal cancer who haven’t received chemotherapy.



“This could easily be a multibillion-dollar drug," McCamant says.

Refers to concepts mentioned in earlier sentences.



"Avastin opens up this new gateway for cancer care," says William Li, president of the Angiogenesis Foundation in Massachusetts. "It's the first in a fleet of other drugs.”

– Is Avastin a vehicle? It opens gateways and travels in a fleet!

13


• There are many indirect ways to say things:

– A two-dose combined hepatitis A and B vaccine would facilitate immunization programs.

• The vaccine helps prevent hep B.

– These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135.

• The treatment TJ-135 helps cure hep.

– Effect of interferon on hepatitis B.• There is an unspecified effect of interferon on hep B.

What do we do?

• Solve sub-problems– Extract certain types of entities

• Gene/protein names• Abbreviation definitions

– Classify the noun phrases using ontologies• MeSH, LocusLink, GO, etc.

– Define relationship types; try to recognize them.– Many other subproblems are actively being worked on

• Word sense disambiguation• Co-reference resolution

Two Main Approaches

Hand-built Rules Machine Learning

Two Main Approaches

• Hand-built rules– Can be very accurate– Are also very “brittle”– Don’t scale

• Machine learning– Usually requires labeled training data

• Unsupervised methods under development

– Can be made to scale– Is the way of the future

Abbreviation Definition Recognition

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Ariel Schwartz and Marti Hearst, PSB 2003 Kauai, Jan 2003

• Fast, simple algorithm for recognizing abbreviation definitions.– Simpler and faster than the rest

• Other approaches are cubic or quadratic in time

– Higher precision and recall– Idea: Work backwards from the end

• Examples:– In eukaryotes, the key to transcriptional regulation of the Heat

Shock Response is the Heat Shock Transcription Factor (HSF). – Gcn5-related N-acetyltransferase (GNAT)

• In future: – Use redundancy across abstracts to figure out abbreviation

meaning even when definition is not present.

Gene name co-occurenceA literature network of human genes for high-throughput analysis of gene expression. Jenssen TK, Laegreid A, Komorowski J, Hovig E. Nat Genet. 2001 May;28(1):21-8.

PubGene Assumption:

If two genes are co-mentioned in a MEDLINE record, there is an underlying biological relationship.

Example: Genes highly upregulated at time point 6 h (6H) in the fibroblast serum response.

Green: upregulation

Red: downregulation

Gene name co-occurenceA literature network of human genes for high-throughput analysis of gene expression. Jenssen TK, Laegreid A, Komorowski J, Hovig E. Nat Genet. 2001 May;28(1):21-8.

Evaluation: 29-40% of the pairs were incorrect

45% of OMIM pairs found51% of DIP pairs found (DB of Interacting Proteins)

How to find functions of genes?• Have the genetic sequence• Don’t know what it does• But …

– Know which genes it coexpresses with– Some of these have known function

• So …infer function based on function of co-expressed genes– This is problem suggested by Michael Walker

and others at Incyte Pharmaceuticals

Gene Co-expression:Role in the genetic pathway

g?

PSA

Kall.

PAP

h?

PSA

Kall.

PAP

g?

Other possibilities as well

Make use of the literature

• Look up what is known about the other genes.

• Different articles in different collections• Look for commonalities

– Similar topics indicated by Subject Descriptors

– Similar words in titles and abstractsadenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...

Formulate a Hypothesis

• Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer

• New tack: do some lab tests– See if mystery gene is similar in molecular

structure to the others– If so, it might do some of the same things

they do

Etiology ExampleComplementary structures in disjoint science literatures. Don R. Swanson. In Proceedings of SIGIR ‘91

• Goal: find cause of disease– Magnesium-migraine connection

• Given – medical titles and abstracts– a problem (incurable rare disease)– some medical expertise

• Find causal links among titles– symptoms– drugs– results

Gathering Evidence

migraine

stress

magnesium

CCB

magnesium

SCD

magnesium

PA

magnesium

Gathering Evidence

migraine magnesium

stress

CCB

PA

SCD

Swanson’s Linking Approach

• Two of his hypotheses have received some experimental verification.

• His technique– Only partially automated– Required medical expertise

• Recently others have made progress automating it.

Automating Swanson-style Discovery

Text Mining: Generating Hypotheses from MEDLINE, Padmini Srinivasan. To appear in JASIST.

• UMLS defines Semantic Types• Every MeSH term is assigned one or more Semantic Types

– Interferon type II falls within both: • Immunologic Factor and • Pharmacologic Substance

• Each PubMed article is assigned a set of MeSH terms• The idea is to characterize a set of articles according to which

semantic types their MeSH terms fall into.


Text Mining: Generating Hypotheses from MEDLINE, Padmini Srinivasan. To appear in JASIST.

Approach:– User inputs topic T of interest– User selects 2 sets from a small number of sets of UMLS semantic

types– System

• Searches PubMed for articles about T• Selects out the important MeSH terms as determined by the user-

chosen semantic type categories• Searches PubMed for articles that contain these MeSH terms• Combines the MeSH terms that result from these retrieved documents;• Call this result C• If a PubMed search on words from T and c from C are empty, place c

as a candidate in a final result set R• Report those terms in R that fall into the second user-selected

semantic type set.


• Text Mining: Generating Hypotheses from MEDLINE, Padmini Srinivasan. To appear in JASIST.

• Results: have successfully reproduced the 7 examples they tried, with very little manual intervention

• Example: input topic is Raynaud’s disease

Main Ideas for NLP Approach

• Assign Semantics using – Statistics– Hierarchical Lexical Ontologies to

generalize– Redundancy in the data

• Build up Layers of Representation– Syntactic and Semantic– Use these in a feedback loop

32

Automated Relation Assignment

• Recall the problem:– A two-dose combined hepatitis A and B vaccine

would facilitate immunization programs.• The vaccine helps prevent hep B.

• Identified 7 relations that can hold between Treatments and Diseases

• Used Machine Learning to address this– Graphical models– Neural nets

• Marked up the text with syntactic and semantic information– MeSH labels turn out to be very important

33


• Use Machine Learning to address this– Graphical models– Neural nets

• Mark up the text with syntactic and semantic information– MeSH labels turn out to be very important

34


• Results

Future Directions

• In text analysis:– Move away from hand-built rules– More focus on labeling with semantics

• In problems tackled– There are so many possibilities!– Help with automated curation

Thank you!

Visit our site:

biotext.berkeley.edu

text mining for bioscience applications: the state of the art marti hearst university of california,...

Documents

genentech slide

berkeley slide

discovery search

bioscience text

new kind of hay slide

text mining

search goals

haystack discovery