a bio text mining workbench combined with active machine learning gary geunbae lee postech 11/25...

53
A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Upload: karen-emily-hamilton

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

A Bio Text Mining Workbench combined with Active Machine Learning

Gary Geunbae Lee

Postech

11/25 LBM2005

Page 2: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Contents

• Introduction

• POSBIOTM/W Workbench

• POSBIOTM/NER System

• POSBIOTM/NER with Active Machine Learning

• POSBIOTM/Event System

• Current status (demo)

Page 3: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Introduction

Exponentially growing biological publications

Page 4: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Introduction

• Biological named entity recognition.

• Extract the biological interaction (events) between biological entities.• Important to biological pathway.

Biological

Papers

Biological

Papers

Two key issues to deal with biological texts.

Page 5: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Introduction

• Development workbench (common in NLP)• Grammar development workbench• POS/Tree Tagging workbench

• Use large amount of Corpus• Machine Learning methods are used in NER task and event extraction

task.• Annotated corpus is essential to achieve good results in machine

learning based methods (both in quantity and quality)• Lack of annotated corpus (notorious in bio/medical fields)

• Need• tools in support of collecting, managing, creating, annotating and

exploiting rich biomedical text resources. • Tools which interacts with the automatic system to increase the high

quality annotated corpus

Bio-text mining workbench

Page 6: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Contents

• Introduction

• POSBIOTM/W Workbench

• POSBIOTM/NER System

• POSBIOTM/NER with Active Machine Learning

• POSBIOTM/Event System

• Current status

Page 7: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W: A development Workbench

Overall Design

Page 8: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

• Goal• help users to search, collect and manage publications.

• Quick Search Bar• provides quick access to PubMed.

• Pubmed Search Assistant• Users can select specific abstracts to do the named-entity tagging and

event extraction

Managing Tool

Page 9: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

Managing Tool• Pubmed search Assistant

Page 10: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

• Named-entity recognition (NER) task • identification of material names concerned.

• Goal: automatically and effectively annotate biomedical-related entities.

• NER Tool is a Client Tool of POSBIOTM/NER System• Currently, Three NER models are provided.

• The GENIA-NER model, the GENE-NER-model and the GPCR-NER model

• Named-entity recognition with Active learning• To minimize the human labeling effort

NER Tool

Page 11: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

NER Tool• Named-entity recognition with Active learning

Page 12: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

• Goal: To extract the events which consist of “interaction”, “effecter”, and “reactant”

• Named-entity types: protein (P), gene (G), small molecule (SM), and cellular process (CP).

• Interaction: biological interaction (BI) and a chemical interaction (CI).

• Event Extraction Tool is a Client Tool of POSBIOTM/Event System

Event Extraction Tool

Page 13: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

• Extraction Result in XML format

Event Extraction Tool

<Result><NER>

....<Sentence SNum = "4"><protein>EDG-1</protein>, encoded by the

<gene>endothelial_differentiation_gene-1</gene> , is a <protein>heterotrimeric_guanine_nucleotide_binding_protein-coupled_receptor</protein> ( <protein >GPCR</ protein > ) for <small_molecule>sphingosine-1-phosphate</ small_molecule> ( < small_molecule>SPP</ small_molecule> ) that has been shown to stimulate < cellular_process>angiogenesis</ cellular_process> and <cellular_process>cell_migration</ cellular_process> in cultured endothelial cells. </Sentence>

.....</NER><Event_Extraction>

<Event SNum = "4"><Interaction>stimulate</Interaction><Effecter>sphingosine-1-phosphate</Effecter><Reactant>angiogenesis</Reactant>

</Event>.....

</ Event_Extraction ></Result>

Page 14: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

• Extraction Result Event Extraction Tool

Page 15: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

• Goal• The GUI-based Annotation tool is designed to manipulate the manual

annotations.

• Named-entity editing• NE is displayed in different colors which could be changed

• add, remove or correct named-entity tags, or change the boundaries of named entities, etc.

Annotation Tool

Page 16: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

• Event editing• extracted events are displayed in a table

• double-clicking the event to look up the original sentence from which each event is extracted

• Upload function• Users can upload the well-annotated data to the POSBIOTM system

• incremental build-up of a massive amount of named-entity and event annotation corpus.

Annotation Tool

Page 17: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/W Workbench

Annotation Tool

Page 18: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Contents

• Introduction

• POSBIOTM/W Workbench

• POSBIOTM/NER System

• POSBIOTM/NER with Active Machine Learning

• POSBIOTM/Event System

• Current status

Page 19: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER System

• Approach• the named entity recognition problem is regarded as a classification problem,

marking up each input token with named entity category labels.

• CRF• Conditional random fields (CRFs) ([Lafferty et.al. 2001]) is a probabilistic

framework for labeling and segmenting a sequential data. (s: state(tag); o: input)

• For example:

Named Entity Recognition (NER)

N

i kiiikk ossf

ZOSP

11

0

)),,(exp(1

)|(

. 0

DNA;-I DNA,-I

1'-EDG' if 1

),,( 11

otherwise

ss

oword

ossf ii

ii

iiik

Page 20: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER System

• Feature Set

Named Entity Recognition (NER)

FeatureFeature DescriptionDescription

Lexical word only in the case that the previous/current/next words are in the surface word dictionary.

word feature orthographical feature of the previous/current/next words.

Upper case letters, numbers, non-alphabet letters. Greek words – alpha cells, beta hemolysis, tau interferon.

prefix/suffix Prefixes/suffixes which are contained in the prefix/suffix dictionary.

Biological prefix, suffix concept – ase, blast, cyt, phore, plast.

part-of-speech tag POS tag of the previous/current/next words.

The part of speech is the term used to describe how a particular word is used. E.g. nouns, verb, etc.

Base noun phrase tag base noun phrase tag of the previous/current/next words.

Page 21: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER System

• Three NER models• GENIA model / GENE-NER model / GPCR-NER model

• GENIA model• The named entity classes used in the evaluation :

DNA, RNA, protein and cell_line, cell_type

• The training data consists of 2000 MEDLINE abstracts of the GENIA version 3 corpus. These abstracts were collected using the search terms “human”, ”blood cell”, “transcription factor”.

• The testing data will come from a super-domain of the training data (“blood cell”, ”transcription factor”).

NER Models

Page 22: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER System

• GENE-NER model• GENE-NER module uses BioCreative corpus.

• The aim of the GENE-NER module is the identification of which terms in biomedical research article are gene and/or protein names.

• The training corpus consists of 7.5k sentences, selected from MEDLINE according to their likelihood of containing gene names.

• GPCR-NER module (Postech)• aims at recognizing four target named entity categories:

protein, gene, small molecule and cellular process.

• The training corpus consists of 50 full articles related to GPCR(G-protein coupled receptor) signal transduction pathway.

NER Models

Page 23: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER System

Corpus Precision Recall F-Measure

GENIA-NER 0.6960 0.6929 0.6945

GENE-NER 0.7550 0.8404 0.7982

GPCR-NER 0.6736 0.8135 0.7370

• Evaluation for Three NER models

NER Models

Page 24: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Contents

• Introduction

• POSBIOTM/W Workbench

• POSBIOTM/NER System

• POSBIOTM/NER with Active Machine Learning

• POSBIOTM/Event System

• Current status

Page 25: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

• NER with Machine Learning• To enhance the NER performance through the idea of re-using the

annotated data and re-training the NER module

• NER with Active Machine Learning• To minimize the human labeling effort without degrading the

performance

• To select the most informative samples for training

Active Learning in NER

Page 26: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

Active Learning in NER Framework

Page 27: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

• Uncertainty-based Sample Selection• Using an entropy-based measure to quantify the uncertainty that the

current classifier holds (entropy or normalized entropy of the CRF conditional probability)

• The most uncertain samples are selected for human annotation

Active Learning Scoring Strategy

Page 28: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

• Diversity-based Sample Selection• To catch the most representative sentences in each sampling.

• The divergence measures of the two sentences are represented by the minimum similarity among the examples

• The similarity score of two words

• The similarity score of two sentences

Active Learning Scoring Strategy

)()(

),(2)(

21

2121 wDepthwDepth

wwDepthwwsim

2211

2121 ),(

SSSS

SSSSsimilarity

i j

ji wwsimSS )( 2121

(for syntactic path)

Page 29: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

• MMR(Maximal Marginal Relevance) method• The two measures for uncertainty and diversity will be combined

using the MMR method to give the sampling scores in our active learning strategy

Active Learning Scoring Strategy

),(Similaritymax)1(),(yUncertaint)( jiTsi

def

i ssMssscoreMj

Page 30: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

• Training Data• 2,000 MEDLINE abstracts from the GENIA corpus

• 5 named entity classes

• DNA, RNA, protein, cell line, cell type

• Test Data• 404 abstracts

• Half of them are from the same domain as the training data and the other half are from the super-domain of ‘blood cell’ and ‘transcription factor’

Experiment and Discussion

Page 31: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

• Pool-based sample selection• 100 abstracts were used to train initial NER module

• Each time, we chose k examples (sentences) from the given pool to train the new NER module

• The number k varied from 1,000 to 17,000 with step size 1,000

• Active learning methods for test• Random selection

• Entropy based uncertainty selection

• Entropy combined with Diversity

• Normalized Entropy combined with Diversity

Experiment and Discussion

Page 32: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

Experiment and Discussion

Page 33: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/NER with Active Learning

• All three kinds of active learning strategies outperform the random selection• The combined strategy reduces 24.64% training examples compared

with the random selection

• The normalized combined strategy reduces 35.43% training examples compared with the random selection

• Diversity increases the classifier’s performance when the large amount of sample are selected• Up to 4,000 sentences, the entropy strategy and the combined

strategy perform similar

• After 11,000 sentence point, the combined strategy surpasses the entropy strategy

Experiment and Discussion

Page 34: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Contents

• Introduction

• POSBIOTM/W Workbench

• POSBIOTM/NER System

• POSBIOTM/NER with Active Machine Learning

• POSBIOTM/Event System

• Current status

Page 35: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

System Architecture

Page 36: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Template Element• Entities - participants of an event

• protein (P), gene (G), small molecule (SM), cellular process (CP)• Interaction - relationship between entities

• biological interaction (BI) – Functional interaction• About how/whether one component affects the other's status

biologically• chemical interaction (CI) – Molecular interaction

• About the interaction among entities at the molecular structural level• Event

• One Interaction (I) • Connecting the effecter and reactant• Interaction keywords (BI, CI)

• One Effecter (E) • Provoking an event• Template element (P, G, SM, CP) or nested event

• One Reactant (R) • Responding to an effecter• Template element (P, G, SM, CP) or nested event

Target Slot Definition

Page 37: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

Target Slot Definition

The cross-talk between PDGF and SPP is required for these embryonic cell movements.

• Template Element • Entities : PDGF (P), SPP (SM), Cell movement (CP)

• Interaction keywords : cross-talk (BI), require (BI)

• Event• cross-talk (I) : PDGF (E) : SPP (R)

• require (I) : cross-talk (E) : cell movement (R)

• Example

Page 38: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Sentence boundary detection

• Annotating Named Entity (NER) • Protein

• Small molecule

• Gene

• Cellular process

• Compound/Complex Sentence Splitter• To simplify the complicated full texts

Pre-Processor

Page 39: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Compound/Complex Sentence Splitter• Simple splitting rules

• [S] NP1 VP1 NP2 [SBAR] that|which VP2 [/SBAR] [/S] NP1 VP1 NP2 + NP2 VP2

• Example

• “The best studied of these is EDG-1, which is implicated in cell migration and angiogenesis.”

==> 1. “The best studied of these is EDG-1.”

2. “EDG-1 is implicated in cell migration and angiogenesis.”

Pre-Processor

Page 40: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Two-level Event Rule Learner

Biological Event Extraction

Page 41: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Event Rule Learner• Adapt a supervised machine learning algorithm: WHISK

• learns rules in the form of context-based regular expressions

• induces the rules with top-down manner• Ex) “{NP} .*? (<CP>)[E] {/NP} {VP} (<BI>)[I] {/VP} {NP} both (<P>)[R] and .*?

{/NP}”

• Limitation of the WHISK

• The longer distance between event components, the more difficult to extract the correct event

• WHISK consider all lexical words between event components

• Cannot handle nested biological events

• Propose two-level rule learning method to handle the limitation of the flat rule learning method

Biological Event Extraction

Page 42: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Two-level Event Rule Learner

Biological Event Extraction

{NP} <BI>cross-talk</BI> between <P>PDGF</P> and <SM>SPP</SM> {/NP} {VP} is <BI>required</BI> {/VP} for {NP} these embryonic <CP>cell_movements</CP> {/NP}

<TAGS> B {interaction cross-talk} {effecter PDGF} {reactant SPP}

<TAGS> B {interaction require} {effecter cross-talk} {reactant cell movement}

1. Marking long NP boundary

2. Learn the short-span rule corresponding to the NP: “<BI>cross-talk</BI> between <P>PDGF</P> and <SM>SPP</SM>”

“ {NP} (<BI>)[I] between (<P>)[E] and (<SM>)[R] {/NP} “

3. Re-annotate the short-span interaction as one noun with regular expression format

{NP} <E>cross-talk_between_PDGF_and_SPP</E> {/NP} {VP} is <BI>required</BI> {/VP} for {NP} these embryonic <CP>cell_movements</CP> {/NP}

<TAGS> B {interaction require} {effecter cross-talk} {reactant cell movement}

4. Learn the long-span rule with the re-annotated sentence

Page 43: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Event Extractor• To extract the events with the automatic generated rules

• by using regular expression pattern matching

• To handle the alias and noun conjunction

• aliases and noun conjunctions have general patterns like ‘sphingosine-1-phosphate(SPP)’ or ‘FP, IP, and TP receptors’

• handle them with simple rules like ‘A(B)’ or ‘A, B, C, and D’

• To remove sentences including the negative words

• ‘not’, ‘never’, ‘fail’, etc

Biological Event Extraction

Page 44: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

Event Component Verifier

Page 45: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• To remove the incorrectly extracted events

• Classify template elements (P, G, SM, CP, BI, CI) into 4 classes• I (interaction), E (effecter), R (reactant), N (none)

• I, E, R : event’s components

• N : a template element , but not an event component

• Use a Maximum Entropy Classifier• Features

• POS tag, phrase chunks, the type of template element of neighboring words and semantic information

Event Component Verifier

Page 46: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

Event Component Verifier

Page 47: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Event Component Verifier

POSBIOTM/Event System

• ExampleExtracted Biological Events

Ev1: Requires (I) sphingosine_kinase(E) cell_migration (R)

Ev2: Requires (I) EDG-1 (E) cell_migration (R)

Ev3: Requires (I) EDG-1 (E) PDGF (R)

Event Component Verifier Results

I : Requires

E : EDG-1, sphingosine_kinase, PDGF

R : cell_migration

Verified Biological Extracted Events

Ev1: Requires (I) sphingosine_kinase (E) cell_migration (R)

Ev2: Requires (I) EDG-1 (E) cell_migration (R)

Page 48: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• 500 Medline abstracts including 2,314 biological events & 10-fold cross validation

• Flat rule learner vs. two-level rule learner

• Before verification vs. after verification

• Performance comparison • Learning Information Extractors for Proteins and their

Interactions (2004) - Razvan Bunescu, et. al

• 1000 abstracts & 10-fold cross validation

Experiment and Discussion

Flat rule learner Two-level rule learnerComparison

systemBefore verification

After verification

Before verification

After verification

Precision(%) 38.3 54.7 38.2 53.1 39

Recall(%) 58.0 49.2 68.0 56.1 63

F-measure 46.1 51.8 48.9 54.6 48.2

Page 49: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Trade-off between precision and recall• Before verification : big gap between precision and recall

• After verification : low gap between precision and recall

• threshold : cut the rules according to the measure on how many of the extracted events from a rule are correct

Experiment and Discussion

Page 50: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

POSBIOTM/Event System

• Constant good performance regardless of the threshold of rule learner

Experiment and Discussion

Page 51: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Other Corpora for Bio-Relation Extraction

• BC-PPI• From BioCreative Corpus for NER

• Protein/Gene interactions

• 255 interactions in 1000 sentences

• IEPA• Protein/Protein interactions

• 410 interactions in 498 sentences

• LLL05• Protein/Gene interactions

• 271 interactions in 80 sentences

• BioText• Disease/Treatment relations

Page 52: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Contents

• Introduction

• POSBIOTM/W Workbench

• POSBIOTM/NER System

• POSBIOTM/NER with Active Machine Learning

• POSBIOTM/Event System

• Current status

Page 53: A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Current Status & future works

• Re-implemented with Java (platform independent)

• Integrated with J-Designer in SBW consortium (will be)

• Integrated with Active learning method to automatically suggest human-annotated corpus

• Used for national large scale BIT fusion projects: search for useful peptide (usable as a ligand for drug)

• Getting more feed back from biologists

• System getting smarter with more usage: workbench + active learning

Workbench Demo