nazife dimililer supervisor: asst. prof. dr. ekrem varo ğlu

89
1 15.06.22 Biomedical Named Entity Recognition from Text using Genetic Algorithm Based Classifier Subset Selection Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varoğlu

Upload: ayala

Post on 11-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Biomedical Named Entity Recognition from Text using Genetic Algorithm Based Classifier Subset Selection. Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu. Outl i ne. Motivation Background Overview of IE tasks Definition of NER Corpus Used Objective of Thesis Related Work - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

122.04.23

Biomedical Named Entity Recognition from Text using

Genetic Algorithm Based Classifier Subset Selection

Nazife DimililerSupervisor: Asst. Prof. Dr. Ekrem Varoğlu

Page 2: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

222.04.23

Outline Motivation Background

Overview of IE tasks Definition of NER Corpus Used

Objective of Thesis Related Work Proposed System

Corpus Individual Classifiers Multi Classifier System

Future Work

Page 3: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

322.04.23

Motivation of the Thesis Vast amount of literature available online Need for

Intelligent Information Retrieval Automatically populating databases Document Understanding/Summarization…

NER is the first step of all IE tasks Annotated Corpora : GENIA, BioCreative, FlyBase Room for improvement Applicability to other domains

Page 4: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

422.04.23

What is Named Entity Recognition?Named entity recognition (NER) A subtask of information extraction

Identifies and labels strings of a text as belonging to predefined classes (Named Entities)

Example NEs : persons, organizations, expressions of times, drugs, proteins, cell types

NER poses a significant challenge in Biomedical Domain

Page 5: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

522.04.23

Overview of IE tasks in Biomedical Domain

Article Preprocessing

Biomedical NER

Bio-Entity & Interaction

Normalization

Bio-Entity Interaction ExtractionArticles

ontology

Term. DB

Question Answering

Article Selection

Text Summarization . . .

Page 6: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

622.04.23

Sources of Problems in Biomedical NER Irregularities and mistakes in

Tokenization Tagging

(Irregular ) use of special symbols

Lack of standard naming conventions

Changing names and notations

Continuous introduction of new names

Abbreviations, Synonyms, Variations

Homonyms or Ambiguous names

Cascaded named entities

Complicated constructions Comma separated lists Disjunctions & Conjunctions

Inclusion of adjectives as part of some NEs

Page 7: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

722.04.23

State of Current Research for Biomedical NER

A large number of systems have been proposed for biomedical NER.Systems based on individual classifiersMultiple Classifier systems with small number of membersexternal sourceshand crafted post processingcorpora with differing NEsdifferent evaluation schemes

Page 8: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

822.04.23

State of Current Research in Biomedical NER

A very important milestone in this area was the Bio-Entity Recognition Task in JNLPBA in 2004.

Same systems as in newswire domain was used with slight changes. Rich feature sets were exploited

Successful classifiers relied on external resources and post processing

Similar systems were used in the Biocreative tasks in 2004,2006 and 2009 and in other publications.

Page 9: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

922.04.23

Objective of the Thesis Improve biomedical NER performance

Use a benchmark corpus Apply classifier selection techniques to

biomedical NERTrain reliable and diverse set of individual

classifiersUtilize a large set of individual classifiersUse Genetic Algorithm to form an ensemble

performing Vote based classifier subset selection

Page 10: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1022.04.23

Corpus Used

JNLPBA data : based on Genia Corpus v. 3.02 Contains 5 Entities:

ProteinRNADNACell LineCell Type

IOB2 tagged : 11 classes

B-Protein I-ProteinB-RNA B-RNAB-DNA B-DNAB-Cell Line B-Cell LineB-Cell Type B-Cell TypeOutside

Page 11: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1122.04.23

Format of JNLPBA DataOur Odata Osuggest Othat Olipoxygenase B-proteinmetabolites I-proteinactivate OROI Oformation Owhich Othen Oinduce OIL-2 B-proteinexpression Ovia ONF-kappa B-proteinB I-proteinactivation O. O

The Operi-kappa B-DNAB I-DNAsite I-DNAmediates Ohuman B-DNAimmunodeficiency I-DNAvirus I-DNAtype I-DNA2 I-DNAenhancer I-DNAactivation O….Human Oimmunodeficiency Ovirus Otype O2 O

Page 12: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1222.04.23

# of abstracts

# of sentences

# of words

Training Set 2,000 20,546 472,006Test Set 404 4,260 96,780

Data Set Statistics

Protein DNA RNA Cell Type Cell Line All Entities

Training Data

# of Entities 30269 9533 951 6718 3830 51301# of Tokens 55117 25307 2481 15466 11217 109588

Test Data # of Entities 5067 1056 118 1921 500 8662# of Tokens 9841 2845 305 4912 1489 19392

MeSH terms "human", "blood cells" and "transcription factors

Super domain of "blood cells" and "transcription factors

Page 13: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1322.04.23

Individual Classifier Architecture Why use SVM?

Successfully used in many NLP tasks and bioinformatics

CoNNL 2000 and CoNNL 2004 BioCreAtIve Competition 2004

Ability to handle large feature sets IOB2 notation is used to represent entities

Multi class classification problem Features extracted from the training data only

Page 14: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1422.04.23

Individual Classifier System used

YamCha : is a generic, customizable, and open source text chunker that uses Support Vector Machines

Tunable parameters :Parsing direction: Left-to-Right/Right-to-LeftRange of context windowDegree of polynomial kernel

Page 15: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1522.04.23

Context Window

The default setting is "F:-2..2:0.. T:-2..-1".

Page 16: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1622.04.23

Training Individual Classifiers

All individual classifiers are trained using one-vs-all approachBackward or forward parse directionDifferent context windowsDifferent degrees of the polynomial kernelDifferent feature (combination)s

Page 17: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1722.04.23

Individual Classifiers

All classifiers are based on SVM Features Types

Lexical Features Morphological Feature Orthographic Features Surface Word Feature

Tokens and the predicted tags are also used as features

Page 18: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1822.04.23

Tokens : words in training data the token to be classified and the preceding and following tokens as specified by

the context window. Previously Predicted Tags : Predicted tags

of the preceding tokens. Specified by the context window.

Features Used

Page 19: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

1922.04.23

Features Used (Cont.)

Lexical Feature : represents grammatical functions of tokens.

Part Of Speech : tags from Penn Treebank Project added using the Geniatagger.

Ex: Adverb, Determiner, AdjectivePhrase Tag : phrasal categories added using an SVM

trained on newswire dataEx: Noun Phrase, Verb Phrase, Adjective Phrase

Base Noun Phrase Tag: basic noun phrases are tagged using fnTBL tagger.

Page 20: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2022.04.23

Different n-grams of the current token

An n-gram of a token is simply formed by using the last or first n characters of the token.

Last 1/2/3/4 letters First 1/2/3/4 letters

Example: TRANSCRIPTION

Features Used : Morphological

1 2 3 4Suffix n on ion tionPrefix t tr tra tran

Page 21: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2122.04.23

Features Used : OrthographicAlso known as Word Formation Patterns

Information about the form of the wordExample: Contains uppercase letters, digits etc.

Two different approaches used:Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score

Page 22: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2222.04.23

Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word formation

patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i for

entity labeled as j (RSi,j) is calculated as:

The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list

entity ngconstituti tokensofNumber property icorthograph with tokensofNumber RS

ji

ji,

Page 23: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2322.04.23

Word Formation Pattern Example Word Formation

Pattern Example

UpperCase IL-2 Upper_and_Other 2-M

InitCap D3 Lower_and_Upper 25-Dihydroxyvitamin

TwoUpper FasL Upper_and_Digits AP-1

Alpha_and_Other AML1/ETO Lower_and_Other dehydratase/dimerization

Hyphen product-albumin Allupper DNA, GR, T

Upper_or_Digit 3H Greek NF-Kappa, beta

Digits 40 Lower_and_Digits gp39

Alpha_and_Digit IL-1beta Start_with_Hyphen -mediated

Features Used : Orthographic (Cont.)

Orthographic Features used

Page 24: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2422.04.23

Features Used : Orthographic (Cont.)

Intricate Use of the Orthographic Feature :Priority Based : Each token is tagged with the first

applicable word formation pattern on the list. Example

Ca2+ Initial letter capitalizedD3 Initial letter capitalized

GR All letters uppercase

-acetate Starts with -

25-Dihydroxyvitamin Contains upper and other

Page 25: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2522.04.23

Features Used : Orthographic (Cont.)

Intricate Use of the Orthographic Feature :Binary string : A binary string containing one bit to

represent each word formation pattern in the list. Example:

Ca2+ 0010111100111110 D3 0011001100000110

GR 1111000100000000

-acetate 0000000011110000

25-Dihydroxyvitamin 0000111101111110

Initial letter capitalized

combination of upper letter and other symbol

combination of upper and lower case letters

combination of upper letter and number

contains upper lettercombination of lower letter and other symbolscombination of alphabetic chars and other symbols

combination of lower letter and number

combination of alphabetical chars and numbers

Contains number

Page 26: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2622.04.23

Features Used (Cont.)

Surface Words: . A separate pseudo-dictionary for each entity

containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit

corresponds to the pseudo dictionary of an entity.

Page 27: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2722.04.23

Effect of Feature Extraction

Each feature type improves the performance in different perspectivesPrecisionRecallBoundariesEntity based performances

Careful combination of features improves the overall performance

Page 28: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2822.04.23

Effect of parse direction and lexical feature Effect of backward parsing:

Precision and recall increased for both boundaries Precision scores improved more than recall scores An overall increase in full recall, precision and F-score

Effect of Lexical Features:Single lexical features: higher precision than recall Combinations : recall and precision values are

more balanced. Combinations slightly improve performance of both

the left boundary and the right boundary F-scores

Page 29: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

2922.04.23

Effect of Morphological Features F-score improves compared to the baseline system Suffixes alone result in higher recall than precision Prefixes alone result in higher precision than recall Combination improves the overall performance Morphological feature improves recall but degrades

precision compared to the baseline

Page 30: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3022.04.23

Effect of Orthographic Features Performance is improved by all orthographic

features Best performance is achieved by the binary

string. For simple orthographic features, precision

scores slightly higher than recall scores intricate orthographic features provide higher

recall values resulting in overall improvement in F-scores.

Page 31: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3122.04.23

Effect of Surface Word Feature

Precision scores improved more than recall scores compared to the baseline classifier

Improvement on the right boundary is more pronounced.

Precision score is greater than the recall scoreuse pseudo-dictionary to generate classifiers with

higher precision values than recall values

Page 32: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3222.04.23

Effect of Feature CombinationsSome specific combinations do not have a significant improvement in performance. Careful combination of features is useful for improving overall performanceDifferent combinations of feature/parameter sets favor different entities

Page 33: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3322.04.23

Motivation for Multiple Classifier Systems For individual classifiers

A set of carefully engineered features improve performance

Unfortunately performance is still NOT satisfactory

Combining multiple classifiers into ensembles The combined opinions of a number experts is more

likely to be correct than that of a single expert

Page 34: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3422.04.23

Classifier Pool

Classifiers exploiting state-of-the-art feature sets => highest F-scores

Classifiers with high precision or recall Classifiers with high precision but low recall

and vice versa One or more classifiers providing the highest

F-score for each entity

Page 35: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3522.04.23

Training Phase

Classifier Fusion Architecture

Training Data

Feature Extraction

Dictionary&

Context WordsGA based Classifier Selection

SVM 1

SVM 2

SVM M

...

SVM Classifier Set

Feature Set 1

Feature Set 2

Feature Set M

Testing Phase

Classifier Fusion

Test Data Feature Extraction

SVM 1

SVM 2

SVM M

...

Feature Set 1

Feature Set 2

Feature Set M

SVM Classifier Set

Post Processing

Predicted Class

Best Fitting Ensemble

Page 36: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3622.04.23

Fusion Algorithm Weighted Majority Voting :

Full object F-score of each classifier on cross-validation data used as weight

Class that receives the highest vote wins the competition

Weighted combination of all votes

Ties broken by random coin toss

Page 37: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3722.04.23

Weighted Majority Voting

Weight : Full Object F-score

Page 38: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3822.04.23

Genetic Algorithm Set Up Initial population: randomly generated bit strings Genetic Algorithm Features

Population size : 100 Mutation rate : 2% Crossover Rate : 70% Crossover Operators:

Two point crossover Uniform crossover

Tournament size= 40 Elitist population 20%

Page 39: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

3922.04.23

Start

Initialize population randomly

Compute fitness of each chromosome

Terminate?

Select parent andapply crossover

Mutate Offspring

Compute fitness of each chromosome

Apply elitist policy to form new populationSelect best

chromosome as the resultant

ensemble

End

Flow chart of the Genetic Algorithm

No

Yes

Page 40: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4022.04.23

Genetic Algorithm Set UpChromosome: List of classifiers to be combined. 3-fold cross validation results are used for individual

classifiers Fitness of chromosomes: Full object F-score of the

classifier ensemble Static Classifier Selection : Each bit represents a

classifier Proposed Vote Based Classifier Selection : Each bit

represents reliability of a classifier for predicting a class

Page 41: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4122.04.23

Chromosome Structure of Static Classifier Selection

Classifier 1

Classifier 2

If a gene=1, the corresponding classifier participates in the decision for all classes, otherwise it remains silent.

0 1 0 1 1 0 0 1

Classifier 3

Classifier 4 Classifier M

M classifiers chromosome has M bits

Page 42: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4222.04.23

Chromosome Structure for the Proposed Vote-based Classifier Selection

Classifier 1 Classifier 2

0 1 0 1 1 1 0 0

Classifier M

Class 1

Class 2

Class 3

Class 4

For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.

M classifiers chromosome has

N classes NxM bits

Page 43: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4322.04.23

Motivation for Vote Based Classifier Subset Selection A classifier cannot predict all classes with the

same performanceA subset of predictions may be unreliableA subset of predictions may be correlated with

predictions of other classifiers Allow a classifier to vote only for the classes it

trusts

Page 44: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4422.04.23

Multiple Classifier Systems used Single Best (SB) : not an MCS, included as a reference Full Ensemble (FE) : Ensemble containing all classifiers Forward Selection (FS) : Ensemble formed using forward

selection Backward Selection (BS) : Ensemble formed using

Backward Selection GA generated Static Ensemble (GAS) : Ensemble formed

using GA Vote Based Classifier Subset Selection using GA (VBS) :

vote based ensemble formed using GA.

Page 45: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4522.04.23

Performance of EnsemblesPrecision (%) Recall (%) F-score (%)

69.40 70.60 69.99

72.10 70.42 71.25

71.64 71.58 71.61

72.28 71.00 71.63

71.76 71.65 71.71

71.45 73.60 72.51

Single Best

Full Ensemble

Forward Selection

Backward Selection

GA Static Ensemble

Proposed Method 72.51

>>

>>≈

≈<<

Page 46: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4622.04.23

Discussion on ensembles All ensembles outperform SB

VBS has the highest F-scoreGA based ensembles better

BS chose 38 classifiersFE and BS similar: precision >> recall

FS and GAS chose 9 classifiers Precision ad recall more balanced

VBS is different: uses 46 classifiers partiallyRecall > precision

Page 47: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4722.04.23

Discussion on ensembles

BS eliminates mainly classifiers using only two features.All eliminated classifiers are backward parsed

FS and GAS almost the same8 classifiers same9th classifier forward parsed for GASEven though the 9th classifier has lower F-score, GAS

ensemble achives a higher F-score

Backward and forward parsed classifiers are more balanced

Backward and forward parsed classifiers are more balanced

Page 48: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4822.04.23

Entity Based F-scores for the ensembles

DNA RNA Cell Line Cell Type Protein

68.75 66.95 51.72 69.87 72.14

69.77 68.91 57.43 71.41 72.86

70.45 67.22 56.02 72.12 73.27

70.61 68.91 57.06 71.92 73.16

70.51 69.49 56.32 71.77 73.43

72.06 69.11 58.80 72.79 73.86

Single Best

Full Ensemble

Forward Selection

Backward Selection

GA Static EnsembleProposed Method VBS

Page 49: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

4922.04.23

Discussion of entity based scores

VBS achieves the best scores for all entities except for RNAGAS ensemble outperforms VBS for RNA

VBS Highest F-score for proteinLowest F-score for cell line

Largest data set

Smallest data set

Page 50: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5022.04.23

Distribution of Vote Counts for the VBSNumber of votes 0 1 2 3 4 5 6 7 8 9 10 11

Number of classifiers 0 3 5 3 7 8 8 6 0 2 4 0

0

0

11

0

None of the classifiers are eliminated from the ensemble

None of the classifiers vote for all eleven classes

Page 51: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5122.04.23

Discussion of Vote Counts

Each classifier contributes to the decision of at least one class

Some classifiers contribute for almost all classes 7 of the 9 classifiers selected by the GAS vote for

more than 5 classes in VBS Classifiers that have only 1 vote in VBS are

excluded by GAS

Page 52: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5222.04.23

Post Processing Approach used

Inconsistencies in tagging mainly induced by ensembling are fixed

Boundaries are extended using separate context word lists for each entity

Dictionary formed from the training data retags mistagged or untagged entities

OOI-proteinOI-cell_typeI-cell_lineI-cell_lineB-cell_lineO

theoftranscriptionstimulatecellsHeLaorBfrom

OOOOI-cell_lineI-cell_lineI-cell_lineB-cell_lineO

theoftranscriptionstimulatecellsHeLaorBfrom

OOI-DNAI-DNAB-DNAOOO

ofregionvariablehumanunrearrangedoffragmentsDifferent

Region is in Right context of DNA

OI-DNAI-DNAI-DNAB-DNAOOO

ofregionvariablehumanunrearrangedoffragmentsDifferent

Page 53: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5322.04.23

Effect of Post Processing on VBSF-score (%)

Improvement F-score (%)

VBS 72.51

VBS+Inconsistent tag correction +0.08 72.59

VBS+ Right boundary correction +0.14 72.65

VBS+ Left boundary correction +0.01 72.52

VBS+Dictionary lookup +0.08 72.59VBS+ All post-processing Rules +0.23 72.74

Page 54: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5422.04.23

Effect of Post Processing on Individual Entities

  DNA RNA Cell Line Cell Type Protein

Before post-processing

72.06 69.11 58.80 72.79 73.86

After post-processing

72.19 71.55 59.96 73.36 73.87

Improvement 0.13 2.44 1.16 0.57 0.012.44 1.16

Page 55: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5522.04.23

Discussion of Results

Post processing provides more success for entities having: lower F-scores lower representation longer names

Page 56: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5622.04.23

Future Work

Post processing stage may be improved:External resourcesMore efficient normalization and lookup approachesDiscovering post processing rules through AI

techniques Different classifier architectures may be used to

increase diversityCRF, HMM

Page 57: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5722.04.23

Future Work

GA may employ a number of different evaluation metrics

Different ensembling strategies may be employedStacked generalizationBagging/Boosting

Page 58: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5822.04.23

Questions?

Page 59: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

5922.04.23

Appendix

Page 60: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6022.04.23

Base Line System Features used:

Token to be classifiedContext window of -2..2

Determines the preceding and following tokens and preceding predictions used as features

2nd Degree Polynomial kernelForward parse directionOne-vs-All method

Object Identification Scores of the Baseline SystemFull Left Right

Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score

0.6275 0.6574 0.6421 0.6669 0.6987 0.6825 0.7087 0.7425 0.7252

Page 61: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6122.04.23

Discussions on Experiments

All classifiers presented are trained using one-vs-all approachBackward/forward parse directionDifferent context windowsDifferent feature (combination)s

Presented results are averages of classifiers in a category

Page 62: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6222.04.23

Training Individual Classifiers

All individual classifiers are trained using one-vs-all approachBackward/forward parse directionDifferent context windowsDifferent polynomial kernelDifferent feature (combination)s

Page 63: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6322.04.23

Effect of Parse Direction

Parse Direction

Full Left Right

Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score

Forward 65.21 65.30 65.23 69.18 69.28 69.20 72.92 73.02 72.94

Backward 66.85 68.13 67.45 70.51 71.87 71.15 74.93 76.38 75.61

Precision and recall increased for both boundariesPrecision scores improved more than recall scoresAn overall increase in full recall, precision and F-score

Page 64: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6422.04.23

Tokens : words in training data the token to be classified and the preceding and following tokens as specified by

the context window. Previously Predicted Tags : Predicted tags

of the preceding tokens. Specified by the context window.

Features Used

Page 65: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6522.04.23

Features Used (Cont.)

Lexical Feature : represents grammatical functions of tokens.

Part Of Speech : tags from Penn Treebank Project added using the Geniatagger.

Ex: Adverb, Determiner, AdjectivePhrase Tag : phrasal categories added using an SVM

trained on newswire dataEx: Noun Phrase, Verb Phrase, Adjective Phrase

Base Noun Phrase Tag: basic noun phrases are tagged using fnTBL tagger.

Page 66: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6622.04.23

Effect of Lexical Features

Single lexical features: higher precision than recall Combinations : recall and precision values are more

balanced. Combinations slightly improve performance of both the

left boundary and the right boundary F-scores

LexicalFeatures

Used

Full Left Right

Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score

Single 63.46 67.51 65.41 67.24 71.54 69.31 71.00 75.53 73.18

Combined 65.53 66.63 66.07 69.46 70.62 70.03 73.08 74.30 73.68

Page 67: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6722.04.23

Different n-grams of the current token

An n-gram of a token is simply formed by using the last or first n characters of the token.

Last 1/2/3/4 letters First 1/2/3/4 letters

Example: TRANSCRIPTION

Features Used : Morphological

1 2 3 4Suffix n on ion tionPrefix t tr tra tran

Page 68: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6822.04.23

Effect of Morphological Features

Morph.Features Used

Full Left Right

Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score

Baseline 62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52

Prefix 64.48 65.13 64.79 68.43 69.12 68.76 72.51 73.23 72.86

Suffix 65.47 64.84 65.14 69.36 68.68 69.01 73.63 72.93 73.27

Combined 65.75 65.04 65.39 69.68 68.93 69.30 73.89 73.10 73.48

Page 69: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

6922.04.23

Effect of Morphological Features F-score improves compared to the baseline system Suffixes alone result in higher recall than precision Prefixes alone result in higher precision than recall Combination improves the overall performance Morphological feature improves recall but degrades

precision compared to the baseline

Page 70: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7022.04.23

Features Used : Orthographicalso known as Word Formation Patterns

information about the form of the wordExample: Contains uppercase letters, digits etc.

two different approaches used:Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score

Page 71: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7122.04.23

Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word

formation patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i

for entity labeled as j (RSi,j) is calculated as:

The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list

entity ngconstituti tokensofNumber property icorthograph with tokensofNumber RS

ji

ji,

Page 72: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7222.04.23

Word Formation Pattern Example Word Formation

Pattern Example

UpperCase IL-2 Upper_and_Other 2-M

InitCap D3 Lower_and_Upper 25-Dihydroxyvitamin

TwoUpper FasL Upper_and_Digits AP-1

Alpha_and_Other AML1/ETO Lower_and_Other dehydratase/dimerization

Hyphen product-albumin Allupper DNA, GR, T

Upper_or_Digit 3H Greek NF-Kappa, beta

Digits 40 Lower_and_Digits gp39

Alpha_and_Digit IL-1beta Start_with_Hyphen -mediated

Features Used : Orthographic (Cont.)

Orthographic Features used

Page 73: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7322.04.23

Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature :

Priority Based : Each token is tagged with the first applicable word formation pattern on the list. Example

Ca2+ Initial letter capitalizedD3 Initial letter capitalized

GR All letters uppercase

-acetate Starts with -

25-Dihydroxyvitamin Contains upper and other

Page 74: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7422.04.23

Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature :

Binary string : A binary string containing one bit to represent each word formation pattern in the list. Example:

Ca2+ 0010111100111110 D3 0011001100000110

GR 1111000100000000

-acetate 0000000011110000

25-Dihydroxyvitamin 0000111101111110

Initial letter capitalized

combination of upper letter and other symbolcombination of upper and lower case letterscombination of upper letter and numbercontains upper lettercombination of lower

letter and other symbolscombination of alphabetic chars and other symbols

combination of lower letter and number

combination of alphabetical chars and numbers

Contains number

Page 75: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7522.04.23

Effect of Orthographic FeaturesOrthog. FeaturesUsed

Full Left Right

Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score

Baseline 62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52

Simple 65.69 65.93 65.79 69.53 69.79 69.64 73.97 74.25 74.09

Priority 68.02 65.62 66.80 71.91 69.36 70.61 76.06 73.37 74.68

Binary 68.48 65.46 66.94 72.26 69.07 70.63 76.58 73.20 74.85

Performance is improved by all orth. features Best performance is achieved by the binary

string.

Page 76: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7622.04.23

Effect of Orthographic Features (Cont.) For simple orthographic features, precision

scores slightly higher than recall scores Simple orthographic feature degrades precision

on the left boundary only intricate approaches degrade the precision

performance on both left and right boundary as well as the full object recognition.

intricate orthographic features provide higher recall values resulting in overall improvement in F-scores.

intricate orthographic features result in an imbalance between precision and recall

Page 77: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7722.04.23

Features Used (Cont.)

Surface Words: . A separate pseudo-dictionary for each entity

containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit

corresponds to the pseudo dictionary of an entity.

Page 78: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7822.04.23

Effect of Surface Word Feature

Pseudo-dictionary

size

Full Left Right

Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score

Baseline62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52

50% 63.24 66.99 65.06 67.02 70.98 68.94 71.53 75.77 73.58

60% 63.39 67.36 65.31 67.14 71.34 69.18 71.68 76.16 73.85

70% 63.48 67.65 65.50 67.23 71.64 69.36 71.74 76.45 74.02

80% 62.71 66.98 64.77 66.50 71.02 68.68 70.90 75.72 73.23

Page 79: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

7922.04.23

Effect of Surface Word Feature

Precision scores improved more than recall scores compared to the baseline classifier

Improvement on the right boundary is more pronounced.

Precision score is greater than the recall scoreuse pseudo-dictionary to generate classifiers with

higher precision values than recall values

Page 80: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8022.04.23

Effect of Feature CombinationsLexical Morpho-

logicalSurface words

Ortho-graphic

Full Object Identification

Recall Precision F-Score

0.6275 0.6574 0.6421

x 0.6399 0.6729 0.6558

x x 0.6744 0.6682 0.6712

x x x 0.6844 0.6669 0.6755

x x x x 0.7006 0.6791 0.6897Some specific combinations do not have a significant improvement in performance. Careful combination of features is useful

Page 81: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8122.04.23

Additional Material

Page 82: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8222.04.23

Sources of Problems in Biomedical NER Irregularities and mistakes in

Tokenization Tagging

(Irregular ) use of special symbols

Lack of standard naming conventions

Changing names and notations

Continuous introduction of new names

Page 83: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8322.04.23

Sources of Problems in Biomedical NER Abbreviations

Homonyms or Ambiguous names

Synonyms

Variations

Page 84: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8422.04.23

Sources of Problems in Biomedical NER Cascaded named entities

Complicated constructionsComma separated listsDisjunctionsConjunctions

Inclusion of adjectives as part of some NEs

Page 85: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8522.04.23

Evaluation

NEs in CorpusNEs predicted by classifier

TPFN FP

TN

Page 86: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8622.04.23

Evaluation Precision : the ratio of correctly identified NEs

to the number of NEs identified by the system

Recall : the ratio of correctly identified NEs to the number of NEs identified by the system

fptptppprecision

,

fntptprrecall

,

Page 87: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8722.04.23

Evaluation

F-score is based on precision and recall

rp

scoreF 1111

F1-score is the harmonic mean of precision and recall

rpprscoreF

2

1

Page 88: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8822.04.23

Chromosome Structure for the Proposed Vote-based Classifier Selection

For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.

Page 89: Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

8922.04.23

Post Processing Rules

Inconsistent tag correction

Boundary Extension

Dictionary Look up

I