nazife dimililer supervisor: asst. prof. dr. ekrem varo ğlu

122.04.23

Biomedical Named Entity Recognition from Text using

Genetic Algorithm Based Classifier Subset Selection

Nazife DimililerSupervisor: Asst. Prof. Dr. Ekrem Varoğlu

222.04.23

Outline Motivation Background

Overview of IE tasks Definition of NER Corpus Used

Objective of Thesis Related Work Proposed System

Corpus Individual Classifiers Multi Classifier System

Future Work

322.04.23

Motivation of the Thesis Vast amount of literature available online Need for

Intelligent Information Retrieval Automatically populating databases Document Understanding/Summarization…

NER is the first step of all IE tasks Annotated Corpora : GENIA, BioCreative, FlyBase Room for improvement Applicability to other domains

422.04.23

What is Named Entity Recognition?Named entity recognition (NER) A subtask of information extraction

Identifies and labels strings of a text as belonging to predefined classes (Named Entities)

Example NEs : persons, organizations, expressions of times, drugs, proteins, cell types

NER poses a significant challenge in Biomedical Domain

522.04.23

Overview of IE tasks in Biomedical Domain

Article Preprocessing

Biomedical NER

Bio-Entity & Interaction

Normalization

Bio-Entity Interaction ExtractionArticles

ontology

Term. DB

Question Answering

Article Selection

Text Summarization . . .

622.04.23

Sources of Problems in Biomedical NER Irregularities and mistakes in

Tokenization Tagging

(Irregular ) use of special symbols

Lack of standard naming conventions

Changing names and notations

Continuous introduction of new names

Abbreviations, Synonyms, Variations

Homonyms or Ambiguous names

Cascaded named entities

Complicated constructions Comma separated lists Disjunctions & Conjunctions

Inclusion of adjectives as part of some NEs

722.04.23

State of Current Research for Biomedical NER

A large number of systems have been proposed for biomedical NER.Systems based on individual classifiersMultiple Classifier systems with small number of membersexternal sourceshand crafted post processingcorpora with differing NEsdifferent evaluation schemes

822.04.23

State of Current Research in Biomedical NER

A very important milestone in this area was the Bio-Entity Recognition Task in JNLPBA in 2004.

Same systems as in newswire domain was used with slight changes. Rich feature sets were exploited

Successful classifiers relied on external resources and post processing

Similar systems were used in the Biocreative tasks in 2004,2006 and 2009 and in other publications.

922.04.23

Objective of the Thesis Improve biomedical NER performance

Use a benchmark corpus Apply classifier selection techniques to

biomedical NERTrain reliable and diverse set of individual

classifiersUtilize a large set of individual classifiersUse Genetic Algorithm to form an ensemble

performing Vote based classifier subset selection

1022.04.23

Corpus Used

JNLPBA data : based on Genia Corpus v. 3.02 Contains 5 Entities:

ProteinRNADNACell LineCell Type

IOB2 tagged : 11 classes

B-Protein I-ProteinB-RNA B-RNAB-DNA B-DNAB-Cell Line B-Cell LineB-Cell Type B-Cell TypeOutside

1122.04.23

Format of JNLPBA DataOur Odata Osuggest Othat Olipoxygenase B-proteinmetabolites I-proteinactivate OROI Oformation Owhich Othen Oinduce OIL-2 B-proteinexpression Ovia ONF-kappa B-proteinB I-proteinactivation O. O

The Operi-kappa B-DNAB I-DNAsite I-DNAmediates Ohuman B-DNAimmunodeficiency I-DNAvirus I-DNAtype I-DNA2 I-DNAenhancer I-DNAactivation O….Human Oimmunodeficiency Ovirus Otype O2 O

1222.04.23

# of abstracts

# of sentences

# of words

Training Set 2,000 20,546 472,006Test Set 404 4,260 96,780

Data Set Statistics

Protein DNA RNA Cell Type Cell Line All Entities

Training Data

# of Entities 30269 9533 951 6718 3830 51301# of Tokens 55117 25307 2481 15466 11217 109588

Test Data # of Entities 5067 1056 118 1921 500 8662# of Tokens 9841 2845 305 4912 1489 19392

MeSH terms "human", "blood cells" and "transcription factors

Super domain of "blood cells" and "transcription factors

1322.04.23

Individual Classifier Architecture Why use SVM?

Successfully used in many NLP tasks and bioinformatics

CoNNL 2000 and CoNNL 2004 BioCreAtIve Competition 2004

Ability to handle large feature sets IOB2 notation is used to represent entities

Multi class classification problem Features extracted from the training data only

1422.04.23

Individual Classifier System used

YamCha : is a generic, customizable, and open source text chunker that uses Support Vector Machines

Tunable parameters :Parsing direction: Left-to-Right/Right-to-LeftRange of context windowDegree of polynomial kernel

1522.04.23

Context Window

The default setting is "F:-2..2:0.. T:-2..-1".

1622.04.23

Training Individual Classifiers

All individual classifiers are trained using one-vs-all approachBackward or forward parse directionDifferent context windowsDifferent degrees of the polynomial kernelDifferent feature (combination)s

1722.04.23

Individual Classifiers

All classifiers are based on SVM Features Types

Lexical Features Morphological Feature Orthographic Features Surface Word Feature

Tokens and the predicted tags are also used as features

1822.04.23

Tokens : words in training data the token to be classified and the preceding and following tokens as specified by

the context window. Previously Predicted Tags : Predicted tags

of the preceding tokens. Specified by the context window.

Features Used

1922.04.23

Features Used (Cont.)

Lexical Feature : represents grammatical functions of tokens.

Part Of Speech : tags from Penn Treebank Project added using the Geniatagger.

Ex: Adverb, Determiner, AdjectivePhrase Tag : phrasal categories added using an SVM

trained on newswire dataEx: Noun Phrase, Verb Phrase, Adjective Phrase

Base Noun Phrase Tag: basic noun phrases are tagged using fnTBL tagger.

2022.04.23

Different n-grams of the current token

An n-gram of a token is simply formed by using the last or first n characters of the token.

Last 1/2/3/4 letters First 1/2/3/4 letters

Example: TRANSCRIPTION

Features Used : Morphological

1 2 3 4Suffix n on ion tionPrefix t tr tra tran

2122.04.23

Features Used : OrthographicAlso known as Word Formation Patterns

Information about the form of the wordExample: Contains uppercase letters, digits etc.

Two different approaches used:Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score

2222.04.23

Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word formation

patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i for

entity labeled as j (RSi,j) is calculated as:

The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list

entity ngconstituti tokensofNumber property icorthograph with tokensofNumber RS

ji

ji,

2322.04.23

Word Formation Pattern Example Word Formation

Pattern Example

UpperCase IL-2 Upper_and_Other 2-M

InitCap D3 Lower_and_Upper 25-Dihydroxyvitamin

TwoUpper FasL Upper_and_Digits AP-1

Alpha_and_Other AML1/ETO Lower_and_Other dehydratase/dimerization

Hyphen product-albumin Allupper DNA, GR, T

Upper_or_Digit 3H Greek NF-Kappa, beta

Digits 40 Lower_and_Digits gp39

Alpha_and_Digit IL-1beta Start_with_Hyphen -mediated

Features Used : Orthographic (Cont.)

Orthographic Features used

2422.04.23


Intricate Use of the Orthographic Feature :Priority Based : Each token is tagged with the first

applicable word formation pattern on the list. Example

Ca2+ Initial letter capitalizedD3 Initial letter capitalized

GR All letters uppercase

-acetate Starts with -

25-Dihydroxyvitamin Contains upper and other

2522.04.23


Intricate Use of the Orthographic Feature :Binary string : A binary string containing one bit to

represent each word formation pattern in the list. Example:

Ca2+ 0010111100111110 D3 0011001100000110

GR 1111000100000000

-acetate 0000000011110000

25-Dihydroxyvitamin 0000111101111110

Initial letter capitalized

combination of upper letter and other symbol

combination of upper and lower case letters

combination of upper letter and number

contains upper lettercombination of lower letter and other symbolscombination of alphabetic chars and other symbols

combination of lower letter and number

combination of alphabetical chars and numbers

Contains number

2622.04.23


Surface Words: . A separate pseudo-dictionary for each entity

containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit

corresponds to the pseudo dictionary of an entity.

2722.04.23

Effect of Feature Extraction

Each feature type improves the performance in different perspectivesPrecisionRecallBoundariesEntity based performances

Careful combination of features improves the overall performance

2822.04.23

Effect of parse direction and lexical feature Effect of backward parsing:

Precision and recall increased for both boundaries Precision scores improved more than recall scores An overall increase in full recall, precision and F-score

Effect of Lexical Features:Single lexical features: higher precision than recall Combinations : recall and precision values are

more balanced. Combinations slightly improve performance of both

the left boundary and the right boundary F-scores

2922.04.23

Effect of Morphological Features F-score improves compared to the baseline system Suffixes alone result in higher recall than precision Prefixes alone result in higher precision than recall Combination improves the overall performance Morphological feature improves recall but degrades

precision compared to the baseline

3022.04.23

Effect of Orthographic Features Performance is improved by all orthographic

features Best performance is achieved by the binary

string. For simple orthographic features, precision

scores slightly higher than recall scores intricate orthographic features provide higher

recall values resulting in overall improvement in F-scores.

3122.04.23

Effect of Surface Word Feature

Precision scores improved more than recall scores compared to the baseline classifier

Improvement on the right boundary is more pronounced.

Precision score is greater than the recall scoreuse pseudo-dictionary to generate classifiers with

higher precision values than recall values

3222.04.23

Effect of Feature CombinationsSome specific combinations do not have a significant improvement in performance. Careful combination of features is useful for improving overall performanceDifferent combinations of feature/parameter sets favor different entities

3322.04.23

Motivation for Multiple Classifier Systems For individual classifiers

A set of carefully engineered features improve performance

Unfortunately performance is still NOT satisfactory

Combining multiple classifiers into ensembles The combined opinions of a number experts is more

likely to be correct than that of a single expert

3422.04.23

Classifier Pool

Classifiers exploiting state-of-the-art feature sets => highest F-scores

Classifiers with high precision or recall Classifiers with high precision but low recall

and vice versa One or more classifiers providing the highest

F-score for each entity

3522.04.23

Training Phase

Classifier Fusion Architecture

Training Data

Feature Extraction

Dictionary&

Context WordsGA based Classifier Selection

SVM 1

SVM 2

SVM M

...

SVM Classifier Set

Feature Set 1

Feature Set 2

Feature Set M

Testing Phase

Classifier Fusion

Test Data Feature Extraction

SVM 1

SVM 2

SVM M

...

Feature Set 1

Feature Set 2

Feature Set M

SVM Classifier Set

Post Processing

Predicted Class

Best Fitting Ensemble

3622.04.23

Fusion Algorithm Weighted Majority Voting :

Full object F-score of each classifier on cross-validation data used as weight

Class that receives the highest vote wins the competition

Weighted combination of all votes

Ties broken by random coin toss

3722.04.23

Weighted Majority Voting

Weight : Full Object F-score

3822.04.23

Genetic Algorithm Set Up Initial population: randomly generated bit strings Genetic Algorithm Features

Population size : 100 Mutation rate : 2% Crossover Rate : 70% Crossover Operators:

Two point crossover Uniform crossover

Tournament size= 40 Elitist population 20%

3922.04.23

Start

Initialize population randomly

Compute fitness of each chromosome

Terminate?

Select parent andapply crossover

Mutate Offspring

Compute fitness of each chromosome

Apply elitist policy to form new populationSelect best

chromosome as the resultant

ensemble

End

Flow chart of the Genetic Algorithm

No

Yes

4022.04.23

Genetic Algorithm Set UpChromosome: List of classifiers to be combined. 3-fold cross validation results are used for individual

classifiers Fitness of chromosomes: Full object F-score of the

classifier ensemble Static Classifier Selection : Each bit represents a

classifier Proposed Vote Based Classifier Selection : Each bit

represents reliability of a classifier for predicting a class

4122.04.23

Chromosome Structure of Static Classifier Selection

Classifier 1

Classifier 2

If a gene=1, the corresponding classifier participates in the decision for all classes, otherwise it remains silent.

0 1 0 1 1 0 0 1

Classifier 3

Classifier 4 Classifier M

M classifiers chromosome has M bits

4222.04.23

Chromosome Structure for the Proposed Vote-based Classifier Selection

Classifier 1 Classifier 2

0 1 0 1 1 1 0 0

Classifier M

Class 1

Class 2

Class 3

Class 4

For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.

M classifiers chromosome has

N classes NxM bits

4322.04.23

Motivation for Vote Based Classifier Subset Selection A classifier cannot predict all classes with the

same performanceA subset of predictions may be unreliableA subset of predictions may be correlated with

predictions of other classifiers Allow a classifier to vote only for the classes it

trusts

4422.04.23

Multiple Classifier Systems used Single Best (SB) : not an MCS, included as a reference Full Ensemble (FE) : Ensemble containing all classifiers Forward Selection (FS) : Ensemble formed using forward

selection Backward Selection (BS) : Ensemble formed using

Backward Selection GA generated Static Ensemble (GAS) : Ensemble formed

using GA Vote Based Classifier Subset Selection using GA (VBS) :

vote based ensemble formed using GA.

4522.04.23

Performance of EnsemblesPrecision (%) Recall (%) F-score (%)

69.40 70.60 69.99

72.10 70.42 71.25

71.64 71.58 71.61

72.28 71.00 71.63

71.76 71.65 71.71

71.45 73.60 72.51

Single Best

Full Ensemble

Forward Selection

Backward Selection

GA Static Ensemble

Proposed Method 72.51

>>

>>≈

≈<<

4622.04.23

Discussion on ensembles All ensembles outperform SB

VBS has the highest F-scoreGA based ensembles better

BS chose 38 classifiersFE and BS similar: precision >> recall

FS and GAS chose 9 classifiers Precision ad recall more balanced

VBS is different: uses 46 classifiers partiallyRecall > precision

4722.04.23

Discussion on ensembles

BS eliminates mainly classifiers using only two features.All eliminated classifiers are backward parsed

FS and GAS almost the same8 classifiers same9th classifier forward parsed for GASEven though the 9th classifier has lower F-score, GAS

ensemble achives a higher F-score

Backward and forward parsed classifiers are more balanced

Backward and forward parsed classifiers are more balanced

4822.04.23

Entity Based F-scores for the ensembles

DNA RNA Cell Line Cell Type Protein

68.75 66.95 51.72 69.87 72.14

69.77 68.91 57.43 71.41 72.86

70.45 67.22 56.02 72.12 73.27

70.61 68.91 57.06 71.92 73.16

70.51 69.49 56.32 71.77 73.43

72.06 69.11 58.80 72.79 73.86

Single Best

Full Ensemble

Forward Selection

Backward Selection

GA Static EnsembleProposed Method VBS

4922.04.23

Discussion of entity based scores

VBS achieves the best scores for all entities except for RNAGAS ensemble outperforms VBS for RNA

VBS Highest F-score for proteinLowest F-score for cell line

Largest data set

Smallest data set

5022.04.23

Distribution of Vote Counts for the VBSNumber of votes 0 1 2 3 4 5 6 7 8 9 10 11

Number of classifiers 0 3 5 3 7 8 8 6 0 2 4 0

0

0

11

0

None of the classifiers are eliminated from the ensemble

None of the classifiers vote for all eleven classes

5122.04.23

Discussion of Vote Counts

Each classifier contributes to the decision of at least one class

Some classifiers contribute for almost all classes 7 of the 9 classifiers selected by the GAS vote for

more than 5 classes in VBS Classifiers that have only 1 vote in VBS are

excluded by GAS

5222.04.23

Post Processing Approach used

Inconsistencies in tagging mainly induced by ensembling are fixed

Boundaries are extended using separate context word lists for each entity

Dictionary formed from the training data retags mistagged or untagged entities

OOI-proteinOI-cell_typeI-cell_lineI-cell_lineB-cell_lineO

theoftranscriptionstimulatecellsHeLaorBfrom

OOOOI-cell_lineI-cell_lineI-cell_lineB-cell_lineO

theoftranscriptionstimulatecellsHeLaorBfrom

OOI-DNAI-DNAB-DNAOOO

ofregionvariablehumanunrearrangedoffragmentsDifferent

Region is in Right context of DNA

OI-DNAI-DNAI-DNAB-DNAOOO

ofregionvariablehumanunrearrangedoffragmentsDifferent

5322.04.23

Effect of Post Processing on VBSF-score (%)

Improvement F-score (%)

VBS 72.51

VBS+Inconsistent tag correction +0.08 72.59

VBS+ Right boundary correction +0.14 72.65

VBS+ Left boundary correction +0.01 72.52

VBS+Dictionary lookup +0.08 72.59VBS+ All post-processing Rules +0.23 72.74

5422.04.23

Effect of Post Processing on Individual Entities

DNA RNA Cell Line Cell Type Protein

Before post-processing

72.06 69.11 58.80 72.79 73.86

After post-processing

72.19 71.55 59.96 73.36 73.87

Improvement 0.13 2.44 1.16 0.57 0.012.44 1.16

5522.04.23

Discussion of Results

Post processing provides more success for entities having: lower F-scores lower representation longer names

5622.04.23

Future Work

Post processing stage may be improved:External resourcesMore efficient normalization and lookup approachesDiscovering post processing rules through AI

techniques Different classifier architectures may be used to

increase diversityCRF, HMM

5722.04.23

Future Work

GA may employ a number of different evaluation metrics

Different ensembling strategies may be employedStacked generalizationBagging/Boosting

5822.04.23

Questions?

5922.04.23

Appendix

6022.04.23

Base Line System Features used:

Token to be classifiedContext window of -2..2

Determines the preceding and following tokens and preceding predictions used as features

2nd Degree Polynomial kernelForward parse directionOne-vs-All method

Object Identification Scores of the Baseline SystemFull Left Right

Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score

0.6275 0.6574 0.6421 0.6669 0.6987 0.6825 0.7087 0.7425 0.7252

6122.04.23

Discussions on Experiments

All classifiers presented are trained using one-vs-all approachBackward/forward parse directionDifferent context windowsDifferent feature (combination)s

Presented results are averages of classifiers in a category

6222.04.23

Training Individual Classifiers

All individual classifiers are trained using one-vs-all approachBackward/forward parse directionDifferent context windowsDifferent polynomial kernelDifferent feature (combination)s

6322.04.23

Effect of Parse Direction

Parse Direction

Full Left Right


Forward 65.21 65.30 65.23 69.18 69.28 69.20 72.92 73.02 72.94

Backward 66.85 68.13 67.45 70.51 71.87 71.15 74.93 76.38 75.61

Precision and recall increased for both boundariesPrecision scores improved more than recall scoresAn overall increase in full recall, precision and F-score

6422.04.23

Tokens : words in training data the token to be classified and the preceding and following tokens as specified by

the context window. Previously Predicted Tags : Predicted tags

of the preceding tokens. Specified by the context window.

Features Used

6522.04.23


Lexical Feature : represents grammatical functions of tokens.

Part Of Speech : tags from Penn Treebank Project added using the Geniatagger.

Ex: Adverb, Determiner, AdjectivePhrase Tag : phrasal categories added using an SVM

trained on newswire dataEx: Noun Phrase, Verb Phrase, Adjective Phrase

Base Noun Phrase Tag: basic noun phrases are tagged using fnTBL tagger.

6622.04.23

Effect of Lexical Features

Single lexical features: higher precision than recall Combinations : recall and precision values are more

balanced. Combinations slightly improve performance of both the

left boundary and the right boundary F-scores

LexicalFeatures

Used

Full Left Right


Single 63.46 67.51 65.41 67.24 71.54 69.31 71.00 75.53 73.18

Combined 65.53 66.63 66.07 69.46 70.62 70.03 73.08 74.30 73.68

6722.04.23

Different n-grams of the current token

An n-gram of a token is simply formed by using the last or first n characters of the token.

Last 1/2/3/4 letters First 1/2/3/4 letters

Example: TRANSCRIPTION

Features Used : Morphological

1 2 3 4Suffix n on ion tionPrefix t tr tra tran

6822.04.23

Effect of Morphological Features

Morph.Features Used

Full Left Right


Baseline 62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52

Prefix 64.48 65.13 64.79 68.43 69.12 68.76 72.51 73.23 72.86

Suffix 65.47 64.84 65.14 69.36 68.68 69.01 73.63 72.93 73.27

Combined 65.75 65.04 65.39 69.68 68.93 69.30 73.89 73.10 73.48

6922.04.23

Effect of Morphological Features F-score improves compared to the baseline system Suffixes alone result in higher recall than precision Prefixes alone result in higher precision than recall Combination improves the overall performance Morphological feature improves recall but degrades

precision compared to the baseline

7022.04.23

Features Used : Orthographicalso known as Word Formation Patterns

information about the form of the wordExample: Contains uppercase letters, digits etc.

two different approaches used:Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score

7122.04.23

Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word

formation patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i

for entity labeled as j (RSi,j) is calculated as:

The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list

entity ngconstituti tokensofNumber property icorthograph with tokensofNumber RS

ji

ji,

7222.04.23

Word Formation Pattern Example Word Formation

Pattern Example

UpperCase IL-2 Upper_and_Other 2-M

InitCap D3 Lower_and_Upper 25-Dihydroxyvitamin

TwoUpper FasL Upper_and_Digits AP-1

Alpha_and_Other AML1/ETO Lower_and_Other dehydratase/dimerization

Hyphen product-albumin Allupper DNA, GR, T

Upper_or_Digit 3H Greek NF-Kappa, beta

Digits 40 Lower_and_Digits gp39

Alpha_and_Digit IL-1beta Start_with_Hyphen -mediated


Orthographic Features used

7322.04.23

Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature :

Priority Based : Each token is tagged with the first applicable word formation pattern on the list. Example

Ca2+ Initial letter capitalizedD3 Initial letter capitalized

GR All letters uppercase

-acetate Starts with -

25-Dihydroxyvitamin Contains upper and other

7422.04.23

Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature :

Binary string : A binary string containing one bit to represent each word formation pattern in the list. Example:

Ca2+ 0010111100111110 D3 0011001100000110

GR 1111000100000000

-acetate 0000000011110000

25-Dihydroxyvitamin 0000111101111110

Initial letter capitalized

combination of upper letter and other symbolcombination of upper and lower case letterscombination of upper letter and numbercontains upper lettercombination of lower

letter and other symbolscombination of alphabetic chars and other symbols

combination of lower letter and number

combination of alphabetical chars and numbers

Contains number

7522.04.23

Effect of Orthographic FeaturesOrthog. FeaturesUsed

Full Left Right


Baseline 62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52

Simple 65.69 65.93 65.79 69.53 69.79 69.64 73.97 74.25 74.09

Priority 68.02 65.62 66.80 71.91 69.36 70.61 76.06 73.37 74.68

Binary 68.48 65.46 66.94 72.26 69.07 70.63 76.58 73.20 74.85

Performance is improved by all orth. features Best performance is achieved by the binary

string.

7622.04.23

Effect of Orthographic Features (Cont.) For simple orthographic features, precision

scores slightly higher than recall scores Simple orthographic feature degrades precision

on the left boundary only intricate approaches degrade the precision

performance on both left and right boundary as well as the full object recognition.

intricate orthographic features provide higher recall values resulting in overall improvement in F-scores.

intricate orthographic features result in an imbalance between precision and recall

7722.04.23


Surface Words: . A separate pseudo-dictionary for each entity

containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit

corresponds to the pseudo dictionary of an entity.

7822.04.23


Pseudo-dictionary

size

Full Left Right


Baseline62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52

50% 63.24 66.99 65.06 67.02 70.98 68.94 71.53 75.77 73.58

60% 63.39 67.36 65.31 67.14 71.34 69.18 71.68 76.16 73.85

70% 63.48 67.65 65.50 67.23 71.64 69.36 71.74 76.45 74.02

80% 62.71 66.98 64.77 66.50 71.02 68.68 70.90 75.72 73.23

7922.04.23


Precision scores improved more than recall scores compared to the baseline classifier

Improvement on the right boundary is more pronounced.

Precision score is greater than the recall scoreuse pseudo-dictionary to generate classifiers with

higher precision values than recall values

8022.04.23

Effect of Feature CombinationsLexical Morpho-

logicalSurface words

Ortho-graphic

Full Object Identification

Recall Precision F-Score

0.6275 0.6574 0.6421

x 0.6399 0.6729 0.6558

x x 0.6744 0.6682 0.6712

x x x 0.6844 0.6669 0.6755

x x x x 0.7006 0.6791 0.6897Some specific combinations do not have a significant improvement in performance. Careful combination of features is useful

8122.04.23

Additional Material

8222.04.23

Sources of Problems in Biomedical NER Irregularities and mistakes in

Tokenization Tagging

(Irregular ) use of special symbols

Lack of standard naming conventions

Changing names and notations

Continuous introduction of new names

8322.04.23

Sources of Problems in Biomedical NER Abbreviations

Homonyms or Ambiguous names

Synonyms

Variations

8422.04.23

Sources of Problems in Biomedical NER Cascaded named entities

Complicated constructionsComma separated listsDisjunctionsConjunctions

Inclusion of adjectives as part of some NEs

8522.04.23

Evaluation

NEs in CorpusNEs predicted by classifier

TPFN FP

TN

8622.04.23

Evaluation Precision : the ratio of correctly identified NEs

to the number of NEs identified by the system

Recall : the ratio of correctly identified NEs to the number of NEs identified by the system

fptptppprecision

,

fntptprrecall

,

8722.04.23

Evaluation

F-score is based on precision and recall

rp

scoreF 1111

F1-score is the harmonic mean of precision and recall

rpprscoreF

2

1

8822.04.23

Chromosome Structure for the Proposed Vote-based Classifier Selection

For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.

8922.04.23

Post Processing Rules

Inconsistent tag correction

Boundary Extension

Dictionary Look up

I

nazife dimililer supervisor: asst. prof. dr. ekrem varo ğlu

Documents

biomedical nerirregularities

biomedical domainoverview

biomedical nertrain

cell typesner

bioentity recognition

post processingsimilar

biocreative tasks

nesstate of current