nazife dimililer supervisor: asst. prof. dr. ekrem varo ğlu
DESCRIPTION
Biomedical Named Entity Recognition from Text using Genetic Algorithm Based Classifier Subset Selection. Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu. Outl i ne. Motivation Background Overview of IE tasks Definition of NER Corpus Used Objective of Thesis Related Work - PowerPoint PPT PresentationTRANSCRIPT
122.04.23
Biomedical Named Entity Recognition from Text using
Genetic Algorithm Based Classifier Subset Selection
Nazife DimililerSupervisor: Asst. Prof. Dr. Ekrem Varoğlu
222.04.23
Outline Motivation Background
Overview of IE tasks Definition of NER Corpus Used
Objective of Thesis Related Work Proposed System
Corpus Individual Classifiers Multi Classifier System
Future Work
322.04.23
Motivation of the Thesis Vast amount of literature available online Need for
Intelligent Information Retrieval Automatically populating databases Document Understanding/Summarization…
NER is the first step of all IE tasks Annotated Corpora : GENIA, BioCreative, FlyBase Room for improvement Applicability to other domains
422.04.23
What is Named Entity Recognition?Named entity recognition (NER) A subtask of information extraction
Identifies and labels strings of a text as belonging to predefined classes (Named Entities)
Example NEs : persons, organizations, expressions of times, drugs, proteins, cell types
NER poses a significant challenge in Biomedical Domain
522.04.23
Overview of IE tasks in Biomedical Domain
Article Preprocessing
Biomedical NER
Bio-Entity & Interaction
Normalization
Bio-Entity Interaction ExtractionArticles
ontology
Term. DB
Question Answering
Article Selection
Text Summarization . . .
622.04.23
Sources of Problems in Biomedical NER Irregularities and mistakes in
Tokenization Tagging
(Irregular ) use of special symbols
Lack of standard naming conventions
Changing names and notations
Continuous introduction of new names
Abbreviations, Synonyms, Variations
Homonyms or Ambiguous names
Cascaded named entities
Complicated constructions Comma separated lists Disjunctions & Conjunctions
Inclusion of adjectives as part of some NEs
722.04.23
State of Current Research for Biomedical NER
A large number of systems have been proposed for biomedical NER.Systems based on individual classifiersMultiple Classifier systems with small number of membersexternal sourceshand crafted post processingcorpora with differing NEsdifferent evaluation schemes
822.04.23
State of Current Research in Biomedical NER
A very important milestone in this area was the Bio-Entity Recognition Task in JNLPBA in 2004.
Same systems as in newswire domain was used with slight changes. Rich feature sets were exploited
Successful classifiers relied on external resources and post processing
Similar systems were used in the Biocreative tasks in 2004,2006 and 2009 and in other publications.
922.04.23
Objective of the Thesis Improve biomedical NER performance
Use a benchmark corpus Apply classifier selection techniques to
biomedical NERTrain reliable and diverse set of individual
classifiersUtilize a large set of individual classifiersUse Genetic Algorithm to form an ensemble
performing Vote based classifier subset selection
1022.04.23
Corpus Used
JNLPBA data : based on Genia Corpus v. 3.02 Contains 5 Entities:
ProteinRNADNACell LineCell Type
IOB2 tagged : 11 classes
B-Protein I-ProteinB-RNA B-RNAB-DNA B-DNAB-Cell Line B-Cell LineB-Cell Type B-Cell TypeOutside
1122.04.23
Format of JNLPBA DataOur Odata Osuggest Othat Olipoxygenase B-proteinmetabolites I-proteinactivate OROI Oformation Owhich Othen Oinduce OIL-2 B-proteinexpression Ovia ONF-kappa B-proteinB I-proteinactivation O. O
The Operi-kappa B-DNAB I-DNAsite I-DNAmediates Ohuman B-DNAimmunodeficiency I-DNAvirus I-DNAtype I-DNA2 I-DNAenhancer I-DNAactivation O….Human Oimmunodeficiency Ovirus Otype O2 O
1222.04.23
# of abstracts
# of sentences
# of words
Training Set 2,000 20,546 472,006Test Set 404 4,260 96,780
Data Set Statistics
Protein DNA RNA Cell Type Cell Line All Entities
Training Data
# of Entities 30269 9533 951 6718 3830 51301# of Tokens 55117 25307 2481 15466 11217 109588
Test Data # of Entities 5067 1056 118 1921 500 8662# of Tokens 9841 2845 305 4912 1489 19392
MeSH terms "human", "blood cells" and "transcription factors
Super domain of "blood cells" and "transcription factors
1322.04.23
Individual Classifier Architecture Why use SVM?
Successfully used in many NLP tasks and bioinformatics
CoNNL 2000 and CoNNL 2004 BioCreAtIve Competition 2004
Ability to handle large feature sets IOB2 notation is used to represent entities
Multi class classification problem Features extracted from the training data only
1422.04.23
Individual Classifier System used
YamCha : is a generic, customizable, and open source text chunker that uses Support Vector Machines
Tunable parameters :Parsing direction: Left-to-Right/Right-to-LeftRange of context windowDegree of polynomial kernel
1522.04.23
Context Window
The default setting is "F:-2..2:0.. T:-2..-1".
1622.04.23
Training Individual Classifiers
All individual classifiers are trained using one-vs-all approachBackward or forward parse directionDifferent context windowsDifferent degrees of the polynomial kernelDifferent feature (combination)s
1722.04.23
Individual Classifiers
All classifiers are based on SVM Features Types
Lexical Features Morphological Feature Orthographic Features Surface Word Feature
Tokens and the predicted tags are also used as features
1822.04.23
Tokens : words in training data the token to be classified and the preceding and following tokens as specified by
the context window. Previously Predicted Tags : Predicted tags
of the preceding tokens. Specified by the context window.
Features Used
1922.04.23
Features Used (Cont.)
Lexical Feature : represents grammatical functions of tokens.
Part Of Speech : tags from Penn Treebank Project added using the Geniatagger.
Ex: Adverb, Determiner, AdjectivePhrase Tag : phrasal categories added using an SVM
trained on newswire dataEx: Noun Phrase, Verb Phrase, Adjective Phrase
Base Noun Phrase Tag: basic noun phrases are tagged using fnTBL tagger.
2022.04.23
Different n-grams of the current token
An n-gram of a token is simply formed by using the last or first n characters of the token.
Last 1/2/3/4 letters First 1/2/3/4 letters
Example: TRANSCRIPTION
Features Used : Morphological
1 2 3 4Suffix n on ion tionPrefix t tr tra tran
2122.04.23
Features Used : OrthographicAlso known as Word Formation Patterns
Information about the form of the wordExample: Contains uppercase letters, digits etc.
Two different approaches used:Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score
2222.04.23
Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word formation
patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i for
entity labeled as j (RSi,j) is calculated as:
The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list
entity ngconstituti tokensofNumber property icorthograph with tokensofNumber RS
ji
ji,
2322.04.23
Word Formation Pattern Example Word Formation
Pattern Example
UpperCase IL-2 Upper_and_Other 2-M
InitCap D3 Lower_and_Upper 25-Dihydroxyvitamin
TwoUpper FasL Upper_and_Digits AP-1
Alpha_and_Other AML1/ETO Lower_and_Other dehydratase/dimerization
Hyphen product-albumin Allupper DNA, GR, T
Upper_or_Digit 3H Greek NF-Kappa, beta
Digits 40 Lower_and_Digits gp39
Alpha_and_Digit IL-1beta Start_with_Hyphen -mediated
Features Used : Orthographic (Cont.)
Orthographic Features used
2422.04.23
Features Used : Orthographic (Cont.)
Intricate Use of the Orthographic Feature :Priority Based : Each token is tagged with the first
applicable word formation pattern on the list. Example
Ca2+ Initial letter capitalizedD3 Initial letter capitalized
GR All letters uppercase
-acetate Starts with -
25-Dihydroxyvitamin Contains upper and other
2522.04.23
Features Used : Orthographic (Cont.)
Intricate Use of the Orthographic Feature :Binary string : A binary string containing one bit to
represent each word formation pattern in the list. Example:
Ca2+ 0010111100111110 D3 0011001100000110
GR 1111000100000000
-acetate 0000000011110000
25-Dihydroxyvitamin 0000111101111110
Initial letter capitalized
combination of upper letter and other symbol
combination of upper and lower case letters
combination of upper letter and number
contains upper lettercombination of lower letter and other symbolscombination of alphabetic chars and other symbols
combination of lower letter and number
combination of alphabetical chars and numbers
Contains number
2622.04.23
Features Used (Cont.)
Surface Words: . A separate pseudo-dictionary for each entity
containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit
corresponds to the pseudo dictionary of an entity.
2722.04.23
Effect of Feature Extraction
Each feature type improves the performance in different perspectivesPrecisionRecallBoundariesEntity based performances
Careful combination of features improves the overall performance
2822.04.23
Effect of parse direction and lexical feature Effect of backward parsing:
Precision and recall increased for both boundaries Precision scores improved more than recall scores An overall increase in full recall, precision and F-score
Effect of Lexical Features:Single lexical features: higher precision than recall Combinations : recall and precision values are
more balanced. Combinations slightly improve performance of both
the left boundary and the right boundary F-scores
2922.04.23
Effect of Morphological Features F-score improves compared to the baseline system Suffixes alone result in higher recall than precision Prefixes alone result in higher precision than recall Combination improves the overall performance Morphological feature improves recall but degrades
precision compared to the baseline
3022.04.23
Effect of Orthographic Features Performance is improved by all orthographic
features Best performance is achieved by the binary
string. For simple orthographic features, precision
scores slightly higher than recall scores intricate orthographic features provide higher
recall values resulting in overall improvement in F-scores.
3122.04.23
Effect of Surface Word Feature
Precision scores improved more than recall scores compared to the baseline classifier
Improvement on the right boundary is more pronounced.
Precision score is greater than the recall scoreuse pseudo-dictionary to generate classifiers with
higher precision values than recall values
3222.04.23
Effect of Feature CombinationsSome specific combinations do not have a significant improvement in performance. Careful combination of features is useful for improving overall performanceDifferent combinations of feature/parameter sets favor different entities
3322.04.23
Motivation for Multiple Classifier Systems For individual classifiers
A set of carefully engineered features improve performance
Unfortunately performance is still NOT satisfactory
Combining multiple classifiers into ensembles The combined opinions of a number experts is more
likely to be correct than that of a single expert
3422.04.23
Classifier Pool
Classifiers exploiting state-of-the-art feature sets => highest F-scores
Classifiers with high precision or recall Classifiers with high precision but low recall
and vice versa One or more classifiers providing the highest
F-score for each entity
3522.04.23
Training Phase
Classifier Fusion Architecture
Training Data
Feature Extraction
Dictionary&
Context WordsGA based Classifier Selection
SVM 1
SVM 2
SVM M
...
SVM Classifier Set
Feature Set 1
Feature Set 2
Feature Set M
Testing Phase
Classifier Fusion
Test Data Feature Extraction
SVM 1
SVM 2
SVM M
...
Feature Set 1
Feature Set 2
Feature Set M
SVM Classifier Set
Post Processing
Predicted Class
Best Fitting Ensemble
3622.04.23
Fusion Algorithm Weighted Majority Voting :
Full object F-score of each classifier on cross-validation data used as weight
Class that receives the highest vote wins the competition
Weighted combination of all votes
Ties broken by random coin toss
3722.04.23
Weighted Majority Voting
Weight : Full Object F-score
3822.04.23
Genetic Algorithm Set Up Initial population: randomly generated bit strings Genetic Algorithm Features
Population size : 100 Mutation rate : 2% Crossover Rate : 70% Crossover Operators:
Two point crossover Uniform crossover
Tournament size= 40 Elitist population 20%
3922.04.23
Start
Initialize population randomly
Compute fitness of each chromosome
Terminate?
Select parent andapply crossover
Mutate Offspring
Compute fitness of each chromosome
Apply elitist policy to form new populationSelect best
chromosome as the resultant
ensemble
End
Flow chart of the Genetic Algorithm
No
Yes
4022.04.23
Genetic Algorithm Set UpChromosome: List of classifiers to be combined. 3-fold cross validation results are used for individual
classifiers Fitness of chromosomes: Full object F-score of the
classifier ensemble Static Classifier Selection : Each bit represents a
classifier Proposed Vote Based Classifier Selection : Each bit
represents reliability of a classifier for predicting a class
4122.04.23
Chromosome Structure of Static Classifier Selection
Classifier 1
Classifier 2
If a gene=1, the corresponding classifier participates in the decision for all classes, otherwise it remains silent.
0 1 0 1 1 0 0 1
Classifier 3
Classifier 4 Classifier M
M classifiers chromosome has M bits
4222.04.23
Chromosome Structure for the Proposed Vote-based Classifier Selection
Classifier 1 Classifier 2
0 1 0 1 1 1 0 0
Classifier M
Class 1
Class 2
Class 3
Class 4
For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.
M classifiers chromosome has
N classes NxM bits
4322.04.23
Motivation for Vote Based Classifier Subset Selection A classifier cannot predict all classes with the
same performanceA subset of predictions may be unreliableA subset of predictions may be correlated with
predictions of other classifiers Allow a classifier to vote only for the classes it
trusts
4422.04.23
Multiple Classifier Systems used Single Best (SB) : not an MCS, included as a reference Full Ensemble (FE) : Ensemble containing all classifiers Forward Selection (FS) : Ensemble formed using forward
selection Backward Selection (BS) : Ensemble formed using
Backward Selection GA generated Static Ensemble (GAS) : Ensemble formed
using GA Vote Based Classifier Subset Selection using GA (VBS) :
vote based ensemble formed using GA.
4522.04.23
Performance of EnsemblesPrecision (%) Recall (%) F-score (%)
69.40 70.60 69.99
72.10 70.42 71.25
71.64 71.58 71.61
72.28 71.00 71.63
71.76 71.65 71.71
71.45 73.60 72.51
Single Best
Full Ensemble
Forward Selection
Backward Selection
GA Static Ensemble
Proposed Method 72.51
>>
>>≈
≈<<
4622.04.23
Discussion on ensembles All ensembles outperform SB
VBS has the highest F-scoreGA based ensembles better
BS chose 38 classifiersFE and BS similar: precision >> recall
FS and GAS chose 9 classifiers Precision ad recall more balanced
VBS is different: uses 46 classifiers partiallyRecall > precision
4722.04.23
Discussion on ensembles
BS eliminates mainly classifiers using only two features.All eliminated classifiers are backward parsed
FS and GAS almost the same8 classifiers same9th classifier forward parsed for GASEven though the 9th classifier has lower F-score, GAS
ensemble achives a higher F-score
Backward and forward parsed classifiers are more balanced
Backward and forward parsed classifiers are more balanced
4822.04.23
Entity Based F-scores for the ensembles
DNA RNA Cell Line Cell Type Protein
68.75 66.95 51.72 69.87 72.14
69.77 68.91 57.43 71.41 72.86
70.45 67.22 56.02 72.12 73.27
70.61 68.91 57.06 71.92 73.16
70.51 69.49 56.32 71.77 73.43
72.06 69.11 58.80 72.79 73.86
Single Best
Full Ensemble
Forward Selection
Backward Selection
GA Static EnsembleProposed Method VBS
4922.04.23
Discussion of entity based scores
VBS achieves the best scores for all entities except for RNAGAS ensemble outperforms VBS for RNA
VBS Highest F-score for proteinLowest F-score for cell line
Largest data set
Smallest data set
5022.04.23
Distribution of Vote Counts for the VBSNumber of votes 0 1 2 3 4 5 6 7 8 9 10 11
Number of classifiers 0 3 5 3 7 8 8 6 0 2 4 0
0
0
11
0
None of the classifiers are eliminated from the ensemble
None of the classifiers vote for all eleven classes
5122.04.23
Discussion of Vote Counts
Each classifier contributes to the decision of at least one class
Some classifiers contribute for almost all classes 7 of the 9 classifiers selected by the GAS vote for
more than 5 classes in VBS Classifiers that have only 1 vote in VBS are
excluded by GAS
5222.04.23
Post Processing Approach used
Inconsistencies in tagging mainly induced by ensembling are fixed
Boundaries are extended using separate context word lists for each entity
Dictionary formed from the training data retags mistagged or untagged entities
OOI-proteinOI-cell_typeI-cell_lineI-cell_lineB-cell_lineO
theoftranscriptionstimulatecellsHeLaorBfrom
OOOOI-cell_lineI-cell_lineI-cell_lineB-cell_lineO
theoftranscriptionstimulatecellsHeLaorBfrom
OOI-DNAI-DNAB-DNAOOO
ofregionvariablehumanunrearrangedoffragmentsDifferent
Region is in Right context of DNA
OI-DNAI-DNAI-DNAB-DNAOOO
ofregionvariablehumanunrearrangedoffragmentsDifferent
5322.04.23
Effect of Post Processing on VBSF-score (%)
Improvement F-score (%)
VBS 72.51
VBS+Inconsistent tag correction +0.08 72.59
VBS+ Right boundary correction +0.14 72.65
VBS+ Left boundary correction +0.01 72.52
VBS+Dictionary lookup +0.08 72.59VBS+ All post-processing Rules +0.23 72.74
5422.04.23
Effect of Post Processing on Individual Entities
DNA RNA Cell Line Cell Type Protein
Before post-processing
72.06 69.11 58.80 72.79 73.86
After post-processing
72.19 71.55 59.96 73.36 73.87
Improvement 0.13 2.44 1.16 0.57 0.012.44 1.16
5522.04.23
Discussion of Results
Post processing provides more success for entities having: lower F-scores lower representation longer names
5622.04.23
Future Work
Post processing stage may be improved:External resourcesMore efficient normalization and lookup approachesDiscovering post processing rules through AI
techniques Different classifier architectures may be used to
increase diversityCRF, HMM
5722.04.23
Future Work
GA may employ a number of different evaluation metrics
Different ensembling strategies may be employedStacked generalizationBagging/Boosting
5822.04.23
Questions?
5922.04.23
Appendix
6022.04.23
Base Line System Features used:
Token to be classifiedContext window of -2..2
Determines the preceding and following tokens and preceding predictions used as features
2nd Degree Polynomial kernelForward parse directionOne-vs-All method
Object Identification Scores of the Baseline SystemFull Left Right
Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score
0.6275 0.6574 0.6421 0.6669 0.6987 0.6825 0.7087 0.7425 0.7252
6122.04.23
Discussions on Experiments
All classifiers presented are trained using one-vs-all approachBackward/forward parse directionDifferent context windowsDifferent feature (combination)s
Presented results are averages of classifiers in a category
6222.04.23
Training Individual Classifiers
All individual classifiers are trained using one-vs-all approachBackward/forward parse directionDifferent context windowsDifferent polynomial kernelDifferent feature (combination)s
6322.04.23
Effect of Parse Direction
Parse Direction
Full Left Right
Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score
Forward 65.21 65.30 65.23 69.18 69.28 69.20 72.92 73.02 72.94
Backward 66.85 68.13 67.45 70.51 71.87 71.15 74.93 76.38 75.61
Precision and recall increased for both boundariesPrecision scores improved more than recall scoresAn overall increase in full recall, precision and F-score
6422.04.23
Tokens : words in training data the token to be classified and the preceding and following tokens as specified by
the context window. Previously Predicted Tags : Predicted tags
of the preceding tokens. Specified by the context window.
Features Used
6522.04.23
Features Used (Cont.)
Lexical Feature : represents grammatical functions of tokens.
Part Of Speech : tags from Penn Treebank Project added using the Geniatagger.
Ex: Adverb, Determiner, AdjectivePhrase Tag : phrasal categories added using an SVM
trained on newswire dataEx: Noun Phrase, Verb Phrase, Adjective Phrase
Base Noun Phrase Tag: basic noun phrases are tagged using fnTBL tagger.
6622.04.23
Effect of Lexical Features
Single lexical features: higher precision than recall Combinations : recall and precision values are more
balanced. Combinations slightly improve performance of both the
left boundary and the right boundary F-scores
LexicalFeatures
Used
Full Left Right
Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score
Single 63.46 67.51 65.41 67.24 71.54 69.31 71.00 75.53 73.18
Combined 65.53 66.63 66.07 69.46 70.62 70.03 73.08 74.30 73.68
6722.04.23
Different n-grams of the current token
An n-gram of a token is simply formed by using the last or first n characters of the token.
Last 1/2/3/4 letters First 1/2/3/4 letters
Example: TRANSCRIPTION
Features Used : Morphological
1 2 3 4Suffix n on ion tionPrefix t tr tra tran
6822.04.23
Effect of Morphological Features
Morph.Features Used
Full Left Right
Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score
Baseline 62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52
Prefix 64.48 65.13 64.79 68.43 69.12 68.76 72.51 73.23 72.86
Suffix 65.47 64.84 65.14 69.36 68.68 69.01 73.63 72.93 73.27
Combined 65.75 65.04 65.39 69.68 68.93 69.30 73.89 73.10 73.48
6922.04.23
Effect of Morphological Features F-score improves compared to the baseline system Suffixes alone result in higher recall than precision Prefixes alone result in higher precision than recall Combination improves the overall performance Morphological feature improves recall but degrades
precision compared to the baseline
7022.04.23
Features Used : Orthographicalso known as Word Formation Patterns
information about the form of the wordExample: Contains uppercase letters, digits etc.
two different approaches used:Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score
7122.04.23
Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word
formation patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i
for entity labeled as j (RSi,j) is calculated as:
The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list
entity ngconstituti tokensofNumber property icorthograph with tokensofNumber RS
ji
ji,
7222.04.23
Word Formation Pattern Example Word Formation
Pattern Example
UpperCase IL-2 Upper_and_Other 2-M
InitCap D3 Lower_and_Upper 25-Dihydroxyvitamin
TwoUpper FasL Upper_and_Digits AP-1
Alpha_and_Other AML1/ETO Lower_and_Other dehydratase/dimerization
Hyphen product-albumin Allupper DNA, GR, T
Upper_or_Digit 3H Greek NF-Kappa, beta
Digits 40 Lower_and_Digits gp39
Alpha_and_Digit IL-1beta Start_with_Hyphen -mediated
Features Used : Orthographic (Cont.)
Orthographic Features used
7322.04.23
Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature :
Priority Based : Each token is tagged with the first applicable word formation pattern on the list. Example
Ca2+ Initial letter capitalizedD3 Initial letter capitalized
GR All letters uppercase
-acetate Starts with -
25-Dihydroxyvitamin Contains upper and other
7422.04.23
Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature :
Binary string : A binary string containing one bit to represent each word formation pattern in the list. Example:
Ca2+ 0010111100111110 D3 0011001100000110
GR 1111000100000000
-acetate 0000000011110000
25-Dihydroxyvitamin 0000111101111110
Initial letter capitalized
combination of upper letter and other symbolcombination of upper and lower case letterscombination of upper letter and numbercontains upper lettercombination of lower
letter and other symbolscombination of alphabetic chars and other symbols
combination of lower letter and number
combination of alphabetical chars and numbers
Contains number
7522.04.23
Effect of Orthographic FeaturesOrthog. FeaturesUsed
Full Left Right
Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score
Baseline 62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52
Simple 65.69 65.93 65.79 69.53 69.79 69.64 73.97 74.25 74.09
Priority 68.02 65.62 66.80 71.91 69.36 70.61 76.06 73.37 74.68
Binary 68.48 65.46 66.94 72.26 69.07 70.63 76.58 73.20 74.85
Performance is improved by all orth. features Best performance is achieved by the binary
string.
7622.04.23
Effect of Orthographic Features (Cont.) For simple orthographic features, precision
scores slightly higher than recall scores Simple orthographic feature degrades precision
on the left boundary only intricate approaches degrade the precision
performance on both left and right boundary as well as the full object recognition.
intricate orthographic features provide higher recall values resulting in overall improvement in F-scores.
intricate orthographic features result in an imbalance between precision and recall
7722.04.23
Features Used (Cont.)
Surface Words: . A separate pseudo-dictionary for each entity
containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit
corresponds to the pseudo dictionary of an entity.
7822.04.23
Effect of Surface Word Feature
Pseudo-dictionary
size
Full Left Right
Recall Precision F-Score Recall Precision F-Score Recall Precision F-Score
Baseline62.75 65.74 64.21 66.69 69.87 68.25 70.87 74.25 72.52
50% 63.24 66.99 65.06 67.02 70.98 68.94 71.53 75.77 73.58
60% 63.39 67.36 65.31 67.14 71.34 69.18 71.68 76.16 73.85
70% 63.48 67.65 65.50 67.23 71.64 69.36 71.74 76.45 74.02
80% 62.71 66.98 64.77 66.50 71.02 68.68 70.90 75.72 73.23
7922.04.23
Effect of Surface Word Feature
Precision scores improved more than recall scores compared to the baseline classifier
Improvement on the right boundary is more pronounced.
Precision score is greater than the recall scoreuse pseudo-dictionary to generate classifiers with
higher precision values than recall values
8022.04.23
Effect of Feature CombinationsLexical Morpho-
logicalSurface words
Ortho-graphic
Full Object Identification
Recall Precision F-Score
0.6275 0.6574 0.6421
x 0.6399 0.6729 0.6558
x x 0.6744 0.6682 0.6712
x x x 0.6844 0.6669 0.6755
x x x x 0.7006 0.6791 0.6897Some specific combinations do not have a significant improvement in performance. Careful combination of features is useful
8122.04.23
Additional Material
8222.04.23
Sources of Problems in Biomedical NER Irregularities and mistakes in
Tokenization Tagging
(Irregular ) use of special symbols
Lack of standard naming conventions
Changing names and notations
Continuous introduction of new names
8322.04.23
Sources of Problems in Biomedical NER Abbreviations
Homonyms or Ambiguous names
Synonyms
Variations
8422.04.23
Sources of Problems in Biomedical NER Cascaded named entities
Complicated constructionsComma separated listsDisjunctionsConjunctions
Inclusion of adjectives as part of some NEs
8522.04.23
Evaluation
NEs in CorpusNEs predicted by classifier
TPFN FP
TN
8622.04.23
Evaluation Precision : the ratio of correctly identified NEs
to the number of NEs identified by the system
Recall : the ratio of correctly identified NEs to the number of NEs identified by the system
fptptppprecision
,
fntptprrecall
,
8722.04.23
Evaluation
F-score is based on precision and recall
rp
scoreF 1111
F1-score is the harmonic mean of precision and recall
rpprscoreF
2
1
8822.04.23
Chromosome Structure for the Proposed Vote-based Classifier Selection
For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.
8922.04.23
Post Processing Rules
Inconsistent tag correction
Boundary Extension
Dictionary Look up
I