named-entity recognition with character-level models dan klein, joseph smarr, huy nguyen, and...

Named-Entity Recognition with Character-Level Models

Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning

Stanford University

CoNLL-2003: Seventh Conference on Natural Language Learning

klein@cs.stanford.edu jsmarr@stanford.edu htnguyen@stanford.edu manning@cs.stanford.edu

Unknown Words are a Central Challenge for NER

Recognizing known named-entities (NEs) is relatively simple and accurate

Recognizing novel NEs requires recognizing context and/or word-internal features

External context and frequent internal words (e.g. “Inc.”) are most commonly used features

Internal composition of NEs alone provide surprisingly strong evidence for classification (Smarr & Manning, 2002) Staffordshire Abdul-Karim al-Kabariti CentrInvest

Are Names Self-Describing?

NO: names can be opaque/ambiguousWord-Level: “Washington” occurs as LOC, PER, and

ORGChar-Level: “–ville” suggests LOC, but exceptions

like “Neville”

YES: names can be highly distinctive/descriptiveWord-Level: “National Bank” is a bank (i.e. ORG)Char-Level: “Cotramoxazole” is clearly a drug

Question: Overall, how informative are names alone?

How Internally Descriptive are Isolated Named Entities?

Classification accuracy of pre-segmented CoNLL NEs without context is ~90%

Using character n-grams as features instead of words yields 25% error reduction

On single-word unknown NEs, word model is at chance; char n-gram model fixes 38% of errors

Words Char N-Grams

All NEs

Words Char N-Grams

Single-word UNKs

NE Classification Accuracy (%)[not CoNLL task]

Exploiting Word-Internal Features

Many existing systems use some word-internal features (suffix, capitalization, punctuation, etc.)

e.g. Mikheev 97, Wacholder et al 97, Bikel et al 97 Features usually language-dependent (e.g. morphology)

Our approach: use char n-grams as primary representation

Use all substrings as classification features:

Char n-grams subsume word features Features are language-independent (assuming its

alphabetic) Similar in spirit to Cucerzan and Yarowsky (99), but uses

ALL char n-grams vs. just prefix/suffix

#Tom##Tom#, #Tom, Tom#, #To,

Tom, om#, #T, To, om, m#, T, o, m

Character-Feature Based Classifier

Model I: Independent classification at each word maxent classifiers, trained using conjugate gradient equal-scale gaussian priors for smoothing trained models with >800K features in ~2 hrs

POS tags and contextual features complement n-grams

Description Added Features Overall F1 (English Dev.)

Words w0

Official Baseline

Char N-Grams n(w0)

POS Tags t0

Simple Context

w-1, w0, t-1, t1

More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›

Character-Based CMM

Model II: Joint classifications along the sequence

Previous classification decisions are clearly relevant: “Grace Road” is a single location, not a

person + location Include neighboring classification

decisions as features Perform joint inference across chain of

classifiers Conditional Markov Model (CMM, aka. maxent

Markov model) Borthwick 1999, McCallum et al 2000

Character-Based CMM

Final extra features: Letter-type patterns for each word

United Xx, 12-month d-x, etc. Conjunction features

E.g., previous state and current signature Repeated last words of multi-word names

E.g., Jones after having seen Doug Jones … and a few more

Description Added Features Overall F1 (English Dev)

More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›

Simple Sequence

s-1, ‹s-1, t-1, t0›

More Sequence ‹s-2, s-1›, ‹s-2, s-1, t-1, t0›

Final misc. extra features

Final Results

Drop from English dev to test largely due to inconsistent labeling

Lack of capitalization cues in German hurts recall more because maxent classifier is precision-biased when faced with weak evidence

Eng Dev Eng Test Ger Dev Ger Test

Precision Recall F1

Conclusions

Character substrings are valuable and underexploited model features Named entities are internally quite

descriptive 25-30% error reduction vs. word-level models

Discriminative maxent models allow productive feature engineering 30% error reduction vs. basic model

What distinguishes our approach? More and better features Regularization is crucial for preventing

overfitting

named-entity recognition with character-level models dan klein, joseph smarr, huy nguyen, and...

word features features

classification features

word model

contextual features

underexploited model

conjunction features

wordinternal features

features overall f

Documents

le tin huy

conll-merge: eﬃcientharmonizationof … · 2019. 5....

venture summit 2014, dr. larry smarr

author's personal copy - larry smarr - california institute...

quang phan huy

dang tinsản xuất huy chương vàng bạc đồng, cung...

conll-2010: shared task fourteenth conference on

tom defanti* and larry smarr - cenic

san diego venture group; venture summit 2013; smarr

the conll–sigmorphon 2018 shared task: universal

huy que verguenza

educause09 smarr arnaud

this file has been cleaned of potential threats. if you...

ketqua huy

conll-sigmorphon 2017 shared task: universal morphological...

smarr oscon 2007

proceedings of the conll 2018 shared task: multilingual...

pengemasan huy

brian oberkirch, tantek celik & joseph smarr @ fowa miami

widget summit: advanced javascript joseph smarr plaxo, inc....