named-entity recognition with character-level models dan klein, joseph smarr, huy nguyen, and...

Post on 26-Mar-2015

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Named-Entity Recognition with Character-Level Models

Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning

Stanford University

CoNLL-2003: Seventh Conference on Natural Language Learning

klein@cs.stanford.edu jsmarr@stanford.edu htnguyen@stanford.edu manning@cs.stanford.edu

2

Unknown Words are a Central Challenge for NER

Recognizing known named-entities (NEs) is relatively simple and accurate

Recognizing novel NEs requires recognizing context and/or word-internal features

External context and frequent internal words (e.g. “Inc.”) are most commonly used features

Internal composition of NEs alone provide surprisingly strong evidence for classification (Smarr & Manning, 2002) Staffordshire Abdul-Karim al-Kabariti CentrInvest

3

Are Names Self-Describing?

NO: names can be opaque/ambiguousWord-Level: “Washington” occurs as LOC, PER, and

ORGChar-Level: “–ville” suggests LOC, but exceptions

like “Neville”

YES: names can be highly distinctive/descriptiveWord-Level: “National Bank” is a bank (i.e. ORG)Char-Level: “Cotramoxazole” is clearly a drug

name

Question: Overall, how informative are names alone?

4

How Internally Descriptive are Isolated Named Entities?

Classification accuracy of pre-segmented CoNLL NEs without context is ~90%

Using character n-grams as features instead of words yields 25% error reduction

On single-word unknown NEs, word model is at chance; char n-gram model fixes 38% of errors

89.1

91.8

80

90

100

Words Char N-Grams

All NEs

37.5

60.7

30

40

50

60

70

Words Char N-Grams

Single-word UNKs

NE Classification Accuracy (%)[not CoNLL task]

5

Exploiting Word-Internal Features

Many existing systems use some word-internal features (suffix, capitalization, punctuation, etc.)

e.g. Mikheev 97, Wacholder et al 97, Bikel et al 97 Features usually language-dependent (e.g. morphology)

Our approach: use char n-grams as primary representation

Use all substrings as classification features:

Char n-grams subsume word features Features are language-independent (assuming its

alphabetic) Similar in spirit to Cucerzan and Yarowsky (99), but uses

ALL char n-grams vs. just prefix/suffix

#Tom##Tom#, #Tom, Tom#, #To,

Tom, om#, #T, To, om, m#, T, o, m

6

Character-Feature Based Classifier

Model I: Independent classification at each word maxent classifiers, trained using conjugate gradient equal-scale gaussian priors for smoothing trained models with >800K features in ~2 hrs

POS tags and contextual features complement n-grams

Description Added Features Overall F1 (English Dev.)

Words w0

Official Baseline

-

Char N-Grams n(w0)

POS Tags t0

Simple Context

w-1, w0, t-1, t1

More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›

52.29

73.10

74.17

82.39

83.09

71.18

7

Character-Based CMM

Model II: Joint classifications along the sequence

Previous classification decisions are clearly relevant: “Grace Road” is a single location, not a

person + location Include neighboring classification

decisions as features Perform joint inference across chain of

classifiers Conditional Markov Model (CMM, aka. maxent

Markov model) Borthwick 1999, McCallum et al 2000

8

Character-Based CMM

Final extra features: Letter-type patterns for each word

United Xx, 12-month d-x, etc. Conjunction features

E.g., previous state and current signature Repeated last words of multi-word names

E.g., Jones after having seen Doug Jones … and a few more

Description Added Features Overall F1 (English Dev)

More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›

Simple Sequence

s-1, ‹s-1, t-1, t0›

More Sequence ‹s-2, s-1›, ‹s-2, s-1, t-1, t0›

Final misc. extra features

83.09

87.21

92.27

85.44

9

Final Results

Drop from English dev to test largely due to inconsistent labeling

Lack of capitalization cues in German hurts recall more because maxent classifier is precision-biased when faced with weak evidence

92.27

86.31

67.03

71.90

50

60

70

80

90

100

Eng Dev Eng Test Ger Dev Ger Test

Precision Recall F1

10

Conclusions

Character substrings are valuable and underexploited model features Named entities are internally quite

descriptive 25-30% error reduction vs. word-level models

Discriminative maxent models allow productive feature engineering 30% error reduction vs. basic model

What distinguishes our approach? More and better features Regularization is crucial for preventing

overfitting

top related