extracting personal names from email: applying named entity recognition to informal text einat...

Extracting Personal Names from Email: Applying Named Entity Recognition

to Informal Text

Einat Minkov & Richard C. WangLanguage Technologies Institute

William W. CohenCenter for Automated

Learning and Discovery

School of Computer Science

Carnegie Mellon University

October 7, 2005 CMU School of Computer Science

2

What is an informal text?• A text that is…

– Written for a narrow audience• Group/task-specific abbreviations often used• Not self-contained (context shared by a related group of

people)

– Not carefully prepared• Contains grammatical and spelling errors• Does not follow capitalization conventions

• Some examples are…– Instant messages– Newsgroup postings– Email messages


3

Objective / Outline

• Investigate named entity recognition (NER) for informal text– Conduct experiments on recognizing personal names in

email• Examine indicative features in email and newswire• Suggest specialized features for email• Evaluate performance of a state-of-the-art extractor (CRF)• Analyze repetition of names in email and newswire• Suggest and evaluate a recall-enhancing method that is effective

for email


4

Corpora• Mgmt corpora – Emails from a management course at CMU in which

students form teams to run simulated companies– Teams: Each set (train/tune/test) formed by different simulation teams– Game: Each set formed by different days during the simulation period

• Enron corpora – Emails from Enron Corporation– Meetings: Each set formed by randomly selected meeting-related emails– Random: Each set formed by repeatedly sampling a user then sampling an

email from that user, both at random

Note: The number of words and names refer to the whole annotated corpora


5

Extraction Method• Train Conditional Random Fields (CRF) to label and

extract personal names– A machine-learning based probabilistic approach to labeling

sequences of examples

• Learning reduces NER to the task of tagging, or classifying, each word using a set of five tags:– Unique: A one-token entity– Begin: The first token of a multi-token entity– End: The last token of a multi-token entity– Inside: Any other token of a multi-token entity– Outside: A token that is not part of an entity

Example:Einat and Richard Wang met William W. Cohen todayUnique Outside Begin End Outside Begin Inside End Outside


6

Top Learned FeaturesFeatures most indicative of a token being part of a name in a Conditional Random Fields (CRF) extractor

Note: A feature is denoted by its direction (left/right) comparing to the focus word, offset, and lexical value

Newswire (MUC-6)Email (Mgmt-Game)

2

In Quoted Excerpt

In Email Signature

Name Titles

Job Titles

Results show that…Email and newswire text have very different characteristics


7

Note: All features are instantiated for the focus word t, and 3 tokens to the left and right of t

Our Proposed Features


8

Feature Evaluation• Entity-level F1 of learned extractor (CRF) using:

– Basic features (B)– Basic and Email features (B+E)– Basic and Dictionary features (B+D)– All features (B+D+E)

B+D+E

Precision Recall

93.8 81.3

95.3 87.8

83.6 70.2

83.0 69.4

Results show that…1) Dictionary and Email features are useful (best when combined)2) Generally high precision but low recall


9

What’s Next?

• Previous experiments show high precision but low recall– Next goal: Improve recall

• One recall-enhancing method– Look for multiple occurrences of names in a corpus

• We conduct experimental studies– Examine repetition patterns of names in email and

newswire text– Examine occurrences of names within a single

document and across multiple documents


10

Doc. Frequency of NamesPercentage of person-name tokens that appear in at most K distinct documents as a function of K

1

Document Frequency

Per

cen

tag

e

30% of names in Mgmt-Game appear only in one document

Nearly 80% of names in MUC-6 appear only in one document

About 20% of names in Mgmt-Game appear in 10+ documents

Only 1.3% of names in MUC-6 appear in 10+ documents

Results show that…

Repetition of names across multiple documents is more common in email corpora

unique(A): duplicates removed from set Adf(w): # of documents containing token w

k

i wdfwunique

iwdfwuniqueKF

1 )0)(:(#

))(:(#)(


11

Single vs. Multiple DocumentsWe define the following extractors:

1. CRF – baseline trained with all features

2. SDR (Single Document Repetition)Rules that extract person-name tokens that appear more than once within a single document; hence an upper bound on recall using only names repetition within a single document

3. MDR (Multiple Document Repetition)Rules that extract person-name tokens that appear in more than one document; hence an upper bound on recall using only names repetition across multiple documents

4. SDR+CRFUnion of extractions by SDR and CRF; hence an upper bound on recall using CRF and names repetition within a single document

5. MDR+CRFUnion of extractions by MDR and CRF; hence an upper bound on recall using CRF and names repetition across multiple documents


12

Single vs. Multiple DocumentsToken-level upper bounds on recall and potential recall-gains associated with methods that look for name tokens that re-occur within a single document or across multiple documents

Results show that…Higher recall and potential recall-gains can be obtained for email corpora using MDR method

MUC-6 has highest recall-gain using SDR

MUC-6 has highest recall using SDR

MUC-6 has lowest recall using MDR

MUC-6 has lowest recall-gain using MDR


13

What’s Next?

• Our studies show the potential of exploiting repetition of names over multiple documents for improving recall in email corpora

• We suggest a recall-enhancing method:1. Auto-construct a dictionary of predicted names and

their variants from test set

2. Statistically filter out noisy names from the dictionary

3. Match names globally from the inferred dictionary onto test set, exploiting repetition of names

Note: A “dictionary” is simply a list of one or more tokens


14

Name Dictionary ConstructionEvery name in the test set predicted by the learned extractor (CRF), trained with all features, is transformed into a set of name variants and inserted into a dictionary

Transformation ExampleName variants of “Benjamin Brown Smith”

.

Original name is included by default


15

Name Dictionary Filtering• Previously constructed dictionary contains noisy names

– i.e. “brown” can also refer to a color– Next goal: Filter out noisy names

• We suggest a filtering scheme to remove every single-token name w from the dictionary when PF.IDF(w) < Θ

cpf(w): # of times w is predicted as a name-token in corpusctf(w): # of occurrences of w in corpusdf(w): document frequency of w in corpusN: # of documents in corpus

Words that get low PF.IDF scores are either highly ambiguous names or very common words in corpus

Note: “Corpus” mentioned here refers to the test set in our experiments

Θ = 0.16 optimizes entity-level F1 in tune sets; thus, we apply the same threshold onto our test sets

Predicted Frequency × Inverse Document Frequency


16

Name Matching

I called Benjamin Brown Smith and left a message to send us an e-mail if he could come. I have not received his e-mail yet. He might not be able to come. We may want to postpone until tomorrow morning. Do you still have our class schedule? Please contact benjamin and confirm the meeting. I do not have classes tomorrow morning.

• A window slides through every token in the test set• A match occurs when tokens in a window starts with

the longest possible name variant in the dictionary• All matched names are marked for evaluation

…benjamin brown smithbenjamin-brown smithbenjamin brown-smithbenjamin-brown-smithbenjamin brown s.benjamin-b. smithbenjamin b. smithbenjamin brown-s.benjamin-brown s.benjamin-brown-sbenjamin-b. s.benjamin-smithbenjamin smithb. brown smithbenjamin b. s.b. brown-smithbenjamin-s.benjamin s.b. brown s.b. b. smithb. brown-s.benjaminb. smithb. b. s.smithb. s.…

Filtered Dictionary

Names Matching Example E-Mail

Predicted by CRF

Missed by CRF


17

Experimental Results

Entity-level relative improvements (and final scores) after applying our recall-enhancing method on test sets– Baseline: learned extractor (CRF) trained with all features

Results show that…1) Recall improved significantly with small sacrifice in precision2) F1 scores improved in all cases


18

Conclusion

• Email and newswire text have different characteristics

• We suggested a set of specialized features for names extraction on email exploiting structural regularities in email

• Exploiting names repetition over multiple documents is important for improving recall in email corpora

• We presented the PF.IDF recall-enhancing method that improves recall significantly with small sacrifice in precision


19

Thank You!


20

References

extracting personal names from email: applying named entity recognition to informal text einat...

Documents

token entitybegin

email features b ebasic

multitoken entityend

multitoken entityinside

multitoken entityoutside

basic features bbasic

email mgmtgame2results

entity recognition ner