sifter for classifying books dense (or not) in family history information & for selecting 3-page...

8
Sifter for classifying books dense (or not) in family history information & for selecting 3- page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Upload: zoe-norris

Post on 24-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Sifter

for classifying books dense (or not) in family history information & for selecting

3-page sequences for evaluation

Deryle Lonsdale1 Oct. 2013

Page 2: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

The task

• Develop a data-rich family history text range recognizer– Perl – Machine learning– Mostly OTS components– Fully automatic– Arbitrary text chunk size

• Evaluate performance

Page 3: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Method• Document features

– Language identifier (and confidence)• We only want English (for now)• Used a pre-existing Perl module (Simões)

– Type/token ratio• We want narrow-domain

– % FH lexical items• We want to prefer FH vocabulary• Hand-coded, 49 words (died, married, cremation, etc.)

– % integer words, % person words, % date words, % organization words, % location words• We want it to be data-rich• Used Stanford named entity engine

– Average sentence length• Maybe sentences are shorter in FH text??

• One vector (floating-point features) per text chunk (e.g. document)

Page 4: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Evaluation

• Gigaword corpus newswire– Associated Press Worldstream articles (Nov. 1994-

May 1995)– 585 obituaries (192,000 words)– 649 non-obituaries (221,000 words, randomly

selected from 85,000 articles)• TiMBL machine learning

Page 5: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Results

F-Score beta=1, microav: 0.939263F-Score beta=1, macroav: 0.939184AUC, microav: 0.940449AUC, macroav: 0.940449overall accuracy: 0.939222 (1159/1234), of which 128 exact matches

Confusion Matrix: nonobit obit --------------nonobit | 595 54 obit | 21 564 -*- | 0 0

Page 6: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Feature ranking

• % FH lexical items• % integers• % person names• % dates• Average sentence length• Type/token ratio• % locations• % organizations

Page 7: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Errors

False positives• Articles about people

perishing in concentration camps

• Crime stories (murders, serial killers, murder trial, terrorist acts)

• Accident stories

False negatives• Lists of creative works

Credits from George Abbott's stage career, compiled by hisoffice and from theater reference books:The Misleading Lady, 1913, actor.Yeoman of the Guard, 1915, actor.The Queens Enemies, 1916, actor.Lightnin', 1918, rewrote scenes.…

• Tagging errorsEDITORS:

Two versions of Yugoslavia-Obit-Djilas moved on circuits. Pleasedisregard the second, shorter, unbylined version.

The AP

Page 8: Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

Caveats

• Obituaries, not FH data per se• Newswire, not books• One source• Will it scale?• Can it port to FSL?• Didn’t do any ML tuning• Binary acceptor; continuous values possible?• Effect of OCR errors?