Sifter
for classifying books dense (or not) in family history information & for selecting
3-page sequences for evaluation
Deryle Lonsdale1 Oct. 2013
The task
• Develop a data-rich family history text range recognizer– Perl – Machine learning– Mostly OTS components– Fully automatic– Arbitrary text chunk size
• Evaluate performance
Method• Document features
– Language identifier (and confidence)• We only want English (for now)• Used a pre-existing Perl module (Simões)
– Type/token ratio• We want narrow-domain
– % FH lexical items• We want to prefer FH vocabulary• Hand-coded, 49 words (died, married, cremation, etc.)
– % integer words, % person words, % date words, % organization words, % location words• We want it to be data-rich• Used Stanford named entity engine
– Average sentence length• Maybe sentences are shorter in FH text??
• One vector (floating-point features) per text chunk (e.g. document)
Evaluation
• Gigaword corpus newswire– Associated Press Worldstream articles (Nov. 1994-
May 1995)– 585 obituaries (192,000 words)– 649 non-obituaries (221,000 words, randomly
selected from 85,000 articles)• TiMBL machine learning
Results
F-Score beta=1, microav: 0.939263F-Score beta=1, macroav: 0.939184AUC, microav: 0.940449AUC, macroav: 0.940449overall accuracy: 0.939222 (1159/1234), of which 128 exact matches
Confusion Matrix: nonobit obit --------------nonobit | 595 54 obit | 21 564 -*- | 0 0
Feature ranking
• % FH lexical items• % integers• % person names• % dates• Average sentence length• Type/token ratio• % locations• % organizations
Errors
False positives• Articles about people
perishing in concentration camps
• Crime stories (murders, serial killers, murder trial, terrorist acts)
• Accident stories
False negatives• Lists of creative works
Credits from George Abbott's stage career, compiled by hisoffice and from theater reference books:The Misleading Lady, 1913, actor.Yeoman of the Guard, 1915, actor.The Queens Enemies, 1916, actor.Lightnin', 1918, rewrote scenes.…
• Tagging errorsEDITORS:
Two versions of Yugoslavia-Obit-Djilas moved on circuits. Pleasedisregard the second, shorter, unbylined version.
The AP
Caveats
• Obituaries, not FH data per se• Newswire, not books• One source• Will it scale?• Can it port to FSL?• Didn’t do any ML tuning• Binary acceptor; continuous values possible?• Effect of OCR errors?