1 automating the extraction of domain-specific information from the web a case study for the...

25
1 Automating the Automating the Extraction of Domain- Extraction of Domain- Specific Information Specific Information from the Web from the Web A Case Study for the Genealogical Domain A Case Study for the Genealogical Domain Troy Walker Troy Walker Thesis Defense Thesis Defense November 19, 2004 November 19, 2004 Research funded by NSF grant #IIS-0083127 Research funded by NSF grant #IIS-0083127

Post on 21-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

11

Automating the Extraction of Automating the Extraction of Domain-Specific Information Domain-Specific Information

from the Webfrom the WebA Case Study for the Genealogical DomainA Case Study for the Genealogical Domain

Troy WalkerTroy WalkerThesis DefenseThesis Defense

November 19, 2004November 19, 2004

Research funded by NSF grant #IIS-0083127Research funded by NSF grant #IIS-0083127

Page 2: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

22

Genealogical Information on Genealogical Information on the Webthe Web

Hundreds of thousands of sitesHundreds of thousands of sites Some professional (Ancestry.com, Some professional (Ancestry.com,

Familysearch.org)Familysearch.org) Mostly hobbyist (240,400 indexed by Mostly hobbyist (240,400 indexed by

Cyndislist.com)Cyndislist.com) Search enginesSearch engines

““Walker genealogy” on Google: 523,000 resultsWalker genealogy” on Google: 523,000 results 1 page/minute = 1 year to go through1 page/minute = 1 year to go through

Why not enlist the help of a computer?Why not enlist the help of a computer?

Page 3: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

33

ProblemsProblems

No standard way of presenting dataNo standard way of presenting data Sites have differing schemasSites have differing schemas Web pages changeWeb pages change New pages continuously come on New pages continuously come on

lineline

Page 4: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

44

GeneTIQSGeneTIQS

Based on work done by BYU DEGBased on work done by BYU DEG Able to extract from:Able to extract from:

Single-record documentsSingle-record documents Simple multiple-record documentsSimple multiple-record documents Complex multiple-record documentsComplex multiple-record documents

Robust to changes in pagesRobust to changes in pages Immediately works for new pagesImmediately works for new pages

Page 5: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

55

Person OntologyPerson Ontology

Page 6: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

66

Person OntologyPerson Ontology

Page 7: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

77

Value MatchersValue Matchers

Page 8: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

88

Record SeparationRecord Separation

Separating data related to each Separating data related to each personperson

Previous techniquePrevious technique Combines many heuristicsCombines many heuristics Has problemsHas problems

Assumes multiple recordsAssumes multiple records Must be simple separationMust be simple separation

Page 9: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

99

Single-Record DocumentSingle-Record Document

Page 10: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1010

Simple Multiple-Record Simple Multiple-Record DocumentDocument

Page 11: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1111

Complex Multiple-Record Complex Multiple-Record DocumentDocument

Page 12: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1212

Vector Space ModelingVector Space Modeling

Ontology VectorOntology Vector Compare to candidate recordsCompare to candidate records

Cosine measureCosine measure

Magnitude measureMagnitude measure

21 vv

1v

Page 13: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1313

Ontology VectorOntology Vector

{ 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0}{ 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0}

Page 14: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1414

Vector Space ModelingVector Space Modeling<!DOCTYPE…><!DOCTYPE…><html><html> <head><head> … … </head></head> <body><body> <div><div> … …header…header… </div></div> <div><div> … … <td><td> … …

{0, 0, 0, 0, 0, 0, 0, 0, 0}{0, 0, 0, 0, 0, 0, 0, 0, 0}{0, 149, 89, 76, 0, 0, 48, 23, 23}{0, 149, 89, 76, 0, 0, 48, 23, 23} {0, 1, 0, 0, 0, 0, 0, 0, 0, 0}{0, 1, 0, 0, 0, 0, 0, 0, 0, 0}

{0, 148, 89, 76, 0, 0, 48, 23, 23}{0, 148, 89, 76, 0, 0, 48, 23, 23} {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0, 0}

{0, 146, 88, 76, 0, 0, 48, 23, 23}{0, 146, 88, 76, 0, 0, 48, 23, 23}…… {0, 1, 1, 1, 0, 0, 0, 0, 0}{0, 1, 1, 1, 0, 0, 0, 0, 0}

GenderGender

ChristeniChristeningng

BuriaBuriall

MarriageMarriage

Relation Relation NameName

BirthBirth

NameName

DeatDeathh

RelationshRelationshipip

Page 15: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1515

Problems and Problems and ImprovementsImprovements

Differing schemasDiffering schemas Low cosine measuresLow cosine measures Discarded dataDiscarded data Prune dimensionsPrune dimensions{0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0}{0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0}

{0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0, 23.0}{0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0, 23.0}

Richness of data in single-record documentsRichness of data in single-record documents High magnitude measureHigh magnitude measure Higher magnitude to split documentsHigher magnitude to split documents

Page 16: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1616

Problems and Problems and ImprovementsImprovements

Missed Simple PatternsMissed Simple Patterns More than 3 recordsMore than 3 records Valid Records:Total Records > 2:3Valid Records:Total Records > 2:3 Keep allKeep all Discard header and footerDiscard header and footer

Page 17: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1717

DemonstrationDemonstration

Page 18: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1818

Presenting ResultsPresenting Results

Page 19: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

1919

EvaluationEvaluation

Semi-structured TextSemi-structured Text 21 single-record documents21 single-record documents 10 simple documents containing 130 10 simple documents containing 130

recordsrecords 20 complex documents with 238 records20 complex documents with 238 records

Precision and recall for record Precision and recall for record separation separation

Page 20: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

2020

  records returned correct precision recall

single1 1 1 1 100.00% 100.00%

single2 1 1 1 100.00% 100.00%

single3 1 1 1 100.00% 100.00%

single4 1 1 1 100.00% 100.00%

single5 1 1 1 100.00% 100.00%

single6 1 1 1 100.00% 100.00%

single7 1 1 1 100.00% 100.00%

single8 1 4 0 0.00% 0.00%

single9 1 1 1 100.00% 100.00%

single10 1 1 1 100.00% 100.00%

single11 1 1 1 100.00% 100.00%

single12 1 3 0 0.00% 0.00%

single13 1 1 1 100.00% 100.00%

single14 1 1 1 100.00% 100.00%

single15 1 1 1 100.00% 100.00%

single16 1 1 1 100.00% 100.00%

single17 1 1 1 100.00% 100.00%

single18 1 1 1 100.00% 100.00%

single19 1 1 1 100.00% 100.00%

single20 1 1 1 100.00% 100.00%

single21 1 1 1 100.00% 100.00%

Total 21 26 19 73.08% 90.48%

Single-Single-Record Record

DocumentDocumentss

Page 21: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

2121

  records returned correct precision recall

simple1 19 20 19 95.00% 100.00%

Simple2 19 17 17 100.00% 89.47%

Simple3 11 11 11 100.00% 100.00%

Simple4 9 9 9 100.00% 100.00%

Simple5 12 13 11 84.62% 91.67%

Simple6 12 11 10 90.91% 83.33%

Simple7 14 10 10 100.00% 71.43%

Simple8 5 7 5 71.43% 100.00%

Simple9 14 14 14 100.00% 100.00%

Simple10 15 15 15 100.00% 100.00%

Total 130 127 121 95.28% 93.08%

  records returned correct precision recall

simple1 19 22 19 86.36% 100.00%

simple2 19 20 0 0.00% 0.00%

simple3 11 14 11 78.57% 100.00%

simple4 9 10 9 90.00% 100.00%

simple5 12 16 12 75.00% 100.00%

simple6 12 23 9 39.13% 75.00%

simple7 14 22 13 59.09% 92.86%

simple8 5 10 0 0.00% 0.00%

simple9 14 16 14 87.50% 100.00%

simple10 15 16 0 0.00% 0.00%

Total 130 169 87 51.48% 66.92%

Simple Multiple-Record Simple Multiple-Record DocumentsDocuments

VSM SeparatorVSM Separator Highest-Fanout SeparatorHighest-Fanout Separator

Page 22: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

2222

Complex Complex Multiple-Multiple-Record Record

DocumenDocumentsts

  records returned missed extra correct precision recall

complex1 10 10 0 0 10 100.00% 100.00%

complex2 15 15 0 0 15 100.00% 100.00%

complex3 12 12 0 0 12 100.00% 100.00%

complex4 7 9 1 3 6 66.67% 85.71%

complex5 16 15 1 0 15 100.00% 93.75%

complex6 15 16 2 3 13 81.25% 86.67%

complex7 13 12 1 0 12 100.00% 92.31%

complex8 10 10 0 0 10 100.00% 100.00%

complex9 19 20 1 2 18 90.00% 94.74%

complex10 10 10 1 1 9 90.00% 90.00%

complex11 15 11 4 0 11 100.00% 73.33%

complex12 15 15 0 0 15 100.00% 100.00%

complex13 11 11 0 0 11 100.00% 100.00%

complex14 16 18 1 3 15 83.33% 93.75%

complex15 8 8 2 2 6 75.00% 75.00%

complex16 8 9 0 1 8 88.89% 100.00%

complex17 10 11 0 0 11 100.00% 110.00%

complex18 4 1 3 0 1 100.00% 25.00%

complex19 8 11 0 3 8 72.73% 100.00%

complex20 16 13 4 1 12 92.31% 75.00%

Total 238 237 21 19 218 91.98% 91.60%

Page 23: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

2323

ConclusionConclusion

Integrate, build on previous DEG workIntegrate, build on previous DEG work Accurate record separationAccurate record separation

Average recall: 92%Average recall: 92% Average precision: 93%Average precision: 93%

Ontology basedOntology based Robust to changes in pagesRobust to changes in pages Immediately works with new pagesImmediately works with new pages

Page 24: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

2424

Future WorkFuture Work

ScaleScale Distribute computationDistribute computation Intelligent URL selectorIntelligent URL selector

More SourcesMore Sources TablesTables Forms and dynamic pagesForms and dynamic pages

Obtain more information behind linksObtain more information behind links

Page 25: 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

2525

Future WorkFuture Work

Improve VSM record separationImprove VSM record separation Weight object importanceWeight object importance Disambiguate before record separationDisambiguate before record separation Recognize patternsRecognize patterns Improve detection of single-record Improve detection of single-record

documentsdocuments