information extraction for building knowledge basis

Post on 26-Jan-2015

113 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation given at PUC Rio on March 8, 2012

TRANSCRIPT

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Information Extraction for

Building Knowledge Bases

Steffen Staab

Saqib Mir – European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 2

A FEW SLIDES WHERE WEST COMES FROM

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 3

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 4

Semantic Web

Web Retrieval

Social Web

Multimedia Web

Software Web

Institut WeST – Web Science & Technologies

GESIS

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 5

We (co-)organize conferences and schools

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 6

We build applications and develop methods…

BTC 1. Prize 2011

1. PrizeGerman Linked Open Gov Data Competition 2012

BTC 1. Prize 2008 German KM 1. Prize 2011

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 7

We teach Web Science

Master in Web Science@Koblenz Free tuition Start Fall 2012 English

2012 Web Science Summer School

Lorentz Center, Leiden, The Netherlands,

9-13 July 2012

Master in eGov@Koblenz Free tuition Start Fall 2012 English

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 8

We are active in joint projects

EU Integrated Project ROBUST (10 Partners):Risk and Opportunity management of huge-scale BUSiness communiTy cooperation

EU Live+Gov - Reality Sensing, Mining and Augmentation for Mobile Citizen–Government Dialogue

EU WeGov – where eGovernment meets the eSociety EU IP SocialSensor - Sensing User Generated Input for

Improved Media Discovery and Experience EU Net2 – a networked for networked knowledge EU MOST – Marrying ontologies and Software

Technologies

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 9

INFORMATION EXTRACTIONFORBUILDING KNOWLEDGE BASES

Steffen Staab,

Saqib Mir, European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 10

GENERAL MOTIVATION

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 11

General objective: Extracting to LOD

hasLivedInuseAsExample

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 12

General objective: Analysing LOD

hasLivedInuseAsExample

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 13

http://lisa.west.uni-koblenz.de/lisa-demo/

Family‘s analysis of Munich LOD + Open Street Map data

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 14

http://lisa.west.uni-koblenz.de/lisa-demo/

Entrepreneur‘s analysis of Munich LOD + Open Street Map data

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 15

OBSERVATIONS ON INFORMATION EXTRACTION

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 16

Challenges & Opportunities for IE

Not all web pages are created equal

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 17

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding type instances

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 18

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding relation instances

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 19

Challenges & Opportunities for IE

Some contain concepts and their descriptions, some don‘t

No types here,few relation types

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 20

Challenges & Opportunities for IE

Knowing that they are instances and of which type

Textual indication

Positional indication

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 21

Challenges & Opportunities for IE

To some extent

positional and layout

indications work across

languages and sites

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 22

Challenges & Opportunities for IE

owl:sameAs

We should not only think about

Web pages, but about Web sites

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 23

Challenges & Opportunities for IE

owl:sameAs

We should not only think about

Web pages, but about Web sites

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 24

Comparing related work to our objectives

Related work objectives IE on Web pages Acquiring instances and

relationship instances

IE based on linear text

Our objectives IE on Web sites Acquiring items Classifying items in

Instances Concepts Relation instances Relationships

IE also based on spatial position

There is overlap and there are few exceptions in related work

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 25

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 26

Presentation-oriented documents

Music band profile

band photo

band name

Acquiring a music band profile: A music band photo that has at east itsdescriptive information

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 27

Presentation-oriented documents

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 28

Presentation-oriented documents

• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM treee structures are conceptually difficult to

query for the user (or a tool!)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 29

Related Work

Web Query languages Xpath 1.0 and XQuery1.0

Established Too difficult to use for scraping from intricate DOM structures

Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in

term of both usability and efficiency Algebras for creating and querying multimedia interactive

presentations (e.g. ppt) [Subrahmanian et al.]

Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]

generate XPath location paths of DOM nodes can benefit from using Spatial XPath

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 30

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 31

b

e

Idea: Use Spatial Relations among DOM Nodes

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 32

Idea: Use Spatial Relations among DOM Nodes

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 33

Idea: Use Spatial Relations among DOM Nodes

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 34

Spatial DOM (SDOM)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 35

Spatial Relations Among Nodes

Rectangular Cardinal Relations (RCR)

Topological Relations

r1 E:NE r2

Spatial models allow for expressing disjunctive relations among regions

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 37

XPath Example

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 38

SXPath Example

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 39

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 40

From XPath 1.0 towards Spatial Querying with SXPath

SXPath features adopts intuitive path notation:

axis::nodetest [pred]*

adds to XPath spatial axes spatial position functions

natural semantics for spatial querying maintains polynomial time combined complexity

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 41

Why SXPath?

an XPath for Information extraction

web applications

familiarity

Simplicity

resilient wrappers

human oriented

efficiency

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 42

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 43

Spatial DOM (SDOM)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 44

Spatial Navigation Axes

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 45

Spatial Navigation Axes

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 46

Syntax of SXPath

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 50

Complexity Results

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 51

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 52

SXPath System Architecture

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 53

SXPath System

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 54

Results of Experiments

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 55

Formative User Study

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 56

Summative User Study

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 57

Summative User Study

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 58

Summative User Study

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 59

Existing Extensions to PDF

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 60

Table

Page Header

Page Footer

Text Area and Paragraphs

Item List

Page Number

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 61

Outline

The Bio-Case Motivation The (Biochemical) Deep

Web Contributions

Page-level wrapper induction

Site-wide wrapper generation

Error Correction by Mutual Reinforcement

Conclusions and Future Directions

The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 62

>1000 Life Science DBs, number growing quickly

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 63

Biochemical Web Sites: Observations - 1

Labeled Data

Total Labeled Unlabeled Unlabeled(Redundant)

754 719 19 16

Table 1: Data fields across 20 Biochemical Web sites

Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 64

Biochemical Web Sites: Observations - 2

Dynamic Web Pages

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 65

Biochemical Web Sites: Observations - 3

Rich Site Structure

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 66

Biochemical Web Sites: Observations - 4

Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description

1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey available at http://sabiork.villa-bosch.de/index.html/survey.html

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 67

Biochemical Web Sites: Implications

Induce Wrapper

Induce Wrapper

Induce Wrapper

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 68

Contributions

Unsupervised Page-Level Wrapper Induction

Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)

Automatic Error Detection and Correction by Mutual Reinforcement

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 69

Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

//*[text()]

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 70

Page-Level Wrapper Induction - 2

Reclassify – Growing Data Regions

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 71

Page-Level Wrapper Induction - 3

D1´ = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2´ = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 72

Page-Level Wrapper Induction - 4

Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” )

html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)

html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 73

Page-Level Wrapper Induction - 5

Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/

html/table[1]/tr[8]/td[1]/code[1]/a[1]html/table[1]/tr[8]/td[1]/code[1]/a[2]

//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()

Pivot GeneralizeRelative

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 74

Selected Sources

KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 75

Wrapper Induction - Evaluation

SOURCE #L #D #S TP FN FP P R

KEGG Compoundhttp://www.genome.jp/kegg/ compound/

10 762 3 411 351 46 89.9 53.9

15 759 3 0 100 99.6

KEGG Reactionhttp://www.genome.jp/kegg/ reaction/

10 205 3 173 32 0 100 84.4

15 205 0 0 100 100

ChEBIhttp://www.ebi.ac.uk/chebi

22 831 3 595 236 41 93.5 71.6

15 829 2 0 100 99.7

MSDChemhttp://www.ebi.ac.uk/msd-srv/msdchem/

30 600 3 600 0 20 96.7 100

15 600 0 20 96.7 100

Average (based on final wrappers for each source) 99.1 99.8

~9 samples – ~99% P, ~98% R

Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 76

Site-Wide Wrapper Induction: Observations

Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)

An efficient approach should ignore these pages We dont need to learn the entire site-structure

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 77

Site-Wide Wrapper Induction: Observations - 2

Classified Link-Collections point to data-intensive pages of the same class.

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 78

Site-Wide Wrapper Induction: Observations - 3

Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 79

Site-Wide Wrapper Induction

1. Start with C0

2. Follow all classified link-collections

3. Generate wrappers for each set of target pages

4. Determine if new class is formed

5. Add navigation step6. Repeat 2 – 5 for each

new class formed in 4

C0

L3

L1

L2

If C0 != Ci (i>0)S=S+Ci;

Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}

S={C0}

C1

C3

C2

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 80

Site-Wide Wrapper Induction – Evaluation

SOURCE #C #C’ #D TP FN FP P R

MSDChem 1 1 N/A N/A N/A N/A N/A N/A

ChEBI 3 1 1711 1195 516 0 100 69.8

KEGG 10 7 6223 5044 1179 188 97 81.1

Average 98.5 75.5

Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 81

Error Detection and Correction:Mutual Reinforcement

Observation: Certain data reappear on more than one class of pages

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 82

Error Detection and Correction:Mutual Reinforcement

Reinforcement if reappearing data correctly classified as Data

Otherwise it points to misclassification Label-Data Mismatch

• Correction: Introduce more samples Label-Label Mismatch

• Cannot be detected

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 83

Where to go next?

Reverse engineering production1. LOD

2. Navigation model

3. Interaction model

4. Layout model

Capture this generative model using machine learning Relational learning

• Markov logic programmes?• …?

emitting RDF & RDFS

what belongs to what

(- not treated at all by us so far -)

spatial positioning

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 84

Bibliography

Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.

S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.

Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Thank you for your attention!

top related