information extraction from web documents cs 652 information extraction and integration li xu yihong...

46
Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

Post on 20-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

Information Extractionfrom Web Documents

CS 652 Information Extraction and Integration

Li XuYihong Ding

Page 2: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

2

IR and IEIR (Information Retrieval) Retrieves relevant documents from collections Information theory, probabilistic theory, and

statistics

IE (Information Extraction) Extracts relevant information from documents Machine learning, computational linguistics,

and natural language processing

Page 3: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

3

History of IE

Large amount of both online and offline textual data.Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks

Latin American terrorism Joint ventures Microelectronics Company management changes

Page 4: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

4

Evaluation MetricsPrecision

Recall

F-measure

Page 5: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

5

Web Documents

Unstructured (Free) Text Regular sentences and paragraphs Linguistic techniques, e.g., NLP

Structured Text Itemized information Uniform syntactic clues, e.g., table

understanding

Semistructured Text Ungrammatical, telegraphic (e.g., missing

attributes, multi-value attributes, …) Specialized programs, e.g., wrappers

Page 6: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

6

Approaches to IEKnowledge Engineering Grammars are constructed by hand Domain patterns are discovered by human

experts through introspection and inspection of a corpus

Much laborious tuning and “hill climbing”

Machine Learning Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user

Page 7: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

7

Knowledge EngineeringAdvantages With skills and experience, good performing

systems are not conceptually hard to develop.

The best performing systems have been hand crafted.

Disadvantages Very laborious development process Some changes to specifications can be hard

to accommodate Required expertise may not be available

Page 8: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

8

Machine Learning Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full

coverage of examples

Disadvantages Training data may not exist, and may be very

expensive to acquire Large volume of training data may be required Changes to specifications may require

reannotation of large quantities of training data

Page 9: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

9

WrapperA specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables)

Challenge: recognizing the data of interest among many other uninterested pieces of text

Tasks Source understanding Data processing

Page 10: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

10

Free Text

AutoSlogLiepPalkaHastenCrystal WebFoot

WHISK

Page 11: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

11

AutoSlog [1993]

The Parliament building was bombed by Carlos.

Page 12: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

12

LIEP [1995]

The Parliament building was bombed by Carlos.

Page 13: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

13

PALKA [1995]

The Parliament building was bombed by Carlos.

Page 14: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

14

HASTEN [1995]

The Parliament building was bombed by Carlos.

Egraphs(SemanticLabel, StructuralElement)

Page 15: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

15

CRYSTAL [1995]The Parliament building was bombed by Carlos.

Page 16: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

16

CRYSTAL + Webfoot [1997]

Page 17: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

17

WHISK [1999]The Parliament building was bombed by Carlos.

WHISK Rule:*(PhyObj)*@passive *F ‘bombed’ * {PP

‘by’ *F (Person)}

Context-based patterns

Page 18: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

18

Web DocumentsSemistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998)

Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock,

1998)

Page 19: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

19

Inductive Learning

TaskInductive InferenceLearning Systems Zero-order First-order, e.g., Inductive Logic

Programming (ILP)

Page 20: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

20

RAPIER [1997]Inductive Logic ProgrammingExtraction Rules Syntactic information Semantic information

Advantage Efficient learning (bottom-up)

Drawback Single-slot extraction

Page 21: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

21

RAPIER Rule

Page 22: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

22

SRV [1998]Relational Algorithm (top-down)Features Simple features (e.g., length, character

type, …) Relational features (e.g., next-token, …)

Advantages Expressive rule representation

Drawbacks Single-slot rule generation Large-volume of training data

Page 23: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

23

SRV Rule

Page 24: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

24

WHISK [1998]Covering Algorithm (top-down)Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to

structured text

Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data

Page 25: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

25

WHISK Rule

Page 26: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

26

WIEN [1997]Assumes Items are always in fixed, known order

Introduces several types of wrappersAdvantages Fast to learn and extract

Drawbacks Can not handle permutations and missing

items Must label entire pages Does not use semantic classes

Page 27: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

27

WIEN Rule

Page 28: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

28

SoftMealy [1998]Learns a transducerAdvantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and

disjunctions

Drawbacks Must see all possible permutations Can not use delimiters that do not

immediately precede and follow the relevant items

Page 29: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

29

SoftMealy Rule

Page 30: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

30

STALKER [1998,1999,2001]

Hierarchical Information ExtractionEmbedded Catalog Tree (ECT) FormalismAdvantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others

Drawbacks Does not exploit item order

Page 31: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

31

STALKER Rule

Page 32: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

32

Web IE Tools (main technique used)

Wrapper languages (TSIMMIS, Web-OQL) HTML-aware (X4F, XWRAP, RoadRunner, Lixto) NLP-based (RAPIER, SRV, WHISK) Inductive learning (WIEN, SoftMealy, Stalker) Modeling-based (NoDoSE, DEByE) Ontology-based (BYU ontology)

Page 33: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

33

Degree of Automation

Trade-off: page lay-out dependent

RoadRunner Assume target pages were automatically

generated from some data sources The only fully automatic wrapper generator

BYU ontology Manually created with graphical editing tool Extraction process fully automatic

Page 34: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

34

Support of Complex Objects

Complex objects: nested objects, graphs, trees, complex tables, …

Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN.BYU ontology Support

Page 35: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

35

Page Contents

Semistructured data (table type, richly tagged)Semistructured text (text type, rarely tagged)

NLP-based tools: text type onlyOther tools (except ontology-based): table type onlyBYU ontology: both types

Page 36: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

36

Ease of Use

HTML-aware tools, easiest to use

Wrapper languages, hardest to use

Other tools, in the middle

Page 37: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

37

Output

XML is the best output format for data sharing on the Web.

Page 38: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

38

Support for Non-HTML Sources

NLP-based and ontology-based, automatically supportOther tools, may support but need additional helper like syntactical and semantic analyzer

BYU ontology support

Page 39: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

39

Resilience and Adaptiveness

Resilience: continuing to work properly in the occurrence of changes in the target pagesAdaptiveness: working properly with pages from some other sources but in the same application domain

Only BYU ontology has both the features.

Page 40: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

40

Summary of Qualitative Analysis

Page 41: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

41

Graphical Perspective of Qualitative Analysis

Page 42: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

42

Name Struc_ture

Semi

Free Single-slot

Multi-slot

Missing items

Permuta_tions

Nested_data

Resilient

WIEN X X X

SoftMealy

X X X X X X*

STALKER

X X X * X X X

RAPIER X X ? X X X ?SRV X X ? X X X ?

WHISK X X X X X X X* ?

AutoSlog

X X X X

ROAD_RUNNER

X X X X X

BYU Onto

X X ? X X X X X X

X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.

Page 43: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

43

Problem of IE (unstructured documents)

Meaning

Knowledge

Information

Data

Source Target

Information Extraction

Page 44: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

44

Problem of IE (structured documents)

Meaning

Knowledge

Information

Data

Source Target

Information Extraction

Page 45: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

45

Problem of IE (semistructured documents)

Meaning

Knowledge

Information

Data

Source Target

Information Extraction

Page 46: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

46

Meaning

Knowledge

Information

Data

Solution of IE (the Semantic Web)

Source Target

Information Extraction