information extraction

26
Information Extraction Joan Hurtado Ignacio Delgado

Upload: ignacio-delgado

Post on 21-Jun-2015

700 views

Category:

Education


2 download

DESCRIPTION

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic: Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah

TRANSCRIPT

Page 1: Information Extraction

Information ExtractionJoan HurtadoIgnacio Delgado

Page 2: Information Extraction

Contents• INTRODUCTION

• IE TASKS

• IE WITH CASCADED FINITE-SATATE TRANSDUCERS

• LEARNING-BASED APPROACHES TO IE

• HOW GOOD IS IE

Page 3: Information Extraction

0INTRODUCTION

Page 4: Information Extraction

What is IE? Information Extraction is

the process of scanning text for relevant information to some interest

Extract: Entities Relations Events Who did what to whom

when and where

Page 5: Information Extraction

Why IE?Need for eficient processing of texts in

specialized domains

Focus on relevant parts, ignore the rest

Typical applications: Gleaning business Government Military intelligence WWW searches (more specific than keywords) Scientific literature searches …

Page 6: Information Extraction

Most common uses Named Entity Recognition

Identify names, special entities (dates, times)

Uses textual patterns Important at biomedical

applications

IE is more than NER Recognition of events

and their participants

Page 7: Information Extraction

How to measure performanceRecall

What percentage of the correct answers did the system get

Precision What percentage of the system’s answers were

correct

F-score Weighted harmonic mean between recall and

precision

Page 8: Information Extraction

c

1IE TASKS

Page 9: Information Extraction

Unstructured vs. Semi-structured text

Unstructured Natural language

sentences

Semantics depends on linguistic analysis

Examples: News stories Magazines articles Books …

Semi-structured Structured data

Semantics defined by its organization

Physical layout plays role in interpretation

Examples: Job postings Rental ads …

Page 10: Information Extraction

Single-document vs. Multi-document Originally IE systems designed for individual

documents

Nowadays new systems to extract facts from WWW

Both use similar techniques

Distinguishing issue: redundancy Multi-document can exploit redundancy

However need to challenge cross-document coreference resolution

Multi-document IE systems also are referred as open-domain

Page 11: Information Extraction

Assumptions about Incoming DocumentsRelevant only documents

Single event documents

Page 12: Information Extraction

2IE WITH CASCADE FINITE-STATE TRANSDUCERS

Page 13: Information Extraction
Page 14: Information Extraction

Complex Words Identify multiwords, company names, people

names, locations, dates, times and basic entities

Recognition strategies: Patterns Dictionaries Context

Page 15: Information Extraction

Basic Phrases Some syntactic constructs can be

identified with reasonable reliability: Noun group Verb group

Strategies: Simple finite-state grammars

Ambiguities Noun-verb ambiguity Verbs locally ambiguous

Problems Not al languages have high

distinction between noun and verb groups

Page 16: Information Extraction

Complex PhrasesRecognize complex noun and verb groups

Complex noun groups Appositives Measure phrases Prepositional attachments (of, for) Noun group conjunction

Complex verb groups Verb conjunction Verb groups with same significance

Domain-relevant entities can be recognized

Page 17: Information Extraction

Domain Events Ignore anything not identified in previous

phases

Domain events require domain-specific patterns for identification

Strategy: Finite-state machines

Certain kind of “pseudo-syntax” can be done

Nowadays IE systems begin to rely in full-sentence parsing

Page 18: Information Extraction

Template Generation: Merging StructuresPrevious stages operate within bounds of single

sentences

Operate over whole text to combine previous collected information into a unified whole

If recognizing multiple events: Determine how many distinct events Assign each entity to appropriate event

Page 19: Information Extraction

3LEARNING-BASED APPROACHES TO IE

Page 20: Information Extraction

Supervised Learning of Extraction patterns & rulesReduce knowledge engineering bottleneck

required to create an IE system for a new domain

Examples: AutoSlog create lexico-syntactic patterns PALKA patterns generalized based on words

semantics LIEP identify syntactic paths related to roles CRYSTAL “concept nodes” with lexical, syntactic

and semantic constrains WHISK learn regular expressions Many others: SRV, RAPIER, …

Page 21: Information Extraction

Supervised Learning of sequential classifier models View IE as a classification problem that can be tackled

using sequential learning models

Read sequentially and label each word as an extraction or a non-extraction

Typical labeling scheme IOB Inside Outside Beginning of desired extraction

Strategies: Hidden Markov Models Maximum Entropy Classifiers Support Vector Machines

Page 22: Information Extraction

Weakly supervised and unsupervised approaches Annotating training text still requires time and

complexity

Further techniques to learn extraction using weakly supervised and unsupervised systems

Examples AutoSlog-TS (preclassifed corpus which texts identified as

relevant or irrelevant) Ex-Disco (manually defined seed, patterns ranked, best

patterns selected added to seed) Meta-bootstraping (seed nouns that belong to semantic

class) On-Demand Information Extraction (dynamically learns

from queries)

Page 23: Information Extraction

Discourse-oriented approaches to IEMost IE systems patterns focus only on local

context surrounding

Extend systems to have more global view

Strategy: Add constrains to connect entities in diferent

clauses Decision trees (WRAP-UP) Set of classifiers to identify new templates

(ALICE)

Page 24: Information Extraction

4HOW GOOD IS IE

Page 25: Information Extraction

How IE systems are progressing?The 60% barrier in performance

Biggest mistakes in entity and event coreference The implicit knowledge on NL not translated to

texts Problems on training data not found on test data Good IE systems typically recognize 90% of

entities An event requires about 4 entities 0.9*0.9*0.9*0.9 = 65.61%

Page 26: Information Extraction

THANKS