information extraction

Information ExtractionJoan HurtadoIgnacio Delgado

Contents• INTRODUCTION

• IE TASKS

• IE WITH CASCADED FINITE-SATATE TRANSDUCERS

• LEARNING-BASED APPROACHES TO IE

• HOW GOOD IS IE

0INTRODUCTION

What is IE? Information Extraction is

the process of scanning text for relevant information to some interest

Extract: Entities Relations Events Who did what to whom

when and where

Why IE?Need for eficient processing of texts in

specialized domains

Focus on relevant parts, ignore the rest

Typical applications: Gleaning business Government Military intelligence WWW searches (more specific than keywords) Scientific literature searches …

Most common uses Named Entity Recognition

Identify names, special entities (dates, times)

Uses textual patterns Important at biomedical

applications

IE is more than NER Recognition of events

and their participants

How to measure performanceRecall

What percentage of the correct answers did the system get

Precision What percentage of the system’s answers were

correct

F-score Weighted harmonic mean between recall and

precision

c

1IE TASKS

Unstructured vs. Semi-structured text

Unstructured Natural language

sentences

Semantics depends on linguistic analysis

Examples: News stories Magazines articles Books …

Semi-structured Structured data

Semantics defined by its organization

Physical layout plays role in interpretation

Examples: Job postings Rental ads …

Single-document vs. Multi-document Originally IE systems designed for individual

documents

Nowadays new systems to extract facts from WWW

Both use similar techniques

Distinguishing issue: redundancy Multi-document can exploit redundancy

However need to challenge cross-document coreference resolution

Multi-document IE systems also are referred as open-domain

Assumptions about Incoming DocumentsRelevant only documents

Single event documents

2IE WITH CASCADE FINITE-STATE TRANSDUCERS

Complex Words Identify multiwords, company names, people

names, locations, dates, times and basic entities

Recognition strategies: Patterns Dictionaries Context

Basic Phrases Some syntactic constructs can be

identified with reasonable reliability: Noun group Verb group

Strategies: Simple finite-state grammars

Ambiguities Noun-verb ambiguity Verbs locally ambiguous

Problems Not al languages have high

distinction between noun and verb groups

Complex PhrasesRecognize complex noun and verb groups

Complex noun groups Appositives Measure phrases Prepositional attachments (of, for) Noun group conjunction

Complex verb groups Verb conjunction Verb groups with same significance

Domain-relevant entities can be recognized

Domain Events Ignore anything not identified in previous

phases

Domain events require domain-specific patterns for identification

Strategy: Finite-state machines

Certain kind of “pseudo-syntax” can be done

Nowadays IE systems begin to rely in full-sentence parsing

Template Generation: Merging StructuresPrevious stages operate within bounds of single

sentences

Operate over whole text to combine previous collected information into a unified whole

If recognizing multiple events: Determine how many distinct events Assign each entity to appropriate event

3LEARNING-BASED APPROACHES TO IE

Supervised Learning of Extraction patterns & rulesReduce knowledge engineering bottleneck

required to create an IE system for a new domain

Examples: AutoSlog create lexico-syntactic patterns PALKA patterns generalized based on words

semantics LIEP identify syntactic paths related to roles CRYSTAL “concept nodes” with lexical, syntactic

and semantic constrains WHISK learn regular expressions Many others: SRV, RAPIER, …

Supervised Learning of sequential classifier models View IE as a classification problem that can be tackled

using sequential learning models

Read sequentially and label each word as an extraction or a non-extraction

Typical labeling scheme IOB Inside Outside Beginning of desired extraction

Strategies: Hidden Markov Models Maximum Entropy Classifiers Support Vector Machines

Weakly supervised and unsupervised approaches Annotating training text still requires time and

complexity

Further techniques to learn extraction using weakly supervised and unsupervised systems

Examples AutoSlog-TS (preclassifed corpus which texts identified as

relevant or irrelevant) Ex-Disco (manually defined seed, patterns ranked, best

patterns selected added to seed) Meta-bootstraping (seed nouns that belong to semantic

class) On-Demand Information Extraction (dynamically learns

from queries)

Discourse-oriented approaches to IEMost IE systems patterns focus only on local

context surrounding

Extend systems to have more global view

Strategy: Add constrains to connect entities in diferent

clauses Decision trees (WRAP-UP) Set of classifiers to identify new templates

(ALICE)

4HOW GOOD IS IE

How IE systems are progressing?The 60% barrier in performance

Biggest mistakes in entity and event coreference The implicit knowledge on NL not translated to

texts Problems on training data not found on test data Good IE systems typically recognize 90% of

entities An event requires about 4 entities 0.9*0.9*0.9*0.9 = 65.61%

THANKS

information extraction

Education

systems patterns

domainspecific patterns

new systems

systems answers

best patterns

relevant information

new domain examples

demand information extraction