information extraction
DESCRIPTION
Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic: Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of UtahTRANSCRIPT
Information ExtractionJoan HurtadoIgnacio Delgado
Contents• INTRODUCTION
• IE TASKS
• IE WITH CASCADED FINITE-SATATE TRANSDUCERS
• LEARNING-BASED APPROACHES TO IE
• HOW GOOD IS IE
0INTRODUCTION
What is IE? Information Extraction is
the process of scanning text for relevant information to some interest
Extract: Entities Relations Events Who did what to whom
when and where
Why IE?Need for eficient processing of texts in
specialized domains
Focus on relevant parts, ignore the rest
Typical applications: Gleaning business Government Military intelligence WWW searches (more specific than keywords) Scientific literature searches …
Most common uses Named Entity Recognition
Identify names, special entities (dates, times)
Uses textual patterns Important at biomedical
applications
IE is more than NER Recognition of events
and their participants
How to measure performanceRecall
What percentage of the correct answers did the system get
Precision What percentage of the system’s answers were
correct
F-score Weighted harmonic mean between recall and
precision
c
1IE TASKS
Unstructured vs. Semi-structured text
Unstructured Natural language
sentences
Semantics depends on linguistic analysis
Examples: News stories Magazines articles Books …
Semi-structured Structured data
Semantics defined by its organization
Physical layout plays role in interpretation
Examples: Job postings Rental ads …
Single-document vs. Multi-document Originally IE systems designed for individual
documents
Nowadays new systems to extract facts from WWW
Both use similar techniques
Distinguishing issue: redundancy Multi-document can exploit redundancy
However need to challenge cross-document coreference resolution
Multi-document IE systems also are referred as open-domain
Assumptions about Incoming DocumentsRelevant only documents
Single event documents
2IE WITH CASCADE FINITE-STATE TRANSDUCERS
Complex Words Identify multiwords, company names, people
names, locations, dates, times and basic entities
Recognition strategies: Patterns Dictionaries Context
Basic Phrases Some syntactic constructs can be
identified with reasonable reliability: Noun group Verb group
Strategies: Simple finite-state grammars
Ambiguities Noun-verb ambiguity Verbs locally ambiguous
Problems Not al languages have high
distinction between noun and verb groups
Complex PhrasesRecognize complex noun and verb groups
Complex noun groups Appositives Measure phrases Prepositional attachments (of, for) Noun group conjunction
Complex verb groups Verb conjunction Verb groups with same significance
Domain-relevant entities can be recognized
Domain Events Ignore anything not identified in previous
phases
Domain events require domain-specific patterns for identification
Strategy: Finite-state machines
Certain kind of “pseudo-syntax” can be done
Nowadays IE systems begin to rely in full-sentence parsing
Template Generation: Merging StructuresPrevious stages operate within bounds of single
sentences
Operate over whole text to combine previous collected information into a unified whole
If recognizing multiple events: Determine how many distinct events Assign each entity to appropriate event
3LEARNING-BASED APPROACHES TO IE
Supervised Learning of Extraction patterns & rulesReduce knowledge engineering bottleneck
required to create an IE system for a new domain
Examples: AutoSlog create lexico-syntactic patterns PALKA patterns generalized based on words
semantics LIEP identify syntactic paths related to roles CRYSTAL “concept nodes” with lexical, syntactic
and semantic constrains WHISK learn regular expressions Many others: SRV, RAPIER, …
Supervised Learning of sequential classifier models View IE as a classification problem that can be tackled
using sequential learning models
Read sequentially and label each word as an extraction or a non-extraction
Typical labeling scheme IOB Inside Outside Beginning of desired extraction
Strategies: Hidden Markov Models Maximum Entropy Classifiers Support Vector Machines
Weakly supervised and unsupervised approaches Annotating training text still requires time and
complexity
Further techniques to learn extraction using weakly supervised and unsupervised systems
Examples AutoSlog-TS (preclassifed corpus which texts identified as
relevant or irrelevant) Ex-Disco (manually defined seed, patterns ranked, best
patterns selected added to seed) Meta-bootstraping (seed nouns that belong to semantic
class) On-Demand Information Extraction (dynamically learns
from queries)
Discourse-oriented approaches to IEMost IE systems patterns focus only on local
context surrounding
Extend systems to have more global view
Strategy: Add constrains to connect entities in diferent
clauses Decision trees (WRAP-UP) Set of classifiers to identify new templates
(ALICE)
4HOW GOOD IS IE
How IE systems are progressing?The 60% barrier in performance
Biggest mistakes in entity and event coreference The implicit knowledge on NL not translated to
texts Problems on training data not found on test data Good IE systems typically recognize 90% of
entities An event requires about 4 entities 0.9*0.9*0.9*0.9 = 65.61%
THANKS