1 discovering and utilizing structure in large unstructured text datasets eugene agichtein math and...

Discovering and Utilizing Structure in Large Unstructured Text DatasetsEugene Agichtein

Math and Computer Science Department

Information Extraction Example Information extraction systems represent text in

structured form

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Disease Outbreaks in The New York Times

Information Extraction System

How can information extraction help?

… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide input to data mining & statistics analysis

Large Text Collection Structured Relation

Goal: Detect, Monitor, Predict Outbreaks

Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, …

911 CallsTraffic accidents, …

Historical news, breaking news stories,wire, alerts, …

Hospital Records

IESys 4

IESys 3

IESys 2

IESys 1

Data Integration, Data Mining, Trend Analysis

Detection, Monitoring, Prediction

Challenges in Information Extraction

Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune

Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years

Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction

Outline Information extraction overview

Partially supervised information extraction Adaptivity Confidence estimation

Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases

Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

What is “Information Extraction”Information Extraction =

segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

TITLE ORGANIZATION

Bill Gates

Microsoft

Bill Veghte

Microsoft

Richard Stallman

founder

Free Soft..

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Information Extraction Tasks Extracting entities and relations

Entities Named (e.g., Person) Generic (e.g., disease name)

Relations Entities related in a predefined way (e.g., Location of a Disease

outbreak) Discovered automatically

Common information extraction steps: Preprocessing: sentence chunking, parsing, morphological analysis Rules/extraction patterns: manual, machine learning, and hybrid Applying extraction patterns to extract new information

Postprocessing and complex extraction: not covered Co-reference resolution Combining Relations into Events, Rules, …

Two kinds of IE approaches

Knowledge Engineering

rule based developed by experienced

language engineers make use of human

intuition requires only small amount

of training data development could be very

time consuming some changes may be

hard to accommodate

Machine Learning

use statistics or other machine learning

developers do not need LE expertise

requires large amounts of annotated training data

some changes may require re-annotation of the entire training corpus

annotators are cheap (but you get what you pay for!)

Extracting Entities from Text

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Sliding WindowClassify Pre-segmented

Candidates

Finite State Machines Context Free GrammarsBoundary Models

Abraham Lincoln was born in Kentucky.

member?

Abraham Lincoln was born in Kentucky.Abraham Lincoln was born in Kentucky.

Classifier

which class?

…and beyond

Classifier

which class?

Try alternatewindow sizes:

Classifier

which class?

BEGIN END BEGIN END

Most likely state sequence?

NNP V P NPVNNP

Hidden Markov ModelsS

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

11 )|()|(),(

ttttt soPssPosP

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models

Yesterday Lawrence Saul spoke this example sentence.

Person name: Lawrence Saul

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

HMM Example: “Nymble”

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Task: Named Entity Extraction

Train on 450k words of news wire text.

Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 1998], [BBN “IdentiFinder”]

Person

(Five other name classes)

start-of-sentence

end-of-sentence

Transitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

Back-off to: Back-off to:

P(st | st-1 )

P(st )

P(ot | st , ot-1 )

P(ot | st )

P(ot )

Results:

Relation Extraction

Extract structured relations from text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Disease Outbreaks in The New York Times

Relation Extraction Typically require Entity Tagging as preprocessing

Knowledge Engineering Rules defined over lexical items

“<company> located in <location>” Rules defined over parsed text

“((Obj <company>) (Verb located) (*) (Subj <location>))” Proteus, GATE, …

Machine Learning-based Learn rules/patterns from examples

Dan Roth 2005, Cardie 2006, Mooney 2005, … Partially-supervised: bootstrap from “seed” examples

Agichtein & Gravano 2000, Etzioni et al., 2004, …

Recently, hybrid models [Feldman2004, 2006]

Comparison of Approaches Use “language-engineering” environments

to help experts create extraction patterns GATE [2002], Proteus [1998]

Train system over manually labeled data Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996]

Exploit large amounts of unlabeled data DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000] Etzioni et al. (’04): KnowItAll: extracting unary relations Yangarber et al. (’00, ’02): Pattern refinement, generalized names

detection

significanteffort

substantialeffort

minimaleffort

The Snowball System: Overview

Snowball

Text Database

Organization Location Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

... ... ..... ... ..

Snowball: Getting User Input

User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ACM DL 2000

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Can use any Can use any full-text search full-text search engineengine

Snowball: Finding Example Occurrences Get

Examples

Evaluate Tuples

Tag Entities

Extract Tuples

Search Engine

Text Database

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp

The Armonk-based IBM introduced a new line…

Change of guard at IBM Corporation’s headquarters near Armonk, NY ...

Named Named entityentity taggerstaggers can recognize can recognize DatesDates, , PeoplePeople, , LocationsLocations, , OrganizationsOrganizations, …, … MITRE’s MITRE’s AlembicAlembic, IBM’s , IBM’s TalentTalent, , LingPipeLingPipe, …, …

Snowball: Tagging EntitiesGet

Examples

Evaluate Tuples

Tag Entities

Extract Tuples

Computer servers at Microsoft ’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp

The Armonk -based IBM introduced a new line…

Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...

Snowball: Extraction Patterns

General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2

Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in,

0.5)] ) Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) )

HMMs, Sparse sequences, Conditional Random Fields, …

Computer servers at Microsoft’s headquarters in Redmond…

Snowball: Generating Patterns Get

Examples

Evaluate Tuples

Tag Entities

Extract Tuples

1 Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

LOCATIONORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

LOCATIONORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>}

2 Cluster Cluster similarsimilar occurrences.occurrences.

Snowball: Generating Patterns Get

Examples

Evaluate Tuples

Tag Entities

Extract Tuples

LOCATIONORGANIZATION { <'s 0.71>, <headquarters 0.71>}

Create Create patternspatterns as filtered as filtered clustercluster centroidscentroids

1Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

2 Cluster Cluster similarsimilar occurrences.occurrences.

Vector Space Clustering

Google 's new headquarters in Mountain View are …

Snowball: Extracting New TuplesMatch tagged text fragments against patterns

GetExamples

Evaluate Tuples

Tag Entities

Extract Tuples

ORGANIZATION {<'s 0.71>, <headquarters 0.71> }

{<located 0.71>, < in 0.71>}

LOCATION {<- 0.71>, <based 0.71>

Match=0.8

Match=0.4

Match=0

ORGANIZATION

LOCATION

V ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION

Snowball: Evaluating Patterns

Automatically estimate Automatically estimate patternpattern confidenceconfidence::Conf(P4)= Conf(P4)= Positive / TotalPositive / Total

= 2/3 = 0.66= 2/3 = 0.66

GetExamples

Evaluate Tuples

Tag Entities

Extract Tuples

IBM, Armonk, reported… PositiveIntel, Santa Clara, introduced... Positive

“Bet on Microsoft”, New York-based analyst Jane Smith said... Negative

LOCATIONORGANIZATION { < , 1> } P4Organization Headquarters

IBM Armonk

Intel Santa Clara

Microsoft Redmond

Current seed tuples

Snowball: Evaluating Tuples

Automatically evaluate tuple confidence:

Conf(T) =

A tuple has high confidence if generated by high-confidence patterns.

GetExamples

Evaluate Tuples

Tag Entities

Extract Tuples

P4: 0.663COM Santa Clara

{<- 0.75>, <based 0.75>}P3: 0.95

Conf(T): 0.83

)PMatch(*)Conf(P-1-1 i

LOCATIONORGANIZATION { < , 1> }

LOCATION ORGANIZATION

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples

Tag Entities

Extract Tuples

Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

3M Minneapolis 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

... .... ..... .... .. ... .... ..... .... ..

Keep only Keep only high-confidencehigh-confidence tuples for next iterationtuples for next iteration

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples

Tag Entities

Extract Tuples

Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

3M Minneapolis 0.7

Start new iteration with Start new iteration with expandedexpanded example setexample setIterate until no new tuples are extractedIterate until no new tuples are extracted

Pattern-Tuple Duality A “good” tuple:

Extracted by “good” patterns Tuple weight goodness

A “good” pattern: Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness

Edge weight: Match/Similarity of tuple context

to pattern

How to Set Node Weights Constraint violation (from before)

Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) =

HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P)

URNS [Downey et al., IJCAI 2005]

EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate

)PMatch(*)Conf(P-1-1 i

Evaluating Patterns and Tuples: Expectation Maximization

EM-Spy Algorithm “Hide” labels for some seed

tuples

Iterate EM algorithm to convergence on tuple/pattern confidence values

Set threshold t such that (t > 90% of spy tuples)

Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final

Microsoft Redmond 1 1

IBM Armonk 1 0.8

Intel Santa Clara 1 0.9

AG Edwards St Louis 0 0.9

Air Canada Montreal 0 0.8

7th Level Richardson 0 0.8

3Com Corp Santa Clara 0 0.8

3DO Redwood City 0 0.7

3M Minneapolis 0 0.7

MacWorld San Francisco 0 0.7

157th Street Manhattan 0 0.52

15th Party Congress

China 0 0.3

15th Century Europe

Dark Ages 0 0.1

Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded

Example Task: DiseaseOutbreaks

Proteus: 0.409Snowball: 0.415

SDM 2006

Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]

CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

Medical literature: PDR, Micromedex… [Thesis] AdverseEffects, DrugInteractions,

RecommendedTreatments

Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms

Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation

Extracting A Relation From a Large Text Database

Brute force approach: feed all docs to information extraction system

Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing

keyword index How to identify “useful” documents?

InformationExtraction

System

Text Database StructuredRelation

]Expensive for large collections

An Abstract View of Text-Centric Tasks

Output tuples

…Extraction

System

Text Database

3. Extract output tuples2. Process documents1. Retrieve documents from database

Task tuple

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Executing a Text-Centric TaskOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

ScanOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| · (R + P)

Time for retrieving a document

Question: How many documents does Scan retrieve

to reach target recall?

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

Iterative Query ExpansionOutput tuples

…Extraction

System

Text Database

3. Extract tuplesfrom docs

2. Process retrieved documents

1. Query database with seed tuples

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a query

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

Time for processing a document

Generation

4. Augment seed tuples with new tuples

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)

Extracted Relation

QXtract: Querying Text Databases for Robust Scalable Information EXtractionUser-Provided Seed Tuples

Queries

Promising Documents

Text Database

Search Engine

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Mad Cow Disease The U.K. July 1995

Pneumonia The U.S. Feb. 1995

Query Generation

Problem: Learn keyword queries to retrieve “promising” documents

Learning Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents using information extraction system as “oracle.”

3. Train classifiers to “recognize” useful documents.

4. Generate queries from classifier model/rules. Queries

Query Generation

Seed Sampling

Classifier Training

tuple1tuple2tuple3tuple4tuple5

User-Provided Seed Tuples

Text Database

Search Engine

Training Classifiers to Recognize “Useful” Documents

disease reported epidemic expected area

virus reported expected infected patients

products made used exported far

past old homerun sponsored event

Ripper SVM

disease AND reported => USEFUL

virus 3

infected 2

1 discovering and utilizing structure in large unstructured text datasets eugene agichtein math and...

information extraction

open source

data integration

text databasescurrent

source software

subsegments of text

opensource concept

software code

Documents

mining unstructured reviews

mining competitors from large unstructured datasets€¦ ·...

lesson 6 - topics reading sas datasets subsetting sas...

semantic relations extraction from unstructured ... ·...

processing and securing healthcare datasets through hadoop...

baoli li, yandong liu , and eugene agichtein emory...

from unstructured to structured information in military...

sigir 2008 yandong liu, jiang bian, eugene agichtein from...

eugene agichtein microsoft research - emory university

data study group final report: global bank · 3.develop...

tracking of vector field singularities in … of vector...

fv, unstructured

development & validation of a high-resolution, wind … data...

date: 2013/9/25 author: mikhail ageev , dmitry lagun ,...

modeling information seeking behavior in social media eugene...

1 scalable information extraction eugene agichtein

“big data” machine“big data” machine learning for...

unstructured data solutions

creating high-quality financial datasets from unstructured...

ndtma 2018 annual conference · -facial...