instaread : an ide for ie
DESCRIPTION
InstaRead : An IDE for IE. Short Presentation in CSE454. Raphael Hoffmann January 17 th , 2013. From Text To Knowledge. Citigroup has taken over EMI, the British music label of the Beatles and Radiohead, under a restructuring of its debt, - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/1.jpg)
InstaRead: An IDE for IE
Raphael HoffmannJanuary 17th, 2013
Short Presentation in CSE454
![Page 2: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/2.jpg)
From Text To Knowledge
Citigroup has taken over EMI,
the British music label of the
Beatles and Radiohead, under
a restructuring of its debt,
EMI announced on Tuesday.
companyband
organization
thing
Beatles
EMICitigrouplabel
acquired
instOf
isA
instOf instOf
isAisA
E
![Page 3: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/3.jpg)
Human Effort
Evaluation #rels #ann words dev timeACE 2004 51 train 300K 12 monthsMR IC 2010 15 train 115K 12 monthsMR KBP 2011 16 train 118K 12 months
team of annotatorsworking in parallel
teams of researchersworking in parallel
Motivates research on learning with less supervision, but best-performing systems use more direct human input (eg. rules)
![Page 4: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/4.jpg)
A Typical IE System
Time-intensive to build
POS
Tagg
er
Cons
titue
nt P
arse
r
Toke
nize
r
Sent
ence
Spl
itter
Docu
men
t Cla
ssifi
er
Patte
rn M
atch
er
WordNetFreebase DomainRules
AnnotatedTraining
DataExtractionPatterns
FeatureSet
Entit
y Ex
trac
tor
Entit
y Li
nker
Rela
tion
Extr
acto
r
DomainEntity
Databases
Infe
renc
e
ParametersCo
refe
renc
e Re
solv
erAnnotated
Training Data
Rela
tion
Clas
sifier
Domain Ontology
![Page 5: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/5.jpg)
Development Cycle
Analyze
AdjustML algorithm
Integrate resource (e.g. WordNet)
Analyze
AnnotateData
Refine Rules
Analyze
Analyze
…
![Page 6: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/6.jpg)
Need Many Iterations
Analyze
AdjustML algorithm
Integrate resource (e.g. WordNet)
Analyze
AnnotateData
Refine Rules
Analyze
Analyze
![Page 7: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/7.jpg)
Iteration is Slow
No direct manipulation
Remove barriers!
Analyze
AdjustML algorithm
Integrate resource (e.g. WordNet)
Analyze
AnnotateData
Refine Rules
Analyze
Analyze
…
ML expertiserequired
No standardizedformats
Different progr. lang.,different data structures
No appropriate visualization
No advancedqueries
No expressive rule languageconnecting components
No interactive speeds
![Page 8: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/8.jpg)
InstaRead: An IDE for IELoad textdatasets
ProvideDBs of instances
Writeextraction
rules in logicSee rule
recommendations
Visualizations Seedistributionallysimilar words
Collect &organize
rulesStatistics
Create relations
![Page 9: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/9.jpg)
Key Ideas1. User writes rules in simple, expressive
languageUse First-Order Logic
2. User instantly sees extractions on lots of textUse Database Indexing
3. User gets automatic rule suggestionsUse Bootstrapping, Learning
![Page 10: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/10.jpg)
Demo• Browser-based tool• API
– cd /projects/pardosa/s1/raphaelh/github/readr/exp– source ../../init.sh– mvn exec:java –Dexec.mainClass=newexp.Materialize
![Page 11: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/11.jpg)
Project IdeasIn general: Create a new component & use InstaRead to explore, tie components together, and analyze
• Integrate exist. component (eg. Entity Linking, OpenIE)• Mine rules from WordNet, FrameNet, VerbNet, Wiktionary• Generate rule-suggestions through clustering• Generate rule-suggestions through (better)
bootstrapping• Set rule (and thus extraction) confidence based on
overlap with knowledge-base• Develop rules/code to handle specific linguistic
phenomenon (eg. time, modality, negation)• Experiment with joint-inference of parsing + rules
![Page 12: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/12.jpg)
More detailed slides (optional)
![Page 13: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/13.jpg)
Rules are Crucial
• Example
• Rules as patterns• Rules as features• Rules (or space of rules)
typically supplied
For both hand-engineeredand learned systems
ExtractionQuality
Time
1h
Goal: Enable experts to write quality rules extremely quickly
![Page 14: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/14.jpg)
Writing Rules in Logic• FOL1 is simple, expressive, extensible, widely
used• Introduce predicates for NER,
dependencies, ...
• Rules are deterministic, execute in defined order
1 We use the subset referred to as ‘safe domain-relational calculus’
![Page 15: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/15.jpg)
Rule Composition• Define new predicates for similar substructure
• More compact set of rules, better generalization
9 instead of 24 rules!
![Page 16: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/16.jpg)
Enabling Instant Execution• Translation to SQL allows dynamic
optimization
• Each predicate maps to a fragment of SQL• Intensional and extensional predicates• Indices, caching, SSDs important
![Page 17: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/17.jpg)
Experimental Setup
½ NYTimes1
½ NYTimes1
Develop • 4 relational extractors
in 55min each
22M news sentences
22M news sentences
5K selected sentences(some gold annotations) CoNLL042
Compare • To a weakly supervised system
• Bootstrap, Keyword, Morphology, and Decomposition features
• To gold annotations for error analysis
Datasets Procedure
1 LDC2008T19, 2 Roth & Yih, 2004
NYTimes0711M news sentences
![Page 18: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/18.jpg)
Comparison to Weakly Supervised Extraction
attendedSchool founded killed married
Precision 1.00 .91 .90 .90#extractions 1,411 997 189 4,694
Precision 0 .71 N/A .50
#extractions 5 14 N/A 2
Rules
Weaklysupervised
• Precision consistently at least 90%• Works well, even when weak supervision fails
![Page 19: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/19.jpg)
Development Phases1. Bootstrap Tool (0:15)
2. Keywords Tool (0:15)
3. Morphology Feature (0:05)mined morphology from Wiktionary (tense etc.)
4. Decomposition Feature (0:20)enable chaining of rules
![Page 20: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/20.jpg)
Comparison of Development Phases
• Bootstrap initially effective, but recall limited• Decomposition effective for some relations
![Page 21: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/21.jpg)
Error Analysis
False Negatives due to
Preprocessing (missing predictions)
NER 30
Dependencies 25
Co-references 7
Rules (missing predictions)
Lexical items 89
Syntactic variation 17
Reasoning chain 10
False Positives due to
Preprocessing (wrong predictions)
NER 0
Dependencies 2
Co-references 0
Rules (wrong predictions)
Lexical items 0
Syntactic variation 0
Reasoning chain 0
• 36% of all errors are due to incorrect pre-processing
• On CoNLL04 ‘killed’ wrt gold annotations: Re .34, Pr .98
![Page 22: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/22.jpg)
Joint Parsing & Relation Extraction
• If top parse is wrong ...– In 35% of cases: correct at top-2– In 50% of cases: correct among top-5– In 90% of cases: correct among top-50
StanfordDependency
Extractor
RelationExtraction
Rules
CJ PhraseStructure
Parser
Text PSP Deps Tuples
deterministic deterministicstatistical
K-best
Recall Precision F1
Pipeline .342 .978 .507
Joint .412 .973 .5818/19
flipped tocorrect parse Almost no
drop in precision
• Heuristic: Select first parse among top-k with most extractions
![Page 23: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/23.jpg)
Runtime Performanceavg median max
#SQL queries per relation (55min) 55 54 66#join tables per SQL query 4.3 4 11
Execution time (s) per SQL query 1.5 .074 52.5
• On 22M sentences, 3.7B rows in 75 tables, 140GB
• Most queries execute in 74ms or less• Outliers due to Bootstrap tool– Aggregation over millions of rows motivates
streaming or sampling
![Page 24: InstaRead : An IDE for IE](https://reader035.vdocuments.site/reader035/viewer/2022081520/5681650b550346895dd78059/html5/thumbnails/24.jpg)
Logic• Often Horn clauses are sufficient, but sometimes
we need ¬,∀,∃• Example 1: founded relation
Michael Dell built his first company in a dorm-room.
Mr. Harris built Dell into a formidable competitor to IBM.
Desired conjunct• Example 2: Integration of co-reference
“nearest noun-phrase which satisfies …”