![Page 1: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/1.jpg)
Advanced Internet Systems
CSE 454Daniel Weld
![Page 2: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/2.jpg)
Project Presentations - Monday
10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents11:15 - Freshipes11:30 - One Click Books 11:45 - ProjectNomNom12:00 - Read.me 12:15 - Developing Regions
Times are approximate
• 12 min presentation
• 2 min questions
Next groups sets up during Qs
![Page 3: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/3.jpg)
Presentation Guidelines
Every group member should talk Practice, practice
Use time wisely: 12 min Get length right
Assumption: own laptop Else…. 8:30am
![Page 4: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/4.jpg)
Presentation Content
Aspirations & Reality Demo Suprises
What was harder or easier than expected? What You learned Experiments & Validation Division of Labor
![Page 5: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/5.jpg)
Final Reports
Due Friday 12/17 12:00 noon Hardcopy of paper Digital copy of paper Digital copy of in-class presentation Code
Content See Web (and see past examples)
Length
![Page 6: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/6.jpg)
Search UI
CSE 454 Overview
Inverted Indicies
Supervised Learning
Information Extraction
Adverts
Vector Space, TF/IDFIndexing
HTTP, HTML
Cryptography & Security
Human Computation
Scalable Computing & Crawling
Malware
Search Processing
![Page 7: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/7.jpg)
Cyptography
Symmetric + asymmetric ciphers Stream + block ciphers; 1-way hash Z=YZ=YXX mod N mod N
DNS, HTTP, HTMLGet, put, postGet, put, postCookies, log file analysisCookies, log file analysis
![Page 8: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/8.jpg)
DNSHierarchical namespaceHierarchical namespaceSusceptible to DOS attacksSusceptible to DOS attacksRecent news: Google Public DNSRecent news: Google Public DNS
HTTP / HTTPSGet, put, postGet, put, postCookies, log file analysisCookies, log file analysis
![Page 9: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/9.jpg)
SSL / PCT / TLSYou (client) Merchant (server)
Here are the protocols + ciphers I know
Let’s use this protocol; here’s my pub key, a cert + nonce
Using your key, I’ve encrypted a random sym key + nonce
Symmetric CiphersStream + block
Asymmetric ciphersZ=YX mod NDiscrete root finding = hard w/o factors of N.
1-way hash
![Page 10: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/10.jpg)
Securing Requires Threat Modeling
Who are potential attackers? What are their
Resources?Capabilities?Motives?
What assets are you trying to protect? What might attackers try, to get assets?
Must answer these questions early!
Slide by Yoshi Kohno
![Page 11: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/11.jpg)
Form handlingForm handlingDrag & draw canvasDrag & draw canvasNative videoNative video
HTML““Semantic Formatting”Semantic Formatting”Link ExtractionLink Extraction
AJAX
HTML5
JavascriptJavascriptAsync communication of XMLAsync communication of XML
![Page 12: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/12.jpg)
Common Types of Clusters
Simple Web Farm Search Engine Cluster
Inktomi (2001) Supports programs (not users) Persistent data is partitioned across servers: capacity, but data loss if server fails
Layer 4 switches
From: Brewer Lessons from Giant-Scale Services
![Page 13: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/13.jpg)
Structure of Mercator Spider
1. Remove URL from queue2. Simulate network protocols & REP3. Read w/ RewindInputStream (RIS)4. Has document been seen before? (checksums and fingerprints)
5. Extract links6. Download new URL?7. Has URL been seen before?8. Add URL to frontier
Document fingerprints
![Page 14: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/14.jpg)
The Precision / Recall Tradeoff
Precision
Proportion of selected items that are correct
Recall Proportion of target items
that were selected
Precision-Recall curve Shows tradeoff
tn
fp tp fn
Tuples returned by System
Correct Tuples
fptp
tp
fntp
tp
Recall
Pre
cisi
onAuC
![Page 15: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/15.jpg)
15
Vector Space Representation Dot Product as Similarity Metric
TF-IDF for Computing Weights wij = f(i,j) * log(N/ni) Where q = … wordi… N = |docs| ni = |docs with wordi|
How Process Efficiently?
documents
term
s
q dj
t1
t2
Copyright © Weld 2002-2007
![Page 16: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/16.jpg)
04/18/23 22:03 16
Thinking about Efficiency
Clock cycle: 2 GHz Typically completes 2 instructions / cycle
~10 cycles / instruction, but pipelining & parallel execution Thus: 4 billion instructions / sec
Disk access: 1-10ms Depends on seek distance, published average is 5ms Thus perform 200 seeks / sec (And we are ignoring rotation and transfer times)
Disk is 20 Million times slower !!!20 Million times
![Page 17: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/17.jpg)
17
Inverted Files for Multiple Documents
107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802
WORD NDOCS PTR
jezebel 20
jezer 3
jezerit 1
jeziah 1
jeziel 1
jezliah 1
jezoar 1
jezrahliah 1
jezreel 39jezoar
34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
67 1 132. . .
“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
OCCURENCE INDEX
One method. Alta Vista uses alternative
…
Copyright © Weld 2002-2007
![Page 18: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/18.jpg)
04/18/23 22:03 19
AltaVista Basic Framework
Flat 64-bit address space Index Stream Readers: Loc, Next, Seek, Prev Constraints
Let E be ISR for word enddoc Constraints for conjunction a AND b
prev(E) loc(A) loc(A) loc(E) prev(E) loc(B) loc(B) loc(E)
b a e b a b e b
prev(E)
loc(A)
loc(E)loc(B)
![Page 19: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/19.jpg)
Web Search at 15
Number of pages indexed 7/94 Lycos – 54,000 pages 95 – 10^6 millions 97 – 10^7 98 – 10^8 01 – 10^9 billions 05 – 10^10 …
Types of content Web pages, newsgroups Images, videos, maps News, blogs, spaces Shopping, local, desktop Books, papers, many
formats Health, finance, travel …
What’s availableHow search is
accessed
Slide by Susan Dumais
![Page 20: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/20.jpg)
The search box Spelling
suggestions Query suggestions Auto complete Inline answers Richer snippets But, we can do
better
Support for Searchers
UW, CSE 454, Dec 8 2009
… … by understanding contextby understanding context
Slide by Susan Dumais
![Page 21: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/21.jpg)
Search Today
User User ContextContext
Task/Use Task/Use ContextContext
Query WordsQuery Words
Ranked List
Search and Search and ContextContext
Document Document ContextContext
Query WordsQuery Words
Ranked List
Slide by Susan Dumais
![Page 22: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/22.jpg)
Modeling Users Short vs. long term Individual vs. group Implicit vs. explicit
Using User Models Stuff I’ve Seen (re-finding) Personalized Search News Junkie (novelty) User Behavior in Ranking Domain Expertise at Web-scale
Evaluation• Many methods, scales• Individual components and their combinations
Systems/Prototypes• New capabilities and experiences• Algorithms and prototypes• Deploy, evaluate and iterate
Redundancy Temporal Dynamics
Inter-Relationships among DocumentsCategorization and Metadata Reuters, spam, landmarks, web categories … Domain-specific features, timeInterfaces and Interaction Stuff I’ve Seen, Phlat, Timelines, SWISH Tight coupling of browsing and search
Slide by Susan Dumais
![Page 23: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/23.jpg)
50-80% Page Views = Revisits
![Page 24: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/24.jpg)
Resonance
![Page 25: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/25.jpg)
A-B testing
Nine Changes in Site Above
![Page 26: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/26.jpg)
Landscape of IE Techniques: Models
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary ModelsAbraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State MachinesAbraham Lincoln was born in Kentucky.
Most likely state sequence?
Slides from Cohen & McCallum
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Any of these models can be used to capture words, formatting or both.
![Page 27: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/27.jpg)
Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference
Which German Scientists Taught
at US Universities? …
Einstein was a guest lecturer at the Institute for Advanced Study in New Jersey…
![Page 28: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/28.jpg)
Next-Generation Search
Information Extraction <Einstein, Born-In, Germany> <Einstein, ISA, Physicist> <Einstein, Lectured-At, IAS> <IAS, In, New-Jersey> <New-Jersey, In, United-States> …
Ontology Physicist (x) Scientist(x) …
Inference Einstein = Einstein…
ScalableMeans
Self-Supervised
![Page 29: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/29.jpg)
Why Wikipedia?
Comprehensive High Quality
[Giles Nature 05] Useful Structure
Unique IDs & LinksInfoboxesCategories & ListsFirst SentenceRedirection pages Disambiguation pagesRevision HistoryMultilingual
Comscore MediaMetrix – August 2007
ConsNatural-Language
Missing Data
Inconsistent
Low Redundancy
![Page 30: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/30.jpg)
TheIntelligence in Wikipedia
ProjectOutline
1. Self-Supervised Extraction from Wikipedia text (bootstrapping to the greater Web)
2. Automatic Taxonomy Generation
3. Scalable Probabilistic Inference for Q/A
![Page 31: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/31.jpg)
Outline1. Self-Supervised Extraction from Wikipedia text
Training on Infoboxes Improving Recall – Shrinkage, Retraining, Web
Extraction Community Content Creation
2. Automatic Taxonomy Generation3. Scalable Probabilistic Inference for Q/A
![Page 32: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/32.jpg)
Traditional, Supervised I.E.
Raw Data
Labeled Training
Data
LearningAlgorithm
Extractor
Kirkland-based Microsoft is the largest software company.Boeing moved it’s headquarters to Chicago in 2003.Hank Levy was named chair of Computer Science & Engr.
…
HeadquarterOf(<company>,<city>)
![Page 33: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/33.jpg)
Kylin: Self-Supervised Information Extraction from Wikipedia
[Wu & Weld CIKM 2007]
Its county seat is Clearfield.
As of 2005, the population density was 28.2/km².
Clearfield County was created in 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.
2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
From infoboxes to a training set
![Page 34: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/34.jpg)
Kylin Architecture
![Page 35: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/35.jpg)
The Precision / Recall Tradeoff
Precision
Proportion of selected items that are correct
Recall Proportion of target items
that were selected
Precision-Recall curve Shows tradeoff
tn
fp tp fn
Tuples returned by System
Correct Tuples
fptp
tp
fntp
tp
Recall
Pre
cisi
onAuC
![Page 36: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/36.jpg)
Preliminary Evaluation
Kylin Performed Well on Popular Classes: Precision: mid 70% ~ high 90%
Recall: low 50% ~ mid 90%
... Floundered on Sparse Classes – Little Training Data
82% < 100 instances; 40% <10 instances
![Page 37: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/37.jpg)
Shrinkage?
performer(44)
.location
actor(8738)
comedian(106)
.birthplace
.birth_place
.cityofbirth
.origin
person(1201)
.birth_place
![Page 38: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/38.jpg)
Outline
1. Self-Supervised Extraction from Wikipedia Text Training on Infoboxes Improving Recall – Shrinkage, Retraining, Web Extraction Community Content Creation
2. Automatic Taxonomy Generation
3. Scalable Probabilistic Inference for Q/A
![Page 39: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/39.jpg)
KOG: Kylin Ontology Generator [Wu & Weld, WWW08]
![Page 40: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/40.jpg)
Subsumption Detection
Person
Scientist
Physicist
6/07
: Ein
stei
n
Binary Classification Problem
Nine Complex FeaturesE.g., String Features… IR Measures… Mapping to Wordnet… Hearst Pattern Matches… Class Transitions in Revision History
Learning AlgorithmSVM & MLN Joint Inference
![Page 41: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/41.jpg)
KOG Architecture
![Page 42: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/42.jpg)
Schema Mapping
Heuristics Edit History String Similarity
• Experiments• Precision: 94% Recall: 87%
• Future• Integrated Joint Inference
Person Performerbirth_datebirth_placenameother_names…
birthdatelocationnameothername…
![Page 43: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/43.jpg)
Outline
1. Self-Supervised Extraction from Wikipedia Text Training on Infoboxes Improving Recall – Shrinkage, Retraining, Web Extraction Community Content Creation
2. Automatic Ontology Generation
3. Scalable Probabilistic Inference for Q/A
![Page 44: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/44.jpg)
Improving Recall on Sparse Classes
[Wu et al. KDD-08]
Shrinkage Extra Training Examples from Related Classes
How Weight New Examples?
performer(44)
actor(8738)
comedian(106)
person(1201)
![Page 45: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/45.jpg)
Improving Recall on Sparse Classes
[Wu et al. KDD-08]
Retraining Compare Kylin Extractions with Tuples from Textrunner
Additional Positive Examples Eliminate False Negatives
TextRunner [Banko et al. IJCAI-07, ACL-08]Relation-Independent ExtractionExploits Grammatical StructureCRF Extractor with POS Tag Features
![Page 46: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/46.jpg)
Recall after Shrinkage / Retraining…
![Page 47: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/47.jpg)
Improving Recall on Sparse Classes
[Wu et al. KDD-08] Shrinkage
Retraining
Extract from Broader Web 44% of Wikipedia Pages = “stub”
Extractor quality irrelevant Query Google & Extract
How maintain high precision? Many Web pages noisy, describe multiple objects How integrate with Wikipedia extractions?
![Page 48: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/48.jpg)
Bootstrapping to the Web
![Page 49: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/49.jpg)
Outline
1. Self-Supervised Extraction from Wikipedia Text Training on Infoboxes Improving Recall – Shrinkage, Retraining, Web Extraction Community Content Creation
2. Automatic Ontology Generation
3. Scalable Probabilistic Inference for Q/A
![Page 50: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/50.jpg)
Problem
Information Extraction is Imprecise Wikipedians Don’t Want 90% Precision
How Improve Precision?
People!
![Page 51: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/51.jpg)
Motivation
Altruism
Self-Esteem, Community
Self-Interest
Money
![Page 52: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/52.jpg)
Accelerate
![Page 53: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/53.jpg)
Contributing as a Non-Primary Task
Encourage contributions Without annoying or abusing readers
Compared 5 different interfaces
![Page 54: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/54.jpg)
![Page 55: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/55.jpg)
Adwords Deployment Study
2000 articles containing writer infobox Query for “ray bradbury” would show
Redirect to mirror with injected JavaScript Round-robin interface selection:
baseline, popup, highlight, icon Track clicks, load, unload, and show survey
[Hoffman et al. 2008]
![Page 56: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/56.jpg)
Results
• Contribution Rate1.6% 13%
• Additional Training DataImproved Extraction Performance
![Page 57: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/57.jpg)
Outline
1. Self-Supervised Extraction from Wikipedia
2. Automatic Ontology Generation
3. Scalable Probabilistic Inference for Q/A
![Page 58: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/58.jpg)
Scalable Probabilistic Inference
Eight MLN Inference RulesTransitivity of predicates, etc.
Knowledge-Based Model ConstructionTested on 100 Million Tulples
Extracted by Textrunner from Web
[Schoenmacker et al. 2008]
![Page 59: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/59.jpg)
Effect of Limited Inference
![Page 60: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/60.jpg)
Conclusion
Self-Supervised Extraction from Wikipedia Training on Infoboxes
Works well on popular classes
Improving Recall – Shrinkage, Retraining, Web Extraction High precision & recall - even on sparse classes, stub articles
Community Content Creation
Automatic Ontology Generation Probabilistic Joint Inference
Scalable Probabilistic Inference for Q/A Simple Inference - Scales to Large Corpora Tested on 100 M Tuples
![Page 61: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/61.jpg)
Internet History 1960s1960 Nelson Xanadu1965 Salton “SMART Doc Retrieval” CACM1969 First ARPAnet message
![Page 62: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/62.jpg)
Internet History 1970s1960 Nelson Xanadu1965 Salton “SMART Doc Retrieval” CACM1969 First ARPAnet message1972 Email1974 Intel 8080 CPU; Design of TCP
![Page 63: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/63.jpg)
Internet History 1980s1960 Nelson Xanadu1965 Salton “SMART Doc Retrieval” CACM1969 First ARPAnet message1972 Email1974 Intel 8080 CPU; Design of TCP1983 ARPAnet (1000 nodes) uses TCP/IP; DNS1990 Berners-Lee creates WWW
![Page 64: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/64.jpg)
Internet History 1990s1960 Nelson Xanadu1965 Salton “SMART Doc Retrieval” CACM1969 First ARPAnet message1972 Email1974 Intel 8080 CPU; Design of TCP1983 ARPAnet (1000 nodes) uses TCP/IP; DNS1990 Berners-Lee creates WWW1994 Pinkerton’s Webcrawler; YHOO, AMZN, Netscape1997 Goto.com1998 Google
![Page 65: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/65.jpg)
Internet History 2000s1960 Nelson Xanadu1965 Salton “SMART Doc Retrieval” CACM1969 First ARPAnet message1972 Email1974 Intel 8080 CPU; Design of TCP1983 ARPAnet (1000 nodes) uses TCP/IP; DNS1990 Berners-Lee creates WWW1994 Pinkerton’s Webcrawler; YHOO, AMZN, Netscape1997 Goto.com1998 Google2001 Wikipedia; Bittorrent (by 2009 55% all traffic)2004-6 Facebook, Digg; Youtube, MTurk; Twitter
![Page 66: Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes](https://reader038.vdocuments.site/reader038/viewer/2022110323/56649d625503460f94a45416/html5/thumbnails/66.jpg)
Adoption