![Page 1: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/1.jpg)
Querying for relations from the semi-structured Web
Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita
Contributors
Rahul Gupta Girija Limaye Prashant Borole
Rakesh Pimplikar Aditya Somani
![Page 2: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/2.jpg)
Web Search
Mainstream web search User Keyword queries Search engine Ranked list of documents
15 glorious years of serving all of user’s search need into this least common denominator
Structured web search User Natural language queries ~/~ Structured queries Search engine Point answer, record sets
Many challenges in understanding both query and content
15 years of slow but steady progress
2
![Page 3: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/3.jpg)
The Quest for Structure Vertical structured search engines
Structure Schema Domain-specific Shopping: Shopbot: (Etzoini + 1997)
Product name, manufacturer, price Publications: Citeseer (Lawrence, Giles,+ 1998)
Paper title, author name, email, conference, year Jobs: Flipdog Whizbang labs (Mitchell + 2000)
Company name, job title, location, requirement People: DBLife (Doan 07)
Name, affiliations, committees served, talks delivered.
Triggered much research on extraction and IR-style search of structured data (BANKS ‘02).
3
![Page 4: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/4.jpg)
Horizontal Structured Search Domain-independent structure Small, generic set of structured primitives over
entities, types, relationships, and properties <Entity> IsA <Type>
Mysore is a city <Entity> Has <Property>
<City> Average rainful <Value> <Entity1> <related-to> <Entity2>
<Person> born-in <City> <Person> CEO-of <Company>
4
![Page 5: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/5.jpg)
Types of Structured Search Web+People Structured databases ( Ontologies)
Created manually (Psyche), or semi-automatically (Yago) True Knowledge (2009), Wolfram Alpha (2009)
Web annotated with structured elements Queries: Keywords + structured annotations
Example: <Physicist> +cosmos Open-domain structure extraction and annotations of web
docs (2005—)
5
![Page 6: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/6.jpg)
Users, Ontologies, and the Web
Users are from Venus• Bi-syllabic, impatient, believe in
mind -reading Ontologies are from Mars
• One structure to fit allG
• Web content creators are from some other galaxy
– Ontologies= – Let search engines bring the
users
![Page 7: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/7.jpg)
What is missed in Ontologies The trivial, the transient, and the textual Procedural knowledge
• What do I do on an error? Huge body of invaluable text of various type
reviews, literature, commentaries, videos Context
By stripping knowledge to its skeletal form, context that is so valuable for search is lost.
As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.
![Page 8: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/8.jpg)
Structured annotations in HTML Is A annotations
KnowITAll (2004) Open-domain Relationships
Text runner (Banko 2007) Ontological annotations
SemTag and Seeker (2003) Wikipedia annotations (Wikify! 2007, CSAW 2009)
8
All view documents as a sequence of tokens
Challenging to ensure high accuracy
![Page 9: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/9.jpg)
WWT: Table queries over the semi-structured web
9
![Page 10: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/10.jpg)
Queries in WWT Query by content
Query by description
10
Alan Turing Turing Machine
E. F. Codd Relational Databases
Desh Late night
Bhairavi Morning
Patdeep Afternoon
Inventor Computer science concept Year
Indian states Airport City
![Page 11: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/11.jpg)
11
Answer: Table with ranked rows
Person Concept/Invention
Alan Turing Turing Machine
Seymour Cray Supercomputer
E. F. Codd Relational Databases
Tim Berners-Lee WWW
Charles Babbage Babbage Engine
![Page 12: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/12.jpg)
12
Verbose articles, notstructured tables
The only document with an unstructured listof some desired records
Desired records spread across many documents
Correct answer is notone click away.
Computer science concept inventor year
Keyword search to find structured records
![Page 13: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/13.jpg)
13
The only list in one of the retrieved pages
![Page 14: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/14.jpg)
14
Highly relevant Wikipedia table not retrieved in the top-k
Ideal answer should be integrated from these incomplete sources
![Page 15: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/15.jpg)
15
Attempt 2: Include samples in query
Documents relevant only to the keywords
Ideal answer still spread across manydocuments
Known examples
alan turing machine codd relational database
![Page 16: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/16.jpg)
16
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Ke
ywo
rd Q
uer
y
So
urce
L1,…
,Lk
Type Inference
Resolver
Resolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated Table
StatisticsCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated table
User
Annotate
Ontology
![Page 17: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/17.jpg)
18
Offline: Annotating to an Ontology
Annotate table cells with entity nodes and table columns with type nodes
movies
Indian_films English_films
2008_films Terrorism_films
A_Wednesday
Black&White
Coffee_house (film)
Wednesday
All
People
Entertainers
Coffee_house (Loc)
Indian_films
2008_filmsIndian_directors
Coffee_house (film)
Black&White
A_Wednesday
![Page 18: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/18.jpg)
Challenges Ambiguity of entity names
“Coffee house” both a movie name and a place name
Noisy mentions of entity names Black&White versus Black and White
Multiple labels Yago Ontology has average 2.2 types per entity
Missing type links in Ontology cannot use least common ancestor Missing link: Black&White to 2008_films Not a missing link: 1920 to Terrorism_films
Scale: Yago has 1.9 million entities, 200,000 types 19
![Page 19: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/19.jpg)
A unified approachGraphical model to jointly label cells and
columns to maximize sum of scores on
ycj = Entity label of cell c of column j
yj = Type label of column j Score(ycj ): String similarity between c & ycj .
Score(yj ): String similarity between header in j & yj
Score( yj, ycj)
Subsumed entity: Inversely proportional to distance between them
Outside enity: Fraction of overlapping entities between yj and immediate parent of ycj
Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero.
movies
Indian_films
English_films
yj
Terrorism_films
Subsumed entity y1j
Outside entity y3j
Subsumed entity y2j
![Page 20: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/20.jpg)
21
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Ke
ywo
rd Q
uer
y
So
urce
L1,…
,Lk
Type Inference
Resolver
Resolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated Table
StatisticsCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated table
User
Annotate
Ontology
![Page 21: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/21.jpg)
23
Extraction: Content queriesExtracting queries columns from list records
New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.
Cornell University Ithaca
State University of New York Stony Brook
New York University New York
Lists are often human generated.
Query: QQuery: Q
A source: LiA source: Li
![Page 22: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/22.jpg)
24
Extraction
New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.
Rule-based extractor insufficient. Statistical extractor needs training data.
Generating that is also not easy!
Extracted table columnsExtracted table columns
Cornell University Ithaca
State University of New York Stony Brook
New York University New York
Query: QQuery: Q
![Page 23: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/23.jpg)
25
Extraction: Labeled data generation
A fast but naïve approach for generating labeled records
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
Query about colleges in NY
Fragment of a relevant list source
Lists are unlabeled. Labeled records needed to train a CRF
New York University New York
Monroe College Brighton
State University of New York Stony Brook
![Page 24: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/24.jpg)
26
Extraction: Labeled data generation
New York University New York
Monroe College Brighton
State University of New York Stony Brook
A fast but naïve approach
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
In the list, look for matches of every query cell.
Another match for New York UniversityAnother match for New York
![Page 25: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/25.jpg)
27
Extraction: Labeled data generation
New York University New York
Monroe College Brighton
State University of New York Stony Brook
A fast but naïve approach
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
In the list, look for matches of every query cell. Greedily map each query row to the best match in the list
1
2
![Page 26: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/26.jpg)
28
Extraction: Labeled data generation
New York University New York
Monroe College Brighton
State University of New York Stony Brook
A fast but naïve approach
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
Hard matching criteria has significantly low recall Missed segments. Does not use natural clues like Univ = University
Greedy matching can be lead to really bad mappings
1
2
Unmapped (hurts recall)
Wrongly MappedAssumed as ‘Other’
![Page 27: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/27.jpg)
29
Generating labeled data: Soft approach New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
New York University New York
Monroe College Brighton
State University of New York Stony Brook
![Page 28: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/28.jpg)
30
Match score for each query and source row Score of best segmentation of source row
to query columns Score of a segment s of column c:
Probability Cell c of query row same as segment s
Computed by the Resolver module based on the type of the column
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
0.9
0.3
1.8
Generating labeled data: Soft approach
New York University New York
Monroe College Brighton
State University of New York Stony Brook
![Page 29: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/29.jpg)
31
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
2.0
0.7
0.3
Generating labeled data: Soft approach
Match score for each query and source row Score of best segmentation of source row into query columns Score of a segment s of column c:
Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the
column
New York University New York
Monroe College Brighton
State University of New York Stony Brook
![Page 30: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/30.jpg)
32
New York University New York
Monroe College Brighton
State University of New York Stony Brook
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
Compute the maximum weight matching Better than greedily choosing the best match for each row
Soft string-matching increases the labeled candidates significantly
Vastly improves recall, leads to better extraction models.
0.9
0.3
1.8
1.8
0.70.3
2
Greedy matching in red
Generating labeled data: Soft approach
![Page 31: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/31.jpg)
33
Extractor
Use CRF on the generated labeled data Feature Set
Delimiters, HTML tokens in a window around labeled segments.
Alignment features Collective training of multiple sources
![Page 32: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/32.jpg)
34
Experiments
Aim: Reconstruct Wikipedia tables from only a few sample rows.
Sample queries TV Series: Character name, Actor name, Season Oil spills: Tanker, Region, Time Golden Globe Awards: Actor, Movie, Year Dadasaheb Phalke Awards: Person, Year Parrots: common name, scientific name, family
![Page 33: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/33.jpg)
35
Experiments: Dataset
Corpus: 16M lists from 500M pages from a web crawl. 45% of lists retrieved by index probe are irrelevant.
Query workload 65 queries. Ground truth hand-labeled by 10 users
over 1300 lists. 27% queries not answerable with one list (difficult). True consolidated table = 75% of Wikipedia table,
25% new rows not present in Wikipedia.
![Page 34: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/34.jpg)
36
Extraction performance
Benefits of soft training data generation, alignment features, staged-extraction on F1 score.
More than 80% F1 accuracy with just three query records
![Page 35: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/35.jpg)
Queries in WWT Query by content
Query by description
37
Alan Turing Turing Machine
E. F. Codd Relational Databases
Desh Late night
Bhairavi Morning
Patdeep Afternoon
Inventor Computer science concept Year
Indian states Airport City
![Page 36: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/36.jpg)
Extraction: Description queries
Lithium 3
Sodium 11
Beryllium 4
Non-informative headers No headers
![Page 37: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/37.jpg)
Context to get at relevant tables Ontological annotations
Context is union of Text around tables Headers Ontology labels when
present
39
Chemical_elements
Metals Non_Metals
Alkali Gas
Aluminium
Lithium Hydrogen
All
People
Non alkali
Lithium 3
Sodium 11
Beryllium 4
Non-gas
CarbonSodium
Alkali
Chemical element
![Page 38: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/38.jpg)
Joint labeling of table columns Given
Candidate tables: T1 ,T2,..Tn
Query column q1, q2,.. qm
Task: label columns of Ti with {q1, q2,…, qm, } to maximize sum of these scores Score (T , j , qk) = Ontology type match + Header string
match with qk
Score (T , * , qk) = Match of description of T with qk
Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk
Inference algorithm in a graphical model solve via Belief Propagation. 40
![Page 39: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/39.jpg)
41
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Ke
ywo
rd Q
uer
y
So
urce
L1,…
,Lk
Type Inference
Resolver
Resolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated Table
StatisticsCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated table
User
Annotate
Ontology
![Page 40: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/40.jpg)
42
Step 3: Consolidation
Cornell University Ithaca
State University of New York
Stony Brook
New York University New York City
Binghamton University Binghamton
Merging the extracted tables into one
SUNY Stony Brook
New York University (NYU)
New York
RPI Troy
Columbia University New York
Syracuse University Syracuse
+
Cornell University Ithaca
State University of New York OR SUNY
Stony Brook
New York University OR New York University (NYU)
New York City OR New York
Binghamton University Binghamton
RPI Troy
Columbia University New York
Syracuse University Syracuse
=Merging duplicates
![Page 41: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/41.jpg)
Consolidation
Challenge: resolving when two rows are the same in the face of Extraction errors Missing columns Open-domain No training.
Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters
![Page 42: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/42.jpg)
45
Resolver
P(RowMatch|rows q,r)
P(1st cell match|q1,r1) P(ith cell match|qi,ri) P(nth cell match|qn,rn)
Bayesian Network
Cell-level probabilities Parameters automatically set using list statistics Derived from user-supplied type-specific similarity functions
![Page 43: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/43.jpg)
47
Ranking• Factors for ranking– Relevance: membership in overlapping sources– Support from multiple sources
– Completeness: importance of columns present Penalize records with only common ‘spam’ columns like City and
State Correctness: extraction confidence
School Location State Merged Row Confidence Support
- - NY 0.99 9
- NYC New York 0.95 7
New York Univ. OR New York University
New York City OR New York
New York 0.85 4
University of Rochester OR Univ. of Rochester,
Rochester New York 0.50 2
University of Buffalo Buffalo New York 0.70 2
Cornell University Ithaca New York 0.76 1
![Page 44: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/44.jpg)
Relevance ranking on set membership Weighted sum approach
Score of a set t: s(t) = fraction of query rows in t
Relevance of consolidated row r: r t s(t)
Graph walk based approach Random walk from rows to table
nodes starting from query rows along with random restarts to query rows
48
Tables
Consolidated rows
Query rows
![Page 45: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/45.jpg)
49
Ranking Criteria• Score(Row r):
× Graph-relevance of r.× Importance of columns C present in r (high if C functionally
determines the other)× Sum of cell extraction confidence: noisy-OR of cell extraction
confidence from individual CRFs
School Location State Merged Row Confidence Support
New York Univ. OR New York University (0.90)
New York City OR New York (0.95)
New York (0.98) 0.85 4
University of Buffalo (0.88) Buffalo (0.99) New York (0.99) 0.70 2
Cornell University (0.92) Ithaca (0.95) New York (0.99) 0.76 1
University of Rochester OR Univ. of Rochester, (0.80)
Rochester (0.95) New York (0.99) 0.50 2
- - NY (0.99) 0.99 9
- NYC (0.98) New York (0.98) 0.95 7
![Page 46: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/46.jpg)
50
Overall performance
Justify sophisticated consolidation and resolution. So compare with: Processing only the magically known single best list
=> no consolidation/resolution required. Simple consolidation. No merging of approximate duplicates.
WWT has > 55% recall, beats others. Gain bigger for difficult queries.
All Queries Difficult Queries
![Page 47: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/47.jpg)
51
Running time
< 30 seconds with 3 query records.
Can be improved by processing sources in parallel. Variance high because time depends on number of columns,
record length etc.
![Page 48: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/48.jpg)
52
Related Work Google-Squared
Developed independently. Launched in May 2009 User provides keyword query, e.g. “list of Italian
joints in Manhattan”. Schema inferred. Technical details not public.
Prior methods for extraction and resolution. Assume labeled data/pre-trained parameters We generate labeled data, and automatically train
resolver parameters from the list source.
![Page 49: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/49.jpg)
53
Summary Structured web search & the role of non-text,
partially structured web sources WWT system
Domain-independent Online: structure interpretation at query time Relies heavily on unsupervised statistical learning
Graphical model for table annotation Soft-approach for generating labeled data Collective column labeling for descriptive queries Bayesian network for resolution and consolidation Page rank + confidence from a probabilistic extractor for
ranking
![Page 50: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye](https://reader035.vdocuments.site/reader035/viewer/2022062511/551475c4550346414e8b630b/html5/thumbnails/50.jpg)
What next? Designing plans for non-trivial ways of combining of
sources Better ranking and user-interaction models. Expanding query set
Aggregate queries: tables are rich in quantities Point queries: attribute value and relationship queries
Interplay between semi-structured web & Ontologies Augmenting one with the other.
Quantify information in structured sources vis-à-vis text sources on typical query workloads
54