Download - Structured Data in Web Search
Structured Data on the Web
Alon HalevyGoogle
May 23, 2014
Joint work with: Jayant Madhavan, Cong Yu, Fei Wu, Hongrae Lee, Warren ShenAnish Das Sarma, Rahul Gupta, Boulos Harb, Zack Ives, Afshin Rostamizadeh, Sree Balakrishnan, Anno Langen, Steven Whang, Mohamed Yahya, and others
Structured Data in Search Results
Set QueriesChicago restaurants
Association Queries
Data in Movies!
The Knowledge Graph
Knowledge Graph
Brazil
Brasiliacapital
population2014
2001
mayor
Query Reformulation
Knowledge Graph
Brazil
Brasiliacapital
population2014
2001
mayor
Brazil capitalWhat is the capital of
Brazil“Google, tell me the
capital of brazil”
Brazil nuts Culture of Brazil “Google, will Brazil
win the world cup?”
Other Sources of Data
Knowledge Graph
Brazil
Brasiliacapital
population2014
2001
mayor
Brazil capital
The population of Brasilia is 2207718 according to the GeoNames geographical
database
Tables Text
Answer Queries Directly from Web?
Brazil capital
The population of Brasilia is 2207718 according to the GeoNames geographical
database
Tables Text
Knowledge Graph
Brazil
Brasiliacapital
population2014
2001
mayor
The Web vs. the Knowledge Graph
Tables, Tables
Brazil capital
The population of Brasilia is 2207718 according to the GeoNames geographical
database
Tables Text
Knowledge Graph
Brazil
Brasiliacapital
population2014
2001
mayor
Fusion Tables: Enabling a broad range of users to create tabular content
WebTables: Finding good HTML tables on the Web
• City planning
• Sustainability: water, coffee, …
• Crisis response
• Advancing public discourse (e.g., gun control)
• Data philanthropy – corporations encouraged to contribute data to the good of society.
Background for Coffee Examples
Fusion Tablesgoogle.com/fusiontables
[SIGMOD 2010, SIGMOD 2012]
• Goal: an easy-to-use database system that is integrated with the Web.
• Key: support common workflows– Easy upload (CSV, KML, spreadsheets)– Sharing (even outside your company)– Visualizations front and center– Easy publishing
• Goal 2: Fusion in the data cloud -- discover others’ data and combine with yours.
Coffee Producing Countries
Coffee Consumption Per Capita
Big Data for Regular People
Table Facts:
English poverty rates:32,000 wards with a total of 1.8 million verticesColors indicate poverty levels
2011 Rioting:2100 incidentsColors indicate addresses of Rioting and Rioters
Best UK Internet Journalist
Knight-Batten Award for Innovations in Journalism
Crowd Sourcing
Data Integration as Search
Join with Population Data:What is a City?
Big Data Integration
Table Facts:
Texas Counties 2010 Census:254 counties with 543000 verticesColored based on various demographics
See SIGMOD 2012 paper for details on scaling map visualizations
Crowdsourcing Cafes
HTML Tables
Search Engine for Data Sets
research.google.com/tables[VLDB 2008, 2011, 2014]
Give Answers from Tables
It Better Be Right!
Answer with a Visualization
Long Term Goal: A Data-Guided Decision Engine
• Support decision making:– Healthcare debate– Should I install solar in my house?– Which charity should I contribute to?
• Show relevant data– Expose facets of the decision and enable drilldown– Show opposing views
• Manually curated examples of decision engines:– Justfacts.com, followthemoney.com, decide.com
WebTables on google.com!
HTML Lists
See Elmeleegy et al., VLDB 2009
Tree Search
Amish quilts
Parking tickets in India
Horses
The Deep Web [Madhavan et al., VLDB 2008]
Other Sources of Data
• Spreadsheets• CSV files• Tables embedded in PDF• XML, RDF• Visualizations• Online databases (Fusion Tables, Tableau, …)
Each source has its particularities, but most problems are common to all.
Non-Tabular Data in HTML
Vertical Tables
Data Optimized for Page Layout
Tabular Data Optimized for Site Layout
See [Ling et al, IJCAI 2013] for stitching tables within a site.
Semantics Can Be Brittle
Semantics are in Text
The Big Challenge
• Analyze natural language text as it pertains to structured data.
• Different from (open) information extraction that builds databases entirely from text.
• Good news: natural language parsing technology is now scalable.
First Step: Annotating Columns [Venetis et al., VLDB 2011]
Step 2: Understanding Relationships
Dictionary of Attributes
• I want the list of all attributes that countries may have.
• Freebase doesn’t have coffee production. • Is this an ontology?
– Not quite! I want an ontology suited for search.
Biperpedia: [VLDB 2014]
Ontology for Search Applications
Comparing to Freebase Coverage
Tower of Babel: Internet Style
In 2013, the coffee production of El Salvador dropped by 20% due to the coffee rust disease.
Coffee production el salvador 2013
El Salvador exports coffee 2013
Knowledge Graph
Tables Text
Conclusions
• This was a talk about Big Data:– Millions of people creating data sets– Billions of people seeing the data being impacted
• Get out there and find your favorite application.
• Dreams do come true:– At least as it pertains to structured data on the
Web!
References
• Fusion Tables: SIGMOD 2010, 2012• WebTables: VLDB 2008, 2009, 2011