Google's Deep-Web Crawl (VLDB 2008)
Google’s Deep-Web CrawlGoogle’s Deep-Web Crawl
Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and
Alon Halevy, Google Inc.
Speaker: Tom
1
Google's Deep-Web Crawl (VLDB 2008)
What is the Deep Web?
Content hidden behind HTML forms
Deep = not accessible through search engines
2
Google's Deep-Web Crawl (VLDB 2008)
Why is it important?
Large source of structured dataForms present a search interface over backend databases
Significant gap in search engine coveragePotentially more content that currently searchable web [Bergman+,
Madhavan+, He+]
More than 10 million distinct HTML forms
Likely to increase and more data comes online
Challenge: make the Deep Web accessible to web search
3
Google's Deep-Web Crawl (VLDB 2008)
Yes: Informational forms
No: Login forms, anything that requires user informationMaybe: Interactive forms, e.g., airline reservations
What is in the Deep Web?
store locationsused cars
radio stationspatents
recipes
4
Google's Deep-Web Crawl (VLDB 2008)5
Google's Deep-Web Crawl (VLDB 2008)
Mediator forms per domainMappings between forms [Doan+, He+, Wu+]Query routing/reformulation at run-timePopular with vertical search engines
Impractical for web search!
Modeling all domains in all languages might not be possibleHigh cost of building and maintaining
Query routing at run-time is very difficultPotentially high loads on deep-web sources
Virtual Integration
mediated form
deep-web sources
semantic mappings
6
Google's Deep-Web Crawl (VLDB 2008)7
Google's Deep-Web Crawl (VLDB 2008)
Surfacing the Deep Web
8
Google's Deep-Web Crawl (VLDB 2008)
Surfacing the Deep Web
Pre-compute all interesting form submissions each HTML form
Each form submission corresponds to a distinct URL
Add URLs for each form submission into search engine index
Enables the reuse of existing search engine infrastructureDeep-web URLs are like any other URL
Reduced load on deep-web sites
Only in response to user clicks on a search results
Search engine performance not dependent on deep-web source
9
Google's Deep-Web Crawl (VLDB 2008)
Surfacing Challenges
1. Predicting the appropriate values for text inputsValid input values are required for retrieving dataIngredients in recipes.com and zipcodes in borderstores.com
2. Predicting the correct input combinationsGenerating all possible URLs is wasteful + unnecessaryCars.com has ~500K listings, but 250M possible queries
10
Google's Deep-Web Crawl (VLDB 2008)
Surfacing for a Search Engine
Goal: access to as much Deep-Web content at possible.
Distribution of form-generated traffic is heavy-tailedMore than 800,000 distinct forms in a week
Overall coverage more important than site-specific coverage
Completely automatic and efficient solution required !Many domains and many languages
No human in the loop, no site-specific scripts
11
Google's Deep-Web Crawl (VLDB 2008)
Contributions and Impact
Research contributionsFormulation: searching for informative query templates
Algorithms: predicting input combinations
Algorithms: predicting input values for text boxes
Google’s Deep-Web crawling systemAffects more than 1000 queries per second
Enables access to more than a million Deep-Web sites
Spans 50+ languages and 100+ domains
12
Google's Deep-Web Crawl (VLDB 2008)
Problem Formulation
13
Google's Deep-Web Crawl (VLDB 2008)
Form Processing 101
GET and POST: types of HTML forms
Only GETs can be surfaced
<form action=http://www.borders.com/locator method=GET> <select name=store><option …/>… </select> … <input name=zip type=text/> <input name=search type=submit value=Go/> <input name=site type=hidden value=homepage/></form>
URL: http://www.borders.com/locator?store=All&city=&state=&zip=94043&within=25&search=Go&site=homepage
on submit
14
Google's Deep-Web Crawl (VLDB 2008)
Problem Formulation
Form submission ~ SQL Query
select * from DBwhere I1=V1 and … and IN=VN
Not all inputs impose selection predicates
E.g., sort order and results per page affect presentation
Problem: find the best set of SQL queries
15
Google's Deep-Web Crawl (VLDB 2008)
Query Templates
Query Template: compact representation of a set of queriesIB: binding inputs in the form
{ select * from DB where PB }PB: selection predicates only involving IB
All queries with different values for IB
Default values assigned to other inputs
Store locator with zip and type can have templates:<Z> {select * from DB where zip = z | z are valid zip codes }<T> {select * from DB where type = t | t are valid store types }<T, Z> {select * from DB where zip = z and type = t | … }
Problem: find the best possible query templates
16
Google's Deep-Web Crawl (VLDB 2008)
Predicting Input Combinations
17
Google's Deep-Web Crawl (VLDB 2008)
Predicting Input Combinations
Forms can have multiple inputsGenerating all possible URLs is wasteful! … and un-necessary!
Goal: minimize URLs while maximizing retrieval!
Other considerationsGenerated URLs must be good candidates for indexOnly need URLs sufficient to drive trafficOnly need URLs sufficient to seed the web crawler
18
Google's Deep-Web Crawl (VLDB 2008)
Query Template Quality
Presentation input is binding– There exists a template with fewer binding inputs
Large query templates (many binding inputs)– Too many queries generated– Numerous queries with empty results+ Likely to ensure complete coverage
Small query templates (fewer binding inputs)+ Smaller number of queries– Lower actual coverage (restrictions on the results per page)– Results of a single query not sufficiently related
19
Google's Deep-Web Crawl (VLDB 2008)
Good Query Templates
Do not contain presentation inputs
Neither too small, neither too largeDependent on database size?
Dependent on potential query traffic?
20
Google's Deep-Web Crawl (VLDB 2008)
Informative Query Templates
http://jobs.shrm.org/search?state=All&kw=&type=Allhttp://jobs.shrm.org/search?state=AL&kw=&type=Allhttp://jobs.shrm.org/search?state=AK&kw=&type=All…http://jobs.shrm.org/search?state=WV&kw=&type=All
http://jobs.shrm.org/search?state=All&kw=&type=ALLhttp://jobs.shrm.org/search?state=All&kw=&type=ANYhttp://jobs.shrm.org/search?state=All&kw=&type=EXACT
Result pages different informative
Result pages similar un-informative
21
Google's Deep-Web Crawl (VLDB 2008)
Identifying Informative Templates
Generate a sampling of possible form submissionsAnalyze and compare the contents of the result pages
Compute content signatures for each corresponding web page
Dist. Frac. = # Distinct Signatures / # URLs
Dist. Frac. > Threshold Informative Template
Content signatures must be robust toChanges in HTML layoutMinor differences in contentPresence of advertisements and transient content
22
Google's Deep-Web Crawl (VLDB 2008)
URL Generation
Low distinctness fractions imply thatpresentation inputs: many pages have similar results
very large template: many pages are empty
error template: all pages are the same with an error message
Generated submissions unlikely to be useful
URL generation strategyEnumerate all possible query templates
Test each template for informativeness
Generate all URLs from informative templates
23
Google's Deep-Web Crawl (VLDB 2008)
Incremental Template Search
Determine informative templates with one binding input
Determine informative templates with two binding inputsOnly consider pairs with one input known to be informative
Incrementally build candidate templatesOnly consider supersets of smaller informative templates
Halt when no larger templates are possible
ISIT: Incremental Search for Informative Templates
24
Google's Deep-Web Crawl (VLDB 2008)
Scalable URL Generation
Our algorithm generates far fewer URLsInformativeness test plays a critical roleNumber of URLs generated depends on database size
Competitors• Cartesian: all possible URLs• Triple: templates with three binding inputs
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
10000000000
1 2 3 4 5 6 7 8 9 10
Number of Inputs
Av
era
ge
UR
Ls
pe
r F
orm
INFORMATIVE CARTESIAN TRIPLE
25
Google's Deep-Web Crawl (VLDB 2008)
Other significant results
Larger Templates are usefulCompare with simple strategy: single binding input templates
Among forms with informative templates with 3 inputsTemplates of size 1 contribute 6% of search results on Google.com
Templates of size 2 contribute 37%
Templates of size 3 contribute 57%
Informative templates are discovered efficientlyAmong forms with 5 inputs, on average
Only 12.6 (out of possible 31) templates are tested
Only 1300 URLs are analyzed in total
26
Google's Deep-Web Crawl (VLDB 2008)
Predicting Text Values
27
Google's Deep-Web Crawl (VLDB 2008)
Generic and Typed Text boxes
Generic Search BoxesAccept any keywords
Challenge: selecting the most appropriate values
Typed Text BoxesOnly values belonging to specific types, e.g., zipcodes
Challenge: selecting the type of the input
28
Google's Deep-Web Crawl (VLDB 2008)
Example: www.wipo.int
29
Google's Deep-Web Crawl (VLDB 2008)
Input values for Generic Search
Iterative Probing for search boxesSelect an initial list of candidate keywords
Download pages based on current set of keywordsExtract more candidate keywords from result pagesRefine the current set of keywords
Repeat until no more new candidate keywordsPrune list of candidate keywords
Related Work:Classifying Deep-Web sources [Ipeirotis+]Extracting text documents [Ntoulas+, Barbosa+]
30
Google's Deep-Web Crawl (VLDB 2008)
Example: www.wipo.int
MetalworkingProteinAntibodyPyrazoleImmobilizerVasoconstrictionPhosphinatesNosepieceSandbridgeViscosityCarboxydiphenylsulphideOzonizer…
31
Google's Deep-Web Crawl (VLDB 2008)
Results Summary
Distribution of keywords extracted is heavy tailed
Large fraction of records retrieved extracted
Text inputs and select menus are complementary and both are important
Web crawler can automatically retrieve additional content
32
Google's Deep-Web Crawl (VLDB 2008)
Typed Text Boxes
Library of types that are common across domainsName patterns and sample values
Zipcodes, City Names, Prices, Dates
Re-use informativeness testTest singleton text boxes
Informative only when using the correct type
33
Google's Deep-Web Crawl (VLDB 2008)
Summary
34
Google's Deep-Web Crawl (VLDB 2008)
Google’s Deep-Web Crawl
Solution based on the idea of informative templates
Automatic descriptions learned for millions of forms
Spans many domains and 50+ languages
Affects more than 1000 queries per sec
Results served from 400K+ distinct forms per day
Results served from 800K+ distinct forms per week
Results validate the utility of Deep-Web content
35
Google's Deep-Web Crawl (VLDB 2008)
Future Work
Extending the coverage of crawlable formsDependencies between inputs, which are currently being ignored
Javascript-based submissions, which involve complex URL generation
Surfacing only part of the solutionPOST forms cannot be indexed by surfacing
Surfacing flattens structure – cannot be exploited during ranking
36
Related to 3D-LBS
Google's Deep-Web Crawl (VLDB 2008)
•Mobile application•Accessibility
•Limited screen size, hard to fill in forms•Recommendation
•Location-sensitive query suggestion•Dependency of inputs•Hong Kong Style Dim Sum Shatin
38
Google's Deep-Web Crawl (VLDB 2008)39
Q&AThanks!