accessing the deep web bin he ibm almaden research center in san jose, ca mitesh patel microsoft...

Accessing the Deep WebBin HeIBM Almaden Research Center in San Jose, CA Mitesh PatelMicrosoft CorporationZhen Zhangcomputer science at the University of Illinois at Urbana-ChampaignKevin Chen-Chuan Changcomputer science at the University of Illinois at Urbana-Champaign

COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

1

Introduction•Web has been rapidly “deepened” by

massive databases online▫current search engines do not reach most of

the data on the internet•Surface Web

▫linked of static HTML pages•a far more significant amount of

information is believed to be “hidden” in the deep Web▫behind the query forms of searchable

databases

2

Conceptual View of the Deep Web

3

Introduction (Con.)

•This article reports the survey of the deep Web▫scale▫subject distribution▫search-engine coverage▫other access characteristics of online

databases

4

Related Work•BrightPlanet.com., 2000

▫established interest in this area▫focused on only the scale aspect▫overlap analysis

43,000–96,000 “deep Web sites” informal estimate of 7,500 terabytes of data exist 500 times larger than the surface Web

▫underestimate assume two search engines randomly and

independently obtain data Actually, highly correlated in coverage of deep web

data

5

Global Scale Estimation•IP sampling approach

▫randomly sampled 1,000,000 IPs▫From the entire space of 2,230,124,544

valid IP address•For each IP

▫download HTML pages▫identified & analyzed web databases in this

sample

6

Site, Databases, and Interface•distinguish three related notions for

accessing the deep web▫site, database, and interface

•a deep web ▫a Web server that provides information

maintained in one or more backend Web databases

•each of database is searchable through one or more HTML forms as its query interfaces

7

Site, Databases, and Interface (Con.)

find the number of query interfaces for each Web site, then the number of Web databases, and finally the number of deep Web sites

8

Query Interface

•exclude non-query HTML forms (which do not access back-end databases) from query interfaces

•exclude login, subscription, registration, polling, and message posting

•exclude “site search” ▫many web sites now provide for searching

HTML pages on their sites•removed duplicates

9

Web Database

•based on the discovered query interfaces•compute the number of Web databases by

finding the set of query interfaces (within a site) that refer to the same database

•if the objects from one interface can always be found in the other one▫the two interfaces are searching the same

database▫randomly choose five objects

10

Deep Web Site

•the recognition of deep web site is rather simple

•a Web site is a deep Web site if it has at least one query interface

11

(Q1) Where to find “entrances” to databases?•To access a Web database, we must first

find its entrances: the query interfaces•depth of query interface

▫the minimum number of hops from the root page of the site to the interface page

•Due to deep crawling, analyzed 1/10 of total IP samples▫100,000 IPs

12

Results of Q1• found 281 Web servers•Exhaustively crawling these servers to

depth 10, we found 24 of them are deep Web sites▫Contained a total of 129 query interfaces

representing 34 Web databases•query interfaces tend to locate shallowly in

their sites▫none of the 129 query interfaces had depth

deeper than 5▫72% (93 out of 129) interfaces were found

within depth 3

13

Depth of Web Database

• since a Web database may be accessed through multiple interfaces▫ measured its depth as

the minimum depths of all its interfaces

▫ 94% (32 out of 34) Web databases appeared within depth 3

▫ 91.6% (22 out of 24) deep Web sites had their databases within depth 3

14

(Q2) What is the scale of the deep Web?•tested and analyzed all of the 1,000,000

IP samples to estimate the scale of the deep Web

•high depth-three coverage▫almost all Web databases can be identified

within depth 3•crawled to depth 3 for these one million

IPs

15

Results of Q2

•2,256 Web servers▫126 deep Web sites▫406 query

interfaces ▫190 Web databases

•s = 1,000,000 unique IP samples

•the entire IP space of t = 2,230,124,544 IPs

Number of deep Web sites

number of Web databases

number of query interfaces

16

(Q3) How “structured” is the deep Web?•classified Web databases into two types

▫unstructured databases provide data objects as unstructured media

(text, images, audio, and video)▫structured databases

provide data objects as structured “relational” records with attribute-value pairs

17

Results of Q3•manual querying and inspection of the

190 Web databases sampled▫found 43 unstructured and 147 structured▫similarly estimate their total numbers to be

102,000 and 348,000•Deep Web features mostly structured data

sources▫3.4:1

18

(Q4)What is the subject distribution of Web databases?

• top-level categories of the yahoo.com directory as taxonomy

• manually categorized the sampled 190 Web databases

• the distribution indicates great subject diversity among Web databases

non-commerce categories

51% (97 out of 190 databases)

19

(Q5) How do search engines cover the deep Web?•randomly selected 20 Web databases from

190•For each database, first manually sampled

five objects (result pages) as test data

20

Coverage of Search Engines

indexing almost the same objects

entirely a subset of Yahoo

• contrasts with the surface Web

• overlap -> low, combination -> greatly improve coverage

21

(Q6) What is the coverage ofdeep Web directories?

• providing deep Web directories

• classify Web databases in some taxonomies

• recorded the number of databases it claimed to have indexed

• low coverage

• manual classification of Web databases (directory-based indexing services)

• hardly scale for the deep Web

22

Conclusion

23

Conclusion (Con.)•poor coverage of both its data and databases

▫access to the deep Web is not adequately supported

•V.s. surface web▫Same

large, fast-growing, and diverse▫Different

more diversely distributed, is mostly structured, and suffers an inherent limitation of crawling

•crawl-and-index techniques▫“limit of coverage” and “structural heterogeneity”

24

Future Work•database-centered, discover-and-forward

access model•automatically discover databases on the

Web by crawling and indexing their query interfaces

•User querying -> forward users to the actual search of data▫use their data-specific interfaces ▫fully leverage their structures

•Recent project▫MetaQuerier and WISE-Integrator

25

accessing the deep web bin he ibm almaden research center in san jose, ca mitesh patel microsoft...

Documents

interfacea deep web

deep web sitesinformal

deep web sitescontained

number of web databases

web databasebased

web serversexhaustively

number of deep web sites

simplea web site