index structures for querying the deep web

36
Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google

Upload: lou

Post on 02-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Index Structures for Querying the Deep Web. Jian Qiu, Feng Shao , Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google. Deep Web. Keyword queries. Static web pages. Surface web. Deep Web. Keyword queries. Static web pages. Surface web. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Index Structures for Querying the Deep Web

Index Structures for Querying the Deep Web

Jian Qiu, Feng Shao, Jayavel ShanmugasundaramCornell Universersity

Misha Zatsman Google

Page 2: Index Structures for Querying the Deep Web

Deep WebKeyword queries Static web pages

Surface web

Page 3: Index Structures for Querying the Deep Web

Deep WebKeyword queries Static web pages

Surface web

Ebaydatabases

CNNdatabases

Cars.comdatabases

…Amazon

databases

www.ebay.com

400-500 times the

size of surface

web!Deep web…

Page 4: Index Structures for Querying the Deep Web

Deep GlueStructured queries Query results

Ebaydatabase

CNNdatabases

Cars.comdatabase

…Amazondatabase

400-500 times the

size of surface

web!Deep web

Page 5: Index Structures for Querying the Deep Web

Deep Glue System

Query Engine

Find textbooks with price<$50

Database Concepts @ half.com…

Query

Superset of relevant data sources

Internet

…Half.com databases

Index structures

Indexer

Our focus

Page 6: Index Structures for Querying the Deep Web

Index structure for deep web: Challenges

Deal with structured dataUnderlying databases are structuredSurface web typically unstructured

Deal with large volumes Orders of magnitude larger than the size of surface web

Page 7: Index Structures for Querying the Deep Web

Our approach

Understand the structure/typing of the data

Support equality and range queries

Heavily compress the index Achieve a factor of 10 compression

Tradeoff between compression factor and the number of false positives

Compression factor 10 with only ~10 false positives for 1000 data sources.

Page 8: Index Structures for Querying the Deep Web

Outline

Query model Index Structures Experimental Evaluation Related work and conclusion

Page 9: Index Structures for Querying the Deep Web

Assumptions

Data sources are classified into domains Online car dealers, online auctions, online travel agents, …

Data sources in the same domain use same logical relational schema

Indexing attributes Price, date, make, model, isbn,… Indexed by Deep Glue system

Indexing data can be obtained via Crawling the deep web [Raghavan 01 ] Previously agreed-upon protocols [Froogle]

Page 10: Index Structures for Querying the Deep Web

Query Model

Support equality and range queries currently on a single indexing attribute

Schema: Car(Id,Make,Model,Year,Price)

Queries: Find all year 2003 cars, year = 2003 Find all cars that cost less than $1000,

price < 1000

Page 11: Index Structures for Querying the Deep Web

Outline

Query model Index Structures Experimental Evaluation Related work and conclusion

Page 12: Index Structures for Querying the Deep Web

Overview

Uncompressed Index

Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index

(VDCI) Histogram Based Index (HBI)

Page 13: Index Structures for Querying the Deep Web

Uncompressed Index (UI)

For each distinct value v for an indexing attribute, stores the list of data sources

d1,d7,d86d2,d3,d65d1,d4,d54

d43d2,d3,d4,d62

d6,d7,d81data sourcesvalue

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X

3 X

4 X X X

5 X X X

6 X X Xd1: ebay.com , d2: amazon.com …

UI:

B+tree

Page 14: Index Structures for Querying the Deep Web

Problems

A huge number of values and data sources in deep web !! Indexing every indexing attribute

requires space

Need to compress UI ! Use gzip?

Have to uncompress the index index lookup too expensive!

Need new compression techniques

Page 15: Index Structures for Querying the Deep Web

Overview

Uncompressed Index

Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index

(VDCI) Histogram Based Index (HBI)

Page 16: Index Structures for Querying the Deep Web

Value Clustered Index (VCI)

Intuition: “closely related” values are stored in

“closely related” data sources ISBN numbers of antique books in the online

book retailers specializing in antique books.

Cluster “closely related” values

Stores the list of data sources only for each cluster

Page 17: Index Structures for Querying the Deep Web

VCI Example

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X

3 X

4 X X X

5 X X X

6 X X X X

Cluster 1: { 1, 6}

Cluster 2: { 2, 5}

Cluster 3: { 3, 4}

False positivesvalue 1 data source d1

Tradeoff between space and accuracyMapping all values in one cluster

Mapping each distinct value into a separate cluster

c16c25c34c33c22c11

Cluster id

value

d1,d4,d5c3

d2,d3,d4,d6c2

d1,d6,d7,d8c1

data sourcesCluster id

VCI structures:Union

B+tree

Page 18: Index Structures for Querying the Deep Web

VCI Implementation

Use existing scalable algorithm Scales to large data sets: Birch Framework

[Zhang96]

Minimize the number of false positives Specify the parameters for Birch

Centroid, the mid-point of a cluster Radius, a measure of quality for a cluster Distance between clusters

Centroid

Radius

Distance

cluster1

cluster2

Page 19: Index Structures for Querying the Deep Web

VCI formulae For a cluster having the set of values V

ds(v): the set of data sources for value v

centroid(V) =

radius(V) =

distance(V1, V2) Additional number of false positives when merging two clusters

Vv

vds

)(

V

vdsVcentroidVv

)()(

)2()1(2)1()2(1 VcentroidVcentroidVVcentroidVcentroidV

Data sources associated with the cluster

Sum of number of false positives

Page 20: Index Structures for Querying the Deep Web

Overview

Uncompressed Index

Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index

(VDCI) Histogram Based Index (HBI)

Page 21: Index Structures for Querying the Deep Web

DataSource Clustered Index (DCI) Intuition: “closely related” data sources may have “closely related” sets

of values Amazon and b&n has similar sets of ISBN numbers

In the data graph, VCI clusters rows and DCI clusters columns

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X

3 X X X X

4 X X X

5 X X X

6 X X X

Cluster 1: {d2,d3,d6}

Cluster 2: { d4, d5}

Cluster 3: { d1, d7, d8}

Table structures are similar to VCI.

See paper for other details

Page 22: Index Structures for Querying the Deep Web

Overview

Uncompressed Index

Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index

(VDCI) Histogram Based Index (HBI)

Page 23: Index Structures for Querying the Deep Web

Value-DataSource Clustered Index (VDCI)

VCI, DCI: clusters in 1 dimension VDCI: clusters in 2 dimensions, generalizes VCI/DCI Cluster: a set of values and a set of data sources

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X X X

3 X X X X

4 X X X X

5 X X X X X

6 X X X

Cluster 1:{ {2,3}, {d2,d3,d4}}

Cluster 2:{ {4,5}, {d4,d5,d6} }

Cluster 3:{ {1,2}, {d6,d7,d8} }

Data source d4 is in two clusters

Value 2 is in two clusters

Table structures are similar to VCI.

See paper for other details

Page 24: Index Structures for Querying the Deep Web

Overview

Uncompressed Index

Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index

(VDCI) Histogram Based Index (HBI)

Page 25: Index Structures for Querying the Deep Web

Histogram Based Index (HBI)

VCI/VDCI don’t consider the ordering among values Range queries implies this need

HBI groups adjacent values in the same cluster

Also need to ensure the accuracy Use threshold to determine the boundary of a

cluster Threshold: average number of false positives in a cluster

Page 26: Index Structures for Querying the Deep Web

HBI Example

value

data source

d1 d2 d3 d4 d5 d6 d7 d8

1 X X X

2 X X X X

3 X

4 X X X

5 X X X

6 X X X

Threshold: 2

Cluster adjacent values

Cluster 1: {1}

Cluster 2: {2,3,4}

Cluster 3: {5,6}

Page 27: Index Structures for Querying the Deep Web

Outline

Query model Index Structures Experimental Evaluation Related work and conclusion

Page 28: Index Structures for Querying the Deep Web

Experimental setupSynthetic data

1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs

Other parameters are in the paper

MetricsIndex creation timeCompression factorFalse positives

Setup 2.8GHz Pentium IV, 1GB memory, 80GB disk C++

Page 29: Index Structures for Querying the Deep Web

Index creation time

Index structure Time(min)

UI 0.25

VCI 15

DCI 3

VDCI 180

HBI 2.5

Page 30: Index Structures for Querying the Deep Web

Equality queries (1000 data sources)

0

10

20

30

40

50

0 5 10 15 20compression factor

false

posit

ives

VCI DCI VDCI HBI

Page 31: Index Structures for Querying the Deep Web

Range Queries (1000 data sources)

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20compression factor

fals

e po

sitiv

es

VCI DCI VDCI HBI

Page 32: Index Structures for Querying the Deep Web

Outline

Query model Index Structures Experimental Evaluation Related work and conclusion

Page 33: Index Structures for Querying the Deep Web

Related work

Distributed database & information integration Niagara system [Naughton01] GlOSS [Gravano99] …

Database/Inverted list compression Query Optimization in Compressed Databases [Chen 01] Compressing the Relations and Index [Goldstein 98] Improved Query Performance with Variant Indices

[O’Neill 97] Implementation and Performance of Compressed

Databases [Westmann 00] Size Reduction of Inverted Files [Weiss 90] …

Page 34: Index Structures for Querying the Deep Web

Conclusion

Space-efficient index structures for querying the deep web

Support equality and range queries A factor of 10 compression with a little

loss in precision

Future work Combine cluster-based and histogram-based Multiple attributes queries Joins Incremental index maintenance

Page 35: Index Structures for Querying the Deep Web

Questions?

Page 36: Index Structures for Querying the Deep Web

Experimental setupOther parameters:

Number of groups The data sources in the same group use same distribution to generate the values

Default 20

Group mode How many groups a data source belongs to

Default 1

Value correlation How the orders in the value space maps to the value ordering over which Gaussian distribution is used.

Default 0.2