anthony okorodudu cse 6392 2006-4-25

21
Estimation in Internet-Scale Data Networks By Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum Anthony Okorodudu CSE 6392 2006-4-25

Upload: robert

Post on 13-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks By Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum. Anthony Okorodudu CSE 6392 2006-4-25. Outline. Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Anthony Okorodudu CSE 6392 2006-4-25

Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data NetworksBy Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum

Anthony OkoroduduCSE 63922006-4-25

Page 2: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

2

Outline Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches Distributed Hash Sketches (DHS) Counting with DHS Conclusion

Page 3: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

3

Introduction Peer-to-peer (P2P) started as a way of

sharing files/CPU cycles among end-users Evolved into cutting networks of today

Distributed Hash Tables (DHT) made this feasible Probabilistic guarantees for degree of

efficiency, fault tolerance, and availability Data management systems of huge scale

Page 4: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

4

Motivation Need for distributed counting

mechanisms File-sharing P2P systems: total

number of documents shared by users Sensor networks: compute aggregates

in a duplicate-insensitive manner Internet-scale DB system: build

histograms for query access plans

Page 5: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

5

Central Goals

1. Efficiency: number of nodes contacted for counting must be small

2. Scalability and availability: large numbers of nodes may need to add elements to a (multi-) set

3. Access and storage load balancing: counting and related overheads should be fairly distributed across all nodes

Page 6: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

6

Central Goals (continued)4. Accuracy: tunable, robust, and

highly accurate cardinality estimation

5. Simplicity and ease of integration: special, solution-based indexing structures should be avoided

6. Duplicate (in)sensitivity: count total number of items as well as the number of unique items in multi-sets

Page 7: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

7

Distributed Counting Protocols One-node-per-counter protocols Gossip-based protocols Broadcast/convergecast-type

protocols Sampling-based protocols

Page 8: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

8

One-node-per-counter Select a node in the overlay of the

DHT and use it to maintain counter value

Poor scalability Resembles a centralized system

Page 9: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

9

Gossip-based Provide weak probabilistic semantics

of “eventual consistency” for outcome

Every node exchanges information with a set of nodes

Low bandwidth Not efficient in terms of number of

nodes to be contacted Low accuracy

Page 10: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

10

Broadcast/Convergecast-type1. Broadcast phase

Querying node broadcasts query through network, creating tree of nodes as query propagates the overlay

2. Convergecast phase Node sends its local part of the answer

along with answers received from nodes deeper down the tree to “parent” node

Similar to gossip-based

Page 11: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

11

Sampling-based Estimate the value of the counter by

selectively querying a set of nodes in the network

Sampling based techniques suffer from accuracy issues

Large samples lead to higher accuracy but more nodes need to be contacted

Sampling based techniques are usually duplicate-sensitive

Page 12: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

12

Distributed Hash Tables (DHT) Family of structured P2P network

overlays exposing hash-table like interface

1. insert(key, value)2. lookup(key)

Highly efficient for point queries

Page 13: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

13

Hash Sketches First proposed as a means of

estimating the cardinality of a multiset in a database

Used in many application domains for counting distinct elements in multi-sets Approximate query answering in very

large DBs, data mining on the internet graph, stream processing

Page 14: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

14

Hash Sketches (continued) PCSA (Probabilistic Counting with

Stochastic Averaging) algorithm assumes of a pseudo-uniform hash function

Super-LogLog algorithm relaxes pseudo-uniform hash function constraints of PCSA

Page 15: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

15

Distributed Hash Sketches (DHS) Fully decentralized, scalable, and

efficient mechanism capable of providing estimates on the cardinality of multi-sets

Satisfy all the central goals Implemented using PCSA (DHS-

PCSA) or super-LogLog (DHS-sLL) hash sketches

Page 16: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

16

DHS O(log N) cost to insert object in an N-

node DHS O(b * log N) bandwidth consumption

if size of data is b bytes Data items are deleted if not updated

within time-to-live so deleting an item incurs no extra cost

Page 17: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

17

DHS (continued) Accuracy of hash sketches

increases with multiple bitmap vectors

Either PCSA or super-LogLog algorithm is applied for counting

Page 18: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

18

Counting with DHS

Page 19: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

19

Conclusion Distributed Hash Sketches is a fully

decentralized, scalable, and efficient mechanism for providing estimates on the cardinality of multi-sets in internet-scale information systems

DHS implemented using either PCSA or the super-LogLog hash sketches

DHS histograms can introduce great performance savings during query optimization

Page 20: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

20

References N. Ntarmos, P. Triantafillou, and G.

Weikum. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. ICDE 2006.

Page 21: Anthony Okorodudu CSE 6392 2006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

21

Thanks