anthony okorodudu cse 6392 2006-4-25

Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data NetworksBy Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum

Anthony OkoroduduCSE 63922006-4-25

2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

2

Outline Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches Distributed Hash Sketches (DHS) Counting with DHS Conclusion


3

Introduction Peer-to-peer (P2P) started as a way of

sharing files/CPU cycles among end-users Evolved into cutting networks of today

Distributed Hash Tables (DHT) made this feasible Probabilistic guarantees for degree of

efficiency, fault tolerance, and availability Data management systems of huge scale


4

Motivation Need for distributed counting

mechanisms File-sharing P2P systems: total

number of documents shared by users Sensor networks: compute aggregates

in a duplicate-insensitive manner Internet-scale DB system: build

histograms for query access plans


5

Central Goals

1. Efficiency: number of nodes contacted for counting must be small

2. Scalability and availability: large numbers of nodes may need to add elements to a (multi-) set

3. Access and storage load balancing: counting and related overheads should be fairly distributed across all nodes


6

Central Goals (continued)4. Accuracy: tunable, robust, and

highly accurate cardinality estimation

5. Simplicity and ease of integration: special, solution-based indexing structures should be avoided

6. Duplicate (in)sensitivity: count total number of items as well as the number of unique items in multi-sets


7

Distributed Counting Protocols One-node-per-counter protocols Gossip-based protocols Broadcast/convergecast-type

protocols Sampling-based protocols


8

One-node-per-counter Select a node in the overlay of the

DHT and use it to maintain counter value

Poor scalability Resembles a centralized system


9

Gossip-based Provide weak probabilistic semantics

of “eventual consistency” for outcome

Every node exchanges information with a set of nodes

Low bandwidth Not efficient in terms of number of

nodes to be contacted Low accuracy


10

Broadcast/Convergecast-type1. Broadcast phase

Querying node broadcasts query through network, creating tree of nodes as query propagates the overlay

2. Convergecast phase Node sends its local part of the answer

along with answers received from nodes deeper down the tree to “parent” node

Similar to gossip-based


11

Sampling-based Estimate the value of the counter by

selectively querying a set of nodes in the network

Sampling based techniques suffer from accuracy issues

Large samples lead to higher accuracy but more nodes need to be contacted

Sampling based techniques are usually duplicate-sensitive


12

Distributed Hash Tables (DHT) Family of structured P2P network

overlays exposing hash-table like interface

1. insert(key, value)2. lookup(key)

Highly efficient for point queries


13

Hash Sketches First proposed as a means of

estimating the cardinality of a multiset in a database

Used in many application domains for counting distinct elements in multi-sets Approximate query answering in very

large DBs, data mining on the internet graph, stream processing


14

Hash Sketches (continued) PCSA (Probabilistic Counting with

Stochastic Averaging) algorithm assumes of a pseudo-uniform hash function

Super-LogLog algorithm relaxes pseudo-uniform hash function constraints of PCSA


15

Distributed Hash Sketches (DHS) Fully decentralized, scalable, and

efficient mechanism capable of providing estimates on the cardinality of multi-sets

Satisfy all the central goals Implemented using PCSA (DHS-

PCSA) or super-LogLog (DHS-sLL) hash sketches


16

DHS O(log N) cost to insert object in an N-

node DHS O(b * log N) bandwidth consumption

if size of data is b bytes Data items are deleted if not updated

within time-to-live so deleting an item incurs no extra cost


17

DHS (continued) Accuracy of hash sketches

increases with multiple bitmap vectors

Either PCSA or super-LogLog algorithm is applied for counting


18

Counting with DHS


19

Conclusion Distributed Hash Sketches is a fully

decentralized, scalable, and efficient mechanism for providing estimates on the cardinality of multi-sets in internet-scale information systems

DHS implemented using either PCSA or the super-LogLog hash sketches

DHS histograms can introduce great performance savings during query optimization


20

References N. Ntarmos, P. Triantafillou, and G.

Weikum. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. ICDE 2006.


21

Thanks

anthony okorodudu cse 6392 2006-4-25

Documents

number of nodes

tree of nodes

terms of number

total number of items

total number of documents

number of unique items

large numbers of nodes

p2p systems