anthony okorodudu cse 6392 2006-4-25
DESCRIPTION
Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks By Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum. Anthony Okorodudu CSE 6392 2006-4-25. Outline. Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches - PowerPoint PPT PresentationTRANSCRIPT
Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data NetworksBy Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum
Anthony OkoroduduCSE 63922006-4-25
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
2
Outline Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches Distributed Hash Sketches (DHS) Counting with DHS Conclusion
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
3
Introduction Peer-to-peer (P2P) started as a way of
sharing files/CPU cycles among end-users Evolved into cutting networks of today
Distributed Hash Tables (DHT) made this feasible Probabilistic guarantees for degree of
efficiency, fault tolerance, and availability Data management systems of huge scale
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
4
Motivation Need for distributed counting
mechanisms File-sharing P2P systems: total
number of documents shared by users Sensor networks: compute aggregates
in a duplicate-insensitive manner Internet-scale DB system: build
histograms for query access plans
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
5
Central Goals
1. Efficiency: number of nodes contacted for counting must be small
2. Scalability and availability: large numbers of nodes may need to add elements to a (multi-) set
3. Access and storage load balancing: counting and related overheads should be fairly distributed across all nodes
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
6
Central Goals (continued)4. Accuracy: tunable, robust, and
highly accurate cardinality estimation
5. Simplicity and ease of integration: special, solution-based indexing structures should be avoided
6. Duplicate (in)sensitivity: count total number of items as well as the number of unique items in multi-sets
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
7
Distributed Counting Protocols One-node-per-counter protocols Gossip-based protocols Broadcast/convergecast-type
protocols Sampling-based protocols
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
8
One-node-per-counter Select a node in the overlay of the
DHT and use it to maintain counter value
Poor scalability Resembles a centralized system
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
9
Gossip-based Provide weak probabilistic semantics
of “eventual consistency” for outcome
Every node exchanges information with a set of nodes
Low bandwidth Not efficient in terms of number of
nodes to be contacted Low accuracy
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
10
Broadcast/Convergecast-type1. Broadcast phase
Querying node broadcasts query through network, creating tree of nodes as query propagates the overlay
2. Convergecast phase Node sends its local part of the answer
along with answers received from nodes deeper down the tree to “parent” node
Similar to gossip-based
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
11
Sampling-based Estimate the value of the counter by
selectively querying a set of nodes in the network
Sampling based techniques suffer from accuracy issues
Large samples lead to higher accuracy but more nodes need to be contacted
Sampling based techniques are usually duplicate-sensitive
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
12
Distributed Hash Tables (DHT) Family of structured P2P network
overlays exposing hash-table like interface
1. insert(key, value)2. lookup(key)
Highly efficient for point queries
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
13
Hash Sketches First proposed as a means of
estimating the cardinality of a multiset in a database
Used in many application domains for counting distinct elements in multi-sets Approximate query answering in very
large DBs, data mining on the internet graph, stream processing
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
14
Hash Sketches (continued) PCSA (Probabilistic Counting with
Stochastic Averaging) algorithm assumes of a pseudo-uniform hash function
Super-LogLog algorithm relaxes pseudo-uniform hash function constraints of PCSA
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
15
Distributed Hash Sketches (DHS) Fully decentralized, scalable, and
efficient mechanism capable of providing estimates on the cardinality of multi-sets
Satisfy all the central goals Implemented using PCSA (DHS-
PCSA) or super-LogLog (DHS-sLL) hash sketches
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
16
DHS O(log N) cost to insert object in an N-
node DHS O(b * log N) bandwidth consumption
if size of data is b bytes Data items are deleted if not updated
within time-to-live so deleting an item incurs no extra cost
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
17
DHS (continued) Accuracy of hash sketches
increases with multiple bitmap vectors
Either PCSA or super-LogLog algorithm is applied for counting
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
18
Counting with DHS
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
19
Conclusion Distributed Hash Sketches is a fully
decentralized, scalable, and efficient mechanism for providing estimates on the cardinality of multi-sets in internet-scale information systems
DHS implemented using either PCSA or the super-LogLog hash sketches
DHS histograms can introduce great performance savings during query optimization
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
20
References N. Ntarmos, P. Triantafillou, and G.
Weikum. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. ICDE 2006.
2006/4/25 Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
21
Thanks