hashing, random projections, and data streaming
TRANSCRIPT
-
8/14/2019 Hashing, random projections, and data streaming
1/17
Hashing, random projections, and data streaming
Ioana Cosma
Statistical Laboratory
University of Cambridge
September 28, 2009
Ioana Cosma Hashing, random projections, and data streaming
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
2/17
Data streaming: definition and problem
Streaming dataTransiently observed sequence of data elements that arriveunordered, with repetitions, and at very high rate of transmission.
Problem
Estimation of summary statistics over streaming data with fastprocessing of data elements and modest storage requirements:
1 cardinality, l distances (quasi-distances), entropy.
2 quantiles, histograms, and other measures of distributional
dissimilarity.
Ioana Cosma Hashing, random projections, and data streaming
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
3/17
Data streaming: definition and problem
Streaming dataTransiently observed sequence of data elements that arriveunordered, with repetitions, and at very high rate of transmission.
Problem
Estimation of summary statistics over streaming data with fastprocessing of data elements and modest storage requirements:
1 cardinality, l distances (quasi-distances), entropy.
2 quantiles, histograms, and other measures of distributional
dissimilarity.
Ioana Cosma Hashing, random projections, and data streaming
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
4/17
Data streaming: definition and problem
Consider a data stream ST of length T with elements of the form
(it, dt), t = 1, . . . , T,
where the item type, it, belongs to a countable, possibly infiniteset D = {c1, . . . , cN}, and the associated quantity is dt (either
positive or negative).
The accumulation vector of ST at stage T is aT = (a1, a2, . . .),where
aj =
T
t=1
dtI(it = cj), j = 1, . . . , N,
is the cumulative quantity of elements of type cj at stage T.
Ioana Cosma Hashing, random projections, and data streaming
D i d fi i i d bl
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
5/17
Data streaming: definition and problem
Consider a data stream ST of length T with elements of the form
(it, dt), t = 1, . . . , T,
where the item type, it, belongs to a countable, possibly infiniteset D = {c1, . . . , cN}, and the associated quantity is dt (either
positive or negative).
The accumulation vector of ST at stage T is aT = (a1, a2, . . .),where
aj =
T
t=1
dtI(it = cj), j = 1, . . . , N,
is the cumulative quantity of elements of type cj at stage T.
Ioana Cosma Hashing, random projections, and data streaming
D t t i d fi iti d bl
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
6/17
Data streaming: definition and problem
Computing summary statistics over streaming data
1 The l norm defined for 1:
l(aT) =jD
|aj|1/
.
The cardinality is c := lim0 l(aT)
.
2 The l distance between streams aT1 and bT2 over D for 1:
d(aT1 , bT2 ) = l(aT1 bT2 ).
3 Let pj = aj/dD ad. The empirical Shannon entropy ofp = (p1, p2, . . .):
H(p) =
jD,pj=0
pj log pj.
Ioana Cosma Hashing, random projections, and data streaming
Data streaming: definition and problem
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
7/17
Data streaming: definition and problem
Computational complexity
Exact computation of functions of the accumulation vector
requires maintaining the counter aj for each j D, updating itwhenever it = cj by aj + dt aj.
The associated storage cost of O(c) is prohibitively large, so thevector (a1, a2, . . .) cannot be stored on main computer memory or
on disk for fast and efficient access.
Data sketching via random projections
Process the data stream in a one-pass algorithm that retainssufficient information in the form of a low-dimensional vector
of random projections.The algorithm is fast, has small space requirements, andstores sufficient information for efficient estimation of thesesummary statistics.
Ioana Cosma Hashing, random projections, and data streaming
Data streaming: definition and problem
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
8/17
Data streaming: definition and problem
Computational complexity
Exact computation of functions of the accumulation vector
requires maintaining the counter aj for each j D, updating itwhenever it = cj by aj + dt aj.
The associated storage cost of O(c) is prohibitively large, so thevector (a1, a2, . . .) cannot be stored on main computer memory or
on disk for fast and efficient access.
Data sketching via random projections
Process the data stream in a one-pass algorithm that retainssufficient information in the form of a low-dimensional vector
of random projections.The algorithm is fast, has small space requirements, andstores sufficient information for efficient estimation of thesesummary statistics.
Ioana Cosma Hashing, random projections, and data streaming
Random projections hashing and estimation
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
9/17
Random projections, hashing, and estimation
The stable distribution
X F is stable if for every n > 0 and X1, . . . , Xn F i.i.d., thereexist constants an > 0 and bn such that
X1 + . . . + XnD= anX + bn.
The norming constant is an = n1/, where 0 < 2 is the index
of stability. If bn = 0, then X is said to be strictly stable.
Theorem on linear combinations
For constants a1, a2 > 0, if X1, X2 F are strictly stable of index and independent, then
a1X1 + a2X2D=|a1|
+ |a2|1/
X, X F. (1)
If, in addition, the distribution is symmetric, then (1) holds for alla1, a2 R.
Ioana Cosma Hashing, random projections, and data streaming
Random projections, hashing, and estimation
http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
10/17
Random projections, hashing, and estimation
The stable distribution
X F is stable if for every n > 0 and X1, . . . , Xn F i.i.d., thereexist constants an > 0 and bn such that
X1 + . . . + XnD= anX + bn.
The norming constant is an = n1/, where 0 < 2 is the index
of stability. If bn = 0, then X is said to be strictly stable.
Theorem on linear combinations
For constants a1, a2 > 0, if X1, X2 F are strictly stable of index and independent, then
a1X1 + a2X2D=|a1|
+ |a2|1/
X, X F. (1)
If, in addition, the distribution is symmetric, then (1) holds for alla1, a2 R.
Ioana Cosma Hashing, random projections, and data streaming
Random projections, hashing, and estimation
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
11/17
p j , g,
The method of random projections
Each element type cj in the data stream can be transformed to adistinct random variable R(cj) F to adequate approximation viaa pseudo-random number generator as follows:
1 Hash cj to an integer (or vector of integers) via a
deterministic hash function with low collision probability.2 Use these integers to seed a pseudo-random number
generator.
3 Use the seeded generator to simulate a sequence of
independent random variables with distribution F; set R(cj) tothe value at a fixed position in the sequence.
Ioana Cosma Hashing, random projections, and data streaming
Random projections, hashing, and estimation
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
12/17
p j g
Processing and estimation
The projection is then accumulated online as
Tt=1
dtR(it) =N
j=1
ajR(cj),
and the process is repeated independently a further k 1 times to
form a k-dimensional data sketch.If R(cj) F is strictly stable and symmetric of index , then
N
j=1
ajR(cj)D=
N
j=1
|aj|
1/
R,
where R F. The problem reduces to that of estimating ascale parameter in an observed sample of size k from the
-stable distribution.
Ioana Cosma Hashing, random projections, and data streaming
Contributions and future work
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
13/17
Contributions
We exploit properties of the -stable distribution and the ideaof data sketching via random projections to propose efficientestimators of cardinality, l distances, and entropy overstreaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)with good small sample performance (shown via simulations),recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (forcardinality and entropy estimation - resulting in spacecomplexity bounds on the size of the data sketch).
Ioana Cosma Hashing, random projections, and data streaming
Contributions and future work
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
14/17
Contributions
We exploit properties of the -stable distribution and the ideaof data sketching via random projections to propose efficientestimators of cardinality, l distances, and entropy overstreaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)with good small sample performance (shown via simulations),recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (forcardinality and entropy estimation - resulting in spacecomplexity bounds on the size of the data sketch).
Ioana Cosma Hashing, random projections, and data streaming
Contributions and future work
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
15/17
Contributions
We exploit properties of the -stable distribution and the ideaof data sketching via random projections to propose efficientestimators of cardinality, l distances, and entropy overstreaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)with good small sample performance (shown via simulations),recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (forcardinality and entropy estimation - resulting in spacecomplexity bounds on the size of the data sketch).
Ioana Cosma Hashing, random projections, and data streaming
Contributions and future work
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
16/17
Future workShannon entropy as characterisation of a data stream appliedin developing convergence diagnostics for monitoring extensiveMCMC simulations in Bayesian inference problems.
Data sketching via random projections as tool for dimension
reduction in computationally expensive simulation algorithmssuch as particle filtering, and particle MCMC.
References
http://www.statslab.cam.ac.uk/ioana
Ioana Cosma Hashing, random projections, and data streaming
Contributions and future work
http://goforward/http://find/http://goback/ -
8/14/2019 Hashing, random projections, and data streaming
17/17
Future workShannon entropy as characterisation of a data stream appliedin developing convergence diagnostics for monitoring extensiveMCMC simulations in Bayesian inference problems.
Data sketching via random projections as tool for dimension
reduction in computationally expensive simulation algorithmssuch as particle filtering, and particle MCMC.
References
http://www.statslab.cam.ac.uk/ioana
Ioana Cosma Hashing, random projections, and data streaming
http://goforward/http://find/http://goback/