hashing, random projections, and data streaming

Upload: senor-smiles

Post on 30-May-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Hashing, random projections, and data streaming

    1/17

    Hashing, random projections, and data streaming

    Ioana Cosma

    Statistical Laboratory

    University of Cambridge

    September 28, 2009

    Ioana Cosma Hashing, random projections, and data streaming

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    2/17

    Data streaming: definition and problem

    Streaming dataTransiently observed sequence of data elements that arriveunordered, with repetitions, and at very high rate of transmission.

    Problem

    Estimation of summary statistics over streaming data with fastprocessing of data elements and modest storage requirements:

    1 cardinality, l distances (quasi-distances), entropy.

    2 quantiles, histograms, and other measures of distributional

    dissimilarity.

    Ioana Cosma Hashing, random projections, and data streaming

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    3/17

    Data streaming: definition and problem

    Streaming dataTransiently observed sequence of data elements that arriveunordered, with repetitions, and at very high rate of transmission.

    Problem

    Estimation of summary statistics over streaming data with fastprocessing of data elements and modest storage requirements:

    1 cardinality, l distances (quasi-distances), entropy.

    2 quantiles, histograms, and other measures of distributional

    dissimilarity.

    Ioana Cosma Hashing, random projections, and data streaming

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    4/17

    Data streaming: definition and problem

    Consider a data stream ST of length T with elements of the form

    (it, dt), t = 1, . . . , T,

    where the item type, it, belongs to a countable, possibly infiniteset D = {c1, . . . , cN}, and the associated quantity is dt (either

    positive or negative).

    The accumulation vector of ST at stage T is aT = (a1, a2, . . .),where

    aj =

    T

    t=1

    dtI(it = cj), j = 1, . . . , N,

    is the cumulative quantity of elements of type cj at stage T.

    Ioana Cosma Hashing, random projections, and data streaming

    D i d fi i i d bl

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    5/17

    Data streaming: definition and problem

    Consider a data stream ST of length T with elements of the form

    (it, dt), t = 1, . . . , T,

    where the item type, it, belongs to a countable, possibly infiniteset D = {c1, . . . , cN}, and the associated quantity is dt (either

    positive or negative).

    The accumulation vector of ST at stage T is aT = (a1, a2, . . .),where

    aj =

    T

    t=1

    dtI(it = cj), j = 1, . . . , N,

    is the cumulative quantity of elements of type cj at stage T.

    Ioana Cosma Hashing, random projections, and data streaming

    D t t i d fi iti d bl

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    6/17

    Data streaming: definition and problem

    Computing summary statistics over streaming data

    1 The l norm defined for 1:

    l(aT) =jD

    |aj|1/

    .

    The cardinality is c := lim0 l(aT)

    .

    2 The l distance between streams aT1 and bT2 over D for 1:

    d(aT1 , bT2 ) = l(aT1 bT2 ).

    3 Let pj = aj/dD ad. The empirical Shannon entropy ofp = (p1, p2, . . .):

    H(p) =

    jD,pj=0

    pj log pj.

    Ioana Cosma Hashing, random projections, and data streaming

    Data streaming: definition and problem

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    7/17

    Data streaming: definition and problem

    Computational complexity

    Exact computation of functions of the accumulation vector

    requires maintaining the counter aj for each j D, updating itwhenever it = cj by aj + dt aj.

    The associated storage cost of O(c) is prohibitively large, so thevector (a1, a2, . . .) cannot be stored on main computer memory or

    on disk for fast and efficient access.

    Data sketching via random projections

    Process the data stream in a one-pass algorithm that retainssufficient information in the form of a low-dimensional vector

    of random projections.The algorithm is fast, has small space requirements, andstores sufficient information for efficient estimation of thesesummary statistics.

    Ioana Cosma Hashing, random projections, and data streaming

    Data streaming: definition and problem

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    8/17

    Data streaming: definition and problem

    Computational complexity

    Exact computation of functions of the accumulation vector

    requires maintaining the counter aj for each j D, updating itwhenever it = cj by aj + dt aj.

    The associated storage cost of O(c) is prohibitively large, so thevector (a1, a2, . . .) cannot be stored on main computer memory or

    on disk for fast and efficient access.

    Data sketching via random projections

    Process the data stream in a one-pass algorithm that retainssufficient information in the form of a low-dimensional vector

    of random projections.The algorithm is fast, has small space requirements, andstores sufficient information for efficient estimation of thesesummary statistics.

    Ioana Cosma Hashing, random projections, and data streaming

    Random projections hashing and estimation

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    9/17

    Random projections, hashing, and estimation

    The stable distribution

    X F is stable if for every n > 0 and X1, . . . , Xn F i.i.d., thereexist constants an > 0 and bn such that

    X1 + . . . + XnD= anX + bn.

    The norming constant is an = n1/, where 0 < 2 is the index

    of stability. If bn = 0, then X is said to be strictly stable.

    Theorem on linear combinations

    For constants a1, a2 > 0, if X1, X2 F are strictly stable of index and independent, then

    a1X1 + a2X2D=|a1|

    + |a2|1/

    X, X F. (1)

    If, in addition, the distribution is symmetric, then (1) holds for alla1, a2 R.

    Ioana Cosma Hashing, random projections, and data streaming

    Random projections, hashing, and estimation

    http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    10/17

    Random projections, hashing, and estimation

    The stable distribution

    X F is stable if for every n > 0 and X1, . . . , Xn F i.i.d., thereexist constants an > 0 and bn such that

    X1 + . . . + XnD= anX + bn.

    The norming constant is an = n1/, where 0 < 2 is the index

    of stability. If bn = 0, then X is said to be strictly stable.

    Theorem on linear combinations

    For constants a1, a2 > 0, if X1, X2 F are strictly stable of index and independent, then

    a1X1 + a2X2D=|a1|

    + |a2|1/

    X, X F. (1)

    If, in addition, the distribution is symmetric, then (1) holds for alla1, a2 R.

    Ioana Cosma Hashing, random projections, and data streaming

    Random projections, hashing, and estimation

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    11/17

    p j , g,

    The method of random projections

    Each element type cj in the data stream can be transformed to adistinct random variable R(cj) F to adequate approximation viaa pseudo-random number generator as follows:

    1 Hash cj to an integer (or vector of integers) via a

    deterministic hash function with low collision probability.2 Use these integers to seed a pseudo-random number

    generator.

    3 Use the seeded generator to simulate a sequence of

    independent random variables with distribution F; set R(cj) tothe value at a fixed position in the sequence.

    Ioana Cosma Hashing, random projections, and data streaming

    Random projections, hashing, and estimation

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    12/17

    p j g

    Processing and estimation

    The projection is then accumulated online as

    Tt=1

    dtR(it) =N

    j=1

    ajR(cj),

    and the process is repeated independently a further k 1 times to

    form a k-dimensional data sketch.If R(cj) F is strictly stable and symmetric of index , then

    N

    j=1

    ajR(cj)D=

    N

    j=1

    |aj|

    1/

    R,

    where R F. The problem reduces to that of estimating ascale parameter in an observed sample of size k from the

    -stable distribution.

    Ioana Cosma Hashing, random projections, and data streaming

    Contributions and future work

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    13/17

    Contributions

    We exploit properties of the -stable distribution and the ideaof data sketching via random projections to propose efficientestimators of cardinality, l distances, and entropy overstreaming data.

    The proposed algorithms have fast running time, and small

    storage requirements.

    The resulting estimators are asymptotically efficient (or nearly)with good small sample performance (shown via simulations),recursively computable (for cardinality estimation), and have

    tail bounds that are exponentially decreasing with k (forcardinality and entropy estimation - resulting in spacecomplexity bounds on the size of the data sketch).

    Ioana Cosma Hashing, random projections, and data streaming

    Contributions and future work

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    14/17

    Contributions

    We exploit properties of the -stable distribution and the ideaof data sketching via random projections to propose efficientestimators of cardinality, l distances, and entropy overstreaming data.

    The proposed algorithms have fast running time, and small

    storage requirements.

    The resulting estimators are asymptotically efficient (or nearly)with good small sample performance (shown via simulations),recursively computable (for cardinality estimation), and have

    tail bounds that are exponentially decreasing with k (forcardinality and entropy estimation - resulting in spacecomplexity bounds on the size of the data sketch).

    Ioana Cosma Hashing, random projections, and data streaming

    Contributions and future work

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    15/17

    Contributions

    We exploit properties of the -stable distribution and the ideaof data sketching via random projections to propose efficientestimators of cardinality, l distances, and entropy overstreaming data.

    The proposed algorithms have fast running time, and small

    storage requirements.

    The resulting estimators are asymptotically efficient (or nearly)with good small sample performance (shown via simulations),recursively computable (for cardinality estimation), and have

    tail bounds that are exponentially decreasing with k (forcardinality and entropy estimation - resulting in spacecomplexity bounds on the size of the data sketch).

    Ioana Cosma Hashing, random projections, and data streaming

    Contributions and future work

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    16/17

    Future workShannon entropy as characterisation of a data stream appliedin developing convergence diagnostics for monitoring extensiveMCMC simulations in Bayesian inference problems.

    Data sketching via random projections as tool for dimension

    reduction in computationally expensive simulation algorithmssuch as particle filtering, and particle MCMC.

    References

    http://www.statslab.cam.ac.uk/ioana

    Ioana Cosma Hashing, random projections, and data streaming

    Contributions and future work

    http://goforward/http://find/http://goback/
  • 8/14/2019 Hashing, random projections, and data streaming

    17/17

    Future workShannon entropy as characterisation of a data stream appliedin developing convergence diagnostics for monitoring extensiveMCMC simulations in Bayesian inference problems.

    Data sketching via random projections as tool for dimension

    reduction in computationally expensive simulation algorithmssuch as particle filtering, and particle MCMC.

    References

    http://www.statslab.cam.ac.uk/ioana

    Ioana Cosma Hashing, random projections, and data streaming

    http://goforward/http://find/http://goback/