monoids and sketches and crdts, oh my!

Post on 14-Apr-2017

75 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Monoids and Sketches and CRDTs, oh my!

Kevin ScaldeferriOSB 2016

How Do I Math with Big Data?

This document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission.

Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,” “expects” or words of similar import.

Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at ir.newrelic.com or the SEC’s website at www.sec.gov.

New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respect to the information provided.

How?

Monoids and Sketches and CRDTs, oh my!

Monoids

超音波システム研究所 / http://bit.ly/26bBTQ1 / CC BY 3.0

WikipediaA monoid is an algebraic structure with a single

associative binary operation and an identity element.

http://bit.ly/1Wlrigv / CC0

It’s just a thing you can “add”

interface Monoid[T] { // (x + y) + z = x + (y + z) T add(T x, T y);

// 0 + x = x = x + 0 T unit();}

interface Monoid[T] { // (x + y) + z = x + (y + z) T add(T x, T y);

// 0 + x = x = x + 0 T unit();}

interface Monoid[T] { // (x + y) + z = x + (y + z) T add(T x, T y);

// 0 + x = x = x + 0 T unit();}

interface Monoid[T] { // (x + y) + z = x + (y + z) T add(T x, T y);

// 0 + x = x = x + 0 T unit();}

interface Monoid[T] { // (x + y) + z = x + (y + z) T add(T x, T y);

// 0 + x = x = x + 0 T unit();}

One data type can have multiple monoids!

Operation Unit

Sum 0

Product 1

Max -∞

Min +∞

Live Demo!

More Monoids

Count Boolean And

Lists & StringConcatenation

Boolean Or

Set UnionFunction

Composition

Tuple Monoids

Monoid[U] & Monoid[V]

Monoid[(U,V)]

Derived Monoids

Count & Sum ➜ Average

Count & Sum & SumOfSquares ➜ StdDev

Sets don’t scale

Dan Morgan / http://bit.ly/1UiFhGs / CC BY 2.0

Sketches=

Monoids +

Physics

Counting by Flipping Coins

HHT T T HHHHHT HT T HHT HT T T

T T T HT T T T T T HT

Unique Count by Hashing0111101001 1110101100 0010010010 0100100011 1000111000 0100011011 1100100110 1111011011 0011100001 1001011100

1110100101 1001110101 1010111001 1011110111 0000101001 0100101001 0100110000 0011110100 1011011010 0010011011

Set Cardinality

(uniqueCount)≈

HyperLogLogAldo Schumann / http://bit.ly/1Yqzvme / public domain

Set Membership

interface ExtensionalSet[T] { Iterator[T] iterator()}

interface IntensionalSet[T] { boolean isMember(T t);}

Intensional Sets≈

Bloom Filters

HashSet

AHashSet

AHashSet

A

HashSet

A

BHashSet

A

BHashSet

A B

HashSet

A B

CHashSet

A B

CHashSet

A B

C

Ohnoes!

HashSet

A B

C

HashSet

A B

C

D?HashSet

A B

C

D?HashSet

A B

C

D?

Nopes!

HashSet

A B

C

E?HashSet

A B

C

E?HashSet

A B

C

E?

Hmmm

HashSet

A B

C

E?==

HashSet

A B

C

E?==Nope!

HashSet

BloomFilter

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ABloomFilter

0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0

ABloomFilter

0 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0

A BBloomFilter

0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0

A B CBloomFilter

0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0

A B C

D?

BloomFilter

0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0

A B C

D?Nope!

BloomFilter

0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0

A B C

A?

BloomFilter

0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0

A B C

A?Yes*

BloomFilter

BloomFilter Monoid

0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0

0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1

0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1

+

=

Circling Back:BloomFilters are a scalable

approximation to Sets

CountMinSketch

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CountMinSketch

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CountMinSketch

0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

CountMinSketch

10 0 0 0 0 0 0 0 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0

BCountMinSketch

0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0

0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0

B CCountMinSketch

0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0

0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0

B CCountMinSketch

0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0

0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0

B C

D?

CountMinSketch

0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0

0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0

B C

D? Min(2,1,0) = 0

CountMinSketch

0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0

0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0

B C

A?

CountMinSketch

0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

A

0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0

0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0

B C

A? Min(2,2,3) = 2

CountMinSketch

CountMinSketchFrequency of Occurrence

Funnels% of users who do A, then B

Size(A ∪ B) ≈ HyperLogLog

Size(A ∩ B) / Size(A ∪ B) ≈

MinHash

pedrik / http://bit.ly/25WzP1H / CC BY 2.0

What About Streaming Data?

Streaming is Distributed-in-Time

Computation

What About Mutable Data?

CRDTs

Conflict-Free

Replicated

Data

Types

Available,Eventually Consistent

Data Structures

How Can Two People Count?

0

0

Shared Counter

0

0

Shared Counter

(+5)5

5

0

0

Shared Counter

(+5)5

5

(-4)

(-3)

1 -2

2 -2

0

0

Op-based Counter

(+5)5

5

(-4)

(-3)

1 -2

2 -2

0

0

Op-based Counter

(+5)5

5 10

Oops!

{}

{}

Naive Sets

{}

{}

Naive Sets

(+X){X}

(+X)

{X}

{X} {X}

{}

{}

Naive Sets

(+X){X}

(+X)

{X}

{X} {X}

(-X){}

{}

{}

{}

Naive Sets

(+X){X}

(+X)

{X}

{X} {X}

(-X){}

{}

Oops!

{}

{}

Observed-Remove Sets

(+Xa){Xa}

(+Xb)

{Xb}

{Xb} {XaXb}

(-Xa){}

{Xb}

0

0

State-based Counter

0

0

State-based Counter

(+5){a=5}=5

{a=5}=5

0

0

{a=9}=9

State-based Counter

(+5) (+4)

(+3)

{a=5}=5

{a=5}=5 {a=5,b=3}=8 {a=9,b=3}=12

{a=9,b=3}=12

0

0

{a=9}=9

State-based Counter

(+5) (+4){a=5}=5

???{a=9}=9

0

0

Increment-only Counter

(+5) (+4){a=5}=5

{a=9}=9{a=9}=9

{a=9}=9

0

0 {a=+5,-4}=1

{a=+5,-4}=1

PN Counter

(+5) (-4){a=+5}=5

{a=+8,-4}=4{a=+5,-4}=1

(+3){a=+8,-4}=4

0

0 {a:2:1}=1

{a:2:1}=1

Versioned State

(+5) (-4){a:1:5}=5

{a:3:4}=4{a:2:1}=1

(+3){a:3:4}=4

Replace exactly-once, in-order delivery

with an idempotent merge strategy

Summing UpMonoids allow computations to be done across many machines and merged

Sketches allow approximate results when the exact answers are computationally infeasible

CRDTs give an approach for mutable distributed data

Thank Youkevin@scaldeferri.com@kscaldef

top related