spark at euclid

Spark at Euclid

Dave Strauss, PhD

July 1, 2014

Dave Strauss, PhD Spark at Euclid July 1, 2014 1 / 15

who is Euclid Analytics?

quantify and measure retail customer behavior

assign unique random id to all wifi enabled devices

predict shopper duration, repeat visitors, etc.

smartphone

access point cloud storage

web dashboard


data processing

process logfiles from wifi access pointsobtain a small amount of information about devices

unique idtimesignal strength

search for patterns that correspond to user behavior

monitor those trends


in a perfect world

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

Staff

VisitorsNext Door

Walkby


what our data really look like

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

log1

0 d

evic

es


why we use Spark

code integration

functional programming in scala

scalable machine learning

extensible

iterative algorithms, complex data flows


challenges

launching clusters regularly and reliably

updating and applying models

distributed optimization problems

adoption and migration


python context

created SparkCluster class for managing AWSclusters

use python context management to manage launchand shutdown

provide method to execute remote jobs

with sparkClusterManager.cluster("production") as spc:

spc.execute("com.euclidanalytics.foo",arguments)

uploadDataToDatabases()

leaves authentication for config files

should anything go wrong, the context willautomatically terminate the cluster, saving resources

provide KEEP_ALIVE flag


python context



provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:





provide KEEP_ALIVE flag


python context



provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:





provide KEEP_ALIVE flagDave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15

update and apply

data store

Spark

preprocess

data

parameters

update

compute

rollup

rollup

web serve


binary parameter search

search to find optimal parameter in tree model

test for maximum changes in conditional entropy

do parameter search for each sensor

touch all of our data across thousands of sensors

distributed binary search

more work necessary to optimize through shufflesteps


large learning

RDD[data]

parameter

.map().reduce()

sc.broadcast()


distributed learning

RDD[(key,data)]

RDD[(key,parameter)]

.map().reduceByKey()

.join()

a for loop doesweird things


recursive search

case class modelParam(low : Double, high: Double)

val data : RDD[record] = preprocess(rawData)

def binarySearch(params: RDD[modelParam], level: Int)

: RDD[modelParam] = {

level match {

case x if (x==0) => params

case _ => {

val updated = model.compute(data, params)

.reduceByKey((a,b)=>aggregate(a,b))

.map(makeDecision)

updated.checkpoint()

binarySearch(updated, level-1)

}

}

}

val result = binarySearch(initialParams, 10)


evolution of Spark at Euclid

pig scripts

introduced AWS redshift for simple models

migrate to Spark, Scala

run nightly job and a whole host of ETL using Spark

looking forward to streaming


questions

Contact [email protected] for more


spark at euclid

Data & Analytics

phd spark

euclid dave strauss

dwelltimeminutes dave

migration dave strauss

trends dave strauss

broadcast dave strauss

streaming dave strauss

log10devices dave strauss