spark at euclid

Post on 21-Jun-2015

482 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Dave Strauss on Spark at Euclid At Euclid, we are making the physical world just as machine readable, trackable, and actionable as cookies and click-throughs have made the online retail world. To do this, we process logs from sensors around the globe to understand the behaviors of people and their interactions with physical retail locations. This challenging task requires us to model each user’s behavior at a device level, meaning that we design, train, and deploy thousands of machine learning models daily. We have recently introduced Spark into the core of our analytics stack. Doing so has enabled greater flexibility in our analysis, improved accuracy, reporting, and testing. There are two parts that we intend to discuss: the technological and programmatic aspects of switching to a strongly typed system for our small engineering team, and the continued challenges we face in deploying daily-tuned random forest and naive Bayes models at scale.

TRANSCRIPT

Spark at Euclid

Dave Strauss, PhD

July 1, 2014

Dave Strauss, PhD Spark at Euclid July 1, 2014 1 / 15

who is Euclid Analytics?

quantify and measure retail customer behavior

assign unique random id to all wifi enabled devices

predict shopper duration, repeat visitors, etc.

smartphone

access point cloud storage

web dashboard

Dave Strauss, PhD Spark at Euclid July 1, 2014 2 / 15

data processing

process logfiles from wifi access pointsobtain a small amount of information about devices

unique idtimesignal strength

search for patterns that correspond to user behavior

monitor those trends

Dave Strauss, PhD Spark at Euclid July 1, 2014 3 / 15

in a perfect world

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

Staff

VisitorsNext Door

Walkby

Dave Strauss, PhD Spark at Euclid July 1, 2014 4 / 15

in a perfect world

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

Staff

VisitorsNext Door

Walkby

Dave Strauss, PhD Spark at Euclid July 1, 2014 4 / 15

what our data really look like

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

log1

0 d

evic

es

Dave Strauss, PhD Spark at Euclid July 1, 2014 5 / 15

why we use Spark

code integration

functional programming in scala

scalable machine learning

extensible

iterative algorithms, complex data flows

Dave Strauss, PhD Spark at Euclid July 1, 2014 6 / 15

challenges

launching clusters regularly and reliably

updating and applying models

distributed optimization problems

adoption and migration

Dave Strauss, PhD Spark at Euclid July 1, 2014 7 / 15

python context

created SparkCluster class for managing AWSclusters

use python context management to manage launchand shutdown

provide method to execute remote jobs

with sparkClusterManager.cluster("production") as spc:

spc.execute("com.euclidanalytics.foo",arguments)

uploadDataToDatabases()

leaves authentication for config files

should anything go wrong, the context willautomatically terminate the cluster, saving resources

provide KEEP_ALIVE flag

Dave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15

python context

created SparkCluster class for managing AWSclusters

use python context management to manage launchand shutdown

provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:

spc.execute("com.euclidanalytics.foo",arguments)

uploadDataToDatabases()

leaves authentication for config files

should anything go wrong, the context willautomatically terminate the cluster, saving resources

provide KEEP_ALIVE flag

Dave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15

python context

created SparkCluster class for managing AWSclusters

use python context management to manage launchand shutdown

provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:

spc.execute("com.euclidanalytics.foo",arguments)

uploadDataToDatabases()

leaves authentication for config files

should anything go wrong, the context willautomatically terminate the cluster, saving resources

provide KEEP_ALIVE flagDave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15

update and apply

data store

Spark

preprocess

data

parameters

update

compute

rollup

rollup

web serve

Dave Strauss, PhD Spark at Euclid July 1, 2014 9 / 15

binary parameter search

search to find optimal parameter in tree model

test for maximum changes in conditional entropy

do parameter search for each sensor

touch all of our data across thousands of sensors

distributed binary search

more work necessary to optimize through shufflesteps

Dave Strauss, PhD Spark at Euclid July 1, 2014 10 / 15

large learning

RDD[data]

parameter

.map().reduce()

sc.broadcast()

Dave Strauss, PhD Spark at Euclid July 1, 2014 11 / 15

distributed learning

RDD[(key,data)]

RDD[(key,parameter)]

.map().reduceByKey()

.join()

a for loop doesweird things

Dave Strauss, PhD Spark at Euclid July 1, 2014 12 / 15

distributed learning

RDD[(key,data)]

RDD[(key,parameter)]

.map().reduceByKey()

.join()

a for loop doesweird things

Dave Strauss, PhD Spark at Euclid July 1, 2014 12 / 15

recursive search

case class modelParam(low : Double, high: Double)

val data : RDD[record] = preprocess(rawData)

def binarySearch(params: RDD[modelParam], level: Int)

: RDD[modelParam] = {

level match {

case x if (x==0) => params

case _ => {

val updated = model.compute(data, params)

.reduceByKey((a,b)=>aggregate(a,b))

.map(makeDecision)

updated.checkpoint()

binarySearch(updated, level-1)

}

}

}

val result = binarySearch(initialParams, 10)

Dave Strauss, PhD Spark at Euclid July 1, 2014 13 / 15

evolution of Spark at Euclid

pig scripts

introduced AWS redshift for simple models

migrate to Spark, Scala

run nightly job and a whole host of ETL using Spark

looking forward to streaming

Dave Strauss, PhD Spark at Euclid July 1, 2014 14 / 15

questions

Contact dstrauss@euclidanalytics.com for more

Dave Strauss, PhD Spark at Euclid July 1, 2014 15 / 15

top related