spark at euclid

19
Spark at Euclid Dave Strauss, PhD July 1, 2014 Dave Strauss, PhD Spark at Euclid July 1, 2014 1 / 15

Upload: jing-jingeuclid

Post on 21-Jun-2015

482 views

Category:

Data & Analytics


0 download

DESCRIPTION

Dave Strauss on Spark at Euclid At Euclid, we are making the physical world just as machine readable, trackable, and actionable as cookies and click-throughs have made the online retail world. To do this, we process logs from sensors around the globe to understand the behaviors of people and their interactions with physical retail locations. This challenging task requires us to model each user’s behavior at a device level, meaning that we design, train, and deploy thousands of machine learning models daily. We have recently introduced Spark into the core of our analytics stack. Doing so has enabled greater flexibility in our analysis, improved accuracy, reporting, and testing. There are two parts that we intend to discuss: the technological and programmatic aspects of switching to a strongly typed system for our small engineering team, and the continued challenges we face in deploying daily-tuned random forest and naive Bayes models at scale.

TRANSCRIPT

Page 1: Spark at Euclid

Spark at Euclid

Dave Strauss, PhD

July 1, 2014

Dave Strauss, PhD Spark at Euclid July 1, 2014 1 / 15

Page 2: Spark at Euclid

who is Euclid Analytics?

quantify and measure retail customer behavior

assign unique random id to all wifi enabled devices

predict shopper duration, repeat visitors, etc.

smartphone

access point cloud storage

web dashboard

Dave Strauss, PhD Spark at Euclid July 1, 2014 2 / 15

Page 3: Spark at Euclid

data processing

process logfiles from wifi access pointsobtain a small amount of information about devices

unique idtimesignal strength

search for patterns that correspond to user behavior

monitor those trends

Dave Strauss, PhD Spark at Euclid July 1, 2014 3 / 15

Page 4: Spark at Euclid

in a perfect world

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

Staff

VisitorsNext Door

Walkby

Dave Strauss, PhD Spark at Euclid July 1, 2014 4 / 15

Page 5: Spark at Euclid

in a perfect world

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

Staff

VisitorsNext Door

Walkby

Dave Strauss, PhD Spark at Euclid July 1, 2014 4 / 15

Page 6: Spark at Euclid

what our data really look like

−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)

0

20

40

60

80

100

120

dw

ell

tim

e (

min

ute

s)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

log1

0 d

evic

es

Dave Strauss, PhD Spark at Euclid July 1, 2014 5 / 15

Page 7: Spark at Euclid

why we use Spark

code integration

functional programming in scala

scalable machine learning

extensible

iterative algorithms, complex data flows

Dave Strauss, PhD Spark at Euclid July 1, 2014 6 / 15

Page 8: Spark at Euclid

challenges

launching clusters regularly and reliably

updating and applying models

distributed optimization problems

adoption and migration

Dave Strauss, PhD Spark at Euclid July 1, 2014 7 / 15

Page 9: Spark at Euclid

python context

created SparkCluster class for managing AWSclusters

use python context management to manage launchand shutdown

provide method to execute remote jobs

with sparkClusterManager.cluster("production") as spc:

spc.execute("com.euclidanalytics.foo",arguments)

uploadDataToDatabases()

leaves authentication for config files

should anything go wrong, the context willautomatically terminate the cluster, saving resources

provide KEEP_ALIVE flag

Dave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15

Page 10: Spark at Euclid

python context

created SparkCluster class for managing AWSclusters

use python context management to manage launchand shutdown

provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:

spc.execute("com.euclidanalytics.foo",arguments)

uploadDataToDatabases()

leaves authentication for config files

should anything go wrong, the context willautomatically terminate the cluster, saving resources

provide KEEP_ALIVE flag

Dave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15

Page 11: Spark at Euclid

python context

created SparkCluster class for managing AWSclusters

use python context management to manage launchand shutdown

provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:

spc.execute("com.euclidanalytics.foo",arguments)

uploadDataToDatabases()

leaves authentication for config files

should anything go wrong, the context willautomatically terminate the cluster, saving resources

provide KEEP_ALIVE flagDave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15

Page 12: Spark at Euclid

update and apply

data store

Spark

preprocess

data

parameters

update

compute

rollup

rollup

web serve

Dave Strauss, PhD Spark at Euclid July 1, 2014 9 / 15

Page 13: Spark at Euclid

binary parameter search

search to find optimal parameter in tree model

test for maximum changes in conditional entropy

do parameter search for each sensor

touch all of our data across thousands of sensors

distributed binary search

more work necessary to optimize through shufflesteps

Dave Strauss, PhD Spark at Euclid July 1, 2014 10 / 15

Page 14: Spark at Euclid

large learning

RDD[data]

parameter

.map().reduce()

sc.broadcast()

Dave Strauss, PhD Spark at Euclid July 1, 2014 11 / 15

Page 15: Spark at Euclid

distributed learning

RDD[(key,data)]

RDD[(key,parameter)]

.map().reduceByKey()

.join()

a for loop doesweird things

Dave Strauss, PhD Spark at Euclid July 1, 2014 12 / 15

Page 16: Spark at Euclid

distributed learning

RDD[(key,data)]

RDD[(key,parameter)]

.map().reduceByKey()

.join()

a for loop doesweird things

Dave Strauss, PhD Spark at Euclid July 1, 2014 12 / 15

Page 17: Spark at Euclid

recursive search

case class modelParam(low : Double, high: Double)

val data : RDD[record] = preprocess(rawData)

def binarySearch(params: RDD[modelParam], level: Int)

: RDD[modelParam] = {

level match {

case x if (x==0) => params

case _ => {

val updated = model.compute(data, params)

.reduceByKey((a,b)=>aggregate(a,b))

.map(makeDecision)

updated.checkpoint()

binarySearch(updated, level-1)

}

}

}

val result = binarySearch(initialParams, 10)

Dave Strauss, PhD Spark at Euclid July 1, 2014 13 / 15

Page 18: Spark at Euclid

evolution of Spark at Euclid

pig scripts

introduced AWS redshift for simple models

migrate to Spark, Scala

run nightly job and a whole host of ETL using Spark

looking forward to streaming

Dave Strauss, PhD Spark at Euclid July 1, 2014 14 / 15

Page 19: Spark at Euclid

questions

Contact [email protected] for more

Dave Strauss, PhD Spark at Euclid July 1, 2014 15 / 15