spark at euclid
DESCRIPTION
Dave Strauss on Spark at Euclid At Euclid, we are making the physical world just as machine readable, trackable, and actionable as cookies and click-throughs have made the online retail world. To do this, we process logs from sensors around the globe to understand the behaviors of people and their interactions with physical retail locations. This challenging task requires us to model each user’s behavior at a device level, meaning that we design, train, and deploy thousands of machine learning models daily. We have recently introduced Spark into the core of our analytics stack. Doing so has enabled greater flexibility in our analysis, improved accuracy, reporting, and testing. There are two parts that we intend to discuss: the technological and programmatic aspects of switching to a strongly typed system for our small engineering team, and the continued challenges we face in deploying daily-tuned random forest and naive Bayes models at scale.TRANSCRIPT
Spark at Euclid
Dave Strauss, PhD
July 1, 2014
Dave Strauss, PhD Spark at Euclid July 1, 2014 1 / 15
who is Euclid Analytics?
quantify and measure retail customer behavior
assign unique random id to all wifi enabled devices
predict shopper duration, repeat visitors, etc.
smartphone
access point cloud storage
web dashboard
Dave Strauss, PhD Spark at Euclid July 1, 2014 2 / 15
data processing
process logfiles from wifi access pointsobtain a small amount of information about devices
unique idtimesignal strength
search for patterns that correspond to user behavior
monitor those trends
Dave Strauss, PhD Spark at Euclid July 1, 2014 3 / 15
in a perfect world
−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)
0
20
40
60
80
100
120
dw
ell
tim
e (
min
ute
s)
Staff
VisitorsNext Door
Walkby
Dave Strauss, PhD Spark at Euclid July 1, 2014 4 / 15
in a perfect world
−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)
0
20
40
60
80
100
120
dw
ell
tim
e (
min
ute
s)
Staff
VisitorsNext Door
Walkby
Dave Strauss, PhD Spark at Euclid July 1, 2014 4 / 15
what our data really look like
−100 −90 −80 −70 −60 −50 −40 −30signal strength (dB)
0
20
40
60
80
100
120
dw
ell
tim
e (
min
ute
s)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
log1
0 d
evic
es
Dave Strauss, PhD Spark at Euclid July 1, 2014 5 / 15
why we use Spark
code integration
functional programming in scala
scalable machine learning
extensible
iterative algorithms, complex data flows
Dave Strauss, PhD Spark at Euclid July 1, 2014 6 / 15
challenges
launching clusters regularly and reliably
updating and applying models
distributed optimization problems
adoption and migration
Dave Strauss, PhD Spark at Euclid July 1, 2014 7 / 15
python context
created SparkCluster class for managing AWSclusters
use python context management to manage launchand shutdown
provide method to execute remote jobs
with sparkClusterManager.cluster("production") as spc:
spc.execute("com.euclidanalytics.foo",arguments)
uploadDataToDatabases()
leaves authentication for config files
should anything go wrong, the context willautomatically terminate the cluster, saving resources
provide KEEP_ALIVE flag
Dave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15
python context
created SparkCluster class for managing AWSclusters
use python context management to manage launchand shutdown
provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:
spc.execute("com.euclidanalytics.foo",arguments)
uploadDataToDatabases()
leaves authentication for config files
should anything go wrong, the context willautomatically terminate the cluster, saving resources
provide KEEP_ALIVE flag
Dave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15
python context
created SparkCluster class for managing AWSclusters
use python context management to manage launchand shutdown
provide method to execute remote jobswith sparkClusterManager.cluster("production") as spc:
spc.execute("com.euclidanalytics.foo",arguments)
uploadDataToDatabases()
leaves authentication for config files
should anything go wrong, the context willautomatically terminate the cluster, saving resources
provide KEEP_ALIVE flagDave Strauss, PhD Spark at Euclid July 1, 2014 8 / 15
update and apply
data store
Spark
preprocess
data
parameters
update
compute
rollup
rollup
web serve
Dave Strauss, PhD Spark at Euclid July 1, 2014 9 / 15
binary parameter search
search to find optimal parameter in tree model
test for maximum changes in conditional entropy
do parameter search for each sensor
touch all of our data across thousands of sensors
distributed binary search
more work necessary to optimize through shufflesteps
Dave Strauss, PhD Spark at Euclid July 1, 2014 10 / 15
large learning
RDD[data]
parameter
.map().reduce()
sc.broadcast()
Dave Strauss, PhD Spark at Euclid July 1, 2014 11 / 15
distributed learning
RDD[(key,data)]
RDD[(key,parameter)]
.map().reduceByKey()
.join()
a for loop doesweird things
Dave Strauss, PhD Spark at Euclid July 1, 2014 12 / 15
distributed learning
RDD[(key,data)]
RDD[(key,parameter)]
.map().reduceByKey()
.join()
a for loop doesweird things
Dave Strauss, PhD Spark at Euclid July 1, 2014 12 / 15
recursive search
case class modelParam(low : Double, high: Double)
val data : RDD[record] = preprocess(rawData)
def binarySearch(params: RDD[modelParam], level: Int)
: RDD[modelParam] = {
level match {
case x if (x==0) => params
case _ => {
val updated = model.compute(data, params)
.reduceByKey((a,b)=>aggregate(a,b))
.map(makeDecision)
updated.checkpoint()
binarySearch(updated, level-1)
}
}
}
val result = binarySearch(initialParams, 10)
Dave Strauss, PhD Spark at Euclid July 1, 2014 13 / 15
evolution of Spark at Euclid
pig scripts
introduced AWS redshift for simple models
migrate to Spark, Scala
run nightly job and a whole host of ETL using Spark
looking forward to streaming
Dave Strauss, PhD Spark at Euclid July 1, 2014 14 / 15