spark at-hackthon8jan2014
DESCRIPTION
TRANSCRIPT
![Page 1: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/1.jpg)
1
Spark @ Hackathon Madhukara Phatak
Zinnia Systems @madhukaraphatak
![Page 2: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/2.jpg)
2
Hackathon
“ Hackathon is a programming ritual , usually done in night, where
programmers solve problems overnight which may take years of time otherwise”
-Anonymous
![Page 3: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/3.jpg)
3
Where / When
Dec 6 and 7 th , 2013At BigData ConclaveHosted by FluturaSolving real world problem(s) in 24 hoursNo restriction on tools to be usedDeliverables : Results, Working code and Visualization
![Page 4: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/4.jpg)
4
What?
Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand.Every minute usage data collected from smart metersFrom 2008- 2012Around 133 mb uncompressed2,75,000 recordsAround 25,000 missing data records
![Page 5: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/5.jpg)
5
Questions to be answered
What would be the energy consumption for the next day?What would be week wise Energy consumption for the next one year?What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month.During WeekdaysDuring WeekendsAssuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff planCan you identify the device usage patterns?
![Page 6: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/6.jpg)
6
Who
Four member team from Zinnia SystemsChose Spark/Scala for solving these problemsEvery one except me ,were new to Scala and SparkAble to solve the problems on timeWon first prize at Hackathon
![Page 7: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/7.jpg)
7
Why Spark?
Faster prototypingExcellent in memory performance
Uses AKKAAble to run 2.5 million concurrent processing in 1GB RAM
Easy to debugExcellent integration in IntelliJ and Eclipse
Little to code - 500 lines of codeSomething new to try at Hackathon
![Page 8: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/8.jpg)
8
Solution
Uses only core sparkUses Geometric Brownian motion algorithm for predictionComplete code Available at Github
https://github.com/zinniasystems/spark-energy-predictionUnder Apache license
Blog serieshttp://zinniasystems.com/blog/2013/12/12/predicting-global-energy-demand-using-spark-part-1/
![Page 9: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/9.jpg)
9
Embrace Scala
Scala is a JVM language in which spark is implemented.Though Spark gives Java API and Python API Scala feels more natural.If you are coming from Pig , You feel at home in Scala API.Spark source base is small. So knowing Scala helps you peek at spark source whenever possible.Excellent REPL support.
![Page 10: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/10.jpg)
10
Go functional
Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are
ClosureFunction chainingLazy evaluation
Ex : Standard deviationSum( (xi – Mean)*(xi – Mean))
Map/Reduce wayMap calculates (xi – Mean)*(xi – Mean)Reduce does the sum
![Page 11: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/11.jpg)
11
Spark way
private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })}
private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = {val sum =0;inputRDD.map(value => sum+=(value-mean)*(value-mean))}
![Page 12: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/12.jpg)
12
Use Tuples
Map/Reduce is restricted to Key,Value pairsRepresenting data like Grouped data is too difficult in Key,Value pairsWritable is too much work to develop/maintainThere was Tuple Map/Reduce effort some point of timeSpark (Scala) has built in.
![Page 13: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/13.jpg)
13
Tuple Example
Aggregating data over hourdef hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day.These tuples can be sent as input to other functionsAggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)]
![Page 14: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/14.jpg)
14
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/ReduceSpark differs
Every operation, other than actions, are lazy evaluatedOnly write critical data to Disk, other intermediate data cache in MemoryBe careful when you use actions. Try to delay calling of actions as late possible.Refer Main.scala
![Page 15: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/15.jpg)
15
Thank you
![Page 16: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/16.jpg)
1
Spark @ Hackathon Madhukara Phatak
Zinnia Systems @madhukaraphatak
![Page 17: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/17.jpg)
2
Hackathon
“ Hackathon is a programming ritual , usually done in night, where
programmers solve problems overnight which may take years of time otherwise”
-Anonymous
![Page 18: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/18.jpg)
3
Where / When
Dec 6 and 7 th , 2013At BigData ConclaveHosted by FluturaSolving real world problem(s) in 24 hoursNo restriction on tools to be usedDeliverables : Results, Working code and Visualization
![Page 19: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/19.jpg)
4
What?
Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand.Every minute usage data collected from smart metersFrom 2008- 2012Around 133 mb uncompressed2,75,000 recordsAround 25,000 missing data records
![Page 20: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/20.jpg)
5
Questions to be answered
What would be the energy consumption for the next day?What would be week wise Energy consumption for the next one year?What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month.During WeekdaysDuring WeekendsAssuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff planCan you identify the device usage patterns?
![Page 21: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/21.jpg)
6
Who
Four member team from Zinnia SystemsChose Spark/Scala for solving these problemsEvery one except me ,were new to Scala and SparkAble to solve the problems on timeWon first prize at Hackathon
![Page 22: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/22.jpg)
7
Why Spark?
Faster prototypingExcellent in memory performance
Uses AKKAAble to run 2.5 million concurrent processing in 1GB RAM
Easy to debugExcellent integration in IntelliJ and Eclipse
Little to code - 500 lines of codeSomething new to try at Hackathon
![Page 23: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/23.jpg)
8
Solution
Uses only core sparkUses Geometric Brownian motion algorithm for predictionComplete code Available at Github
https://github.com/zinniasystems/spark-energy-predictionUnder Apache license
Blog serieshttp://zinniasystems.com/blog/2013/12/12/predicting-global-energy-demand-using-spark-part-1/
![Page 24: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/24.jpg)
9
Embrace Scala
Scala is a JVM language in which spark is implemented.Though Spark gives Java API and Python API Scala feels more natural.If you are coming from Pig , You feel at home in Scala API.Spark source base is small. So knowing Scala helps you peek at spark source whenever possible.Excellent REPL support.
![Page 25: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/25.jpg)
10
Go functional
Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are
ClosureFunction chainingLazy evaluation
Ex : Standard deviationSum( (xi – Mean)*(xi – Mean))
Map/Reduce wayMap calculates (xi – Mean)*(xi – Mean)Reduce does the sum
![Page 26: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/26.jpg)
11
Spark way
private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })}
private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = {val sum =0;inputRDD.map(value => sum+=(value-mean)*(value-mean))}
The code is available in EnergyUsagePrediction.scala
![Page 27: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/27.jpg)
12
Use Tuples
Map/Reduce is restricted to Key,Value pairsRepresenting data like Grouped data is too difficult in Key,Value pairsWritable is too much work to develop/maintainThere was Tuple Map/Reduce effort some point of timeSpark (Scala) has built in.
![Page 28: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/28.jpg)
13
Tuple Example
Aggregating data over hourdef hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day.These tuples can be sent as input to other functionsAggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)]
![Page 29: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/29.jpg)
14
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/ReduceSpark differs
Every operation, other than actions, are lazy evaluatedOnly write critical data to Disk, other intermediate data cache in MemoryBe careful when you use actions. Try to delay calling of actions as late possible.Refer Main.scala
![Page 30: Spark at-hackthon8jan2014](https://reader033.vdocuments.site/reader033/viewer/2022061114/545bc979af795994188b63c8/html5/thumbnails/30.jpg)
15
Thank you