cassandra day london 2015: british gas connected homes: 5 things we wish we had known before...
TRANSCRIPT
Five Things... (we wish we had known)
British Gas Connected Homes
Josep Casals - Lead Data EngineerJim Anning - Head of Data & Analytics
Hive : Control Your Heating from your Phone
Connected Boiler: Proactive Maintenance
MyEnergy: Understand your Energy Usage
Hive
170K - 2 minutes
600K - Smart Meter
3.8M Monthly
Future - 10 seconds
Internet of Things
Data ScienceLots of Data
0
15000
30000
45000
60000
2011 Now
Internet of Things
Data ScienceLots of Data
0
15000
30000
45000
60000
C* + Spark
2011 Now
Lesson 1 : Not to race against bicycles
Spark is for parallel execution
• Makes sense when we have jobs that can’t run on a single machine
• The Spark master needs to distribute the job to workers
• If the job shuffles all data to one single node, parallelism is lost
• For small tasks, many times a simple script is better
techblog.netflix.com
Things that look like a Spark / C* cluster A Large Hadron Collider
A Ion Thrust Engine • It can achieve big energies
• It takes a lot of fine tuning
• It starts slow but in the long run goes very fast
Who wins?
It depends on how far you go…
Lesson 2 : Not to use Spark too much
Joining data from multiple sources Think twice when you do that
Upserting data from multiple sources
Do that if possible
Upserting data from multiple sources
Lesson 3 : Spark is stronger than Cassandra
Spark Properties & Cassandra-specific properties tuning
Lesson 4 : Mindset
Lesson 5 : Velocity
Idea ValueData Science Data EngineeringData Operations
Idea ValueData Science Data EngineeringData Operations
Data OperationsIdea ValueData Science Data Engineering
Creative
Experimental
Incremental Robust
Defined
Maintainable
Research
Scalable
Testable
Data OperationsIdea ValueData Science Data Engineering
Creative
Experimental
Incremental Robust
Defined
Maintainable
Research
Scalable
Testable
R PythonJava Scala
Small DatasetsOffline
Single Machine
#BigData
ClusteredRealtime
Data OperationsIdea ValueData Science Data Engineering
Creative
Experimental
Incremental Robust
Defined
Maintainable
Research
Scalable
Testable
R PythonJava Scala
Small DatasetsOffline
Single Machine
#BigData
ClusteredRealtime
Data OperationsIdea ValueData Science Data Engineering
Creative
Experimental
Incremental Robust
Defined
Maintainable
Research
Scalable
Testable
R PythonJava Scala
Small DatasetsOffline
Single Machine
#BigData
ClusteredRealtime
x