managing and mining smart meter data – at scale cse project showcase9 july 2013 twitter:...
TRANSCRIPT
Managing and mining
smart meter data – at scale
CSE Project Showcase 9 July 2013
Twitter: @cse_bristol #SmartMeterData
Introduction
Contents
-Introduction to the project, the data, and its applications
-Managing SM data at scale
-Getting valuable knowledge out of SM data
-Demo: Smart Meter Analytics, Scaled by Hadoop (SMASH)
-Where next?
-Discussion
Introduction
Project Background
“Generating Value from Smart Electricity Meter Data”
18 Month TSB-supported collaboration
CSE, University of Bristol, SSE and Western Power Distribution
Three themes:
• Managing the data at scale• Extracting useful knowledge• Integrating the above in a user-facing application
Introduction
The data
A half-hourly timeseries for each smart meter / register
Content: date, time, consumption in the half hour.
For a single register: 17,520 records per year.
This is what 18 months look like:
Introduction
The data
EDRP:• 18 months• 16,250 smart metered households• 16,250 smart electricity meters• 9,364 smart gas meters• 670m half-hourly records (E: 420m, G: 250m)• 40GB of raw csv file data
Post rollout, per year, domestic only:• 25m smart metered households• 25m smart electricity meters• 20m smart gas meters• 800 billion half-hourly records (E: 450Bn, G: 350Bn)• 50TB of raw csv file data
EDRP ~ 0.1% of a year’s domestic data
Introduction
What might we use it for?
Improve existing processes
• Settlement• Billing, reconciliation, audit• Demand profiling• Customer profiling & segmentation
New processes not possible without HH data at scale
• Localised prediction• Distribution network planning and modelling• Automated DSM – prediction and verification• System state detection• Individualised consumer energy services
Introduction
What are the essential processes?
Ingestion – getting the data into the system
Storage – keeping it there securely
Analysis and reporting
• Ad-hoc queries• Transaction reports• Descriptives and summaries (e.g. OLAP)• Mining and modelling• Visualisation
Data management & processing
More fundamentally
Moving data between storage, memory and CPU
Transforming it in the CPU into desired forms
There are physical constraints on the speed of this.
(These are relevant at the scale of smart meter datasets).
Data management & processing
Single machine RDBMS
MEMORY ~10s of GB per machine
CPU
STORAGE ~ 1TB per disk
~ 100 MB/s
~ 1000 MB/s
~ 2.5GHz
Using SQL Server to sum half hourly consumption:
4 bn records: ~ 1 hour40 bn records: ~ 10 hours1 years’ worth: ~ 200 hours
Data management & processing
Single machine RDBMS
MEMORY ~10s of GB per machine
CPU
STORAGE ~ 1TB per disk
~ 100 MB/s
~ 1000 MB/s
~ 2.5GHz
Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.
Data management & processing
Single machine RDBMS
MEMORY ~10s of GB per machine
CPU
STORAGE ~ 1TB per disk
~ 100 MB/s
~ 1000 MB/s
~ 2.5GHz
Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.
Solution: harness multiple individual machines (‘horizontal scaling’).
Data management & processing
Single machine RDBMS
MEMORY ~10s of GB per machine
CPU
STORAGE ~ 1TB per disk
~ 100 MB/s
~ 1000 MB/s
~ 2.5GHz
Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.
Solution: harness multiple individual machines.
Problem: this is difficult and expensive using traditional relational database applications
Data management & processing
Solution
Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling:
1 machine~£10k
2.5GHz1 GB/s100MB/s
~ a week
Data management & processing
Solution
Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling:
1 machine~£10k
2.5GHz1 GB/s100MB/s
~ a week
10 node cluster~£50k
25GHz10 GB/s1 GB/s
~ a day
Data management & processing
Solution
Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling:
1 machine~£10k
2.5GHz1 GB/s100MB/s
~ a week
10 node cluster~£50k
25GHz10 GB/s1 GB/s
~ a day
100 node cluster~£300k
250GHz100 GB/s10 GB/s
~ an hour
Data management & processing
Hadoop
Designed to solve the problem of exponentially growing data volumes (originally, google’s searchable copy of the web)
Harness a large number of commodity machines and low cost networking and storage.
Software takes a job (query, calculation, whatever) and ‘maps’ it out across the cluster.
In parallel each node locally processes a subset of the problem, before the results are ‘reduced’ back to a single dataset.
(Hence ‘Map/Reduce’)
Data management & processing
Experiments: SQL serverSingle high performance machine: bottlenecked by the speed of the hard drive
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
0 2,000,000,000 4,000,000,000 6,000,000,000
Runti
me
in s
econ
ds
Aggregation query performance versus dataset size
SQL Rows/second
~ 400GB
Data management & processing
Experiments: Hadoop 11 node physical cluster (~£50k hardware cost)
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
0 10,000,000,000 20,000,000,000 30,000,000,000 40,000,000,000
Runti
me
in s
econ
ds
Aggregation query performance versus dataset size
SMASH Rows per second vs dataset size
~2,500GB
Data management & processing
Experiments comparedNot straightforward to get SQL Server to run over ~ 10Bn records.
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
0 10,000,000,000 20,000,000,000 30,000,000,000 40,000,000,000
Runti
me
in s
econ
ds
Aggregation query performance versus dataset size
SMASH Rows per second vs dataset size
SQL Rows/second
~2,500GB
Data management & processing
Experiments: growing the clusterFixed dataset size of 500m records
R² = 0.9148
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1 2 3 4 5 6 7 8 9 10 11
Row
s pe
r sec
ond
Cluster size (nodes)
Aggregation query performance versus cluster size
SMASH speed in records per second vs cluster size
Data management & processing
Hadoop
Pros•Open source software – free and customisable•Adjustable data redundancy (data is replicated over the cluster)•Incrementally scalable – on both performance and cost measures: just add machines, system adapts automatically.•Responsive and cooperative developer community
Cons•Not the last word in user-friendliness (but this is changing)•Sledgehammer to crack a nut below a certain scale•Less mature (but rapidly developing) software ecosystem•Algorithms must fit the framework
Conclusion: low cost option for smart meter data processing
Data mining and visualisation
Finding value in the data
Improve existing processes
• Settlement• Billing, reconciliation, audit• Demand profiling• Customer profiling & segmentation
New processes not possible without HH data at scale
• Localised prediction• Distribution network planning and modelling• Automated DSM – prediction and verification• System state detection• Individualised consumer energy services
Data mining and visualisation
Finding value in the data
Collaborative approach with industry partners to identify business needs
Focus on:
(1) Datamining for subgroup discovery – classifying end users
(2) Cluster analysis on demand data – finding profiles
(3) Innovative visualisation of consumption data and datamining results
Data mining and visualisation
Subgroup discovery
“Pattern features”: 14 variables describing each household
•Income, geography, access to gas, size of house, value of house etc.
“Target features”: describe the behaviour of interest
•Profile error: how different is usage from the assigned profile?
Outputs:•groups of households with significantly different profile errors
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Clustering
Can we use demand data to create better profiles?
Define target features: waveform’s properties of interest
Two examples: using imposed and emergent properties.
Each using 3 clusters.
Data mining and visualisation
Clustering
E.g. 1 the average weekday as 5 pairs of numbers:
Data mining and visualisation
Clustering
E.g. 2: Frequency spectrum of the demand timeseries
Data mining and visualisation
Cluster analysis
Project competition results (the University won)
0.25
0.27
0.29
0.31
0.33
0.35
Average % difference from the cluster centroid
Data mining and visualisation
Conclusions from datamining
Subgroup discovery results suggest the approach is useful as long as you have metadata on the households
Cluster analysis work suggests it is possible to improve on the standard profile classes using SM data
Further work needs to be carried out on more representative datasets
There are many other potential applications!
The SMASH application
Web application
Installation of Hadoop on UoB and CSE clusters
11 Node physical cluster at the university (£50k)8 Node virtual cluster at CSE (£15k)
Integration of a range of Hadoop-friendly data management components
Development of a proof-of-concept web application for user interaction, job management, visualisation etc.
Deployment on both clusters
The SMASH application
Web application
Currently running on the CSE virtual Hadoop cluster
Generating Value from SM Data
Where next?
We have a proof-of-concept system developed with TSB R&D funding support.
We have mastered the underlying technologies and established that this approach has the potential to be a low-cost solution to a number of industry data challenges.
On a technical level the next steps are to•Further develop the web application •Refine the datamining algorithms (with more data)•Implement selected DM algorithms directly on the cluster
On a policy/programme level we want ensure this knowledge is incorporated into SM rollout infrastructure decision making.
Questions and discussion
@cse_bristol#SmartMeterData
Contacts:
Simon Roberts [email protected]
Joshua Thumim [email protected]
Web: www.cse.org.uk Sign up to our monthly e-news through our website
Follow us on Twitter @cse_bristol