how spark enables the internet of things: efficient integration of multiple spark components for...
TRANSCRIPT
© 2015 IBM Corporation
How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases
Paula Ta-ShmaIBM [email protected]
Joint work with:Adnan Akbar, University of SurreyMichael Factor, IBM ResearchGuy Hadash, IBM ResearchJuan Sancho, ATOS
© 2015 IBM Corporation2
The Evolution of Data Collection
Internet of Things
© 2015 IBM Corporation3
2005 2012 2017
The IoT market will grow to $1.7 trillion in 2020 (IDC)
Sens
ors
(Inte
rnet
of T
hings
)
VoIPSocial Media
(video, audio and text)
Enterprise Data
By 2020 the number of networked devices will be 30 billion (IDC), more than 4 times the entire global population
IoT : The Biggest Big Data
Glo
bal D
ata
Volu
me
in E
xaby
tes
2005 2012 2017
© 2015 IBM Corporation4
EMT Madrid Bus Company Needs to Make Decisions According to Current and Predicted Future Traffic State The Problem
– EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output. This can be slow and costly.
Objective– Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real-
time traffic problems
Approach– Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based
upon knowledge derived from historical data
Today Tomorrow
© 2015 IBM Corporation5
1. Collect historical time series data– Collect data from devices– Aggregate into objects– Index and/or partition
Generic IoT Architecture – Data Flow
Secor
IoT
Swift
© 2015 IBM Corporation6
2. Learn patterns in data– May be time/location dependent– Generate thresholds, classifiers etc.
Generic IoT Architecture – Data Flow
SecorSwift
© 2015 IBM Corporation7
IoT
3. Apply what was learned on real time data stream– Take action
Generic IoT Architecture – Data Flow
Secor
CEP
Swift
© 2015 IBM Corporation8
How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases
IoT
Generic IoT Architecture – Data Flow
CEP
SecorSwift
Green Flows: Real time
Purple Flows: Batch
© 2015 IBM Corporation9
Aim: Collect historical timeseries data for analysis– Continuously collect data from up to 3000 Madrid council traffic sensors via web service
- Data includes traffic speeds and intensities, updated every 5 mins– Push the messages to Kafka– Use Secor to aggregate multiple messages into a single Swift object
- According to policy, e.g., every 60 mins- Possibly partition the data, e.g. according to date- Convert to Parquet format- Annotate with metadata, e.g., min/max speed, start/end time
– Index Swift objects according to their metadata using ElasticSearch
Secor
Swift
IoT Architecture – Madrid Traffic – Ingestion Flow
IoT
© 2015 IBM Corporation10
IoT Architecture – Madrid Traffic – Data Access
Aim: Access data efficiently and cost effectively
– Store IoT data in OpenStack Swift object storage
- Open source, low cost deployment, and highly scalable
– Parquet data is accessible via Spark SQL– Optimized predicate pushdown
- Custom Spark SQL external data source driver
- Uses object metadata indexes- Searches for Swift objects whose min/max
values overlap requested ranges
Get all data for morning traffic:SELECT codigo, intensidad, velocidad FROM madridtraffic WHERE tf >= '08:00:00' AND tf <= '12:00:00'
Brute force method13245 Swift requestsOptimized predicate pushdown616 Swift requests21.5 times improvement
Swift
© 2015 IBM Corporation11
IoT Architecture – Madrid Traffic – Machine Learning
Aim: Learn to differentiate between ‘good’ and ‘bad’ traffic
– Depends on context - Time (morning/evening), Day (weekday/weekend)- Location
– Use Spark MLlib k-means clustering– Produce threshold values for real-time decision making– Re-run algorithm when quality of clusters decreases
- Can use silhouette index to measure quality
Swift
© 2015 IBM Corporation12
IoT Architecture – Madrid Traffic – Machine Learning
Event Detection:
• Use Spark MLlib k-means clustering to separate data into 2 clusters
• Find the midpoint between the 2 cluster centres
• Use this midpoint to generate the thresholds
• Repeat for each context e.g. time period (morning, afternoon, evening, night)
Anomaly Detection:
• Use a single cluster and define an anomaly to be further than a certain distance from the cluster centre
Morning Traffic on Weekdays
© 2015 IBM Corporation13
IoT Architecture – Madrid Traffic – Real Time Decision MakingAim: Respond in real time to traffic conditions
– Use Complex Event Processing (CEP) approach- Rule based- Process events record by record- CEP rules are typically defined manually but in many
cases it is difficult to get them right- We automate this process and make it smart
- uCEP has a small footprint, can be run at the edge
CEP
IoT
Work in ProgressProactive approach:
• Use Spark streaming linear regression to predict traffic behavior (e.g. speed, intensity) for near future
• Apply CEP on predicted data
• Respond pro-actively to predicted events such as traffic congestion
– e.g. EMT can proactively re-route buses
© 2015 IBM Corporation14
Demo
© 2015 IBM Corporation15
Our Architecture Applies to Many IoT Use Cases Energy/utilities
– Anomaly detection- Pipe leakage- Appliance malfunction
– Occupancy detection
Healthcare– Healthcare patient
monitoring/alert/response
Insurance– Driver behavior and location
monitoring
Transportation– Connected vehicles, engine
diagnostics, automated service scheduling
Logistics– Goods tracking, sensitive
goods management
© 2015 IBM Corporation
Data Sources
Apache
Spark
Node-RED
Secor
Message Bus
Data Storage
Data Analytics
Data Visualization
Freeboard Dashboard
Object Storage
16
MQTT
The Madrid Traffic Use Case on IBM Bluemix
Madrid Traffic Sensors
Joint work with Naeem Altaf and team
© 2015 IBM Corporation17
Thank You !
© 2015 IBM Corporation18
Backup
© 2015 IBM Corporation19
COSMOS Funding: EU FP7 at level of 2PY x 3 years Started: Sept 2013 Coordinator: ATOS Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid
Council, III Taiwan – Smart Cities use cases Project Vision: Enable ‘things’ to interact with each other based on shared
experience, trust, reputation etc.
© 2015 IBM Corporation20
IBM Bluemix Data Analytics for IoT Architecture
© 2015 IBM Corporation21
What is it?– Apache Kafka is a high throughput distributed publish/subscribe messaging system. – Secor is an open source tool developed by Pinterest, which aggregates Kafka messages
and saves as an S3 object. What extensions were needed?
– Support for OpenStack Swift as a Secor target. We also added support for Parquet format and annotating objects with metadata search to support indexing.
What is the value of integration with Swift?– Enables bringing new data and applications to Swift which is an open source solution.
Parquet and metadata search enable improved performance for batch analytics. Status
– We contributed OpenStack Swift support to the Secor community and it is now part of Secor.
Secor
Kafka + Secor
© 2015 IBM Corporation22
Parquet What is it?
– A column based semi-structured, schema-based storage format supported by Hadoop and Spark. Enables column-wise compression and projection pushdown.
What integration is needed?– Since Swift is now part of the Hadoop ecosystem, no additional integration is needed.
Data in Swift can be stored in Apache Parquet format, inheriting associated advantages. Status
– Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage systems such as Swift.
© 2015 IBM Corporation23
elasticsearch What is it?
– A distributed, scalable, real-time search and analytics engine, built on Apache Lucene. What integration is needed?
– Index object metadata allowing search for objects by attributes. What is the value of integration with Swift
– Use search to select objects for further processing, e.g., relevant objects for analytics. - Note that S3 does not yet have native search according to metadata.
Status– The IBM SoftLayer object service includes a basic implementation of metadata search;
At IBM Research, we added extensions such as data type support and range searches.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
For up-to-date information and newsabout the Spark and the Spark Technology Center,
Sign up for our newsletter at www.spark.tc