introduction to wso2 data analytics platform
TRANSCRIPT
An Introduction to the WSO2 Analytics PlatformSrinath PereraVP Research WSO2, Apache Member(@srinath_perera) [email protected]
Collect Data One Sensor API to
publish events - REST, Thrift, Java, JMS,
Kafka- Java clients, java script
clients* First you define streams
(think it as a infinite table in SQL DB)
Then publish events via Sensor API
“Publish once, process anyway you like”
Collecting Data: Example
Java example: create and send events Events send asynchronously See client given in http://goo.gl/vIJzqc for more info
Agent agent = new Agent(agentConfiguration);publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );
StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);definition.addPayloadData("sid", STRING);... publisher.addStreamDefinition(definition);... Event event = new Event();event.setPayloadData(eventData);publisher.publish(STREAM_NAME, VERSION, event);Send events
Define Stream
Initialize Stream
Data Collection Examples• Collect data from inbuilt agents in
WSO2 products, Tomcat etc.• Collecting your log data via log stash • Collecting JVM and JMX stats via
agent • Ingesting data from message queues
such as JMS or Kafka • Pulling data from a RSS feed, or
scraping a web page • Write a custom agent to collect data
from your system and push it to DAS
Photo credit http://www.torange.us/ CC license
Analysis: Batch Analytics• Batch analytics reads data from a disk ( or some other
storage) and process them record by record • “MapReduce” is most widely used technology for batch
analytics – Apache Hadoop– Apache Spark 30X faster and much more flexible
• Analytics (Min, Max, average, correlation, histograms, might join or group data in many ways)
• Key Performance indicators (KPIs)– E.g. Profit per square feet for retail
• Presented as a Dashboard
SQL like Queries: Spark SQL Since many understands SQL,
Hive made large scale data processing Big Data accessible to many
Expressive, short, and sweet. Define core operations that
covers 90% of problems Lets experts dig in when they
like! (via User Defined functions)insert overwrite table BusSpeed select hour, average(v) as avgV, busID from BusStream group by busID, getHour(ts);
Spark SQL Query
Count entries where username is not empty group by user name and ordered by the count
SELECT username, COUNT(*) AS count FROM wikiData WHERE username <> '' GROUP BY username ORDER BY count DESC LIMIT 10
Usecase: API Usage
• Looking at different API calls by countries• Designed to draw attention to what APIs are
used and where
Value of some Insights degrade Fast!
For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrades very quickly with time.
We need technology that can produce outputs fast Static Queries, but need very fast
output (Alerts, Realtime control) Dynamic and Interactive Queries
( Data exploration)
Realtime Analytics: Complex Event Processing
CEP Queries 1
Calculate average temperature over a 1 minute sliding window group by roomNo
Define Stream TempStream(roomNo string, temp double )from TempStream#window.time(1 min)
select roomNo, avg(temp) as avgTempgroup by roomNoinsert all events into AvgRoomTempStream ;
CEP Queries 2
Using data from a Football game Kick stream shows kicks by players on the ball Ball possession is hit by me, followed by any number
of hits by me, followed by hit by someone else
from every k1 =KickStream, KickStream[playerid = k1.playerid]*,KickStream[playerid != k1.playerid]
select ..insert into BallPosessionStream;
People Tracking via BLE
• Track people through BLE via triangulation
• Higher level logic via Complex Event Processing
• Traffic Monitoring • Smart retail • Airport management
Realtime Soccer Analysis
Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
Scaling CEP Queries on top of Storm
▪Accepts CEP queries with hints about how to partition streams
▪Partition streams, build a Apache Storm topology running CEP nodes as Storm Sprouts, and run it. see http://goo.gl/pP3kdX for more info.
CEP Queries On Strom
@dist(parallel='4’) ask to run it with 4 nodes Use partition definition to break the data so they
can run in parallel
define partition on TempStream.region {@dist(parallel='4’) from TempStream[temp > 33]insert into HighTempStream;
}
from HighTempStream#window(1h)select max(temp)as max insert into HourlyMaxTempStream;
Interactive Analytics Best way to explore
data is by asking Ad-hoc questions
Interactive Analytics ( Search) let you query the system and receive fast results (<10s)
Shows data in context (e.g. by grouping events from the same transaction together)
Built using Lucence based Indexes.
SparkSQL> SELECT * FROM TWITTER_DATA
Predictive Analytics Can you “Write a program to drive a
Car?” Machine learning
Takes in lot of examples, and build a program that matches those examples
We call that program a “model” Lot of tools
- R ( Statistical language)- Sci-kit learn (Python)- Apache Spark’s MLBase and Apache
Mahout (Java)
Predictive Analytics in DAS• Building models
– With WSO2 Machine Learner Product via a Wizard ( powered by MLLib)
– Build model using R and export them as PMML
• Built models can be used them with both WSO2 CEP and ESB
Using the Model Within CEPfrom InputStream#ml:predict(’/../diabetes-model', 'double')select *insert into PredictionStream;
<predict> <model storage-location=”../downloaded-ml-model"/> <features> <feature name="SI2" expression="$body/features/SI2"/> .. </features> <predictionOutput property="result"/></predict>
Within ESB
WSO2 Machine Learner• Upload or select data • Explore the data • Train a Machine
learning model
WSO2 Machine Learner• Compare Results• Understand why• Iterate
Supported Algorithms• Deep Learning based classification (H2O’s Stacked
Autoencoders Classifier).• Classification algorithms - Decision Trees, Linear
Regression, Lasso Regression, SVM, Naïve • K-Mean clustering for unsupervised learning on your
data• Employ Anomaly Detection using K Means
Algorithm to identify fraud, network penetration and other difficult scenarios
• Recommendations Engine (Collaborative Filtering Algorithm)
Predict wait time in the Airport
• Predicting the time to go through airport
• Real-time updates and events to passengers
• Let airport manage by allocate resources
Predict Promising Customers• Typical website can get millions of users • Only very small fraction coverts • Each user, we know what he access, where
is works, country, what browser, OS, etc. • Problem is to predict what users will covert • Used Logistic regression, Random Forest,
Survival Modeling etc.
Predict Super Bowl• Predicted 7 of the 11
games • Done with Random
Forest Algorithm • Even what we missed
are instructive
See Yuda’s post: Predicting the Super Bowl with Machine Learning
Anomaly Detection:Markov Models
• Can model probability of a sequences
• Given a sequence, can predict likelihood, and use that to detect anomalies.
• Implemented with WSO2 CEP
Anomaly Detection: Clustering• Use clustering to
identify normal behavior as clusters
• Consider points away from all cluster as anomalies.
• Point is considered away from a cluster if it is outside 99% percentile line for that cluster
• Includes in WSO2 ML
Communicate: Dashboards• Dashboard give an
“Overall idea” in a glance (e.g. car dashboard)– Boring when everything is
good!!• Build your own dashboard.
– WSO2 DAS supports a
gadget generation Wizard– You can write your own
Gadgets using D3 and Javascript.
Gadget Generation Wizard
• Starts with data in tabular format • Map each column to dimension in
your plot like X,Y, color, point size, etc
• Create a chart with few clicks
Powered by VizGrammer lib that uses
Vaga undneath
(see https://github.com/wso2/VizGrammar)
Communicate: Alerts▪Done with CEP Queries▪Last Mile- Email, SMS- Push notifications to a UI- Pager - Trigger physical Alarm
Real Life Use Cases▪Cisco ( OEM the platform with Cisco solutions, Health, Smart Parking)
▪Experian ( Digital Marketing) - see video
▪Pacific Controls ( Smart City Platform, Vehicle tracking, building monitoring) - see video
▪Throttling and Anomaly Detection ( by group of Telco companies)
▪API Analytics (13+ customers)
No battle plan survives contact with
the enemy--Helmuth von Moltke
Key Differentiators• Open Source, under Apache 2 license• Publish data once, analyze it anyway
you like experience. • Flexible packaging or as a scalable
cluster • Rich, extensible, SQL-like configuration
language• Compact, easy to learn syntax
addressing complex requirements, such as time windows, patterns, sequences which would be complex to develop in a programming language such as Java.
• Rich set of data connectors, which can be easily extended
•Events only need to be published once from applications to the platform, and can be consumed by batch or real time pipeline.
• Performance on single node satisfies 90% of use cases
• Part of the overall WSO2 platform
36
More Information▪Introducing WSO2 Analytics Platform: Note for Architects, https://iwringer.wordpress.com/2015/03/18/introducing-wso2-analytics-platform-note-for-architects/
▪WSO2 Data Analytics Server, http://wso2.com/products/data-analytics-server/
▪WSO2 Complex Event Processor, http://wso2.com/products/complex-event-processor/
▪WSO2 Machine Learner, http://wso2.com/products/machine-learner/
Thank You