hong-linh truong faculty of informatics, tu wien http ... · web services for data-as-a-service...
TRANSCRIPT
![Page 1: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/1.jpg)
Quality-aware data analytics
Hong-Linh Truong
Faculty of Informatics, TU Wien
[email protected]://www.infosys.tuwien.ac.at/staff/truong
@linhsolar
1ASE Summer 2018
Advanced Services Engineering,
Summer 2018
![Page 2: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/2.jpg)
What this lecture is about
Big Data analytics – general view
The meaning of quality-aware data analytics
Incident management for cloud-based big data
analytics systems
Concepts and approaches
Quality of analytics (QoA) for data analytics
Quality of data in data analytics workflows
Data elasticity management
ASE Summer 2018 2
![Page 3: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/3.jpg)
What this lecture is about
After this lecture
Make sure that you can monitor incidents in your
systems
Apply and revise the analytics part in your project
Deal with quality of analytics and see how you could
offer quality-aware analytics in your project
ASE Summer 2018 3
![Page 4: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/4.jpg)
Big Data
Data: facts, responses, events, measurement, etc.
4
{"station_id":"1160629000","datap
oint_id":122,"alarm_id":310,"even
t_time":"2016-09-
17T02:05:54.000Z","isActive":fals
e,"value":6,"valueThreshold":10}
What does it mean
“Big data”?
ASE Summer 2018
![Page 5: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/5.jpg)
Big Data
Sources
Internet of Things (IoT), human participation, social
networks, software services, environment monitoring,
advanced science instruments, science discovery,
etc.
Several challenges in terms of data gathering,
integration, and analytics
ASE Summer 2018 5
H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan,
and Cyrus Shahabi. 2014. Big data and its technical challenges. Commun. ACM 57, 7 (July 2014), 86-94.
DOI=10.1145/2611567 http://doi.acm.org/10.1145/2611567
![Page 6: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/6.jpg)
Characterize big data
Big data is often characterized by the concepts
of V*: Volume, Variety, Velocity, Veracity and
Valence Volume: size (big size, large-data set, massive of small
data)
Variety: complexity (formats, types of data)
Velocity: speed (generating speed, data movement speed)
Veracity: quality is very different (bias, accuracy, etc.)
Valence: potential/possible relationships among different
type of data w.r.t data combination
6ASE Summer 2018
![Page 7: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/7.jpg)
Data Management/Delivery
Systems
Static data – data at rest
Hadoop file systems
Large scale storage data systems
iRODS, BigQuery, and other NoSQL
Web services for Data-as-a-Service (e.g., GIS)
Real time data – data in motion
Cloud data platforms
Several MOM (Message-oriented Middleware)
E.g., Apache Kafka
Domain-specific streamming systems (e.g., images)
ASE Summer 2018 7
![Page 8: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/8.jpg)
Data Processing Framework
Batch processing
Mapreduce/Hadoop
Data pipelines/Data flows
Scientific workflows
(Near) realtime streaming processing
Apache Flink, Apache Kafka Streaming, Apache
Apex, Apache Spark
ASE Summer 2018 8
![Page 9: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/9.jpg)
Data Analytics
Data Analytics: Analysis + Decision
ASE Summer 2018 9
Data Processing Frameworks
Streaming/Online
Data Processing
Batch Workflow
Data Processing
Hybrid Data
Processing
Data at restData in
motion
Decision Data Analysis
Analytics,
Tools,
Processes &
Models
![Page 10: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/10.jpg)
Analysis: workflow models
ASE Summer 2018 10
Things
PeopleDaaS
Computation
Service
Important notes: Structures
and resources
![Page 11: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/11.jpg)
Analysis: Stream data processing
Processing
elements/operators
are arranged in
graphs
Streaming data
comes to processing
elements
Results from an
element are passed to
another
ASE Summer 2018 11
Source: Neumeyer, L.; Robbins, B.; Nair, A.; Kesari, A., "S4:
Distributed Stream Computing Platform," Data Mining Workshops
(ICDMW), 2010 IEEE International Conference on , vol., no.,
pp.170,177, 13-13 Dec. 2010
Check also: http://www.infosys.tuwien.ac.at/staff/truong/dst/pdfs/truong-dst2018-lecture5.pdf
![Page 12: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/12.jpg)
Analysis: hybrid data processing
ASE Summer 2018 12
Source:http://lambda-architecture.net/
Combine batch processing and streaming processing
e.g., https://spark.apache.org/
Which scenarios should we use a combination?
![Page 13: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/13.jpg)
Cloud services and big data analytics
Data sources
(sensors, files, database,
queues, log services)
Messaging systems
(e.g., Kafka, AMQP,
MQTT)
Storage and Database
(S3, Google BigQuery, InfluxDB, HDFS,
Cassandra, MongoDB, Elastic Search
etc.)
Batch data processing
systems
(e.g., Hadoop, Airflow, Spark)
Stream processing
systems
(e.g. Apex, Kafka, Flink,
WSO2, Google Dataflow)
Elastic Cloud Infrastructures
(VMs, dockers, OpenStack elastic resource management tools, storage)
Warehouse
Analytics
Operation/Management/
Business Services
ASE Summer 2018 13
![Page 14: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/14.jpg)
ASE Summer 2018 14
What do we mean by quality-aware
data analytics:
Able to determine quality and incidents,
establish their relationships and optimize
the system accordingly based on
constraints on quality and incidents
![Page 15: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/15.jpg)
Incidents
System incidents
Data incidents
Processing incidents
Cross systems and cross layers
ASE Summer 2018 15
Check: https://en.wikipedia.org/wiki/ITIL
![Page 16: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/16.jpg)
Quality of Analytics (QoA)
Characterize the results of analytics processes
Different elements of QoA
Performance (e.g. Execution time)
Quality of data/data quality
Cost
Data format of output results
Etc.
Customer: expects QoA
Provider: offers QoA and enforces QoA
ASE Summer 2018 16
![Page 17: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/17.jpg)
A simple QoA view
17
Data Analytics
Data in
Data out
Executed on
Analytics
Processesuses
Execution time?
Performance Overhead?
Memory Consumption?
Is the data good
enough?
How bad data
impacts on
performance?
Is the data good enough
to be stored and shared?
Note: Data quality metrics and models
are strongly domain-specific
Which processes should
be used?
ASE Summer 2018 17
![Page 18: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/18.jpg)
INCIDENTS IN CLOUD-BASED
BIG DATA
ASE Summer 2018 18
![Page 19: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/19.jpg)
Public cloud infrastructures
Private cloud infrs.Base Transceiver Station (BTS)
Case Study BTS
Large-scale systems (1K+ BTS)
Flexible back-end clouds
Generic enough for other applications (e.g., in smart agriculture)
With bad infrastructures for IoT and connectivity
ASE Summer 2018 19
SensorIoT
GatewayMQTT
Broker
BigQuery
Influxdb
Hadoop FS
G. StorageActuator
Optimizer AnalyticsAnalytics
Analytics
![Page 20: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/20.jpg)
ASE Summer 2018 20
If you monitor alarms in BTSs and see this
What could be happened?
![Page 21: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/21.jpg)
Challenges
The ultimate goal of the (domain) data scientist is
to meet
Quality of Analytics (QoA)
QoA: cost, performance (response time), quality of
data (up-to-date ness, accuracy)
But there are many interactions that might cause
incidents that lead to unexpected QoA
ASE Summer 2018 21
Hong-Linh Truong , Aitor Murguzur, Erica Yang, Challenges in Enabling Quality of Analytics in the Cloud, ACM JDIQ
Challenge paper, 2017.
![Page 22: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/22.jpg)
Apache NifiBig data storage (Hadoop
FS/Google Storage)
analytics
result
BTS
Monitoring
SFTP
Apache SparkEnrichment
Service
Kibana
Visualization
analyticsanalytics
resultresult
resultresult
ElasticSearch
resultresult
resultresult
resultdata
notificationanalytics
results
Web
services
Client
BTS
Monitoring
MQTT
RabbitMQ
BatchAnalytics
Manager
Analytics Web
Service
Planner
Streaming Data
Processing
Ingestion
Service
BigQuery
Analytics
Service
Problem 1: the complexity of
software stacks and subsystemsSource: Simplified version of the
design from I & A Computing Lab, VN
www.inacomputing.com
ASE Summer 2018 22
![Page 23: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/23.jpg)
Porblem 2: Complexity of the
underlying virtual computing and
network infrastructures
Heavily based on virtual resources
IoT, Network functions and Clouds
ASE Summer 2018 23
IoT Big Data Analytics
The SINC Concept: http://sincconcept.github.io
![Page 24: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/24.jpg)
Incident monitoring and analytics
Classification of incidents:
to quantify incidents and identify possible data
sources, monitoring techniques and analytics.
Measurement/Instrumentation:
to provide mechanisms for measurement and data
collection for incidents.
Incident analytics:
to find out the root cause and dependencies of
incidents.
ASE Summer 2018 24
![Page 25: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/25.jpg)
Analysis/
Transformation
Task
IoT
Sensor
Data
Storage
Resulting
analytics
Message
Broker/Data
Logistics
Service
….
Large number
of data
sources (e.g.,
IoT devices)
Large-scale
brokers & data
transfer/logistics
services
Complex big data
processing
frameworks
Other
systems in
the pipeline
IoT
Gateway
Analysis/
Transformation
Task
W3H: what, when, where and how
for incidents
Too complex with many types of software. Can we
have a simplified taxonomy for mapping
incidents?
ASE Summer 2018 25
Hong-Linh Truong, Manfred Halper, Characterizing Incidents in Cloud-based IoT Data Analytics,, The 42nd IEEE
International Conference on Computers, Software & Applications Tokyo, Japan, July 23-27, 2018.
![Page 26: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/26.jpg)
Points of instrumentation for
gathering data for incident analytics
ASE Summer 2018 26
Hong-Linh Truong, Manfred Halper, Characterizing Incidents in Cloud-based IoT Data Analytics,, The 42nd IEEE International Conference
on Computers, Software & Applications Tokyo, Japan, July 23-27, 2018.
Capture monitoring data to analyze and solve incidents,
especially incidents related to data quality, across
subsystems in ensembles to achieve quality of results
![Page 27: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/27.jpg)
Classification of incidents
ASE Summer 2018 27
Hong-Linh Truong, Manfred Halper, Characterizing Incidents in Cloud-based IoT Data Analytics,, The 42nd IEEE International Conference
on Computers, Software & Applications Tokyo, Japan, July 23-27, 2018.
![Page 28: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/28.jpg)
Example of incident classification
ASE Summer 2018 28
See https://www.researchgate.net/publication/324170664_Characterizing_Incidents_in_Cloud-based_IoT_Data_Analytics
![Page 29: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/29.jpg)
Large-scale brokers and
storage
Complex big data processing
frameworks and ML
applications (e.g., Spark)
Monitoring and Analytics
Not just fast, distributed and cross layer monitoring
Hard to collect some incident related data for
quality of data
Analytics: will be based on big data principles
with ML but dependency analysis is not trivial
ASE Summer 2018 29
![Page 30: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/30.jpg)
One example of tools for
monitoring
ASE Summer 2018 30
Check: https://github.com/rdsea/bigdataincidentanalytics
![Page 31: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/31.jpg)
QOA IN DATA ANALYTICS
WORKFLOWS
ASE Summer 2018 31
![Page 32: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/32.jpg)
Data analytics workflow execution
models
ASE Summer 2018 32
Data analytics
workflows Execution Engine
Local Scheduler
job job job job
Web
serviceWeb
serviceWeb
serviceWeb
service
People
![Page 33: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/33.jpg)
Data analytics workflow execution
models
ASE Summer 2018 33
Data analytics
workflows
Execution EngineData Analysis
Service Unit
input
dataAnalytics
Results
Complex batch
processing (e.g.,
Meduce/Hadoop)
Dockers/VMs/Servers/Cloud/Cluster
A unit/an
activity can be
complex
![Page 34: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/34.jpg)
Representing and programming
data analytics workflows/processes
Programming languages
General- and specific-purpose programming languages, such as Java, Python, Swift
Programming models
such as MapReduce, Hadoop, Complex event processing, Spark
Descriptive languages
BPEL and several languages designed for specific workflow engines
They can also be combined
34ASE Summer 2018
Check also: http://www.infosys.tuwien.ac.at/staff/truong/dst/pdfs/truong-dst2018-lecture5.pdf
![Page 35: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/35.jpg)
Some examples (3)
ASE Summer 2018 35
Source: Sudipto Das, Yannis Sismanis, Kevin S. Beyer, Rainer Gemulla, Peter J. Haas, and John McPherson. 2010.
Ricardo: integrating R and Hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management
of data (SIGMOD '10). ACM, New York, NY, USA, 987-998. DOI=10.1145/1807167.1807275
http://doi.acm.org/10.1145/1807167.1807275
![Page 36: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/36.jpg)
Some examples (4): Airflow from
Airbnb
Workflow is a DAG (Direct Acyclic Graph)
http://airbnb.io/projects/airflow/
Task/Operator:
BashOperator, PythonOperator, EmailOperator,
HTTPOperator, SqlOperator, Sensor,
DockerOperator, HiveOperator, S3FileTransferOperator,
PrestoToMysqlOperator, SlackOperator
ASE Summer 2018 36
![Page 37: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/37.jpg)
Example for processing signal file
ASE Summer 2018 37
![Page 38: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/38.jpg)
Some examples (5): Mapreduce
ASE Summer 2018 38
Source: Jeffrey Dean and Sanjay Ghemawat.
2008. MapReduce: simplified data processing on
large clusters. Commun. ACM 51, 1 (January
2008), 107-113. DOI=10.1145/1327452.1327492
http://doi.acm.org/10.1145/1327452.1327492
![Page 39: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/39.jpg)
Apache Beam
Goal: separate from pipelines from backend
engines
ASE Summer 2018 39
Read data
analytics
Post-processing
resultStore analysis
result
![Page 40: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/40.jpg)
ASE Summer 2018 40
So how do we enable QoA-aware
analytics?
![Page 41: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/41.jpg)
Solutions
Computational resources provisioning?
Replication of data analysis tasks ?
Performance and cost measurement and
optimization?
Improve quality of input data ?
Improve the quality of output data?
ASE Summer 2018 41
![Page 42: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/42.jpg)
ASE Summer 2018 42
Which tools do you need for such
solutions?
![Page 43: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/43.jpg)
ASE Summer 2018 43
We will focus on quality of data as it
has not been studied well
![Page 44: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/44.jpg)
Mostly performance but not data
quality
ASE Summer 2018 44
![Page 45: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/45.jpg)
ASE Summer 2018 45
If a job is failed due to the quality of data,
how do you know?
![Page 46: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/46.jpg)
Well-addressed
concerns –
performance/cost
ASE Summer 2018 46
Source: David Chiu, Sagar Deshpande, Gagan
Agrawal, Rongxing Li: Cost and accuracy
sensitive dynamic workflow composition over grid
environments. GRID 2008: 9-16
![Page 47: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/47.jpg)
Data Operations and cost with
BigQuery
ASE Summer 2018 47
Source: https://cloud.google.com/bigquery/pricing
![Page 48: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/48.jpg)
ASE Summer 2018 48
Just think about a simple example:
If you want to implement cost together data
size and performance, what would be your
way?
![Page 49: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/49.jpg)
Provenance info
ASE Summer 2018 49
![Page 50: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/50.jpg)
ASE Summer 2018 50
If you are able to detect a quality problem in
the analysis phase, can you trace back to the
data sources? what would be your way?
![Page 51: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/51.jpg)
Research questions for QoD
What are main QoD metrics, what are the relationship between QoD
metrics and other service level objectives, and what are their roles
and possible trade-offs?
How to support different domain-specific QoD models and link them
to workflow structures?
How to model, evaluate and estimate QoD associated with data
movement into, within, and out to workflows? When and where
software or scientists can perform automatic or manual QoD
measurement and analysis
How to optimize the workflow composition and execution based on
QoD specification?
How does QoD impact on the provisioning of data services,
computational services and supporting services?
ASE Summer 2018 51
![Page 52: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/52.jpg)
Approach
ASE Summer 2018 52
Core models, techniques and algorithms to allow the modeling and evaluating QoD metrics
QoD-aware composition and execution
QoD-aware service provisioning and infrastructure optimization
![Page 53: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/53.jpg)
Modeling and evaluating QoD
metrics for data analytics
workflows
ASE Summer 2018 53
![Page 54: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/54.jpg)
QoD-aware optimization for data
analytics workflow composition
and execution
ASE Summer 2018 54
![Page 55: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/55.jpg)
ASE Summer 2018 55
How to integrate QoD evaluators? And
which concerns need to be considered?
![Page 56: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/56.jpg)
QoD metrics evaluation
Domain-specific metrics
Need specific tools and expertise for determining
metrics
Evaluation
Cannot done by software only: humans are required
Exact versus inexact evaluation due to big and
streaming data
Complex integration model
Where to put QoD evaluators and why?
How evaluators obtain the data to be evaluated?
Impact of QoD evaluation on performance of
data analytics workflowsASE Summer 2018 56
![Page 57: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/57.jpg)
57
Evaluating quality of data in
workflows
Michael Reiter, Uwe Breitenbücher, Schahram Dustdar, Dimka Karastoyanova, Frank Leymann, Hong Linh Truong: A
Novel Framework for Monitoring and Analyzing Quality of Data in Simulation Workflows. eScience 2011: 105-112
ASE Summer 2018
![Page 58: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/58.jpg)
QoD Evaluator
Software-based QoD evaluators
Can be provided under libraries integrated into
invoked applications
Web services-based evaluators
Human-based QoD evaluators
Built based on the concept human-based services
Can be interfaces via Human-Task
Simple mapping at the moment
Human resources from clouds/crowds
ASE Summer 2018 58
![Page 59: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/59.jpg)
ASE Summer 2018 59
what kind of optimization can be done
with QoD?
![Page 60: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/60.jpg)
QoD-aware optimization for data
analytics workflows
Improving quality of analytics
Reducing analytics costs and time
Enabling early failure detection
Enabling elasticitiy of services provisioning
Enabling elastic data analytics support
Etc.
ASE Summer 2018 60
![Page 61: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/61.jpg)
ASE Summer 2018 61
How to support QoA driven analytics with
tradeoffs of multiple criteria?
QoA: QoD, performance, cost, etc.
![Page 62: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/62.jpg)
Quality-of-analytics driven
workflows
Some basic steps
Conceptualize expected QoA
Associate the expected QoA with workflow activities
Use the expected QoA
to match/select underlying services (e.g., data sources,
cloud IaaS, etc
Utilize the expected QoA and the measured QoA and
apply elasticity principles for Refine the workflow structure
Provision computation, network and data
ASE Summer 2018 62
Hong-Linh Truong, Aitor Murguzur, and Erica Yang. 2018. Challenges in Enabling Quality of Analytics in the Cloud.
J. Data and Information Quality 9, 2, Article 9 (January 2018), 4 pages. DOI: https://doi.org/10.1145/3138806
![Page 63: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/63.jpg)
Using Data Elasticity
Management Process to ensure QoA
ASE Summer 2018 63
Tien-Dung Nguyen, Hong Linh Truong, Georgiana Copil, Duc-Hung Le, Daniel Moldovan, Schahram Dustdar:
On Developing and Operating of Data Elasticity Management Process. ICSOC 2015: 105-119
![Page 64: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/64.jpg)
Data elasticity
Key techniques
Monitoring QoD for streaming and big data
Monitoring cloud resources
Having multiple data analysis algorithms
Using elasticity rules for cloud resources and
analysis algorithms
Building your own elasticity rules/models
ASE Summer 2018 64
![Page 65: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/65.jpg)
Exercises
Read mentioned papers
Examine possible incidents in your data pipelines
Examine how QoD evaluators can be integrated into
different programming models for QoA-aware data
analytics workflows
Implement some QoD evaluators
Develop techniques for determining places where QoD
evaluators can be performed in your mini projects
Support data elasticity management in your mini project
ASE Summer 2018 65
![Page 66: Hong-Linh Truong Faculty of Informatics, TU Wien http ... · Web services for Data-as-a-Service (e.g., GIS) Real time data –data in motion Cloud data platforms Several MOM (Message-oriented](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec56fcbc82f0c182427b1d1/html5/thumbnails/66.jpg)
66
Thanks for your attention
Hong-Linh Truong
Faculty of Informatics, TU Wien
http://www.infosys.tuwien.ac.at/staff/truong
@linhsolar
ASE Summer 2018