scalable data mining on parallel, distributed and cloud ... · 13/02/15 1 scalable data mining on...
TRANSCRIPT
13/02/15
1
Scalable Data Mining on Parallel, Distributed
and Cloud Computing Systems
Domenico Talia
DIMES, Università della Calabria & DtoK Lab Italy
Lecture Goals
[2]
§ Parallel systems, distributed compu9ng infrastructures, like Grids and P2P networks, and Cloud compu9ng plaAorms, offer an effec9ve support for addressing both the § computa/onal and § data storage needs of Big Data mining and parallel analy/cs applica/ons.
§ Complex data mining tasks involve data-‐ and compute-‐intensive algorithms that require large storage facili9es together with high performance processors to get results in suitable 9mes.
13/02/15
2
Lecture Goals
[3]
§ In this lecture we introduce the most relevant topics and the main research issues in high performance data mining including § parallel data mining strategies, § distributed analysis techniques, and § knowledge services and Cloud-‐based data mining.
§ We discuss parallel models, scalable algorithms, data mining programming tools and applica9ons.
General Outline
[4]
LESSONS
1. Parallel Data Mining Strategies
2. Distributed and Service-‐oriented Data Mining
3. Data Mining Tools on Clouds
13/02/15
3
LESSON 2
[5]
Data Mining Tools on Clouds
6
§ Big problems and Big data
§ Using Clouds for data mining
§ A collec9on of services for scalable data analysis § Data mining workflows
§ Data Mining Cloud Framework (DMCF)
§ JS4Cloud for programming service-‐oriented workflows.
§ Final remarks.
Talk outline
13/02/15
4
7
§ KDD and data mining techniques are used in many domains to extract useful knowledge from big datasets.
§ KDD applica9ons range from § Single-‐task applica9ons § Parameter-‐sweeping applica9ons/ regular parallel applica9ons § Complex applica9ons (Workflow-‐based, distributed, parallel).
§ Cloud Compu/ng can be used to provide developers and end-‐users with compu9ng and storage services and scalable execu9on mechanisms needed to efficiently run all these classes of applica/ons.
Goals (1)
8
§ How using Cloud services for scalable execu9on of data analysis workflows.
§ A programming environment for data analysis: Data Mining Cloud Framework (DMCF).
§ A visual programming interface VL4Cloud and the script-‐based JS4Cloud language for implemen9ng service-‐oriented workflows.
§ Evalua9ng the performance of data mining workflows on DMCF.
Goals (2)
13/02/15
5
9
§ PaaS (PlaAorm as a Service) can be an appropriate model to build frameworks that allow users to design and execute data mining applica9ons.
§ SaaS (So^ware as a Service) can be an appropriate model to implement scalable data mining applica9ons.
§ Those two cloud service models can be effec9vely exploited for delivering data analysis tools and applica9ons as services.
Data analysis as a service
Service-‐Oriented Data Mining
10
§ Knowledge discovery (KDD) and data mining (DM) are: § Compute-‐ and data-‐intensive processes/tasks § O^en based on distribu/on of data, algorithms, and users.
§ Large scale service-‐oriented systems (like Clouds) can integrate both distributed compu9ng and parallel compu9ng, thus they are useful plaAorms.
§ They also offer § security, resource informa9on, data access and management, communica9on, scheduling, SLAs, …
13/02/15
6
Services for Distributed Data Mining
11
§ By exploi9ng the SOA model it is possible to define basic services for suppor/ng distributed data mining tasks/applica/ons.
§ Those services can address all the aspects of data mining and in knowledge discovery processes • data selec/on and transport services,
• data analysis services,
• knowledge models representa/on services, and
• knowledge visualiza/on services.
Services for Distributed Data Mining
12
§ It is possible to design services corresponding to
Single KDD Steps
All steps that compose a KDD process such as preprocessing, filtering, and visualization are expressed as services.
Single Data Mining Tasks
Here are included tasks such as classification, clustering, and association rules discovery.
Distributed Data Mining Patterns
This level implements, as services, patterns such as collective learning, parallel classification and meta-learning models.
Data Mining Applications and KDD processes This level includes the previous tasks and patterns composed in multi-step workflows.
13/02/15
7
Services for Distributed Data Mining
13
§ This collec9on of data mining services implements an
Open Service Framework for Distributed Data Mining
Data Mining Task Services
Distributed Data Mining patterns
Distributed Data Mining Applications and KDD
processes
KDD Step Services
Services for Distributed Data Mining
14
§ It allows developers to program distributed KDD processes as a composi/on of single and/or aggregated services available over a service-‐oriented infrastructure.
§ Those services should exploit other basic Cloud services for data transfer, replica management, data integra9on and querying.
A Service-oriented Cloud workflow
13/02/15
8
Services for Distributed Data Mining
15
§ By exploi9ng the Cloud services features it is possible to develop data mining services accessible every /me and everywhere (remotely and from small devices).
§ This approach can produce not only service-‐based distributed data mining applica9ons, but also
§ Data mining services for communi/es/virtual organiza9ons.
§ Distributed data analysis services on demand.
§ A sort of knowledge discovery eco-‐system formed of a large
numbers of decentralized data analysis services.
The Data Mining Cloud Framework
§ Data Mining Cloud Framework supports workflow-‐based KDD applica5ons, expressed (visually and by a script language) as a graphs that link together data sources, data mining algorithms, and visualiza9on tools.
16
CoverType
Sample1
Sample3
Sample2
J48
J48
J48
Sample4
Model1
Model2
Model3 Splitter Voter Model
train
train
train
test
T2
T3
T4 T5 T1
13/02/15
9
Architecture Components
§ Compute is the computa9onal environment to execute Cloud applica9ons: § Web role: Web-‐based applica9ons. § Worker role: batch applica9ons. § VM role: virtual machine images.
§ Storage provides scalable storage elements: § Blobs: storing binary and text data. § Tables: non-‐rela9onal databases. § Queues: communica9on between components.
§ Fabric controller links the physical machines of a single data center: § Compute and Storage services are built on top of this component.
17
Compute
Cloud Platform
Website
Web Role instances
Worker
Worker Role instances
Queues
Task Queue
Tables
Task Status Table Storage
Fabric
Blobs
Input datasets
Data mining models
The Data Mining Cloud Framework: Architecture
18
13/02/15
10
The Data Mining Cloud Framework -‐ Mapping
19
The Data Mining Cloud Framework – Execu/on
20
13/02/15
11
Example applica/ons (1)
Finance: Predic9on of personal income based on census data
Networks: Discovery of network afacks from log analysis.
E-‐Health: Disease classifica9on based on gene analysis
21
Example applica/ons (2)
Smart City: Car trajectory pafern detec9on applica9ons.
Biosciences: drug metabolism associa9ons in pharmacogenomics.
22
13/02/15
12
The Cloud4SNP workflow
§ DMET (Drug Metabolism Enzymes and Transporters) has been designed specifically to test drug metabolism associa9ons in pharmacogenomics case-‐control study.
§ Cloud4SNP 3 is a Cloud implementa9on of the DMET Analyzer by using the DMCF.
23
§ Pharmacogenomics is the science that allows us to predict a response to drugs based on an individuals gene9c makeup and searches for correla9ons between gene expression or Single Nucleo9de Polymorphisms (SNPs) of pa9ent's genome and the toxicity or efficacy of a drug.
§ SNPs data, produced by microarray plaAorms, need to be preprocessed and analyzed in order to find that correla9on.
§ The DMET (Drug Metabolism Enzymes and Transporters) plaAorm comprises 225 genes involved in Absorp9on, Distribu9on, Metabolism and Excre9on (ADME) of drugs.
DMET Microarray
[24]
13/02/15
13
DMET -‐Case Control-‐Data Analysis workflow
[25]
DMCF
DMET-‐Analyzer Scalability Issues
§ Parameter sweeping applica9on. § Researchers o^en need that data sets should be
analyzed in parallel by mul9ple instances of the same data mining algorithm with different parameters.
§ Graphical-‐based workflow applica9ons: Researcher need support in applica9on design.
§ Computa9onal Aspects: Running Times and Scalability.
[26]
13/02/15
16
Trajectory Pa^ern Detec/on
§ Analyze trajectories of mobile users to discover movement paferns and rules.
§ A workflow that integrate frequent regions detec9on,
trajectory data synthe9za9on and trajectory pafern extrac9on.
31
32
Frequent Regions Detec/on. • Detect areas more densely passed through • Density-‐based clustering algorithm (DB-‐
Scan) • Further analysis: movement through areas
Trajectory Data Synthe/za/on. • each point is subs9tuted by the dense region it
belongs to. • trajectory representa9ons is changed from
movements between points into movements between frequent regions
Trajectory Pa^ern Extrac/on. • Discovery of paferns from structured
trajectories • T-‐Apriori algorithm, i.e. ad-‐hoc modified
version of Apriori
Applica/on Main Steps
13/02/15
17
Applica/on Workflow
33
34
Workflow Implementa/on
§ Workflow implemen9ng the trajectory pafern detec9on algorithm § Each node represents either a data source or a data mining tool § Each edge represents an execu9on dependency among nodes § Some nodes are labeled by the array nota9on
§ Compact way to represent mul9ple instances of the same dataset or tool § Very usfeul to build complex workflows (data/task parallelism, parameter sweeping, etc.)
128 parallel tasks !
13/02/15
18
35
n vs several data sizes (up to 128 9mestamps), for different number of servers
w it propor9onally increases with the the input size
w it propor9onally decreases with the increase of compu9ng resources
n vs the number of servers (up to 64), for different data sizes
w comparison parallel\sequen9al execu9on
w D16 (D128): it reduces from 8.3 (68) hours to about 0.5 (1.4) hours
Experimental Evalua/on
Turnaround /me
68 hours
1.4 hours
35
36
n speed-up
w notable trend, up to the case of 16 nodes
w good trend for higher number of nodes (influence of the sequen9al steps)
n scale-‐up
w comparable 9mes when data size and
#servers increase propor9onally w DBSCAN step (parallel) takes most of the
total 9me w other steps (sequen9al) increases with
larger datasets
Experimental Evalua/on
§ Scalability indicators
14.5
49.3
36
13/02/15
19
Script-‐based workflows
§ We extended DMCF adding a script-‐based data analysis programming model as a more flexible programming interface.
§ Script-‐based workflows are an effec9ve alterna9ve to graphical programming.
§ A script language allows experts to program complex applica9ons more rapidly, in a more concise way and with higher flexibility.
§ The idea is to offer a script-‐based data analysis language as an addi/onal and more flexible programming interface to skilled users.
37
The JS4Cloud script language
§ JS4Cloud (JavaScript for Cloud) is a language for programming data analysis workflows.
§ Main benefits of JS4Cloud: § it is based on a well known scrip9ng language, so users do not have to
learn a new language from scratch;
§ it implements a data-‐driven task parallelism that automa9cally spawns ready-‐to-‐run tasks to the available Cloud resources;
§ it exploits implicit parallelism so applica9on workflows can be programmed in a totally sequen9al way (no user du9es for work par99oning, synchroniza9on and communica9on).
38
13/02/15
20
JS4Cloud func/ons
JS4Cloud implements three addi9onal func9onali9es, implemented by the set of func9ons: § Data.get, for accessing one or a collec9on of datasets stored in the Cloud; § Data.define, for defining new data elements that will be created at
run9me as a result of a tool execu9on; § Tool, to invoke the execu9on of a so^ware tool available in the Cloud as a
service.
39
Script-‐based applica/ons
§ Code-‐defined workflows are fully equivalent to graphically-‐defined ones:
13/02/15
21
JS4Cloud pa^erns
var DRef = Data.get("Customers");!var nc = 5;!var MRef = Data.define("ClustModel");!K-Means({dataset:DRef, numClusters:nc, model:MRef});!
Single task!
41
JS4Cloud pa^erns
Pipeline!
var DRef = Data.get("Census");!var SDRef = Data.define("SCensus");!Sampler({input:DRef, percent:0.25, output:SDRef});!var MRef = Data.define("CensusTree");!J48({dataset:SDRef, confidence:0.1, model:MRef});!
42
13/02/15
22
JS4Cloud pa^erns
Data partitioning!
var DRef = Data.get("CovType");!var TrRef = Data.define("CovTypeTrain");!var TeRef = Data.define("CovTypeTest");!PartitionerTT({dataset:DRef, percTrain:0.70,! trainSet:TrRef, testSet:TeRef});!
43
JS4Cloud pa^erns
var DRef = Data.get("NetLog");!var PRef = Data.define("NetLogParts", 16);!Partitioner({dataset:DRef, datasetParts:PRef});!
Data partitioning!
44
13/02/15
23
JS4Cloud pa^erns
var M1Ref = Data.get("Model1");!var M2Ref = Data.get("Model2");!var M3Ref = Data.get("Model3");!var BMRef = Data.define("BestModel");!ModelChooser({model1:M1Ref, model2:M2Ref,! model3:M3Ref, bestModel:BMRef});!
Data aggregation!
45
JS4Cloud pa^erns
var MsRef = Data.get(new RegExp("^Model"));!var BMRef = Data.define("BestModel");!ModelChooser({models:MsRef, bestModel:BMRef});!
Data aggregation!
46
13/02/15
24
JS4Cloud pa^erns
var TRef = Data.get("TrainSet");!var nMod = 5;!var MRef = Data.define("Model", nMod);!var min = 0.1;!var max = 0.5;!for(var i=0; i<nMod; i++)! J48({dataset:TRef, model:MRef[i],! confidence:(min+i*(max-min)/(nMod-1))});!
Parameter sweeping!
47
JS4Cloud pa^erns
var nMod = 16;!var MRef = Data.define("Model", nMod);!for(var i=0; i<nMod; i++)! J48({dataset:TsRef[i], model:MRef[i],! confidence:0.1});!
Input sweeping!
48
13/02/15
25
Parallelism exploita/on
49
Monitoring interface
§ A snapshot of the applica9on during its execu9on monitored through the programming interface.
50
13/02/15
26
Performance evalua/on
• Input dataset: 46 million tuples!• Used Cloud: up to 64 virtual servers (single-core
1.66 GHz CPU, 1.75 GB of memory, and 225 GB of disk)!
51
Turnaround and speedup
107 hours (4.5 days)!
2 hours!
7.6!
50.8!
52
13/02/15
27
Efficiency
0.96!
0.8!
0.9!
53
Another applica/on example
• Ensemble learning workflow (gene analysis for classifying cancer types)
• Turnaround 9me: 162 minutes on 1 server, 11 minutes on 19 servers. • Speedup: 14.8
54
13/02/15
28
Final remarks
§ Data mining and knowledge discovery tools are needed to support finding what is interes/ng and valuable in big data.
§ Cloud compu9ng systems can effec9vely be used as scalable plaAorms for service-‐oriented data mining.
§ Design and programming tools are needed for simplicity and scalability of complex data analysis processes.
§ The DMCF and its programming interfaces support users in implemen9ng and running scalable data mining.
55
Big is quite a moving target?
56
ü Yes, but not only for the increasing size of data.
ü Big is a misleading term.
ü It must include complexity (and difficulty) of handling huge amounts of heterogeneous data.
ü In the first report (2001) on Big Data the term Big was not actually used.
13/02/15
29
Big is quite a moving target? Some example
57
§ Some data challenges examples we face today § Scien9fic data produced at a rate of hundreds of gigabits-‐per-‐second that must be stored, filtered and analyzed.
§ Ten millions of images per day that must be mined (analyzed) in parallel.
§ One billion of tweets/posts queried in real-‐9me on an in-‐memory database.
Are the challenges faced today different from the challenges faced 10 years ago?
58
§ Yes, because data sources are much more than 10 years ago.
§ Yes, because we want to solve more complex problems.
§ No, because in data mining we s9ll work on data produced with different goals.
13/02/15
30
Is Size/Volume the most important issue in Knowledge Discovery?
59
§ No , volume is only one dimension of the problem.
§ The most important issue is Value.
§ Size and complexity represent the problem, value is the real benefit.
Smart algorithms and scalable systems
60
Combina9on of § Big data analy9cs and knowledge-‐
discovery techniques with § scalable compu9ng systems
for § an effec/ve s t ra tegy f o r
producing new insights in a shorter period of 9me.
§ Clouds can help.
13/02/15
31
Data mining for social good
[61]
§ It’s now 9me for the public sector to invest in Big data collec9on and analysis.
§ It could improve the quality of life of ci9zens and the efficiency of public administra9ons.
§ European countries must do more in this area.
Ongoing & future work
§ DtoK Lab is a startup that originated from our work in this area.
www.scalabledataanaly9cs.com
§ The DMCF system is delivered on public clouds as a high-‐performance So^ware-‐as-‐a-‐Service (SaaS) to provide innova9ve data analysis tools and applica9ons.
§ Applica9ons in the area of social data analysis, urban compu/ng, air traffic and others have been developed by JS4Cloud.
62
13/02/15
32
Some publica/ons
§ D. Talia, P. Trunfio, Service-‐oriented distributed knowledge discovery, CRC Press, USA, 2012.
§ D. Talia, “Clouds for Scalable Big Data Analy/cs”, IEEE Computer, 46(5), pp. 98-‐101, 2013
§ F. Marozzo, D. Talia, P. Trunfio, "A Cloud Framework for Parameter Sweeping Data Mining Applica/ons". Proc. CloudCom 2011, IEEE CS Press, pp. 367-‐374, Dec. 2011.
§ F. Marozzo, D. Talia, P. Trunfio, "A Cloud Framework for Big Data Analy/cs Workflows on Azure". In: Clouds, Grids and Big Data, IOS Press, 2013.
§ F. Marozzo, D. Talia, P. Trunfio, "Using Clouds for Scalable Knowledge Discovery Applica/ons". Euro-‐Par Workshops, LNCS, Springer, pp. 220-‐227, Aug. 2012.
63
Thanks
Ques/ons?
Credits: Eugenio Cesario, Carmela Comito, Deborah Falcone, Paolo Trunfio