scalable data mining on parallel, distributed and cloud ... · 13/02/15 1 scalable data mining on...

32
13/02/15 1 Scalable Data Mining on Parallel, Distributed and Cloud Computing Systems Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy [email protected] Lecture Goals [2] Parallel systems, distributed compu9ng infrastructures, like Grids and P2P networks, and Cloud compu9ng plaAorms, offer an effec9ve support for addressing both the computa/onal and data storage needs of Big Data mining and parallel analy/cs applica/ons. Complex data mining tasks involve data and compute intensive algorithms that require large storage facili9es together with high performance processors to get results in suitable 9mes.

Upload: lamdan

Post on 02-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

13/02/15  

1  

Scalable Data Mining on Parallel, Distributed

and Cloud Computing Systems

Domenico Talia

DIMES, Università della Calabria & DtoK Lab Italy

 

 

   

[email protected]

Lecture  Goals  

[2]

§  Parallel  systems,  distributed  compu9ng  infrastructures,  like  Grids  and  P2P  networks,  and  Cloud  compu9ng  plaAorms,  offer  an  effec9ve  support  for  addressing  both  the  §  computa/onal  and    §  data  storage  needs    of  Big  Data  mining  and  parallel  analy/cs  applica/ons.  

§  Complex  data  mining  tasks  involve  data-­‐  and  compute-­‐intensive  algorithms  that  require  large  storage  facili9es  together  with  high  performance  processors  to  get  results  in  suitable  9mes.  

13/02/15  

2  

Lecture  Goals  

[3]

§  In  this  lecture  we  introduce  the  most  relevant  topics  and  the  main  research  issues  in  high  performance  data  mining  including    §  parallel  data  mining  strategies,    §  distributed  analysis  techniques,  and  §  knowledge  services  and  Cloud-­‐based  data  mining.  

§  We  discuss  parallel  models,  scalable  algorithms,  data  mining  programming  tools  and  applica9ons.  

General  Outline  

[4]

LESSONS    

1.  Parallel  Data  Mining  Strategies  

2.  Distributed  and  Service-­‐oriented  Data  Mining  

3.  Data  Mining  Tools  on  Clouds  

13/02/15  

3  

LESSON  2  

[5]

       

Data  Mining  Tools  on  Clouds    

6

§  Big  problems  and  Big  data  

§  Using  Clouds  for  data  mining  

§  A  collec9on  of  services  for  scalable  data  analysis  §  Data  mining  workflows  

§  Data  Mining  Cloud  Framework  (DMCF)  

§  JS4Cloud  for  programming  service-­‐oriented  workflows.  

§  Final  remarks.  

Talk  outline  

13/02/15  

4  

7

§  KDD  and  data  mining  techniques  are  used  in  many  domains  to  extract  useful  knowledge  from  big  datasets.  

§  KDD  applica9ons  range  from  §  Single-­‐task  applica9ons  §  Parameter-­‐sweeping  applica9ons/  regular  parallel  applica9ons  §  Complex  applica9ons  (Workflow-­‐based,  distributed,  parallel).  

§  Cloud  Compu/ng  can  be  used  to  provide  developers  and  end-­‐users  with  compu9ng  and  storage  services  and  scalable  execu9on  mechanisms  needed  to  efficiently  run  all  these  classes  of  applica/ons.    

Goals  (1)  

8

§  How  using  Cloud  services  for  scalable  execu9on  of  data  analysis  workflows.  

§  A  programming  environment  for  data  analysis:  Data  Mining  Cloud  Framework  (DMCF).  

§  A  visual  programming  interface  VL4Cloud  and  the  script-­‐based  JS4Cloud  language  for  implemen9ng  service-­‐oriented  workflows.  

§  Evalua9ng  the  performance  of  data  mining  workflows  on  DMCF.  

Goals  (2)  

13/02/15  

5  

9

§  PaaS  (PlaAorm  as  a  Service)  can  be  an  appropriate  model  to  build  frameworks  that  allow  users  to  design  and  execute  data  mining  applica9ons.  

§  SaaS  (So^ware  as  a  Service)  can  be  an  appropriate  model  to  implement  scalable  data  mining  applica9ons.  

§  Those  two  cloud  service  models  can  be  effec9vely  exploited  for  delivering  data  analysis  tools  and  applica9ons  as  services.  

Data  analysis  as  a  service  

Service-­‐Oriented  Data  Mining  

10

§  Knowledge  discovery  (KDD)  and  data  mining  (DM)  are:  §  Compute-­‐  and  data-­‐intensive  processes/tasks  §  O^en  based  on  distribu/on  of  data,  algorithms,  and  users.  

§  Large  scale  service-­‐oriented  systems  (like  Clouds)  can  integrate  both  distributed  compu9ng  and  parallel  compu9ng,  thus  they  are  useful  plaAorms.  

§  They  also  offer  §  security,  resource  informa9on,  data  access  and  management,  communica9on,  scheduling,  SLAs,  …  

13/02/15  

6  

Services  for  Distributed  Data  Mining  

11

§  By  exploi9ng  the  SOA  model  it  is  possible  to  define  basic  services  for  suppor/ng  distributed  data  mining  tasks/applica/ons.  

§  Those  services  can  address  all  the  aspects  of  data  mining  and  in  knowledge  discovery  processes    •  data  selec/on  and  transport  services,  

•  data  analysis  services,  

•  knowledge  models  representa/on  services,  and  

•  knowledge  visualiza/on  services.  

Services  for  Distributed  Data  Mining  

12

§  It  is  possible  to  design  services  corresponding  to    

Single KDD Steps

All steps that compose a KDD process such as preprocessing, filtering, and visualization are expressed as services.

Single Data Mining Tasks

Here are included tasks such as classification, clustering, and association rules discovery.

Distributed Data Mining Patterns

This level implements, as services, patterns such as collective learning, parallel classification and meta-learning models.

Data Mining Applications and KDD processes This level includes the previous tasks and patterns composed in multi-step workflows.

13/02/15  

7  

Services  for  Distributed  Data  Mining  

13

§  This  collec9on  of  data  mining  services  implements  an        

 

Open  Service  Framework  for  Distributed  Data  Mining  

Data Mining Task Services

Distributed Data Mining patterns

Distributed Data Mining Applications and KDD

processes

KDD Step Services

Services  for  Distributed  Data  Mining  

14

§  It  allows  developers   to  program  distributed  KDD  processes  as  a  composi/on  of  single  and/or  aggregated  services  available  over  a  service-­‐oriented  infrastructure.    

§  Those  services  should  exploit  other  basic  Cloud  services  for  data  transfer,  replica  management,  data  integra9on  and  querying.    

A Service-oriented Cloud workflow

13/02/15  

8  

Services  for  Distributed  Data  Mining  

15

§  By   exploi9ng   the   Cloud   services   features   it   is   possible   to  develop   data   mining   services   accessible   every   /me   and  everywhere  (remotely  and  from  small  devices).  

§  This   approach   can   produce   not   only   service-­‐based   distributed  data  mining  applica9ons,  but  also  

§  Data  mining  services  for  communi/es/virtual  organiza9ons.  

§  Distributed  data  analysis  services  on  demand.  

§  A   sort   of   knowledge   discovery   eco-­‐system   formed   of   a   large  

numbers  of  decentralized  data  analysis  services.    

The  Data  Mining  Cloud  Framework  

§  Data   Mining   Cloud   Framework   supports   workflow-­‐based  KDD   applica5ons,   expressed   (visually   and   by   a   script  language)   as   a   graphs   that   link   together   data   sources,   data  mining  algorithms,  and  visualiza9on  tools.  

16

CoverType

Sample1

Sample3

Sample2

J48

J48

J48

Sample4

Model1

Model2

Model3 Splitter Voter Model

train

train

train

test

T2

T3

T4 T5 T1

13/02/15  

9  

Architecture  Components  

§  Compute  is  the  computa9onal  environment  to  execute  Cloud  applica9ons:    §  Web  role:  Web-­‐based  applica9ons.    §  Worker  role:  batch  applica9ons.    §  VM  role:  virtual  machine  images.  

§  Storage  provides  scalable  storage  elements:    §  Blobs:  storing  binary  and  text  data.    §  Tables:  non-­‐rela9onal  databases.  §  Queues:  communica9on  between  components.  

§  Fabric  controller  links  the  physical  machines  of  a  single  data  center:    §  Compute  and  Storage  services  are  built  on  top  of  this  component.  

17

Compute

Cloud Platform

Website

Web Role instances

Worker

Worker Role instances

Queues

Task Queue

Tables

Task Status Table Storage

Fabric

Blobs

Input datasets

Data mining models

The  Data  Mining  Cloud  Framework:  Architecture  

18

13/02/15  

10  

The  Data  Mining  Cloud  Framework  -­‐  Mapping  

19

The  Data  Mining  Cloud  Framework  –  Execu/on  

20

13/02/15  

11  

Example  applica/ons  (1)  

Finance:   Predic9on   of   personal  income  based  on  census  data  

Networks:   Discovery   of  network   afacks   from   log  analysis.  

E-­‐Health:   Disease   classifica9on    based  on  gene  analysis  

21

Example  applica/ons  (2)  

Smart   City:   Car   trajectory   pafern  detec9on  applica9ons.  

Biosciences:  drug  metabolism  associa9ons  in  pharmacogenomics.  

22

13/02/15  

12  

The  Cloud4SNP  workflow  

§  DMET  (Drug  Metabolism  Enzymes  and  Transporters)  has  been  designed  specifically    to  test  drug  metabolism  associa9ons  in  pharmacogenomics  case-­‐control  study.  

§  Cloud4SNP  3  is  a  Cloud  implementa9on  of    the  DMET  Analyzer  by  using  the  DMCF.  

23

§  Pharmacogenomics  is  the  science  that  allows  us  to  predict  a  response  to  drugs  based  on  an  individuals  gene9c  makeup  and  searches  for  correla9ons  between  gene  expression  or  Single  Nucleo9de  Polymorphisms  (SNPs)  of  pa9ent's  genome  and  the  toxicity  or  efficacy  of  a  drug.  

§  SNPs  data,  produced  by  microarray  plaAorms,  need  to  be  preprocessed  and  analyzed  in  order  to  find  that  correla9on.  

§  The  DMET  (Drug  Metabolism  Enzymes  and  Transporters)  plaAorm  comprises  225  genes  involved  in  Absorp9on,  Distribu9on,  Metabolism  and  Excre9on  (ADME)  of  drugs.  

DMET Microarray  

[24]

13/02/15  

13  

DMET  -­‐Case  Control-­‐Data  Analysis  workflow  

[25]

DMCF

DMET-­‐Analyzer  Scalability  Issues  

§  Parameter  sweeping  applica9on.  §  Researchers  o^en  need  that  data  sets  should  be    

analyzed  in  parallel  by  mul9ple  instances  of  the  same  data  mining  algorithm  with  different  parameters.  

§  Graphical-­‐based  workflow  applica9ons:  Researcher  need  support  in  applica9on  design.  

§  Computa9onal  Aspects:  Running  Times  and  Scalability.  

[26]

13/02/15  

14  

Cloud4SNP  workflow  (execu/on)  

[27]

Cloud4SNP  workflow  (final  result)  

[28]

13/02/15  

15  

Performance  evalua/on  

150 m!

10 m!

29

Performance  evalua/on  

13/02/15  

16  

Trajectory  Pa^ern  Detec/on  

§  Analyze  trajectories  of  mobile  users  to  discover  movement  paferns  and  rules.  

 §  A  workflow  that  integrate  frequent  regions  detec9on,  

trajectory  data  synthe9za9on  and  trajectory  pafern  extrac9on.  

31

32

Frequent  Regions  Detec/on.    •  Detect  areas  more  densely  passed  through  •  Density-­‐based   clustering   algorithm   (DB-­‐

Scan)  •  Further  analysis:  movement  through  areas  

Trajectory  Data  Synthe/za/on.    •  each   point   is   subs9tuted   by   the   dense   region   it  

belongs  to.    •  trajectory   representa9ons   is   changed   from  

movements   between   points   into   movements  between  frequent  regions    

Trajectory  Pa^ern  Extrac/on.    •  Discovery   of   paferns     from   structured  

trajectories      •  T-­‐Apriori   algorithm,   i.e.   ad-­‐hoc   modified  

version  of  Apriori  

Applica/on  Main  Steps  

13/02/15  

17  

Applica/on  Workflow  

33

34

Workflow  Implementa/on  

§  Workflow  implemen9ng  the  trajectory  pafern  detec9on  algorithm  §  Each  node  represents  either  a  data  source  or  a  data  mining  tool  §  Each  edge  represents  an  execu9on  dependency  among  nodes  §  Some  nodes  are  labeled  by  the  array  nota9on  

§  Compact  way  to  represent  mul9ple  instances  of  the  same  dataset  or  tool  §  Very  usfeul  to  build  complex  workflows  (data/task  parallelism,  parameter  sweeping,  etc.)  

128  parallel  tasks  !  

13/02/15  

18  

35

n  vs  several  data  sizes  (up  to  128  9mestamps),  for  different  number  of  servers      

w  it  propor9onally  increases  with  the    the  input  size  

w  it  propor9onally  decreases  with  the  increase  of  compu9ng  resources  

n  vs  the  number  of  servers  (up  to  64),  for  different  data  sizes  

w  comparison  parallel\sequen9al  execu9on  

w  D16  (D128):  it  reduces  from  8.3  (68)  hours  to  about  0.5  (1.4)  hours  

Experimental  Evalua/on  

Turnaround  /me  

68 hours

1.4 hours

35

36

n  speed-up

w  notable  trend,  up  to  the  case  of  16  nodes  

w  good  trend  for  higher  number  of  nodes  (influence  of  the  sequen9al  steps)    

n  scale-­‐up  

 w  comparable  9mes  when  data  size  and  

#servers  increase  propor9onally    w  DBSCAN  step  (parallel)  takes  most  of  the  

total  9me  w  other  steps  (sequen9al)  increases  with  

larger  datasets  

Experimental  Evalua/on  

§  Scalability  indicators  

14.5

49.3

36

13/02/15  

19  

Script-­‐based  workflows  

§  We  extended  DMCF  adding  a  script-­‐based  data  analysis  programming  model  as  a  more  flexible  programming  interface.  

§  Script-­‐based  workflows  are  an  effec9ve  alterna9ve  to  graphical  programming.  

§  A  script  language  allows  experts  to  program  complex  applica9ons  more  rapidly,  in  a  more  concise  way  and  with  higher  flexibility.  

§  The  idea  is  to  offer  a  script-­‐based  data  analysis  language  as  an  addi/onal  and  more  flexible  programming  interface  to  skilled  users.  

37

The  JS4Cloud  script  language  

§  JS4Cloud  (JavaScript  for  Cloud)  is  a  language  for  programming  data  analysis  workflows.  

§  Main  benefits  of  JS4Cloud:    §  it  is  based  on  a  well  known  scrip9ng  language,  so  users  do  not  have  to  

learn  a  new  language  from  scratch;  

§  it  implements  a  data-­‐driven  task  parallelism  that  automa9cally  spawns  ready-­‐to-­‐run  tasks  to  the  available  Cloud  resources;  

§  it  exploits  implicit  parallelism  so  applica9on  workflows  can  be  programmed  in  a  totally  sequen9al  way  (no  user  du9es  for  work  par99oning,  synchroniza9on  and  communica9on).  

38

13/02/15  

20  

JS4Cloud  func/ons  

JS4Cloud  implements  three  addi9onal  func9onali9es,  implemented  by  the  set  of  func9ons:  §  Data.get,  for  accessing  one  or  a  collec9on  of  datasets  stored  in  the  Cloud;  §  Data.define, for  defining  new  data  elements  that  will  be  created  at  

run9me  as  a  result  of  a  tool  execu9on;  §  Tool, to  invoke  the  execu9on  of  a  so^ware  tool  available  in  the  Cloud  as  a  

service.  

39

Script-­‐based  applica/ons  

§  Code-­‐defined  workflows  are  fully  equivalent  to  graphically-­‐defined  ones:  

13/02/15  

21  

JS4Cloud  pa^erns  

var DRef = Data.get("Customers");!var nc = 5;!var MRef = Data.define("ClustModel");!K-Means({dataset:DRef, numClusters:nc, model:MRef});!

Single task!

41

JS4Cloud  pa^erns  

Pipeline!

var DRef = Data.get("Census");!var SDRef = Data.define("SCensus");!Sampler({input:DRef, percent:0.25, output:SDRef});!var MRef = Data.define("CensusTree");!J48({dataset:SDRef, confidence:0.1, model:MRef});!

42

13/02/15  

22  

JS4Cloud  pa^erns  

Data partitioning!

var DRef = Data.get("CovType");!var TrRef = Data.define("CovTypeTrain");!var TeRef = Data.define("CovTypeTest");!PartitionerTT({dataset:DRef, percTrain:0.70,! trainSet:TrRef, testSet:TeRef});!

43

JS4Cloud  pa^erns  

var DRef = Data.get("NetLog");!var PRef = Data.define("NetLogParts", 16);!Partitioner({dataset:DRef, datasetParts:PRef});!

Data partitioning!

44

13/02/15  

23  

JS4Cloud  pa^erns  

var M1Ref = Data.get("Model1");!var M2Ref = Data.get("Model2");!var M3Ref = Data.get("Model3");!var BMRef = Data.define("BestModel");!ModelChooser({model1:M1Ref, model2:M2Ref,! model3:M3Ref, bestModel:BMRef});!

Data aggregation!

45

JS4Cloud  pa^erns  

var MsRef = Data.get(new RegExp("^Model"));!var BMRef = Data.define("BestModel");!ModelChooser({models:MsRef, bestModel:BMRef});!

Data aggregation!

46

13/02/15  

24  

JS4Cloud  pa^erns  

var TRef = Data.get("TrainSet");!var nMod = 5;!var MRef = Data.define("Model", nMod);!var min = 0.1;!var max = 0.5;!for(var i=0; i<nMod; i++)! J48({dataset:TRef, model:MRef[i],! confidence:(min+i*(max-min)/(nMod-1))});!

Parameter sweeping!

47

JS4Cloud  pa^erns  

var nMod = 16;!var MRef = Data.define("Model", nMod);!for(var i=0; i<nMod; i++)! J48({dataset:TsRef[i], model:MRef[i],! confidence:0.1});!

Input sweeping!

48

13/02/15  

25  

Parallelism  exploita/on  

49

Monitoring  interface  

§  A  snapshot  of  the  applica9on  during  its  execu9on  monitored  through  the  programming  interface.  

50

13/02/15  

26  

Performance  evalua/on  

•  Input dataset: 46 million tuples!•  Used Cloud: up to 64 virtual servers (single-core

1.66 GHz CPU, 1.75 GB of memory, and 225 GB of disk)!

51

Turnaround  and  speedup  

107 hours (4.5 days)!

2 hours!

7.6!

50.8!

52

13/02/15  

27  

Efficiency  

0.96!

0.8!

0.9!

53

Another  applica/on  example  

•  Ensemble  learning  workflow      (gene  analysis  for  classifying  cancer  types)  

•  Turnaround  9me:  162  minutes  on  1  server,    11  minutes  on  19  servers.  •  Speedup:    14.8  

54

13/02/15  

28  

Final  remarks  

§  Data  mining  and  knowledge  discovery  tools  are  needed  to  support  finding  what  is  interes/ng  and  valuable  in  big  data.  

§  Cloud  compu9ng  systems  can  effec9vely  be  used  as  scalable  plaAorms  for  service-­‐oriented  data  mining.  

§  Design  and  programming  tools  are  needed  for  simplicity  and  scalability  of  complex  data  analysis  processes.  

§  The  DMCF  and  its  programming  interfaces  support  users  in  implemen9ng  and  running  scalable  data  mining.  

55

Big  is  quite  a  moving  target?  

56

ü Yes,  but  not  only   for   the   increasing  size  of  data.  

ü Big  is  a  misleading  term.  

ü  It   must   include   complexity   (and  difficulty)  of  handling  huge  amounts  of  heterogeneous    data.  

ü  In  the  first  report  (2001)  on  Big  Data  the  term  Big  was  not  actually  used.  

13/02/15  

29  

Big  is  quite  a  moving  target?  Some  example  

57

§  Some  data  challenges  examples  we  face  today  §  Scien9fic  data  produced  at  a  rate  of  hundreds  of  gigabits-­‐per-­‐second  that  must  be  stored,  filtered  and  analyzed.  

§  Ten  millions  of  images  per  day  that  must  be  mined  (analyzed)  in  parallel.  

§  One  billion  of  tweets/posts  queried  in  real-­‐9me  on  an  in-­‐memory  database.  

Are  the  challenges  faced  today  different  from  the  challenges  faced  10  years  ago?  

58

§  Yes,   because   data   sources   are  much  more  than  10  years  ago.  

§  Yes,   because   we   want   to   solve  more  complex  problems.  

§  No,  because  in  data  mining  we  s9ll  work   on   data   produced   with  different  goals.  

13/02/15  

30  

Is  Size/Volume  the  most  important  issue  in  Knowledge  Discovery?  

59

§  No ,   volume   is   only   one  dimension  of  the  problem.  

§  The   most   important   issue   is  Value.  

§  Size   and   complexity   represent  the   problem,   value   is   the   real  benefit.  

Smart  algorithms  and  scalable  systems  

60

Combina9on  of    §  Big  data  analy9cs  and  knowledge-­‐

discovery  techniques  with    §  scalable  compu9ng  systems    

   for    §  an   effec/ve   s t ra tegy   f o r  

producing   new   insights   in   a  shorter  period  of  9me.  

§  Clouds  can  help.  

13/02/15  

31  

Data  mining  for  social  good    

[61]

§  It’s  now  9me  for  the  public  sector  to  invest  in  Big  data  collec9on  and  analysis.  

§  It  could  improve  the  quality  of  life  of  ci9zens  and  the  efficiency  of  public  administra9ons.  

§  European  countries  must  do  more  in  this  area.  

Ongoing  &  future  work  

§  DtoK  Lab  is  a  startup  that  originated  from  our  work  in  this  area.  

                               www.scalabledataanaly9cs.com  

§  The  DMCF  system  is  delivered  on  public  clouds  as  a  high-­‐performance  So^ware-­‐as-­‐a-­‐Service  (SaaS)  to  provide  innova9ve  data  analysis  tools  and  applica9ons.  

§  Applica9ons  in  the  area  of  social  data  analysis,  urban  compu/ng,  air  traffic  and  others  have  been  developed  by  JS4Cloud.  

62

13/02/15  

32  

Some  publica/ons  

§  D.   Talia,   P.   Trunfio,  Service-­‐oriented   distributed  knowledge  discovery,  CRC  Press,  USA,  2012.  

§  D.  Talia,  “Clouds  for  Scalable  Big  Data  Analy/cs”,  IEEE  Computer,  46(5),  pp.  98-­‐101,  2013  

§  F.   Marozzo,   D.   Talia,   P.   Trunfio,   "A   Cloud  Framework   for   Parameter   Sweeping   Data  Mining  Applica/ons".  Proc.  CloudCom  2011,  IEEE  CS  Press,  pp.  367-­‐374,  Dec.  2011.  

§  F.   Marozzo,   D.   Talia,   P.   Trunfio,   "A   Cloud  Framework  for  Big  Data  Analy/cs  Workflows  on  Azure".  In:  Clouds,  Grids  and  Big  Data,  IOS  Press,  2013.  

§  F.   Marozzo,   D.   Talia,   P.   Trunfio,   "Using   Clouds  for  Scalable  Knowledge  Discovery  Applica/ons".  Euro-­‐Par   Workshops,   LNCS,   Springer,   pp.  220-­‐227,  Aug.  2012.  

63

Thanks  

Ques/ons?  

Credits:  Eugenio  Cesario,  Carmela  Comito,  Deborah  Falcone,  Paolo  Trunfio