parallel programing in r - bioconductor - home...murder assault urbanpop rape alabama 13.2 236 58...

59
Parallel Compu,ng in R BioC 2009, Sea7le, July 2009

Upload: others

Post on 10-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

ParallelCompu,nginR

BioC2009,Sea7le,July2009

Page 2: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

WhoareREvolu,onCompu,ng?

REvolu,onCompu,ng:–  Isacommercialopen‐sourcecompany,foundedin2007–  ProvidesservicesandproductsbasedonR

•  The“RedHat”®forR–  Producesfreeandsubscrip,on‐basedhigh‐performance,enhanceddistribu,onsofR

–  Offerssupport,training,valida,onandotherservicesaroundR

–  Hasexper,seinhigh‐performanceanddistributedcompu,ng

–  IsafinancialandtechnicalcontributortotheRcommunity–  Hasopera,onsinNewHaven,Sea7le,andSanFrancisco

Page 3: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

The People of REvolution

•  Mar$nSchultz,ChiefScien,ficOfficer(ArthurKWatsonProfessorofComputerScience,YaleUniversity;founderofScien,ficCompu,ngAssociates;researchinalgorithmdesign,parallelprogrammingenvironmentsandarchitectures)

•  DavidSmith,DirectorofCommunity&Rblogger(co‐authorofAnIntroduc+ontoR,ESS)

•  BryanLewis,AmbassadorofCool(akaDirectorofSystemsEngineering;appliedmathinterestsinnumericalanalysisofinverseproblems;formerCEOofRocketcalc)

•  DaneseCooper,OpenSourceDiva(boardofdirectors,OpenSourceIni,a,ve;member,ApacheSo`wareFounda,on;advisoryboard,Mozilla.org;previouslyseniordirector,opensourcestrategiesatIntelandSun)

•  SteveWeston,SeniorResearchScien,st,DirectorofEngineering(REvolu,onandScien,ficCompu,ngAssociates;developmentofNetWorkSpaces–parallelprogrammingwithR,Python,Ruby,andMatlab–NetworkLinda,Paradise,andPiranha).

•  JayEmerson,DeptSta,s,cs,YaleUniversity(authorofbigmemorypackage)

Page 4: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

WhatisREvolu,onR?

•  REvolu,onRisthefreedistribu,onofR–  Op,mizedforspeed–  Usesmul,pleCPUs/coresforperformance–  ForWindowsandMacOS(soon:Ubuntu)–  Supportviacommunityforums

•  REvolu,onREnterpriseisourenhanced,subscrip,on‐onlydistribu,onofR–  Telephone/emailsupportfromrealRexperts–  Suitableforuseinregulated/validatedenvironments–  IncludesproprietaryParallelRpackagesforreliabledistributedcompu,ngwithR

•  onclustersorinthecloud–  Supportedon64‐bitWindows,Linux

Page 5: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Suppor,ngtheRCommunity

Weareanopensourcecompanysuppor,ngtheRcommunity:

•  BenefactorofRFounda$on•  FinancialsupporterofRconferencesandusergroups•  Newfunc$onalitydevelopedincoreRtocontributedunderGPL

•  64‐bitWindowssupport•  Step‐debuggingsupport

•  REvangelism

“Revolu$ons”Blog:blog.revolu,on‐compu,ng.comDailyNewsaboutR,sta1s1cs,andopen‐source

5

Page 6: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Intoday’slab:

•  Introduc,ontoParallelProcessing•  Mul,‐ThreadedProcessing

– Compu,ngontheGPU

•  Iterators•  Theforeachloop•  Usingmul,plecores:SMP

•  ClusterCompu,ng

•  Mul,‐Stratumparallelism

•  Q&A/Exercises6

Page 7: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

GemngStarted

•  R2.10.x,and2.9.xonWindows:install.packages("foreach",type="source")

install.packages("iterators",type="source")

•  R2.9.x,Mac/Linuxonly:install.packages("doMC")

require(doMC)

•  Windows/Mac:–  InstallREvolu,onREnterprise2.0(R2.7.2)require(doNWS)

7

Page 8: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

8

Introduc,ontoParallelProcessing

WithanasidetoHigh‐PerformanceCompu,ng

Page 9: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

HPCo`enmeansefficientlyexploi,ngspecializedhardware

ImagescopyrightCray,Xlinix,NVIDIAfromupper‐left,clockwise.

WhatisHigh‐PerformanceCompu,ng(HPC)?

Page 10: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Imagefromncbr.sdsc.edu

WhatisHigh‐PerformanceCompu,ng(HPC)?

•  Thesedays,HPCisfrequentlyassociatedwithCOTS*clustercompu,ngandwithSIMDvectoriza,onandpipelining(GPUs)

*Commodity,offtheshelf

•  New:cloudcompu,ng

Page 11: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

•  HPCiso`enconcernedwithmul,‐processing(parallel

processing),thecoördina,onof

mul,ple,simultaneously

running(sub)programs

–  Threads

–  Processes

–  Clusters

ImageCopyrightLawrenceLivermoreNationalLab

WhatisHigh‐PerformanceCompu,ng(HPC)?

Page 12: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

HPCo`eninvolveseffec,velymanaginghugedatasets

–  Parallelfilesystems(GPFS,PVFS2,Lustre,GFS2,S3…)

–  Paralleldataopera,ons(map‐reduce)

–  Workingwithhigh‐performancedatabases

–  bigmemorypackageinR

ImageCopyrightHP(amulti‐petabytestoragesystem)

WhatisHigh‐PerformanceCompu,ng(HPC)?

Page 13: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

ATaxonomyofParallelProcessing

•  Mul$‐node/cluster/cloudcompu$ng(heavyweightprocesses)–  Memorydistributedacrossnetwork–  Examples:foreach,SNOW,Rmpi,batchprocessing

•  Mul$‐core/mul$‐processorcompu$ng(heavyweightprocesses)–  SMP:SymmetricMul,‐Processing–  Independentmemoryinsharedspace–  Naturallyscalestomul,‐nodeprocessing–  Examples:mul,core(Windows/Unix),foreach

•  Mul$‐threadedprocessing(lightweightprocesses)–  Usuallysharedmemory–  Hardertoscaleoutacrossnetworks–  Examples:threadedlinear‐algebralibrariesforR(ATLAS,MKL);GPU

processors(CUDA/NVIDIA;ct/INTEL)

13

Page 14: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

14

Mul,‐ThreadedProcessing

Page 15: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Whatisthreadedprogramming?

•  Athreadisakindofprocessthatsharesaddressspacewithitsparentprocess

•  Created,destroyed,managedandsynchronizedinCcode– POSIXthreads– OpenMPthreads

•  Fast,butdifficulttoprogram– Easytooverwritevariables– Needtoworryaboutsynchroniza,on

15

Page 16: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Exploi,ngthreadswithR

•  RlinkstoBLAS(BasicLinearAlgebraSubprograms)librariesforefficientvector/matrixopera,ons

•  Linux:NeedtocompileandlinkwiththreadedBLAS(ATLAS)

•  Windows/Mac:REvolu,onRlinkedtoIntelMKLlibraries,usesasmanythreadsascores– Manyhigher‐levelopera,onsop,mizedaswell

•  MacOS:CRANbinaryusesveclibBLAS–  threaded,pre7ygoodperformance

16

Page 17: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

REvolution R SVD Performance

Exampledatamatrix150,000x500fast.svd

Quad‐coreIntelCore2CPU,WindowsVista64‐bitWorkstation

Revolu,onRPerformance

Mul,‐ThreadedProcessing

17

Page 18: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

18

GPUProgramming

Page 19: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

WhatisaGPU?

•  Dedicatedprocessingchip(orcard)dedicatedtofastfloa,ng‐pointopera,ons– Originallyfor3‐Dgraphicscalcula,ons

•  Highlyparallel:100’sofprocessorsonasinglechip,capableofrunning1000’softhreads

•  Usuallyincludesdedicatedhigh‐speedRAM,accessibleonlybyGPU– Needtotransferdatain/out

•  ProgrammeddirectlyusingcustomCdialect/compilers

•  >90%ofnewdesktops/laptopshaveanintegratedGPU

19

Page 20: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

GeForce8800GT

•  LaunchedOct29,2007•  512Mbof256‐bitmemory•  128processors•  512simultaneousthreads•  <$200

•  DownloadNVIDIACUDATools:– h7p://www.nvidia.com/object/cuda_home.html

•  Tutorial– h7p://www.ddj.com/architect/207200659

20

Page 21: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

PerformanceComparison

Method Time(seconds)

BaseRconvolvefunc,on 9.89

AMDACML 6.29

FFTW(8threads) 3.75

CUDAonGeForce8800GT 1.88(singleprecision)

21

  Convolve2vectorsoflength2^22  60Mbofdata

  Quaddual‐coreprocessor/GPU

Page 22: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

22

IntroducingIterators

Page 23: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

ThoughtExperiment:DrawingCards

•  Youaretheteacherof10grade‐schoolpupils.•  Classproject:draweachofthe52playingcardsasaposter.

•  Eachchildhassuppliesofposterpaperandcrayons,butrequiresareferencecardtocopy.

•  Howtoorganizethepupils?

23

Page 24: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Iterators

> require(iterators) •  Generalizedloopvariable•  Valueneednotbeatomic

– Rowofamatrix– Randomdataset– Chunkofadatafile– Recordfromadatabase

•  Createwith:iter •  Getvalueswith:nextElem •  Usedasindexingargumentwithforeach

Page 25: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Iteratorsarememoryfriendly

•  Allowdatatobesplitintomanageablepiecesonthefly

•  Helpsalleviateproblemswithprocessinglargedatastructures

•  Piecescanbeprocessedinparallel

Page 26: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Iteratorsactasadaptors

•  Allowsyourdatatobeprocessedbyforeachwithoutbeingconverted

•  Caniterateovermatricesanddataframesbyroworbycolumn:

it <- iter(Boston, by="row") nextElem(it)

Page 27: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

NumericIterator

> i <- iter(1:3) > nextElem(i) [1] 1

> nextElem(i) [1] 2 > nextElem(i) [1] 3 > nextElem(i) Error: StopIteration

Page 28: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Longsequences

> i <- icount(1e9) > nextElem(i) [1] 1 > nextElem(i) [1] 2 > nextElem(i) [1] 3 > nextElem(i) [1] 4 > nextElem(i) [1] 5

Page 29: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Matrixdimensions

> M <- matrix(1:25,ncol=5) > r <- iter(M,by="row") > nextElem(r) [,1] [,2] [,3] [,4] [,5] [1,] 1 6 11 16 21 > nextElem(r) [,1] [,2] [,3] [,4] [,5] [1,] 2 7 12 17 22 > nextElem(r) [,1] [,2] [,3] [,4] [,5] [1,] 3 8 13 18 23

Page 30: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

DataFile

> rec <- iread.table("MSFT.csv",sep=",", header=T, row.names=NULL) > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.91 30.25 29.4 29.86 76935100 28.73 > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.7 29.97 29.44 29.81 45774500 28.68 > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.63 29.75 29.45 29.64 44607200 28.52 > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.65 30.1 29.53 29.93 50220200 28.8

30

Page 31: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Database

> library(RSQLite) > m <- dbDriver('SQLite') > con <- dbConnect(m, dbname="arrests") > it <- iquery(con, 'select * from USArrests', n=10) > nextElem(it) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Connecticut 3.3 110 77 11.1 Delaware 5.9 238 72 15.8 Florida 15.4 335 80 31.9 Georgia 17.4 211 60 25.8

31

Page 32: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Infinite&Irregularsequences

iprime <- function() { lastPrime <- 1 nextEl <- function() { lastPrime <<- as.numeric(nextprime(lastPrime)) lastPrime } it <- list(nextElem=nextEl) class(it) <- c('abstractiter','iter') it}

> require(gmp) > p <- iprime() > nextElem(p) [1] 2 > nextElem(p) [1] 3

Page 33: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

33

Loopingwithforeach

Page 34: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Loopingwithforeach

foreach (var=iterator) %dopar% { statements }

  Evaluatestatementsuntiliteratorterminates  statementswillreferencevariablevar   Valuesof{ … }blockcollectedintoalist

  Runssequentially(bydefault)(orforcewith%do% )

34

Page 35: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

> foreach (j=1:4) %dopar% sqrt (j)

[[1]] [1] 1

[[2]] [1] 1.414214

[[3]] [1] 1.732051

[[4]] [1] 2

Warning message: executing %dopar% sequentially: no parallel backend registered

35

Page 36: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

CombiningResults> foreach(j=1:4, .combine=c) %dopar% sqrt(j) [1] 1.000000 1.414214 1.732051 2.000000

> foreach(j=1:4, .combine='+’, .inorder=FALSE) %dopar% sqrt(j)

[1] 6.146264

  Whenorderofevaluationisunimportant,use.inorder=FALSE

36

Page 37: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Referencingglobalvariables> z <- 2 > f <- function (x) sqrt (x + z)

> foreach (j=1:4, .combine='+') %dopar% f(j)

[1] 8.417609

  foreachautomaticallyinspectscodeandensuresunboundobjectsarepropagatedtotheevaluationenvironment

37

Page 38: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Nestedforeachexecu,on

•  foreach opera,onscanbenestedusing%:%operator

•  Allowsparallelexecu,onacrossmul,pleloopslevels,“unrolling”theinnerloops

foreach(i=1:3, .combine=cbind) %:% foreach(j=1:3, .combine=c) %dopar% (i + j)

Page 39: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

39

Speedingupcodewithforeach

SMPProcessing

Page 40: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Quickreviewofparallelanddistributedcompu,nginR

•  NetWorkSpaces(packagenws;SMP,distributed)–  GPL,alsocommerciallysupportedbyREvolu,onCompu,ng–  Verycross‐pla�orm,distributedshared‐memoryparadigm–  Fault‐tolerant

•  Mul,Core(packagemulticore;SMPonly)–  Linux/MacOS(requiresPOSIX)–  UsesforktocreatenewRprocesses

•  Rmpi(packageRmpi;SMP,distributed)–  Fine‐grainedcontrolallowsveryhigh‐performancecalcula,ons–  Canbetrickytoconfigure–  LimitedWindowsandheterogeneousclustersupport

•  SNOW(packagesnow;SMP,distributed*)–  LimitedWindowssupport(*singlemachineonly)–  Meta‐package:supportsMPI,sockets,NWS,PVM

40

Page 41: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Parallelbackendsforforeach

•  %dopar%behaviourdependsoncurrent“registered”parallelbackend

•  Modularparallelbackends

•  registerDoSEQ(default)•  registerDoNWS(NetWorkSpaces)•  registerDoMC(mul,core,MacOS/Windows)

•  FromTerminal/ESSonly!(R.appGUIwillcrash.)•  registerDoSNOW•  registerDoRMPI

41

Page 42: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

GemngStarted:Mul,‐coreProcessing

•  R2.10.x–waitun,lofficialrelease•  R2.9.x

require(doMC)

registerDoMC(cores=2)

•  REvolu,onREnterpriserequire(doNWS)

s <- sleigh(workerCount=2) registerDoNWS()

42

Page 43: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Asimplesimulation:

birthday <- function(n) { ntests <- 1000 pop <- 1:365 anydup <- function(i) any(duplicated( sample(pop, n, replace=TRUE)))

sum(sapply(seq(ntests), anydup)) / ntests }

x <- foreach (j=1:100) %dopar% birthday (j)

43

Page 44: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

BirthdayExample‐,mings

Backend Time(s)registerDoSEQ() 41sregisterDoMC()#2cores 28sregisterDoNWS()#2workers 26s(*)

44

Dual‐core2.4GHzIntelMacBook:

system.time{ x <- foreach (j=1:100) %dopar% birthday (j) } # Elapsed

Page 45: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

BirthdaySimula,on:Mul,core/NWS

> x <- foreach (j=1:100) %dopar% birthday (j) > plot(1:100, unlist(x),type="l")

Page 46: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

46

UsingclusterswithNetworkSpaces

Page 47: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

SemngUpaCluster

1.  Iden,fymachinestoformnodesoncluster–  EasiestwithLinux/MacOS

–  PossiblewithWindows

2.  Selectaservermachine–  OKforthisonetobeonWindows

3.  Makesurepasswordlesssshenabledoneachworkernode

–  ssh nodename Revo --versionshouldwork

47

Page 48: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Semngupacluster,part2

4.  Logintoserver,startREvolu,onR5.  Createasleigh

require(doNWS) s <- sleigh(nodeList=c( rep("localhost",2), rep("thor",8), rep("loki",4)), launch=sshcmd) registerDoNWS(s)

6.  Useforeachasbefore7.  (op,onal)usejoinSleightoaddnewnodes

48

Page 49: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

ParallelRandomForest

# a simple parallel random forest

library(randomForest) x <- matrix(runif(500), 100)

y <- gl(2, 50) wc <- 2

n <- ceiling(1000 / wc)

registerDoNWS(s) foreach(ntree=rep(n, wc), .combine=combine,

.packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)

•  Easier: randomShrubberyNWS()

49

Page 50: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Conver,ngexis,ngcode

•  Converttheseloopstoforeach:– for:makebodyreturnitera,onvalueand.combine

– apply:useiter(X, by="row”)and.combine •  Oriapply(X,1)

– lapply:useiter(mylist)

50

Page 51: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

ppiStatsExample

•  Sequen,al:bpMats1 <- lapply(bpList, function(x) { bpMatrix(x, symMat = TRUE, homodimer = FALSE, baitAsPrey = FALSE, unWeighted = FALSE, onlyRecip = FALSE, baitsOnly = FALSE) })

•  Parallel:bpMats1 <- foreach(x=iter(bpList), .packages = "ppiStats") %dopar% { bpMatrix(x, sysMat = TRUE, homodimer = FALSE, baitAsPrey = FALSE, unWeighted = FALSE, onlyRecip = FALSE, baitsOnly = FALSE) }

51

Page 52: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

ppiStatsExample

•  Sequen,al:bpGraphs <- lapply(bpMats1, function(x) {

genBPGraph(x, directed = TRUE, bp = FALSE)

})

•  Parallel:bpGraph <- foreach(x=iter(bpMat1),

.packages = "ppiStats") %dopar% {

genBPGraph(x, directed = TRUE, bp = FALSE) }

52

Page 53: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Excercise

•  Findother“embarassinglyparallel”BioConductorexamples,andconverttoparallelwithforeach.

53

Page 54: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

54

Mul,‐StratumParallelism

Page 55: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

foreach(iterator)%dopar%{tasks}

foreach…

task task

foreach…

task task

CLUSTER

SMP

Anexampleofexplicitmulti‐stratum||ism

55

Page 56: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

require ("doNWS") require ("foreach") require ("doMC")

s <- sleigh(nodelist=c(rep("localhost",2), rep("bladeserver",8)) registerDoNWS(s)

foreach (iterator_i, .packages=c("foreach", "doMC"))%dopar% { registerDoMC() foreach (iteratorj_) %dopar% { tasks… } }

Mul,‐stratumtemplate:NWS/Mul,core

56

Page 57: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Pi�allstoavoid

•  Sequen,alvsParallelProgramming•  RandomNumberGenera,on

–  library(sprngNWS) –  sleigh(workerCount=8, rngType=‘sprngLFG’)

•  Nodefailure•  CosmicRays

57

Page 58: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

Conclusions

•  Parallelcompu,ngiseasy!•  Writeloopswithforeach/%dopar%

– Worksfineinasingle‐processorenvironment

– Third‐partyuserscanregisterbackendsformul,processororclusterprocessing

– Speedbenefitswithoutmodifyingcode

•  Easyperformancegainsonmodernlaptops/desktops

•  Expandtoclustersformeatyjobs

58

Page 59: Parallel Programing in R - Bioconductor - Home...Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California

ThankYou!

•  DavidSmith–  david@revolu,on‐compu,ng.com,@revodavid

•  REvolu,onCompu,ng– www.revolu,on‐compu,ng.com

•  Revolu+ons,theRblog– blog.revolu,on‐compu,ng.com

•  Downloads:– Slides:h7p://,nyurl.com/R‐Bioc‐slides– Script:h7p://,nyurl.com/R‐Bioc‐script