r, paralelización, datos masivos y aplicaciones web...

Post on 15-Mar-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

R, paralelización, datos masivos yaplicaciones web: ejemplos del uso de R

en bioinformática

Ramón Díaz-Uriarte

Dept. BioquímicaUniversidad Autónoma de Madrid

Madrid, Spainrdiaz02@gmail.com

http://ligarto.org/rdiaz

III Jornadas de Usuarios de R17-Noviembre-2011

(1 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

License and copyright

This work is Copyright, c©, 2011, Ramón Díaz-Uriarte, andis licensed under the Creative Commons

Attribution-NonCommercial-ShareAlike License. To view acopy of this license, visit

http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons,559 Nathan Abbott Way, Stanford, California 94305, USA.

*****************************Please, respect the copyright. This material is provided freely, and if you use

it, I only ask that you use it according to the (very permissive) terms of the

license: attribution, non-comercial use, and a share alike license. If you have

any doubts, ask me. (2 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Outline

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions, future, et al.

Appendix

(3 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ContextBiological contextComputational context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions, future, et al.

Appendix

(4 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Chromosomes

From the Wikipedia; original sourcehttp://www.genome.gov/Pages/Hyperion//DIR/VIP/

Glossary/Illustration/karyotype.shtml(5 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

DNA→ protein

(From O. Rueda’s PhD Thesis)(6 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Data from a microarray experiment

Slide from Gema Moreno Bueno, Department of Biochemistry, UAM(7 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

More microarray data

Modified from http://www2.warwick.ac.uk/fac/sci/moac/

students/peter_cock/r/heatmap/scaled_color_key.png

(8 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

DNA→ protein

(From O. Rueda’s PhD Thesis)(9 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

aCGH

Chromosome

Olshen, 2005

Barrett et al., 2004

Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)

Hupe & Barillot, 2005

Calling gains and losses: hypothesistesting

Inferring number of copy gains/losses: estimation L

og

2(R

ati

o)

(10 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Data, data, data (in Gigabytes)

Expression arrays (mRNA) > 40,000 probesCopy number with aCGH > 400,000 common;

some > 4 x 106

. . . . . .

(11 : 58)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Multicores and computing clusters

Increases in CPU speed slowed down (< 20% peryear since 2002).Increase in the number of “cores”: 2, 4, 8. Next 10years?Inexpensive computing clusters with off-the-shelfcomponents.Must design our programs from the start: parallelprogramming

Image from http://faq.distributed.net/

(12 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions, future, et al.

Appendix

(13 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Standalone

(14 : 58)

Statistical Computingin Bioinformatics

Develop statistical methods

Implement existingapproaches

Implement for statisticians and bioinformaticians

Implement for wet lab users

- Parallel Computing

- Fault tolerance

Web apps:- User friendly

- No installation

- Statistical rigour - Best practices

Increased speed (40x - 60x)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

R code

Code available for many procedures (but a few yearsago none parallelized!)Many computations embarrassingly parallelizable:

I bootstrapping and cross-validationI arrays (or samples)I arrays by chromosomesI parallel chains in MCMC

Figure production can be parallelized

(15 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Parallelizing R code

(Implement missing functionality: R/C)MPI: R packages Rmpi, papply, snow, snowfallLoad balancedWrappers over “mid level” functions in package: easeupdatingParallelize:

I Bootstrap samples/Cross-val. runs.I arraysI arrays by chromosomesI (or a combination of both)

(16 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Is it worth it?

Are speed improvements really worth the effort?Over what range of problems do see improvements?With what hardware can we see improvements?

(17 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

What do we gain?

(18 : 58)

HMM

Use

r w

all t

ime

(sec

onds

)

10 50 100 15020

20

50

100

300

500

1000

2500

5000

10000

● ● ●●

● ● ●

Sequential code

Parallelized code

60

30

10●

GLAD

10 50 100 150

●●

● ●

CBS

10 50 100 150

● ●●

●●

●●

●BioHMM

10 50 100 150

20

50

100

300

500

1000

2500

5000

10000

●●

20,000 genes

Number of arrays (samples)

Use

r w

all t

ime

(sec

onds

)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

What do we gain?

Are speed improvements really worth the effort?Your effort: “R CMD INSTALL ADaCGH2”.

Over what range of problems do see improvements?10 to 103 arrays/samples;104 to 106 spots/genes.

With what hardware can we see improvements?2 cores to 120 cores.

Smaller clusters: more cost effectiveSingle node/multi-core: lesscommunication overhead

(19 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Where is this running?

varSelRF (CRAN)ADaCGH2 (BioConductor)SignS (launchpad:http://launchpad.net/signs)

(20 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Context

Parallelizing code

Web applicationsWeb apps: how

Large data sets and parallelization

R, C, and compression on the fly

Conclusions, future, et al.

Appendix

(21 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Applications for wet lab researchers

Analyze data in a reasonably short time.User friendly access to methods that are statisticallyrigorous.

(22 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Web-based applications

User-friendly interface.No hardware/software hassles for end users.Parallelization is transparent.Method selection can be partially transferred (to us).Short user wall time: use (hardware/software)resources rarely available to individual biomedicalresearchersJust type in a URL:http://www.some-application

Image modified from http://faq.distributed.net/

(23 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Sometimes collaborations feel like . . .

(From http://www.bitacoradegalileo.com/2010/11/16/giordano-bruno-en-la-cara-oculta-de-la-luna/)

(24 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Parallelization in web applications

(25 : 58)

Statistical Computingin Bioinformatics

Develop statistical methods

Implement existingapproaches

Implement for statisticians and bioinformaticians

Implement for wet lab users

- Parallel Computing

- Fault tolerance

Web apps:- User friendly

- No installation

- Statistical rigour - Best practices

Increased speed (40x - 60x)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Main web-based applications

(26 : 58)

Dealing with raw data

Statistical analysis (sensu stricto)

Annotation and Interpretation

Remove artifactsfrom microarrays

- Missing data- Replicate spots

DNMAD preP

Differentiallyexpressed

genes

Select genesfor

classification

Tnasas GeneSrFPomelo_II

Molecular signatures

survival data

SignS

SegmentaCGH

WaviCGH ADaCGH

Interpret results

IDClight PaLS

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

What do we gain?

(27 : 58)

250

500

1000

2000

4000

1 5 10 20

CBS

1 5 10 20

●●

CGHseg●

●●

GLAD

250

500

1000

2000

4000

HMM15000 genes, 40 arrays

Number of simultaneous users

Use

r w

all t

ime

(sec

onds

)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

How it works: some key ideas

Each runI Parallelization (transparent for users)I Fault-tolerance (network problems, machine crashes,

bugs)I Check-pointing

Periodic tasks (keep system running 24h, 365 d)I Automatic monitorizationI Automated testing suite

(28 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

What happens

(29 : 58)

UserHead node (LVS):Send request to

one of the servers.

CGI: data checking,file upload

Execution: Python program

- Setting up LAM/MPI- Starting R

- Fault tolerance- Checking termination of R

- Checking run errors- Formatting output

R program

Autorefreshing HTMLuntil final results

Sequential code Parallelized code

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

What happens: details

(30 : 58)

User

Head node (LVS)

Server 2

Server 1

Continue R execution till end

Apache

Server 3Server n

CGI

Read dataCreate MPI universe

Launch R, RmpiMonitor R execution

Maintain R process counters

(slave)

(Master)

(slave)(slave)Rmpi started

OK?

Halt MPI universe Produce and return results pages

Is R done?Yes

Return autorefreshing page

NoNo

Yes

Stop execution Halt MPI universe

Return error

Not after K attempts

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

MPI details

(31 : 58)

Sleep Can we run?(Count other lam daemons)No

Boot (new)LAM/MPI

Yes

Start R: continue from last checkpoint Sleep

Run outof time?

Are we done?R crashed (bugs)?

MPI universe:Servers 1 ... n

NFS sharedtemporary storage

NFS sharedstorage

Segmentation and Figures (over subjects and chrom.).

Rmpi crashed?LAM/MPI/nodes crashed?

No

Halt MPI universe Produce and return results pages

Yes

Yes

No

Verify servers(modify LAM defs)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Where is this running?

http://signs.bioinfo.cnio.es

http://wavi.bioinfo.cnio.es

http://genesrf.bioinfo.cnio.es

(32 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions, future, et al.

Appendix

(33 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

aCGH

Chromosome

Olshen, 2005

Barrett et al., 2004

Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)

Hupe & Barillot, 2005

Calling gains and losses: hypothesistesting

Inferring number of copy gains/losses: estimation L

og

2(R

ati

o)

(34 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Large data sets

Millions of spotsHundreds or thousands of subjects.No need to hold everything in RAM at once.

Package ff: “memory-efficient storage of large dataon disk and fast access functions.”

Combined with:I parallelizationI shared storage

(35 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization

ff stores the object on disk.Read that object from various R processes.Different R processes can write in different ff objects

(36 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (I)

R1 R2 Rn

Common ff object

ff1 ff2 ffn

write

read only

Rmaster

ffall

(37 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (I)

R1 R2 Rn

Common ff object

ff1 ff2 ffn

write

read only

Rmaster

ffall

(37 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (I)

R1 R2 Rn

Common ff object

ff1 ff2 ffn

write

read only

Rmaster

ffall

(37 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only (multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(38 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Where is this running?

ADaCGH2 (BioConductor package)Web-based applicationhttp://wavi.bioinfo.cnio.es.

(39 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions, future, et al.

Appendix

(40 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

aCGH

Chromosome

Olshen, 2005

Barrett et al., 2004

Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)

Hupe & Barillot, 2005

Calling gains and losses: hypothesistesting

Inferring number of copy gains/losses: estimation L

og

2(R

ati

o)

(41 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Store and access (large) pre-computed results

HMM for aCGH data with Reversible Jump: ViterbiCommon regions: “count” on the Viterbi paths.

Fitting HMM/common regions: distinct operations.

C: number-crunching.R: wrapper and figures/tables.C: creates large amounts of data.

In package RJaCGH (CRAN).

(42 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Fit HMM

R C (HMM)

Store Viterbias gzipped file

return filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(43 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Fit HMM

R C (HMM)

Store Viterbias gzipped filereturn filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(43 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Fit HMM

R C (HMM)

Store Viterbias gzipped filereturn filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(43 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Fit HMM

R C (HMM)

Store Viterbias gzipped filereturn filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(43 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions, future, et al.

Appendix

(44 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Web-based: A few things we’ve learned

Configuration sucks (if you need to modify > 1 file)Too many languagesAdding test cases to the testing suites: web, RDocumentation: in the code, web pages, LATEX . . .

Too much R code to catch errorsUser interfaces: who designs them?

(45 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Fault tolerance and communicationManual check for errors (R ain’t Erlang)Too much network traffic

(46 : 58)

Boot (new)LAM/MPI

Start R: continue from last checkpoint

Sleep

Run outof time?

Are we done?R crashed

(coding errors)?

MPI universe:Servers 1 ... n

NFS sharedtemporary storage

NFS sharedstorage

Rmpi crashed?LAM/MPI crashed?

(includes node crashes)

No

Halt MPI universe Produce and return results pages

Yes

Yes

No

Verify servers(modify LAM defs)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Solutions?

Literate programming and org-modeAlternatives to MPI and/or use Erlang. . .Keep things as they are (only a few painful events ayear)

(47 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Single machine applications

Run applications in a single, multicore (e.g., 12)machineJust verify if the machine is up

(48 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Rethinking web-based applications

Users can get into trouble.

Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational

approaches (R ;-)I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble

Web-based applications are here to stay

(49 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Rethinking web-based applications

Users can get into trouble.

Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational

approaches (R ;-)I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble

Web-based applications are here to stay

(49 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Rethinking web-based applications

Users can get into trouble.

Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational

approaches (R ;-)I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble

Web-based applications are here to stay

(49 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

. . . so . . .

Forget about them: just write R (plus C, etc) codeGo for it

I R will do its job (which is only part of the job)I HPC availableI No problem with large data setsI But other tools and work necessary

(50 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Regardless of web-based applications . . .

Parallel computing can be used routinelyI (library(parallel) in R ≥ 2.14.0)

Large data sets with ff + parallelization.

(51 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Acknowledgements

O. M. Rueda, A. Alibés, A. Cañada, E. R. Morrissey,M. L. Neves, D. Rico.Funding: Fundación de Investigación Médica MutuaMadrileña, Project TIC2003-09331-C02-02 of theSpanish MEC and BIO2009-12458 of the SpanishMICINN. Ramón y Cajal Programme of the SpanishMinistry of Education and Science.CNIO (Spanish National Cancer Research Center).The R users and developers for a vibrant statisticalcomputing community and amazing platform.The organizing and scientific committees of the IIIJornadas de Usuarios de R.

(52 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Trying to fit it all

(53 : 58)

Statistical Computingin Bioinformatics

Develop statistical methods

Implement existingapproaches

Implement for statisticians and bioinformaticians

Implement for wet lab users

- Parallel Computing

- Fault tolerance

Web apps:- User friendly

- No installation

- Statistical rigour - Best practices

Informationintegration

Answer biological question

Increased speed (40x - 60x)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

What do we gain?

(54 : 58)

Effect of number of arrays (number of genes = 7399)

Number of arrays (samples)

Fold

incre

ase in s

peed (

rela

tive to 1

CP

U)

0.2

0.5

1.0

2.0

5.0

10.0

20.0

50.0

20 40 80 100

1505 2476 4608 5664

620712048 23258 30102

29 51 95 118

Effect of number of genes (number of arrays = 160)

Number of genes

Fold

incre

ase in s

peed (

rela

tive to 1

CP

U)

0.2

0.5

1.0

2.0

5.0

10.0

20.0

50.0

1000 2000 4000 6000 12000 24000 48000

5804 6473 7544 8831 12355 20231 35072

12900

18197

30699

45314

106 116 144 179 268 442 788

Original SequentialNew sequential (1 CPU)Parall. 2 CPUsParall. 10 CPUsParall. 20 CPUsParall. 60 CPUs

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

What do we gain?

(55 : 58)

1 5 10 15

500

1000

1500

2000

2500

Breast data set (78 arrays x 4751 genes)

Number of simultaneous users

User

wall

tim

e (

seconds)

1 5 10 15

1000

2000

3000

4000

5000

6000

7000

8000

DLBCL data set (160 arrays x 7399 genes)

Number of simultaneous users

User

wall

tim

e (

seconds)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Tools. . .

(56 : 58)

Org mode R, Python, C/C++

Literate programming

Interactive debugging

ConfigFiles

Bazaar-NG

FunkLoad: - Whole system testing

- Regression testing - Stress testing

''Continuous Integration''

Commit hooks

Web app. frameworks

WebHelp

InstallScript Distributed computing

Parallel. (MPI, others) Grid computing Web services

Admin.

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Too many languagesImpedance mismatch problem:“Building Web-based applications requires the mastering of anumber of languages/technologies (e.g. HTML, CSS, CGI, ASP,PHP, XML, etc..). Such languages and technologies werecreated to address different aspects on a by-need evolutionarymanner. The result is a plethora of tools that are fitted togetherin an ad hoc fashion.” El-Ansary, Grolaux, Van Roy, Rafea(2005) “Overcoming the Multiplicity of Languages andTechnologies for Web-Based Development Using aMulti-paradigm Approach”.

R and CHTML and Python: CGI, data entry, displayPython (and others): control and monitor MPIJavascript: AJAX and figures

(57 : 58)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions,future, et al.

Appendix

Other solutions?

Too many languages Use languages designed toovercome this problem: Hop, Links, QHTML.

Fault tolerance and too much traffic Alternatives to MPI?Linda and tuple spaces (alsobetween-language funct.)Roll-our-own based on RserveHave Erlang control R processes?

(58 : 58)

top related