introducing a new client/server framework for big data ... · introducing a new client/server...

Post on 11-Oct-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introducing a New Client/Server Framework for BigData Analytics with the R Language

Drew Schmidt, Wei-Chen, and George Ostrouchov

July 19, 2016

Slides: http://bit.do/cddXm Drew SchmidtIntroducing a New Client/Server Framework for Big Data

Analytics with the R Language

Acknowledgements

Support

This work was supported in part by the project “Harnessing ScalableLibraries for Statistical Computing on Modern Architectures andBringing Statistics to Large Scale Computing” funded by the NationalScience Foundation Division of Mathematical Sciences under Grant No.1418195.This work used resources supported by the National ScienceFoundation under Grant Number 1137097 and by the University ofTennessee through the Beacon Project.Just got an allocation on Comet! Thank you XSEDE!

Disclaimer

The findings and conclusions in this presentation have not beenformally disseminated by the U.S. Department of Health & HumanServices nor by the U.S. Department of Energy, and should not beconstrued to represent any determination or policy of University,Agency, Administration and National Laboratory.

Slides: http://bit.do/cddXm Drew SchmidtIntroducing a New Client/Server Framework for Big Data

Analytics with the R Language

Background and Motivation

1 Background and Motivation

2 pbdR Compute Backend

3 The Client/Server

4 Challenges and Future Work

Slides: http://bit.do/cddXm Drew Schmidt

Background and Motivation

Problems with ”Big Data“ Software

Many frameworks; what do they all do?

Don’t always play nice with HPC systems.

Often not as ”high level“ as advertised.

Almost exclusively batch!

Slides: http://bit.do/cddXm Drew Schmidt 3/27

Background and Motivation

Data Analysis Is An Interactive Activity

Data analysis is an interactiveactivitya

aData analysis is an interactive activity

Slides: http://bit.do/cddXm Drew Schmidt 4/27

Background and Motivation

Why am I here?

XSEDE: a conference for users and service providers.

I want to talk to you!

Slides: http://bit.do/cddXm Drew Schmidt 5/27

pbdR Compute Backend

1 Background and Motivation

2 pbdR Compute Backend

3 The Client/Server

4 Challenges and Future Work

Slides: http://bit.do/cddXm Drew Schmidt

pbdR Compute Backend

pbdR

Client/Server + pbdR = interactive big data analysis

This talk is not about the old stuff. . .

But we have to talk about it!

Slides: http://bit.do/cddXm Drew Schmidt 6/27

pbdR Compute Backend

Programming with Big Data in R (pbdR)

Free/Open Source R packages.

Actively maintained, available on CRANand Github.

Packages: Library interfaces, high-levelframeworks, profiling and vis, . . .

For today: MPI+ScaLAPACK stuff.

Syntax identical to R.

Slides: http://bit.do/cddXm Drew Schmidt 7/27

pbdR Compute Backend

About the Name

Slides: http://bit.do/cddXm Drew Schmidt 8/27

pbdR Compute Backend

R as an Interface Language

http://datascience.la/john-chambers-user-2014-keynote/

Slides: http://bit.do/cddXm Drew Schmidt 9/27

pbdR Compute Backend

Distributed Matrices and Statistics with pbdDMAT

21.34

42.6885.35170.7 341.41

1016.09

21.3442.6885.35 170.7

341.411016.09

21.3442.68

85.35

170.7341.41

1016.09

25

50

75

100

125

5041008

20164032

806424000

5041008

20164032

806424000

5041008

20164032

806424000

Cores

Run

Tim

e (S

econ

ds)

Predictors 500 1000 2000

Fitting y~x With Fixed Local Size of ~43.4 MiB

l i b r a r y (pbdDMAT)i n i t . g r i d ( )

x <− ddmat r i x ( ” rnorm” ,nrow=m, nco l=n )

y <− ddmat r i x ( ” rnorm” ,nrow=m, nco l =1)

mdl <− lm . f i t ( x=x , y=y )

f i n a l i z e ( )

Slides: http://bit.do/cddXm Drew Schmidt 10/27

pbdR Compute Backend

pbdR Paradigms/”Opinions“

In core.

Focus on dense data.

Integrates well with traditional HPC.

C and Fortran for performance; R for the interface.

Exclusively batch.

(Mostly) Very different from Hadoop/Spark!

Slides: http://bit.do/cddXm Drew Schmidt 11/27

pbdR Compute Backend

”OLCF Researchers Scale R to Tackle Big Science Data Sets“

A problem that takes several hours onApache Spark [was analyzed] in less than aminute using R on OLCF high-performancehardware.

”. . . for situations where one needsinteractive near-real-time analysis, thepbdR approach is much better.“

https://www.hpcwire.com/2016/07/06/

olcf-researchers-scale-r-tackle-big-science-data-sets/

Slides: http://bit.do/cddXm Drew Schmidt 12/27

The Client/Server

1 Background and Motivation

2 pbdR Compute Backend

3 The Client/ServerDesign and Architecture of the Client ServerExamplesOverhead

4 Challenges and Future Work

Slides: http://bit.do/cddXm Drew Schmidt

The Client/Server Design and Architecture of the Client Server

3 The Client/ServerDesign and Architecture of the Client ServerExamplesOverhead

Slides: http://bit.do/cddXm Drew Schmidt

The Client/Server Design and Architecture of the Client Server

Motivation

Connect local R session to a remote one.

Slides: http://bit.do/cddXm Drew Schmidt 13/27

The Client/Server Design and Architecture of the Client Server

The C/S Framework

No specialized softwareenvironment; just R packages.

Designed for the cloud or HPCresources.

Mostly ”un-opinionated“; can useour compute backend, SparkR,whatever.

Optional security features:passwords and encryption.

Slides: http://bit.do/cddXm Drew Schmidt 14/27

The Client/Server Design and Architecture of the Client Server

A Quick Summary of the Main Components

pbdZMQ: ZeroMQ bindings.Use: our C/S packages; IRkernel, . . .

remoter: client/server core.Use: cloud computing, ”reference“ bigdata framework.

pbdCS: Extension of remoter for the MPIbackend stuff.Use: Bringing interactivity to pbdR ”bigdata“ system.

Slides: http://bit.do/cddXm Drew Schmidt 15/27

The Client/Server Design and Architecture of the Client Server

Comparison to Similar Utilities

”Moving computations to a remote machine“

Web frameworks: shiny, htmlwidgets, fiery, . . .

Revolution Microsoft Analytics: Microsoft R Server and R Client

Jupyter/notebooks.

RStudio Server

rmote

Slides: http://bit.do/cddXm Drew Schmidt 16/27

The Client/Server Examples

3 The Client/ServerDesign and Architecture of the Client ServerExamplesOverhead

Slides: http://bit.do/cddXm Drew Schmidt

The Client/Server Examples

Basics

Elsewhere:

1 R> remoter :: server ()

2

3 ## [2016 -07 -19 12:53:23]: *** Launching secure server ***

In your local R session:

1 R> remoter :: client ()

2 remoter> x = 1

3 remoter> x

4 ## [1] 1

5 remoter> exit()

6

7 R> x

8 ## Error: object ’x’ not found

9 R> remoter :: client ()

10

11 remoter> s2c(x)

12 remoter> exit()

13 R> x

14 [1] 1

Slides: http://bit.do/cddXm Drew Schmidt 17/27

The Client/Server Examples

Slides: http://bit.do/cddXm Drew Schmidt 18/27

The Client/Server Examples

A Demo?

No time for a live demo. . .

Any time you see me at XSEDE16, ask and I’ll give you one!

For a non-live demo, see:https://github.com/snoweye/user2016.demo

More information in the remoter vignette:https://cran.r-project.org/web/packages/remoter/

vignettes/remoter.html

Slides: http://bit.do/cddXm Drew Schmidt 19/27

The Client/Server Overhead

3 The Client/ServerDesign and Architecture of the Client ServerExamplesOverhead

Slides: http://bit.do/cddXm Drew Schmidt

The Client/Server Overhead

Size Costs

There is a size overhead in transmitted messages.

For almost everything you would do, no big deal.

See paper for full discussion.

Slides: http://bit.do/cddXm Drew Schmidt 20/27

The Client/Server Overhead

Relays

Slides: http://bit.do/cddXm Drew Schmidt 21/27

The Client/Server Overhead

Run Time Costs

Slides: http://bit.do/cddXm Drew Schmidt 22/27

Challenges and Future Work

1 Background and Motivation

2 pbdR Compute Backend

3 The Client/Server

4 Challenges and Future Work

Slides: http://bit.do/cddXm Drew Schmidt

Challenges and Future Work

Already done!

Batch computing with a remote server.

Graphics and help, integration with rmote.

Many small things we didn’t bother to mention.

Slides: http://bit.do/cddXm Drew Schmidt 23/27

Challenges and Future Work

Object Guarding

Currently possible to send huge amounts of data without realizing it.

1 pbdR > print(huge_thing)

2 ## uh-oh!

Slides: http://bit.do/cddXm Drew Schmidt 24/27

Challenges and Future Work

Relays and Integration with the Scheduler

Some day. . .

1 R> bigJob(machine="stampede", nodes =10, interactive=TRUE)

Slides: http://bit.do/cddXm Drew Schmidt 25/27

Challenges and Future Work

References

Schmidt, D., Chen, W.-C. and Ostrouchov, G. (2016), “Introducing aNew Client/Server Framework for Big Data Analytics with the RLanguage, XSEDE2016.

Chen, W.-C., Ostrouchov, G., Schmidt, D., Patel, P. and Yu, H.(2012), “A Quick Guide for the pbdMPI Package. R Vignette, URLhttp://cran.r-project.org/package=pbdMPI

Schmidt, D., Chen, W.-C., Patel, P., and Ostrouchov, G. (2014),“Speaking Serial R with a Parallel Accent. R Vignette, URLhttp://cran.r-project.org/package=pbdDEMO

Chen, W.-C., Ostrouchov, G., Pugmire, D., Prahat, and Wehner, M.(2013), “A Parallel EM Algorithm for Model-Based Clustering Appliedto the Exploration of Large Spatio-Temporal Data, Technometrics,55(4), 513-523.

Slides: http://bit.do/cddXm Drew Schmidt 26/27

Challenges and Future Work

∼Thanks!∼

Questions?

Email: wrathematics@gmail.com

GitHub: https://github.com/wrathematics

Web: http://wrathematics.info

Blog: http://librestats.com

Twitter: @wrathematics

Slides: http://bit.do/cddXm Drew Schmidt 27/27

top related