examples of use cases mining large-scale data: dealing with the data deluge

1
Biocep, towards a federative collaborative user-centric and grid-enabled computational open platform Karim Chine R is becoming the lingua franca of data analysis and statistical computing. It has a very powerful graphics system as well as cross- platform capabilities for packaging any computational code. Hundreds of available R packages implement the most up-to-date computational methods and reflect the state-of-the-art of research in various fields. R packages are foreseen as a reproducible research enabler. There is no obstacle to a large-scale deployment of R on public grids since it is a GPL software. However R is not multithreaded, doesn't operate as a server and has only a low-level non-object-oriented API. GUIs development for R remains non-standardized. R's potential as a computational back end engine for applications has yet to be fully exploited. While its user base is growing at a high rate, this growth rate would be significantly higher in the presence of a user-friendly and rich workbench Biocep is a general unified open source solution for integrating and virtualizing the access to R engines/servers and aims to become a federative user-friendly computational e-platform for research, finance and education. The Biocep virtual workbench enables the plugability for all the elements of a computational environment: the computational resource whether it is a local machine, a cluster, a grid or a cloud server via a simple URL, the computational components via the import of R packages and the computational GUIs via the import of plugins from repositories or the design of new views with a drag&drop GUI editor. Several dockable built-in views allow users to work interactively and collaboratively with R engines running anywhere. The views include a console, highly interactive remote graphic devices, a workspace explorer, PDF and SVG viewers, R data inspectors, linked plots and collaborative spreadsheets fully integrated with R functions and data. Biocep is also a toolbox that can be used to generate stateless and stateful web services automatically mapping R functions. It enables Python/Groovy scripting with remote R engines on client and on server sides. Using the Biocep frameworks, pools of R engines can be deployed on heterogeneous nodes, managed and used for parallel and distributed computing and for generating dynamic content on-the-fly for highly scalable web applications. A Biocep based R virtualization infrastructure has been successfully deployed on the National Grid Service. The result leaves no doubt about how useful this service would become for researchers. If the new platform was widely adopted, it would greatly enhance the usability of existing HPC infrastructures and would increase their usage. It may also work as an enabler of a new computing business model that would synergize the utility computing model (resources) and the pay-per-use software model (components/GUIs). Project Home : www.biocep.net Examples of use cases Mining large-scale data: dealing with the data deluge The volume of data generated in research is growing and the data can’t be moved any more to be analyzed. Biocep takes the computation to the data. An R engine is a « robotic hand » that can operate anywhere: “near” the big files or within a database. It is extensible via components (R packages and dynamically-loaded Java code) and remote-controllable from anywhere via the Biocep interfaces or the Virtual R Workbench (extensible with plugins, collaborative). R session alive for ever With Biocep, an R user can create an R engine anywhere (clusters, EC2 server..) and connect to it, use it and disconnect from it. Once reconnected again (from anywhere), the R user retrieves his full environment including his graphic devices. Workflows with computational Web Services The statefulness of the Web Services generated by Biocep solves the overhead problem caused by the transfer of intermediate results. Between Workflow nodes, only proxies referencing the data (kept in the memory of the R engine allocated for the computing session) are propagated. R Virtualization Collaborative R Workflows with generated stateful Web Services Automatic Web Sevices generation for R functions Scripting with one or many remote R engines NGS R engines pools deployment Forthcoming Roadmap Documentation: User manuals, deployment documents, Javadoc.. Workbench: Plugins architecture finalization. Parallel computing: Finish implementing ‘Snow ‘like APIs with Biocep. Biocep based MapReduce. Remote engines fine- grain control from the R console. Security: Finish implementing the security architecture. Cloud Computing: Provide VMWare virtual machines and public AMIs (EC2) with R & Biocep pre-installed. Integrate basic AMIs management to Biocep. Workflows: Implement the automatic generation of Knime nodes for R functions. Evangelization: Seminars and tutorials planned for several institutions and companies. Tutorial at the 4th IEEE International Conference on e-Science (Indianapolis). Search for funding: Please get in touch with the author if you are Interested in being a sponsor or a partner for Biocep. Author’s contact details: [email protected] - +44(0)7769961344 Acknowledgements AT&T Research Labs: Simon Urbanek BAH: Ghazi Ben Amor EBI: Alvis Brazma, Wolfgang Huber, Kimmo Kallio, Misha Kapushesky, Michael Kleen, Alberto Labarga, Philippe Rocca-

Upload: lukas

Post on 24-Feb-2016

28 views

Category:

Documents


2 download

DESCRIPTION

Biocep , towards a federative collaborative user-centric and grid-enabled computational open platform Karim Chine. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Examples  of use cases Mining large-scale data: dealing with the data deluge

Biocep, towards a federative collaborative user-centric and grid-enabled computational open platform

Karim Chine R is becoming the lingua franca of data analysis and statistical computing. It has a very powerful graphics system as well as cross-platform capabilities for packaging any computational code. Hundreds of available R packages implement the most up-to-date computational methods and reflect the state-of-the-art of research in various fields. R packages are foreseen as a reproducible research enabler. There is no obstacle to a large-scale deployment of R on public grids since it is a GPL software. However R is not multithreaded, doesn't operate as a server and has only a low-level non-object-oriented API. GUIs development for R remains non-standardized. R's potential as a computational back end engine for applications has yet to be fully exploited. While its user base is growing at a high rate, this growth rate would be significantly higher in the presence of a user-friendly and rich workbench

Biocep is a general unified open source solution for integrating and virtualizing the access to R engines/servers and aims to become a federative user-friendly computational e-platform for research, finance and education. The Biocep virtual workbench enables the plugability for all the elements of a computational environment: the computational resource whether it is a local machine, a cluster, a grid or a cloud server via a simple URL, the computational components via the import of R packages and the computational GUIs via the import of plugins from repositories or the design of new views with a drag&drop GUI editor. Several dockable built-in views allow users to work interactively and collaboratively with R engines running anywhere. The views include a console, highly interactive remote graphic devices, a workspace explorer, PDF and SVG viewers, R data inspectors, linked plots and collaborative spreadsheets fully integrated with R functions and data. Biocep is also a toolbox that can be used to generate stateless and stateful web services automatically mapping R functions. It enables Python/Groovy scripting with remote R engines on client and on server sides. Using the Biocep frameworks, pools of R engines can be deployed on heterogeneous nodes, managed and used for parallel and distributed computing and for generating dynamic content on-the-fly for highly scalable web applications.

A Biocep based R virtualization infrastructure has been successfully deployed on the National Grid Service. The result leaves no doubt about how useful this service would become for researchers. If the new platform was widely adopted, it would greatly enhance the usability of existing HPC infrastructures and would increase their usage. It may also work as an enabler of a new computing business model that would synergize the utility computing model (resources) and the pay-per-use software model (components/GUIs).

Project Home : www.biocep.net

Examples of use cases

Mining large-scale data: dealing with the data deluge The volume of data generated in research is growing and the data can’t be moved any more to be analyzed. Biocep takes the computation to the data. An R engine is a « robotic hand » that can operate anywhere: “near” the big files or within a database. It is extensible via components (R packages and dynamically-loaded Java code) and remote-controllable from anywhere via the Biocep interfaces or the Virtual R Workbench (extensible with plugins, collaborative).

R session alive for ever With Biocep, an R user can create an R engine anywhere (clusters, EC2 server..) and connect to it, use it and disconnect from it. Once reconnected again (from anywhere), the R user retrieves his full environment including his graphic devices.

Workflows with computational Web ServicesThe statefulness of the Web Services generated by Biocep solves the overhead problem caused by the transfer of intermediate results. Between Workflow nodes, only proxies referencing the data (kept in the memory of the R engine allocated for the computing session) are propagated.

R Virtualization Collaborative R

Workflows with generated stateful Web Services

Automatic Web Sevices generation for R functions

Scripting with one or many remote R engines

NGSR engines pools deployment Forthcoming Roadmap

Documentation: User manuals, deployment documents, Javadoc..Workbench: Plugins architecture finalization.Parallel computing: Finish implementing ‘Snow ‘like APIs with Biocep. Biocep based MapReduce. Remote engines fine-grain control from the R console.Security: Finish implementing the security architecture.Cloud Computing: Provide VMWare virtual machines and public AMIs (EC2) with R & Biocep pre-installed. Integrate basic AMIs management to Biocep.Workflows: Implement the automatic generation of Knime nodes for R functions. Evangelization: Seminars and tutorials planned for several institutions and companies. Tutorial at the 4th IEEE International Conference on e-Science (Indianapolis). Search for funding: Please get in touch with the author if you are Interested in being a sponsor or a partner for Biocep. Author’s contact details: [email protected] - +44(0)7769961344

Acknowledgements

AT&T Research Labs: Simon Urbanek BAH: Ghazi Ben Amor EBI: Alvis Brazma, Wolfgang Huber, Kimmo Kallio, Misha Kapushesky, Michael Kleen, Alberto Labarga, Philippe Rocca-Serra,Ugis Sarkans,Kirsten Williams EPFL: Darlene Goldstein ETH Zürich: Yohan Chalabi, Diethelm Würtz, Martin Mächler FHCRC: Seth Falcon, Martin Morgan Imperial College London: Asif Akram, Jeremy Cohen, Vasa Curcin, John Darlington, Brian Fuchs Lancaster University: Robert Crouchley OeRC: Dimitrina Spencer, Matteo Turilli, David Wallom, Steven Young Oracle: Andrew Bond Platform Computing: Christopher Smith Technische Universität Dortmund: Uwe Ligges University of Manchester: Richard D Pearson University of Split: Ivica Puljak