biocore and gems: cyber infrastructure for cyber chemistry jesús a. izaguirre computer science...
TRANSCRIPT
BioCoRE and GEMS: Cyber
Infrastructure for Cyber Chemistry
Jesús A. IzaguirreComputer Science & Engineering
University of Notre Damewith Kirby Vandivort
NIH Resource for Macromolecular Modeling and Bioinformatics
University of Illinois
BioCoRE and GEMS 3 October 2004
Overview I
• Chemical applications such as virtual screening, protein kinetics and structure, and analysis and validation of molecular simulations require enormous resources that can be provided by CyberInfrastructure
• Successful solution of these problems require collaborative approaches, also facilitated by CyberInfrastructure
BioCoRE and GEMS 3 October 2004
Overview II
To make CyberInfrastucture effective, the following issues must be addressed:
• Users of CyberInfrastructure need a data-centric way of managing their computations and data
• Distributed databases on the grid need to address the problem of reliability and fault-tolerance of data
BioCoRE and GEMS 3 October 2004
Overview III
• We will study examples of collaborative software that address these issues, primarily:– BioCoRE: A Collaboratory for Structural
Biology– GEMS: Grid Enabled Molecular Simulations
Toolset and Database
BioCoRE and GEMS 3 October 2004
Sample CyberScience Projects
Collaborative Biophysics BioCoRE
K. Schulten, Illinois
Virtual Screening The Screensaver Project
W.G. Richards, Oxford
Protein Kinetics Folding@Home
V. Pande, Stanford
Distributed Database of Molecular Simulations
BioSimGrid
M. Sansom, Oxford
BioCoRE and GEMS 3 October 2004
What is BioCoRE?
BioCoRE: a collaborative work environment for biomedical research, research management and training.
BioCoRE assists the entire research process, from talking with collaborators to performing simulations and collecting data, to preparing papers and reports.
BioCoRE and GEMS 3 October 2004
Sharing Documents
With the BioFS and WebDAV, scientists can exchange and edit files from anywhere with a web connection.
BioCoRE and GEMS 3 October 2004
Setting Up and Running Simulations
• NAMDCFG: A “Simulation Setup Wizard”
• Online help and error checking for NAMD input files
• Job submission to supercomputers simplified
• Job status monitored for easy retrieval
• Job data archived for future reference
BioCoRE and GEMS 3 October 2004
Sharing Molecular ViewsUsing VMD and BioCoRE, collaborators may exchange and manipulate 3-D models of molecules
Emphasis on collaborative sessions.Streamlined process of sharing views.
BioCoRE and GEMS 3 October 2004
Communicating
• Control Panel provides instant messaging and notifications
• BioCoRE also provides message boards, Web site library, lab book
BioCoRE and GEMS 3 October 2004
Programming Interface
• Provide way for users to programmatically interact with BioCoRE.
• Communication (Control Panel), shared states (VMD)
• WebDAV
BioCoRE and GEMS 3 October 2004
Availability
• Free
• Can be accessed from Illinois site, or server software can be installed locally
• Server software can be modified if necessary
• http://www.ks.uiuc.edu/Research/biocore/
BioCoRE and GEMS 3 October 2004
Virtual Screening
• Combinatorial Complexity Lead Exploration
• Screen docking affinities based on a scoring function (interaction energies, RMSD, etc…)
• Modeled as an all pairs problem
• Logically independent computational requirements are well suited for wide area grid distribution
Leads (ligands)
L0001
L0002
L0003
L0004
L0005
BioCoRE and GEMS 3 October 2004
CyberInfrastructure Needs for Virtual Screening I
• Incorporate protein (receptor) flexibility– Use multiple protein structures (hierarchical
representations and algorithms)
• Iterative refinement of results– Add new protein conformations to improve
docking– Use higher resolution models for promising hits
(integration of data and work flow)– Monitor status of results (not just jobs running)
BioCoRE and GEMS 3 October 2004
CyberInfrastructure Needs for Virtual Screening II
• Manage computation and storage in the grid– Declarative rather than imperative specification
• Automate usage of algorithms / tools– Select software and optimal parameters for
algorithms (recommender system)– Example: MDSimAid (
http://mdsimaid.cse.nd.edu) selects optimal MD simulation protocol (limited options)
BioCoRE and GEMS 3 October 2004
BioSimGrid Mark S. P. Sansom, Oxford
• Trajectory data stored in relational database tables per Data Schema
• Semi-Automated Deposition of trajectory files for certain formats (CHARMM, NAMD, etc…)
• Trajectory analysis modules• Future goal to distribute
database
• Database for biomolecular simulations• Specifically: molecular dynamics trajectories• Facilitate validation and analysis of simulations• Provides “independence” from the specific simulation semantics
(configuration parameters, architecture, simulation tools, etc…)
BioCoRE and GEMS 3 October 2004
CyberInfrastructure Needs for Distributed Databases I
• Metadata for trajectories– Simulation protocol, software, etc.
• Distribution on the grid– Storage fault tolerance / reliability– Scalable solution: reduce storage requirements
and centralization
BioCoRE and GEMS 3 October 2004
CyberInfrastructure Needs for Distributed Databases II
• Data-driven model for the user– Data organized around key themes (trajectories,
molecules)
• Generic tools for developers– Applicable to different applications
BioCoRE and GEMS 3 October 2004
Solving Integration Problem
• We need to capture the data flow and the work flow
– Ecce project– XML metadata– Component architectures (e.g., JavaBeans,
Common Component Architecture)
BioCoRE and GEMS 3 October 2004
Solving Integration Problem
• BioCoRE (K. Schulten, Illinois)– Use of programming interface– Provides multiple services to applications (web
file system, job management, shared visualization)
BioCoRE and GEMS 3 October 2004
Solving Grid Management
• Current grid tools are task oriented: run this particular simulation code with these input files, etc.– Web portals are an incremental improvement
over command line or stand alone applications
• Problem: Controlling multiple resources– For example, create 10,000 tasks & keep track
of the data, as might be needed for virtual screening or @home applications
BioCoRE and GEMS 3 October 2004
Solving Grid Management with GIPSE
• GIPSE: Grid Interface for Parameter-driven Simulation Environments– Shift focus from management to research– Result-driven interface– Scripting capabilities
BioCoRE and GEMS 3 October 2004
Solving grid management with GIPSE
• Simplify process– XML Data format– Missing “glue”
• Powerful searches– Optimizations– Control loops
GEMS Toolset HIV-1 Protease
BioCoRE and GEMS 3 October 2004
Solving grid management with GIPSE
• Manage data– Storage– Database retrieval
• Monitor progress– Status– Application – specific
GEMS Toolset HIV-1 Protease
BioCoRE and GEMS 3 October 2004
GEMS Database Toolset
• Grid Enabled Molecular Simulation– Data Centric
– Wide area distributed storage
– Researchers have data and resource autonomy
– Simulation configuration, input data files, and output data files identified via XML
– Centralized SQL locator
– Availability via replication
BioCoRE and GEMS 3 October 2004
Reliability and Leveraged Availability via Runtime Imaging
• Reliability of data storage is increased• User can tradeoff availability versus storage volume
• Workspace data has 2-way redundancy by default• Archival data has a 2-way redundancy of fewer
snapshots, but saves the computational images• For each computational run through the GEMS portal a
comprehensive runtime image is created from which the simulation can automatically be regenerated.
• Runtime images include executable version and location, library requirements, hardware requirements, input files, and configuration parameters
BioCoRE and GEMS 3 October 2004
Integration of Distributed Data Into New Simulations
• A grid distributed “make” based on a computational requirement over a set parameter sweep– Example: optimize MD simulation protocol
• Before starting the sweep a query determines data points that are up to date and those that require computation (including regeneration)– Example: keep current list of results of virtual
screening as more computations are performed or targets and ligands added
BioCoRE and GEMS 3 October 2004
Example: Validating Simulations
• Locate specific published simulation configurations for benchmarking
• Select pertinent input data files (pdb, psf, force fields, etc…) for direct utilization in a new simulation for purpose of comparison/contrast.
• Researcher B wants to vary certain parameters of Researcher A’s published simulation to test her new MD integrator
BioCoRE and GEMS 3 October 2004
Acknowledgments
• Collaborators in GIPSE and GEMS: – Aaron Striegel– Doug Thain – Jeff Peng
• Students– Paul Brenner– Santanu Chatterjee
• Funding from NSF Career and Biocomplexity
• Klaus Schulten• BioCoRE Team:
– Robert Brunner
– Michael Bach
– David Brandon
• BioCoRE funding from NIH