i don't care where my data and methods are: a web-service approach for distributed access to...

27
A Web-Service Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Overview Pub3D Model Exchange A Web-Service Approach for Distributed Access to Methods, Data and Models (I Don’t Care Where My Data and Methods Are) Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild School of Informatics Indiana University 235 th ACS National Meeting 6 th March, 2008

Upload: rguha

Post on 14-Jun-2015

1.524 views

Category:

Education


0 download

TRANSCRIPT

Page 1: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

A Web-Service Approach for DistributedAccess to Methods, Data and Models(I Don’t Care Where My Data and Methods Are)

Rajarshi Guha Geoffrey Fox Kevin E. GilbertMarlon Pierce David Wild

School of InformaticsIndiana University

235th ACS National Meeting6th March, 2008

Page 2: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Outline

Overview

Pub3D

Model Exchange

Page 3: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Local versus Distributed Resources

I Access arbitrary resources (methods, data, applications)

I All resources look like function calls

Page 4: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Combining Data & Functionality

I But what do we mean by “combining” all theseresources?

I Different levels of complexityI Keep track of new additions to a database via an RSS

feedI Use Yahoo Pipes to combine and manipulate output

easilyI Write full fledged programs in your language of choice

I Web services allow us to support all these activities in auniform manner

Page 5: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Web Services - What’s Available

Cheminformatics functionality

I Molecular descriptors

I Similarity (2D, 3D)

I Format conversion

I Depictions

I 3D coordinate generation

I Summarized on ChemBioGrid

Dong, X. et al, J. Chem. Inf. Model., 2007, 47, 1303–1307

http://www.chembiogrid.org/projects/proj_core.html

Page 6: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Web Services for Modeling

I Computational web services can be viewed as wrappersaround the actual program

I Since predictive models are a common feature incheminformatics we’d like to support them as well

I This leads to a number of requirementsI Ability to develop modelsI Store (deploy) modelsI Use the models via the web service infrastructure

I We provide a computational infrastructure based on Rwhich supports

I Feature selectionI Model developmentI Model deploymentI Arbitrary R code

Guha, R. J. Chem. Inf. Model., 2008, 48, 456–464

Page 7: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

What’s Available?

I Regression (OLS, CNN, RF)

I Classification (LDA)

I Clustering (k-means)

I Feature selection (stepwise and exhaustive)I Automated model generation

I Load X, Y data, build linear and non-linear models withoptional LOO CV

I Deployed modelsI Random forests for 60 NCI DTP cell linesI CytotoxicityI Ames mutagenicity

Wang, H. et al., J. Chem. Inf. Model., 2007, 47, 2063–2076

Guha, R. and Schurer, S., J. Comp. Aid. Molec. Des, 2008, ASAP

Page 8: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Web Services - What’s Available

Data sources

I We maintain a number of databasesI PubChem mirror (mainly for local research)I Pub3D

I Directly accessible via queries in SQL

I We also encapsulate specific queries as web service calls

http://www.chembiogrid.org/projects/proj_db.html

Page 9: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

REST Frontends to SOAP Services

I REST is a network architecture that avoids complexmessage formats

I No extra libraries requiredI Access services using URL’sI Much simpler interface compared to SOAP

I We’ve been putting REST interfaces onto a number ofour SOAP services

I Current REST services includeI Database (3D structure, PubChem mirror)I Molecular descriptorsI Depictions

Page 10: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

REST 3D Structure Service

I To get a 3D structure for a PubChem CID

http://www.chembiogrid.org/cheminfo/rest/db/pub3d/2244

Page 11: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

REST Depiction Service

I To get a 2D depiction for arbitrary SMILES

http://www.chembiogrid.org/cheminfo/rest/depict/C(=O)OCC

Page 12: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

REST Descriptor Service

I To get descriptors for an arbitrary SMILES string

http://www.chembiogrid.org/cheminfo/rest/desc/

descriptors/CC(=O)OCCN

Page 13: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Pub3D

I A 3D structure database derived from PubChemI Current version contains a single 3D structure for 17M

compoundsI Structures obtained using MMFF94I Not the lowest energy conformer

I Structures can be retrieved in SD format by CID using aweb page or web service interfaces

I We also store distance moment shape descriptorsallowing us to perform shape similarity searches

I http://www.chembiogrid.org/cheminfo/p3d/

Ballester, P.J. and Graham Richards, W., J. Comp. Chem., 2007, 28, 1711–1723

Page 14: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Pub3D Performance

I Shape searches can be as fast as 5s, for reasonably largeresult sets

I Fast enough for us to explore the “density of space”around a given query compound

Query Times for 10000 Random CIDs (R = 0.4)

Time (sec)

Num

ber

of Q

uerie

s

0 20 40 60 80

050

010

0015

0020

0025

0030

0035

00

0.0 0.1 0.2 0.3 0.4 0.5 0.6

020

000

6000

010

0000

1400

00

Radius

Nea

rest

Nei

ghbo

r C

ount

● ● ● ● ●●

● ● ● ●●

1 0.91 0.83 0.77 0.71 0.67 0.62

Similarity

Levitra (110634)

Diazepam (3016)

Taxol (36314)

Didanosine (50599)

Calcascorbin (6247)

Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1713–1722

Page 15: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Pub3D & Clustering

I Indexing gives goodperformance

I As index size increases,performance degrades

I Could add more RAM

I Clustering the databaseallows us to scale tosignificantly larger collections

Page 16: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Pub3D & Clustering

I Indexing gives goodperformance

I As index size increases,performance degrades

I Could add more RAM

I Clustering the databaseallows us to scale tosignificantly larger collections

Page 17: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Pub3D and Conformers

I Single, arbitrary conformers aren’t too usefulI We have currently generated conformers for a subset of

PubChem (4 to 10 heavy atoms)I 3 Kcal energy windowI 243,892 compoundsI ≈ 2M conformers

I Clustering will be vital when conformers are consideredI Allows us to handle arbitrary numbers of conformersI May need to consider some sort of compressionI Could use cluster information to optimize queries

Page 18: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Model Exchange

I The literature contains many published modelsI No way to utilize them unless we manually rebuild them

I In many cases we will not have access to descriptors

I Difficult to search for models specificallyI We should be able to do the following

I Search for modelsI Exchange themI Execute them

I Is this a “format” issue?I To some extent yes

Page 19: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Model Exchange

I The literature contains many published modelsI No way to utilize them unless we manually rebuild them

I In many cases we will not have access to descriptors

I Difficult to search for models specificallyI We should be able to do the following

I Search for modelsI Exchange themI Execute them

I Is this a “format” issue?I To some extent yes

Page 20: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

PMML for Model Exchange

I Predictive Modelling Markup Language

I A standard supported by the Data Modeling GroupI Allows you to serialize predictive models to XML

I Linear, logistic regressionI Tree modelsI Neural networks, Naıve Bayes modelsI Association modelsI Ensemble models (random forests, arbitrary ensembles)

I Supported by a number of platformsI IBM, Salford SystemsI SAS, SPSSI R

Page 21: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

PMML for Model Exchange

I Since it’s XML, it’s extensible

I PMML allows us to specify the model itselfI But we need to add extra information to make a model

truly searchableI Provenance (who made it, when was it made, . . . )I Property (what is being modeled)I Requirements (what are the inputs, how to get them)

Page 22: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

PMML for Model Exchange

Provenance

I Easily solved using Dublin Core

Property

I Could be addressed using keywords

I Could reuse pre-existing ontologies

Requirements

I This is tricky

I Need to identify what descriptors were used

I What software, version

I How to evaluate the descriptors (if possible)

Page 23: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

PMML for Model Exchange

Provenance

I Easily solved using Dublin Core

Property

I Could be addressed using keywords

I Could reuse pre-existing ontologies

Requirements

I This is tricky

I Need to identify what descriptors were used

I What software, version

I How to evaluate the descriptors (if possible)

Page 24: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

PMML for Model Exchange

Provenance

I Easily solved using Dublin Core

Property

I Could be addressed using keywords

I Could reuse pre-existing ontologies

Requirements

I This is tricky

I Need to identify what descriptors were used

I What software, version

I How to evaluate the descriptors (if possible)

Page 25: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Summary

I Pub3D is a shape searchable version of PubChem

I Conformers will have to be considered for it to be useful

I Are searches meaningful? Benchmarks required

I Web services provide one approach to model deployment

I We should be able to search for models explicitly

I PMML is one extensible solution the addresses modelexchange

I Automated model execution is more challenging

Page 26: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Summary

I Pub3D is a shape searchable version of PubChem

I Conformers will have to be considered for it to be useful

I Are searches meaningful? Benchmarks required

I Web services provide one approach to model deployment

I We should be able to search for models explicitly

I PMML is one extensible solution the addresses modelexchange

I Automated model execution is more challenging

Page 27: I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

A Web-ServiceApproach forDistributedAccess to

Methods, Dataand Models

Rajarshi GuhaGeoffrey Fox

Kevin E. GilbertMarlon PierceDavid Wild

Overview

Pub3D

Model Exchange

Acknowledgments

I Kangseok Kim

I NIH