pushing chemical biology through the pipes

24
Pushing Chemical Biology Data Through the Pipes Architec(ng and Extending the BARD API Rajarshi Guha, John Braisted, Ajit Jadhav, DacTrung Nguyen, Tyler Peryea, Noel Southall September 9, 2014 Indianapolis, IN

Upload: rguha

Post on 10-May-2015

470 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Pushing Chemical Biology Through the Pipes

Pushing  Chemical  Biology  Data  Through  the  Pipes  

Architec(ng  and  Extending  the  BARD  API  

Rajarshi  Guha,  John  Braisted,  Ajit  Jadhav,  Dac-­‐Trung  Nguyen,  Tyler  

Peryea,  Noel  Southall    

September  9,  2014  Indianapolis,  IN  

   

Page 2: Pushing Chemical Biology Through the Pipes

Outline  

MoIvaIons  for  the  BARD  

InteracIng  with  the  BARD  

Extensibility  &  Community  

Page 3: Pushing Chemical Biology Through the Pipes

The  BioAssay  Research  Database  

•  Originated  from  the  NIH  Molecular  Libraries  Program  

•  MoIvated  to  make  the  bioassay  data  generated  by  the  MLP  more  accessible  and  amenable  to  exploraIon  and  hypothesis  generaIon  

•  Joint  effort  between  NCGC,  Broad,  UNM,  UMiami,  Vanderbilt,  Scripps,  Burnham  

Page 4: Pushing Chemical Biology Through the Pipes

Goals  of  the  BARD  

BARD’s  mission  is  to  enable  novice  and  expert  scien7sts  to  effec7vely  u7lize  MLP  data  to  

generate  new  hypotheses    

•  Developed  as  an  open-­‐source,    industrial-­‐strength  plaWorm  to  support  public  translaIonal  research  

•  Foster  new  methods  to  interpret  &  analyze  chemical  biology  data  

•  Develop  and  adopt  an  Assay  Data  Standard  •  Enable  co-­‐loca7on  of  data  and  methods  

 

Page 5: Pushing Chemical Biology Through the Pipes

Components  of  the  BARD  Pla=orm  

CAP, Data Dictionary, and Results Deposition Data model created & populated

CAP UI with View and basic editing

Dictionary defined as OWL using Protégé

Warehouse loaded with all PubChem AIDs and results

Warehouse loaded with GO terms, KEGG terms, and DrugBank annotations

Page 6: Pushing Chemical Biology Through the Pipes

Interac?ng  with  the  BARD  

Page 7: Pushing Chemical Biology Through the Pipes

Searching  the  BARD  

•  Full  text  search  via  Lucene/Solr  – Key  entry  point  for  new  users  – Search  code  runs  queries  in  parallel  – Fast  faceIng,  auto  suggest,  filters  

•  EnIIes  are  linked  manually  via  Solr  schema  – Allows  us  to  pick  up  enIIes  related  to  the  query  – Supply  matching  context  

•  Custom  NCATS  code  for  fast  structure  searches  

Page 8: Pushing Chemical Biology Through the Pipes

The  BARD  API  

•  A  RESTful  applicaIon  programming  interface  •  Access  to  individual  &  collecIons  of  enIIes  •  Documented  via  the  wiki  •  Versioned  •  Hit  a  URL,  get  a  JSON  response  

– Every  language  supports  parsing  JSON  documents  – Easy  to  inspect  via  the  browser  or  REST  clients  

h\p://bard.nih.gov/api/v18/  

Page 9: Pushing Chemical Biology Through the Pipes

API  Architecture  

•  Java,  read-­‐only,  deployed  on  Glassfish  cluster  •  Different  funcIonality  hosted  in  different  containers  – Maintenance,  security  – Stability  – Performance  

 

API

Text Search

Struct Search

Data Warehouse

Plugins

Page 10: Pushing Chemical Biology Through the Pipes

ETL Database Text Search Engine Structure Search Engine

Caching Layer

http://bard.nih.gov/api

Jersey Webapps deployed on HA

Application Server Cluster

Open  Source  as  Far  as  Possible  

Page 11: Pushing Chemical Biology Through the Pipes

API  Resources  

•  Covers  many  data  types  •  Each  resource  supports  a    variety  of  sub-­‐resources  – Usually  linked  to  other    resources  

•  Use  /_info  to  see  what    sub-­‐resources  are  available  

/assays/_info!

Page 12: Pushing Chemical Biology Through the Pipes

API  Resources  

En7ty   Count    /assays! 990  /biology! 1080  /cap! 1942  /compounds! 42,572,799  /documents! 10,499  /experiments! 1314  /exptdata! 32,754,242  /projects! 144  /substances! 113,751,456  

Page 13: Pushing Chemical Biology Through the Pipes

Data  Warning  

We’re  s7ll  in  the  process  of  data  cura7on  and  QC  so  while  you  can  access  all  BARD  data,  keep  in  mind  that  much  of  it  will  be  undergoing  review  and  cura7on  and  so  may  change  

Page 14: Pushing Chemical Biology Through the Pipes

Summary  Resources  

•  A  number  of  enIIes  have  a  /summary sub-­‐resource  (Compounds,  Projects)  

•  Aggregates  informaIon,  suitable  for  dashboards  

Page 15: Pushing Chemical Biology Through the Pipes

JSON  Responses  

•  All  responses  are    currently  JSON  

•  EnIIes  can  include    other  enIIes  (recursively)  

•  JSON  Schema  is  available    via    /_schema  

/assays/_schema!

Page 16: Pushing Chemical Biology Through the Pipes

Extending  the  API  

•  Concept  of  plugins  – Expands  the  resource  hierarchy  – Has  to  be  wri\en  in  Java  

•  What  can  a  plugin  accept?  – Anything  

•  What  can  a  plugin  provide?  – Anything  –  plain  text,  XML,  JSON,  HTML,  Flash  

•  Plugin  manifest  –  describe  available  resources,  argument  types    

Page 17: Pushing Chemical Biology Through the Pipes

Plugins  have  to    be  deployable    on  the  JVM  

Plugin  Development  Workflow  

Page 18: Pushing Chemical Biology Through the Pipes

What  Can  a  Plugin  Expect?  

•  Direct  access  to  the  database  via  JDBC  •  Faster  access  to  REST  API  via  co-­‐localizaIon  •  No  local  storage  in  the  BARD  warehouse  

– But  the  plugin  can  use  its  own  storage  (such  as  an  embedded  database)  

•  Plugins  have  access  to  system  JARs  (e.g.,  XOM,  JChem)  but  should  bundle  their  own  required  dependencies  

Page 19: Pushing Chemical Biology Through the Pipes

Plugin  Valida?on  

•  Run  a  series  of  checks  on  a  plugin  before  deployment  – Catches  manifest/resource  errors  – Doesn’t  check  for  correctness  (not  our  job)  – Examines  plugin  Java  class  – Examines  final  plugin  package  

•  Run  as  a  command  line  tool  or  from  your  code  •  Could  be  made  into  an  Ant/Maven  plugin  

Page 20: Pushing Chemical Biology Through the Pipes

What  Plugins  are  Available?  

•  The  plugin  registry  provides  a  list  of  deployed  plugins    – Path  to  the  plugin  – Version  – Availability  

•  Long  term  goal  is  to    have  a  plugin  store  

/plugins/registry/list!

Page 21: Pushing Chemical Biology Through the Pipes

Exemplar  Plugins  

•  Look  at  the  plugin  repository  on  Github  •  Ranges  from    

–  trivial  calls  to  external  service  –  Incorporate  external  tools/programs  

•  Current  set  of  plugins  highlight  plugin  development  

•  Plugins  let  us  move  non-­‐essenIal  funcIonality  out  of  the  core  API  

 

Page 22: Pushing Chemical Biology Through the Pipes

P.  Rydberg  et  al,  h\p://www.farma.ku.dk/smartcyp/  

BARD-­‐  SMARTCyp  

Page 23: Pushing Chemical Biology Through the Pipes

More  Than  Just  Data  

BARD is not just a data store – it’s a platform

•  Seamlessly interact with users’ preferred tools •  Allows the community to tailor it to their needs •  Serve as a meeting ground for experimental

and computational methods •  Enhance collaboration opportunities

Page 24: Pushing Chemical Biology Through the Pipes

Links  

@AskTheBARD  

hNp://bard.nih.gov  

API  Source  Code  Repository  

Plugin  Source  Code  Repository  

Development  Wiki