in this study, we chose the logical data structure (lds) entity-relationship data modeling notation...

1
In this study, we chose the Logical Data Structure (LDS) Entity-Relationship data modeling notation over XML, UML, or RDF to describe the metadata structure. The tools used were ERwin Data Modeler Version 4.1.4 and ER/Studio Version 7.1. Detailed information about the ERwin notation and usefulness in genomic data modeling is described by Chen and Carlis [4]. A Clinical Proteomics Data Model for Managing Metadata of Mass Spectrometry Experiments Scott H. Harrison 1 , Sudipto Saha 1 , Peter Hussey 2 , Xiang Zhang 3 , Jake Chen 1 1 School of Informatics, Indiana University–Purdue University Indianapolis, IN 2 LabKey Software, Inc, Seattle, WA USA 3 Bindley Bioscience Center , Discovery Park, Purdue University West Lafayette, IN Contacts: Peter Hussey, [email protected], and Jake Chen, [email protected] Abstract Data Model Design Implementation as MySQL & php application Discussion Proteomics is generally defined as the study of protein structure, function, and interaction, and tandem MS (MS/MS) has been a tool for determining the overall presence or absence of proteins as well as their varying levels of expression. The study protocol of an MS/MS experiment involves extracting and filtering a sample (often in liquid chromatography devices such as LC columns) to isolate particular sets of proteins. The filtered extract proteins to be analyzed are typically fragmented with an enzyme like trypsin or are fragmented with non-enzymatic approaches. Further preparation of the sample involves dilution of the protein fragments to a level that is ready for injection into a MS device. Between laboratories, the identification of proteins can be difficult to cross-validate based on the diversity of experimental approaches and measurements, and the different databases do not generally overlap in each of their identified sets of proteins. Zhang et al. [1] present a recent summary of bio-systems technologies used for cancer diagnostics. To date, over a dozen candidate biomarkers have been identified with 5 different types of MS methods (SELDI-TOF, ESI-MS, MALDI-TOF, LC-MS, and Q-TOF) to detect the presence of 6 different forms of cancer (ovarian, prostate, lung, breast, liver and colon). For statistical identification, there are at least 5 different search software algorithms in common use, and the sensitivity and specificity of different algorithms varies significantly. The design, data system and set of performance specifications for individual MS platform products and search algorithms results in unique profiles of strengths and weaknesses that ultimately impact the de facto criteria used to statistically identify and quantify proteins across various experiments. Pragmatic and relational approaches have been in the past employed successfully to model and implement metadata for bio-system studies for the purpose of collaborative cross-experimental effort and analysis. The Proteomics Standard Initiatives (PSI) work group of the HUPO aims to have the developing MIAPE standard become the required set of minimal data elements for a proteomics publication. Jones et al. have created a functional genomics experiment object model (FuGE) [2]. Standardized file formats have also emerged to gather and compare MS data and metadata.[3] While many designs for proteomics Introduction Mass spectrometry (MS) experimentation for clinical proteomics research presents a special problem of metadata management requiring accuracy in design and long-term sustainability of implementation based on the formalities, costs and times of duration associated with clinical trials. The growth of data and intensive needs for quantifications and analysis leading to discovery and clinical application require significant investment into the storage and processing capacities of computer infrastructure. Yet, the degree of institutional commitment required for managing complex and costly proteomics laboratory resources, combined with cross-disciplinary and cross-validating analyses of data, prioritizes the need for a team- based software system for archival, retrieval and collaboration. The data modeling solution we describe is centered upon use cases experienced by 5 research teams and more than 10 academic research labs that are collaborating together as part of the Clinical Proteomic Technology Assessment for Cancer (CPTAC) project. Dissemination of the solution has involved implementing a developed CPTAC data model within the metadata framework of the widely used Computational Proteomics Analysis System (CPAS), and establishing a portal for data entry at the National Institute of Standards and Technology (NIST). The implementation of our initiative was based on logical data structures geared for flexibility and efficiency in data entry, and is achieving architectural goals for ease of use and customization. 1) Zhang, X., and Wei, D., Yap, Y., Li, L., Guo, S., and Chen, F. Mass Spectrometry Reviews, 26, (2007), 403-431. 2) Jones, A., et. al., Bioinformatics, 20, 10 (Jul. 2004), 1583-1590. 3) Pedrioli, P.G., et al., National Biotechnology., 22, 11 (Nov. 2004), 1459-1466 4) Chen, J.Y., and Carlis, J.V. Genomic data modeling. Information Systems, 28 (2003), 287-310. References The first implementation of the data model (Fig. 2) was created by developers at the National Institute of Standards (NIST) Data tables from the data model were implemented in MySQL and delivered as a PHP-based web application to serve as an interface for data and metadata entry from multiple institutions. The system allows team members to securely share files and uses a distributed file system, Tranche, for storing and sharing proteomics data. The data model we designed was based on a study of the CPTAC community (http://proteomics.cancer.gov/), where each experiment has maximum information about sample preparation method, LC separation method, and MS experiment. The schema of the CPTAC data model has been shown in Fig 1. There are 19 relational data tables that serve as the entities representing the concepts and objectives for how the CPTAC data model represents an experimental design of clinical proteomics. The main concepts and objectives of the data model are the 5 entities: Experiment, Study Protocol, LC Separation Method, Sample Preparation Method, and MS Search Method. Each entity contains a set of attributes for the experimental design characterizing the overall mass spectrometry analysis. The usage of the data model involves each CPTAC team group performing many experiments with different MS platforms and software processing approaches to identify the proteins in the sample. Identified proteins with high statistical The approach we pursued commenced with a data model that was subsequently implemented in both a newly enhanced version of a popular proteomics platform as well as in an elementary transactional framework in the form of a MySQL-based web application. A huge challenge for long-term clinical experiments is to allow for the expansion of information collected after a set of experiments has begun or for necessary integrations of experimental metadata as new collaborations may arise. We find it promising that updating customized sets of values leads to the auto-generation of dropdown lists in the CPAS interface, and the approach for virtualized tables and attributes allows for future extension. It is in this way that our goals for flexibility, efficiency and usage have been accomplished. Future effort is now geared for enhancing the analytical performance of proteomics data processing based on structuring data analyses with the available experimental metadata. Our study presents a stage of software development for clinical proteomics research in which the formalized structures of logic from a data model are being successfully deployed into collaboration-based, production-level systems. Figure 1. The CPTAC data model For the model implementation on CPAS, we used version 2.2 of LabKey Server (LabKey Software, www.labkey.org). LabKey was run on a Windows OS with a PostgreSQL database server and Apache Tomcat web server. The first step in implementing the data model on CPAS was to determine which objects in the data model already had analogues in the CPAS schema. It turned out that many of the proposed CPTAC objects were already available in the Experiment and MS2 schemas in CPAS, albeit with different names. For example, the core Experiment object in Figure 1 corresponds very closely with the ExperimentRun object in CPAS. Figure 3 shows a condensed view of how the structure of the CPTAC data model was implemented around the CPAS schema. For the model implementation on CPAS, the new Assay and List features in version 2.2 of LabKey Server were used to implement the customized objects of the CPTAC metadata model. Lists are the term for application-specific lookup tables of values, for example a set of valid physical MS machines and their types. Lists are designed to be easy for a researcher to define and load from spreadsheet data. The resulting interface for of an implemented MSInstrument list is shown in Figure 4. Assays are experiments runs that, for accuracy with recording by the experimentalist, need metadata to be captured at the same time. The Assay designer allows a researcher to specify list-valued and single- valued attributes (Figure 5). At experiment run time, CPAS translates the Assay definition into an generated data entry page for annotating a specific run. List-valued attributes become drop-down “pick list” controls on the web page to allow for users to select from the fixed set of values. Also, descriptions, stored as attributes of the Assay design, emerge when the user hovers the mouse cursor over a question mark symbol (Figure 6). Both the Assay and List features utilize virtualized tables for enabling CPAS to adapt to project-specific data models. Implementation on LabKey/CPAS Figure 2. Implementation of CPTAC data model as a web application using MySQL and php. (from NIST) Figure 3. The CPTAC data model as adapted to CPAS on LabKey Server. The tables with a white background are existing objects in the Experiment and MS2 schemas of CPAS. The objects and attributes with a gray background are specific to the CPTAC application and are in most cases modeled as list-valued and single-valued custom attributes of the ExperimentRun object. Figure 4. Defining “Lists” for lookup tables. The first step in implementing a metadata model using the Assay feature in CPAS is to define and populate the set of lookup tables that hold specific choices for some metadata attributes. Figure 5. The Assay designer in LabKey/CPAS. The Assay designer describes the metadata to be collected at run upload time. Data collected consists of both single- valued entries (such as Final Dilution) and List-valued entries such as MSInstrument. Figure 6. The data entry form generated from the Assay definition. When the experimentalist is ready to upload MS2 data, this page collects the metadata described in Figure 4. Pick lists, defaults and help text all work to improve the efficiency and accuracy of the metadata captured.. Acknowledgements This work was supported by a grant from the National Cancer Institute (U24CA126480-01), part of NCI’s Clinical Proteomic Technologies Initiative (http://proteomics.cancer.gov). This work was also supported by a Development Award Grant from the Canary Foundation. We thank Paul Rudnick from NIST for initial discussions and help to implement the draft model #1..

Upload: eleanor-eaton

Post on 06-Jan-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: In this study, we chose the Logical Data Structure (LDS) Entity-Relationship data modeling notation over XML, UML, or RDF to describe the metadata structure

In this study, we chose the Logical Data Structure (LDS) Entity-Relationship data modeling notation over XML, UML, or RDF to describe the metadata structure. The tools used were ERwin Data Modeler Version 4.1.4 and ER/Studio Version 7.1. Detailed information about the ERwin notation and usefulness in genomic data modeling is described by Chen and Carlis [4].

A Clinical Proteomics Data Model for Managing Metadata of Mass Spectrometry ExperimentsScott H. Harrison1, Sudipto Saha1, Peter Hussey2, Xiang Zhang3, Jake Chen1

1School of Informatics, Indiana University–Purdue University Indianapolis, IN 2LabKey Software, Inc, Seattle, WA USA

3Bindley Bioscience Center, Discovery Park, Purdue University West Lafayette, INContacts: Peter Hussey, [email protected], and Jake Chen, [email protected]

Abstract

Data Model Design

Implementation as MySQL & php application

Discussion

Proteomics is generally defined as the study of protein structure, function, and interaction, and tandem MS (MS/MS) has been a tool for determining the overall presence or absence of proteins as well as their varying levels of expression. The study protocol of an MS/MS experiment involves extracting and filtering a sample (often in liquid chromatography devices such as LC columns) to isolate particular sets of proteins. The filtered extract proteins to be analyzed are typically fragmented with an enzyme like trypsin or are fragmented with non-enzymatic approaches. Further preparation of the sample involves dilution of the protein fragments to a level that is ready for injection into a MS device. Between laboratories, the identification of proteins can be difficult to cross-validate based on the diversity of experimental approaches and measurements, and the different databases do not generally overlap in each of their identified sets of proteins. Zhang et al. [1] present a recent summary of bio-systems technologies used for cancer diagnostics. To date, over a dozen candidate biomarkers have been identified with 5 different types of MS methods (SELDI-TOF, ESI-MS, MALDI-TOF, LC-MS, and Q-TOF) to detect the presence of 6 different forms of cancer (ovarian, prostate, lung, breast, liver and colon). For statistical identification, there are at least 5 different search software algorithms in common use, and the sensitivity and specificity of different algorithms varies significantly. The design, data system and set of performance specifications for individual MS platform products and search algorithms results in unique profiles of strengths and weaknesses that ultimately impact the de facto criteria used to statistically identify and quantify proteins across various experiments. Pragmatic and relational approaches have been in the past employed successfully to model and implement metadata for bio-system studies for the purpose of collaborative cross-experimental effort and analysis. The Proteomics Standard Initiatives (PSI) work group of the HUPO aims to have the developing MIAPE standard become the required set of minimal data elements for a proteomics publication. Jones et al. have created a functional genomics experiment object model (FuGE) [2]. Standardized file formats have also emerged to gather and compare MS data and metadata.[3] While many designs for proteomics metadata have been proposed using different foundation technologies (UML, XML), less attention has been paid to implementation and usage issues. In this study, we present a data model for clinical proteomics that has efficacy within an existing proteomics application. We examine two implementations of this data model, one a direct implementation in a standalone web application, and the second an adaptation of the model to work within the LabKey/CPAS experimental biology platform. The two implementations illuminate some of the appropriate criteria for choosing a particular metadata implementation.

Introduction

Mass spectrometry (MS) experimentation for clinical proteomics research presents a special problem of metadata management requiring accuracy in design and long-term sustainability of implementation based on the formalities, costs and times of duration associated with clinical trials. The growth of data and intensive needs for quantifications and analysis leading to discovery and clinical application require significant investment into the storage and processing capacities of computer infrastructure. Yet, the degree of institutional commitment required for managing complex and costly proteomics laboratory resources, combined with cross-disciplinary and cross-validating analyses of data, prioritizes the need for a team-based software system for archival, retrieval and collaboration. The data modeling solution we describe is centered upon use cases experienced by 5 research teams and more than 10 academic research labs that are collaborating together as part of the Clinical Proteomic Technology Assessment for Cancer (CPTAC) project. Dissemination of the solution has involved implementing a developed CPTAC data model within the metadata framework of the widely used Computational Proteomics Analysis System (CPAS), and establishing a portal for data entry at the National Institute of Standards and Technology (NIST). The implementation of our initiative was based on logical data structures geared for flexibility and efficiency in data entry, and is achieving architectural goals for ease of use and customization.

1) Zhang, X., and Wei, D., Yap, Y., Li, L., Guo, S., and Chen, F. Mass Spectrometry Reviews, 26, (2007), 403-431.2) Jones, A., et. al., Bioinformatics, 20, 10 (Jul. 2004), 1583-1590.3) Pedrioli, P.G., et al., National Biotechnology., 22, 11 (Nov. 2004), 1459-14664) Chen, J.Y., and Carlis, J.V. Genomic data modeling. Information Systems, 28 (2003), 287-310.

References

The first implementation of the data model (Fig. 2) was created by developers at the National Institute of Standards (NIST) Data tables from the data model were implemented in MySQL and delivered as a PHP-based web application to serve as an interface for data and metadata entry from multiple institutions. The system allows team members to securely share files and uses a distributed file system, Tranche, for storing and sharing proteomics data.

The data model we designed was based on a study of the CPTAC community (http://proteomics.cancer.gov/), where each experiment has maximum information about sample preparation method, LC separation method, and MS experiment. The schema of the CPTAC data model has been shown in Fig 1. There are 19 relational data tables that serve as the entities representing the concepts and objectives for how the CPTAC data model represents an experimental design of clinical proteomics. The main concepts and objectives of the data model are the 5 entities: Experiment, Study Protocol, LC Separation Method, Sample Preparation Method, and MS Search Method. Each entity contains a set of attributes for the experimental design characterizing the overall mass spectrometry analysis. The usage of the data model involves each CPTAC team group performing many experiments with different MS platforms and software processing approaches to identify the proteins in the sample. Identified proteins with high statistical thresholds of confidence are stored in the CPTAC Catalog Protein entity.

The approach we pursued commenced with a data model that was subsequently implemented in both a newly enhanced version of a popular proteomics platform as well as in an elementary transactional framework in the form of a MySQL-based web application. A huge challenge for long-term clinical experiments is to allow for the expansion of information collected after a set of experiments has begun or for necessary integrations of experimental metadata as new collaborations may arise. We find it promising that updating customized sets of values leads to the auto-generation of dropdown lists in the CPAS interface, and the approach for virtualized tables and attributes allows for future extension. It is in this way that our goals for flexibility, efficiency and usage have been accomplished. Future effort is now geared for enhancing the analytical performance of proteomics data processing based on structuring data analyses with the available experimental metadata. Our study presents a stage of software development for clinical proteomics research in which the formalized structures of logic from a data model are being successfully deployed into collaboration-based, production-level systems.

Figure 1. The CPTAC data model

For the model implementation on CPAS, we used version 2.2 of LabKey Server (LabKey Software, www.labkey.org). LabKey was run on a Windows OS with a PostgreSQL database server and Apache Tomcat web server. The first step in implementing the data model on CPAS was to determine which objects in the data model already had analogues in the CPAS schema. It turned out that many of the proposed CPTAC objects were already available in the Experiment and MS2 schemas in CPAS, albeit with different names. For example, the core Experiment object in Figure 1 corresponds very closely with the ExperimentRun object in CPAS. Figure 3 shows a condensed view of how the structure of the CPTAC data model was implemented around the CPAS schema.

For the model implementation on CPAS, the new Assay and List features in version 2.2 of LabKey Server were used to implement the customized objects of the CPTAC metadata model. Lists are the term for application-specific lookup tables of values, for example a set of valid physical MS machines and their types. Lists are designed to be easy for a researcher to define and load from spreadsheet data. The resulting interface for of an implemented MSInstrument list is shown in Figure 4. Assays are experiments runs that, for accuracy with recording by the experimentalist, need metadata to be captured at the same time. The Assay designer allows a researcher to specify list-valued and single-valued attributes (Figure 5). At experiment run time, CPAS translates the Assay definition into an generated data entry page for annotating a specific run. List-valued attributes become drop-down “pick list” controls on the web page to allow for users to select from the fixed set of values. Also, descriptions, stored as attributes of the Assay design, emerge when the user hovers the mouse cursor over a question mark symbol (Figure 6). Both the Assay and List features utilize virtualized tables for enabling CPAS to adapt to project-specific data models.

Implementation on LabKey/CPAS

Figure 2. Implementation of CPTAC data model as a web application using MySQL and php. (from NIST)

Figure 3. The CPTAC data model as adapted to CPAS on LabKey Server. The tables with a white background are existing objects in the Experiment and MS2 schemas of CPAS. The objects and attributes with a gray background are specific to the CPTAC application and are in most cases modeled as list-valued and single-valued custom attributes of the ExperimentRun object.

Figure 4. Defining “Lists” for lookup tables. The first step in implementing a metadata model using the Assay feature in CPAS is to define and populate the set of lookup tables that hold specific choices for some metadata attributes.

Figure 5. The Assay designer in LabKey/CPAS.The Assay designer describes the metadata to be collected at run upload time. Data collected consists of both single-valued entries (such as Final Dilution) and List-valued entries such as MSInstrument.

Figure 6. The data entry form generated from the Assay definition.When the experimentalist is ready to upload MS2 data, this page collects the metadata described in Figure 4. Pick lists, defaults and help text all work to improve the efficiency and accuracy of the metadata captured..

AcknowledgementsThis work was supported by a grant from the National Cancer Institute (U24CA126480-01), part of NCI’s Clinical Proteomic Technologies Initiative (http://proteomics.cancer.gov). This work was also supported by a Development Award Grant from the Canary Foundation. We thank Paul Rudnick from NIST for initial discussions and help to implement the draft model #1..