international collaboration in proteomics and informatics

73
1 INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS Bibliotheca Alexandrina, 9 October, 2007 Gilbert S. Omenn, M.D., Ph.D. Center for Computational Medicine & Biology Chair, HUPO Plasma Proteome Project University of Michigan, Ann Arbor, MI, USA

Upload: sheryl

Post on 31-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS. Bibliotheca Alexandrina, 9 October, 2007 Gilbert S. Omenn, M.D., Ph.D. Center for Computational Medicine & Biology Chair, HUPO Plasma Proteome Project University of Michigan, Ann Arbor, MI, USA. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

1

INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

Bibliotheca Alexandrina, 9 October, 2007

Gilbert S. Omenn, M.D., Ph.D.

Center for Computational Medicine & Biology

Chair, HUPO Plasma Proteome Project

University of Michigan, Ann Arbor, MI, USA

Page 2: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

2

It Is Such A Great Pleasure to Visit The Bibliotheca Alexandrina

One of the Wonders of the Modern World!

“The First Digital Library, from its Birth”

Facilitating International Collaboration in

Science and Technology

Page 3: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

3

Nearly-Complete Human Genome Sequence, 15-16 Feb 2001

Page 4: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

4

We Live in a New World of Life Sciences

New Biology---New Technology: a “parts list”

Genome Expression Microarrays

Comparative Genomics + CNV + miRNA

Proteomics and Metabolomics

Bioinformatics & Computational Biology• Mechanism- & Evidence-Based Medicine: “What were you doing up to now?!”• Predictive, personalized, preventive, participatory healthcare and community health services

Page 5: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

5

Key Components of the Vision of Biology As An Information Science

• An avalanche of genomic information: validated SNPs, haplotype blocks, candidate genes/alleles, proteins, & metabolites--associated with disease risk• Powerful computational methods• Effective linkages with better environmental and behavioral datasets for eco-genetic analyses• Credible privacy and confidentiality protections• Breakthrough tests, vaccines, drugs, behaviors, and regulatory actions to reduce health risks and cost-effectively treat patients globally.

Page 6: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

6

A Golden Age for the Public Health Sciences

Sequencing and analyzing the human genome is generating genetic information that must be linked with information about:• Nutrition and metabolism• Lifestyle behaviors• Diseases and medications • Microbial, chemical, physical exposuresEvery discipline of public health sciences

needed.

Page 7: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

7

Definitions

Genetics is the scientific study of genes and their roles in health and disease, physiology, and evolution.

Genomics is a modern subset of the broader field of genetics, made feasible by remarkable advances in molecular biology, biotechnology, and computational sciences, to examine the entire complement of genes and their actions.

Global analyses permit us and require us to go beyond the known “lamp-posts” of individual gene associations and effects.

Page 8: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

8

Proteins are the action molecules of the cell and the leading candidates for biomarkers—in tissues and in the blood. Proteins are coded for by genes. Understanding one protein can be a lifetime’s work!

Proteomics is the global analysis of proteins in cells or body fluids. Techniques for global analysis of proteins are advancing rapidly, especially for discovery of biomarkers for diagnosis, treatment, and prevention.

Metabolomics is the global analysis of metabolites.

Proteomics + metabolomics + epigenomics = “functional genomics”

Page 9: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

9Protein DNA

Page 10: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

10

Rationale for Proteomics

Proteins are much closer to the pathophysiologic changes and molecular targets for drugs than are mRNAs.

Changes in mRNAs are clues, but changes in corresponding proteins often are not highly correlated.

Advances in fractionation of complex tissue and plasma protein mixtures, in mass spectrometry, and in curated databases of proteins help address complexity, dynamic range, and uncertainty of protein identifications.

Page 11: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

11

A Vision For Proteomics

Multiple protein biomarkers discovered

Biomarkers combined on diagnostic chips

Detect organ location of cancers, for surgery or radiation

Detect mechanism of disease for chemotherapy, even if location unknown

Mechanistic, rather than “geographic” classification

Better efficacy/less toxicity for all types of patients

Page 12: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

12

Status of Proteomics Assays

• Many technology platforms of increasing sensitivity and resolution

• Patterns or specific proteins still just biomarker candidates —most lack independent confirmation and coefficient of variation, let alone “validation” with standard clinical chemistry parameters of sensitivity, specificity, and especially positive predictive value

• Approaches of clinical chemistry needed to guide further development of the field

Page 13: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

13

Barriers for Proteomic Cancer Biomarker Discovery in Plasma

Human cancers are very heterogeneousTumor proteins are in low abundance for

early detection of cancersTumor proteins are greatly diluted upon

release to ECF and bloodPlasma is an extraordinarily complex

specimen dominated by high abundance proteins (50% by weight is albumin)

Knowledge of the plasma proteome is still limited

Page 14: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

14

Outline of Lecture

1. Review of the vision, strategy, and output of the HUPO Human Plasma Proteome Project Pilot Phase

2. Objectives for the New Phase of the Plasma Proteome Project

3. Example of the power of computational tools and collaborations (if time)

Page 15: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

15

HUPO

The international Human Proteome Organization (HUPO) was founded in 2001. Its aims are:

1. To advance the science of proteomics

2. To enhance training in proteomics

3. To build international initiatives by organ (liver, brain, kidney), biofluid (plasma, urine, CSF, saliva), and disease (cardiovascular, cancers), plus antibodies and data standards.

Page 16: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

16

Proteomics Interaction MapRuth McNally, sociologist

Page 17: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

17

Samir Hanash, founding President of HUPO Gil Omenn, leader of HUPO PPP

Page 18: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

18

THE PLASMA PROTEOME

Advantages: The most available human specimen; the most comprehensive sample of tissue-derived proteins; the basis for a Disease Biomarkers Initiative tied to organ proteomes.

Specific Disadvantages: Extreme complexity/enormous dynamic range High risk of ex vivo modifications Lack of highly standardized protocols General Challenges: Inadequate appreciation of

incomplete sampling by MS/MS; evolving annotations and unstable databases

Page 19: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

19

Long-Term Scientific Goals of the HUPO Human Plasma Proteome Project

1. Comprehensive analysis of plasma and serum protein constituents in people2. Identification of biological sources of variation within individuals over time, with validation of biomarkers Physiological: age, sex/menstrual cycle, exercise Pathological: selected diseases/special cohorts Pharmacological: common medications3. Determination of the extent of variation across populations and within populations

Page 20: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

20

HUPO HUMAN PLASMA PROTEOME

PROJECT (PPP)HUPO PPP Participating Labs

Technology Vendors

Development & Validation of Biomarkers

Liver and Brain Proteome, Antibody, Protein Stds Projects

Reference Specimens

Technology Platforms--Separation and Identification

Serum vs Plasma

Omenn GS. The Human Proteome Organization Plasma Proteome Project Pilot Phase: Reference Specimens, Technology Platform Comparisons, and Standardized Data Submissions and Analyses. Proteomics 2004;4:1235-1240.

Scheme Showing Aims and Linkages of the HUPO Plasma Proteome Project, Pilot Phase

Page 21: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

21

OUTPUT FROM PPP Pilot PhaseSpecial Issue Aug 2005, Proteomics, “Exploring

the Human Plasma Proteome”: 28 papers—collaborative analyses and annotations, plus lab-specific analyses, and Wiley book (2006)

Publicly-accessible datasets: www.ebi.ac.uk/pride [EBI]

www.peptideatlas.org/repository [ISB] www.bioinformatics.med.umich.edu/hupo/ppp Additional papers are encouraged: Nature Biotechnology 2006; 24:333-338 (States et al) Genome Biology 2006;7:R35 (Fermin et al) Proteomics 2006; 6: 5662-5673 (Omenn)Numerous citations/comparisons of datasets

Page 22: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

22

Page 23: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

23

1. BD: specially prepared male/female pooled samples, divided into EDTA-, Heparin-, and Citrate-anti-coagulated Plasma and Serum (250 ul x4 of each).

BD clot activator. No protease inhibitors. Three separate ethnic pools prepared. Shipped frozen.

2. Chinese Academy of Medical Sciences: Sets of three plasmas + serum, similar to BD protocol.3. National Institute for Biological Standards & Control, UK: citrate-anti-coagulated, freeze-dried plasma, from 25 donors, prepared for Intl Soc Thrombosis & Hemostasis, 1 ml aliquots/ampoules.

SERUM AND PLASMA REFERENCE SPECIMENS

Page 24: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

24

Specifications for Data SubmissionEach of 55 labs agreed (July, 2003 Workshop) to

provide, and 31 labs did provide: a) a detailed experimental protocol, to “push the

limits” to detect low-abundance proteins b) peptide sequences, rated as “high” or “lower”

confidence, based on MS/MS criteria c) protein IDs from IPI 2.21 (July 2003) and

search engine parameters used to align peptide sequences with proteins in human database

Later, we obtained m/z peak lists and raw spectra (by DVD) for independent analyses.

Page 25: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

25

PeptidesSample Proteins

digestion

200200 400400 600600 8008001000100012001200m/zm/z

From Peptides to Genome Annotation

Spectrum Peptide Probability Spectrum 1 LGEYGH 1.0 … … …Spectrum N EIQKKF 0.3

BLASTprotein

database

statisticalfiltering

LC-MS/MSdatabase

searchextraction

Mass Spectrum

Peptides

visualization

PeptideAtlas DatabaseGenome Browser

Map togenome

Peptide … Chrom Start_Coord End_Coord … PAp00007336 … X 132217318 132217368 … … … … … … …

SBEAMS

Page 26: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

26

Numbers of Proteins Identified (LC-MS/MS or FTICR-MS, 18 labs)

From 15,519 reported distinct protein IDs in IPI 2.21, we chose one representative/cluster:

(a) 9504 = 1 or more peptide matches

(b) 3020 = 2+ peptide matches (Core Dataset)

(c) 1274 = 3 or more peptide matches

(d) 889 = follow-up high-stringency analysis with adjustments for protein length and multiple (43,000) comparisons in IPI v2.21

(Nature Biotech 2006; 24:333-338)

Page 27: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

27

GREATEST RESOLUTION AND SENSITIVITY

The most extensive high-confidence yield was from combined methods of immunoaffinity (“top-6”) depletion, 2 or 3-D high-resolution fractionation, and then ESI-MS/MS with ion-trap LTQ instrument.

LTQ gave several fold more IDs (1168) than did LCQ (271) in same hands (B1-serum vs B1-heparin) and obtained multiple peptides for many proteins which had just one hit with LCQ.

Page 28: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

28

SPECIFIC OBSERVATIONS: DEPLETION

• Many investigators depleted albumin and/or immunoglobulins

• Several were provided Agilent immunoaffinity column to remove “top-6” proteins

• Much higher numbers of identifications after depletion if sufficient fractionation

• Inadvertent removal of other proteins; “sponge” effect of albumin

• Assay both flow-through & bound fractions

Page 29: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

29

SPECIMEN VARIABLES

What evidence have we developed for choice of specimens for analysis?

Plasma preferred over serum—more consistent, less degradation

EDTA-plasma preferred over heparin interferences and citrate dilution

Clot activator? necessary only for serum

Minimize freeze/thaw cycles (archives)

Minimal evidence of platelet activation [4C]

Protease inhibitors desirable, but alter proteins

Page 30: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

30

INFLUENCE OF ABUNDANCE

Using quantitative immunoassays and microarrays (generally unknown epitopes), we have found very high rates of detection of the more abundant proteins, less in the mid-range, and occasional detection of very low abundance proteins, as expected.

High correlation (r=0.9) between # peptides and measured concentrations

Page 31: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

31

Least Abundant Proteins Identified with two distinct peptides

(pg/ml: range 200 pg/ml to 20 ng/ml)

Alpha fetoprotein 2.9E+-02 TNF-R-8 3.3E+02 TNF-ligand-6 1.5E+03 PDGF-R alpha 4.6E+03 Leukemia inhibitory factor receptor 5.0E+03 MMP-2/gelatinase 8.8E+03 EGFR 1.1E+04 TIMP-1 1.4E+04 IGFBP-2 1.5E+04 Activated leukocyte adhesion mol 1.6E+04 Selectin L [five labs;10 peptides] 1.7E+04

Page 32: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

32

BIOLOGICAL INSIGHTS

The proteins identified can be annotated by many methods. We have searched multiple databases, including Gene Ontology, Novartis Atlas, Online Mendelian Inheritance in Man (OMIM), incomplete or unidentified sequences in the human genome, microbial genomes, InterPro protein domains, transmembrane domains, secretion signals.

See Proteomics 2005; 5:3226-3519; Wiley, 2006

Page 33: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

33

GENE ONTOLOGY SPECIFIC TERMS

Over-represented in PPP 3020 (vs whole genome): “extracellular”, “immune response”, “blood coagulation”, “lipid transport”, “complement activation”, “regulation of blood pressure”, as expected; also: cytoskeletal proteins, receptors and transporters.

Proteins from most cellular locations and molecular processes are recognized.

Under-represented: “perception of smell” (1 vs 25 exp); cation transporters, ribosomal proteins, G-protein coupled receptors, and nucleic acid binding proteins.

Page 34: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

34

InterPro Protein Domain Analysis

Compared with the whole human genome, the 3020 PPP proteins are:

Over-represented for EGF, intermediate filament protein, sushi, thrombospondin, complement C1q, and cysteine protease inhibitor.

Under-represented: Zinc finger (C2H2, B-box, RING), tyrosine protein phosphatase, tyrosine and serine/threonine protein kinases, helix-turn-helix motif, and IQ calmodulin binding region domains.

Page 35: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

35

TRANSMEMBRANE AND SECRETED PROTEIN FEATURES

1297 of 3020: SwissProt Annotated ProFun Both

Transmembrane 230 151 104

Secretion signal 373 420 358

1723 of 3020: ProFun Predicted TM domain(s) 137

Secretion signal 255

Page 36: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

36

Cardiovascular-Related Proteins Biomarker Candidates in the PPP Database

Proteins characterized in eight groups:

Inflammation

Vascular

Signaling

Growth and differentiation

Cytoskeletal

Transcription factors

Channels

Receptors

Page 37: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

37

Comparison of Five Search Algorithms

Using PPP data, Kapp et al (Proteomics 2005) found Sequest and Spectrum Mill more sensitive and MASCOT, Sonar, and X!Tandem more specific for peptide identifications at specified false-positive rates.

Some investigators have reported using combinations of two or more search engines. Decision rules are necessary.

Page 38: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

38

Can We Overcome the Idiosyncrasies of Individual Instruments and Laboratories?

Several informatics investigators approached the human PPP with an offer to re-analyze the complete MS/MS datasets using their own software and criteria from the raw spectra (or peaklists).

These analyses eliminated the heterogeneity of search algorithms, search parameters, and idiosyncrasies of individual labs.

The results are hard to compare, given different extent of analysis. However, each can be compared with the Core Dataset.

Page 39: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

39

Independent Analyses from Raw Spectra (#IDs with 2+ peptides)

Core Dataset (18 datasets, 3020)

• PepMiner (Beer, 8 large datasets, 2895) [1051 in 3020 dataset, + 700 in the 9504]

• X!Tandem (Beavis/States, 18 datasets, 2678) [577 in the 3020; 218 in the 889]

• PeptideProphet/ProteinProphet (Deutsch, 7+ datasets, 960)[479 in 3020]

• Mascot/Digger (Kapp, Australia, 14 datasets, 513 with 1.4% error rate; ongoing analysis

Page 40: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

40

What is Required and Feasible to Enhance the Statistical Robustness of Findings?

Many complex proteomics analyses are done once, without replicates required to estimate coefficient of variation or other standard parameters for clinical chemistry use.

“Five to ten independent repetitions of the experiments are a must” [Hamacher et al, Proteomics in Drug Discovery, 2006].

How should we determine how similar or different are samples A and B, or the results of methods X and Y? What decision rules apply?

We have a long way to go from discovery research to clinical applications.

Page 41: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

41

Comparison of 5 Published Reports on Plasma Proteins with HUPO PPP Datasets

Report #IDs #IPI in 3020 in 9504

Anderson 1175 990 316 471

Shen [1682] 1842 213 526

Chan 1444 1019 257 402

Zhou 210 148 51 88

Rose 405 287 142 159

Page 42: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

42

Comparison of New Biofluid Proteome Findings with HUPO PPP-3020 Proteins

Proteome # Proteins IPI 2.21 PPP-3020 Urine 1543 910 293 tears 491 313 117 semen 923 560 180

Refs from Matthias Mann Lab, Genome Biology, 2007, different IPI versions.

Comparison, Omenn, Proteomics-Clinical Applications (2007).

Page 43: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

43

NEXT PHASE OF PPP (PPP-2)

1. Standard operating procedures (SOPs), including EDTA-plasma as standard specimen; replication and confirmation of results

2. Quantitation and subproteomes, using new methods and advanced instruments

3. Databases and robust bioinformatics4. Clinical chem/disease-related studies

Page 44: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

44

PPP-2 Research & Technology Thrusts

Learned a lot from Pilot Phase—plasma is a very complex specimen; no single platform sufficient; analyses currently far from comprehensive, let alone reproducible; now have improved data quality and informatics resources.

PPP-2: use multiple methods; focus on biomarker discovery; build upon already-funded laboratories and repositories.

Page 45: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

45

Specific Technology Recommendations

N-Glycosite (proteotypic) peptide resource is a special subproteome likely to have high biomarker relevance.

Capture glycoproteins, digest with trypsin and PNGase F to yield N-linked glycopeptides. Choose one unique to each protein; a finite number; not all proteins. Use complementary lectin approach to characterize glycans.

Prepare isotope-labeled N-glyco-peptides for multiple uses as standards and to spike specimens.

Page 46: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

46

N-Glycosites

Glycoproteins are enriched on cell surface, in secreted proteome and in plasma

Glycoproteins tend to be stable

Only few glycosites per protein: reduction in sample complexity (excludes albumin)

Inherent validation of N-glycosite by fragment ion spectrum

N-glycosite subproteome is probably the one easiest to completely map

Page 47: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

47

Glycopeptide Isolation

PNGaseF Digestion

N-linkedglycopeptides

Capture

Non-glycoproteinsTrypsin digestion

Non-glyco-peptidesWash

Wash

Asn Asp

Zhang H., Li X.-J., Martin D.B. & Aebersold R. (2003) Nat Biotech 21: 660-666

Page 48: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

48

Flow chart of process

Capture / Digestion

Normal & DiseaseTissue Samples

'Glycopeptide' Fract.

Data Analysis

Targeted LC/MS/MS

LC-MS

LC/MS Maps

Plasma Samples

'Glycopeptide' Fract.

Data Analysis

MRM LC/MS/MS

Target peptides

Page 49: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

49

Reducing Complexity: Glycoprotein-Enriched Subproteomes

Methods Lab 2 Lab 11

Enrichment hydrazide chem lectin chrom’y

Peptide Fxn SCX + RP RP

Mass Spec qtof deca-xp

Search engine Seq/ProteinProphet Sequest

Protein IDs 222 83

in B1-serum [51 in common]

Of total 254, 164 found among data from 11 other labs without glycoprotein enrichment.

Page 50: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

50

Technology Recommendations (cont’d)

Orbitrap and other advanced instruments with high mass accuracy and increased throughput

Multiple Reaction Monitoring (Q-Trap, triple quad---LOD <50 amol, 5 logs range, probably ng/ml range for GP.

Extensive fractionation and newer labeling methods.

Recruit several major labs; be open to volunteers.

Determine interest in reference specimen.

Make peptide standards available through PPP-2: post lists and make labeled compounds.

Page 51: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

51

Multiple Reaction Monitoring (MRM)

High selectivity ~ two levels of mass selection (increased S/N)

High sensitivity because of high duty cycle (Q1 and Q3 are static)

Only known peptides (candidates) are detected

time Fixed Fixed

MS-2MS-1 CIDSource

Set precursor m/z Set fragment m/z

Peptide (M) Fragment (m)

Page 52: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

52

Technology Recommendations (cont’d)

Compare pooled samples from disease and control; high throughput not essential for discovery phase

Continue to build the catalogDo longitudinal repeat measures on individuals to

establish CoV—must reliably tell whether two samples are the same or different, including PTMs

Pay attention to precursor ions Known interested labs: Aebersold, Paik, Smith,

Speicher, Hancock, Mann; probably Chinese, Michigan, FHCRC, Japanese/glycomics.

Page 53: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

53

Issues for PPP Bioinformatics What are imperatives for project design?

How can many more spectra be interpreted?

How can more confident protein IDs be generated?

How do we add value and benefit from EBI/PRIDE and ISB/PeptideAtlas repositories?

What is required to make the datasets more useful for other investigators?

Can quantitation, including of PTMs, be achieved with statistical robustness?

Page 54: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

54

A Robust Bioinformatics Architecture

Individuallabs

Level I repository

PRIDE

PeptideAtlas

Dissemination

Genomeannotation

Page 55: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

55

Repositories and Resources for Proteomics Informatics

PRIDE at EBI, repository for protein identifications (Martens)

PeptideAtlas, repository for raw data processed through TransProteomics Pipeline at ISB (Deutsch), plus SpectraST barcodes from NIST

Tranche Distributed File System/DFS (Andrews, UM) at ProteomeCommons.org, National Resource for Proteomics and Pathways

CPAS, developed as part of Mouse Models of Human Cancers Consortium, at Fred Hutchinson (McIntosh)

GPMdb, developed by Beavis (Canada)

Page 56: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

56

Tranche Distributed (P2P) File System

Open, simple, cross-platform protocols– e-Commerce-grade encryption makes it appropriate

for scientific research (peer-review and traceability)

– Can easily grow to accommodate very large amounts of data and users

• Commodity hardware @ $0.37 per GB storage

~16 TB over 12 servers (30 additional TB ordered) and funding for additional 20TB

Documentation, tools, code, credits: http://www.proteomecommons.org/dev/dfs

Data sets: GPM, PNNL, Aurum, QqTOF vs QSTAR, sPRG ABRF 2006, HUPO PPP– Links with PeptideAtlas, OPD, HPRD, TheGPM

Page 57: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

57

Can We Identify More High Confidence Peptides from the MS/MS spectra?

The spectra, not protein lists, are the raw data. <20% of spectra are confidently assigned to peptide sequences; the rest are typically discarded.

More high quality spectra can be mined (Nesvizhskii et al, MCP 2006).

Higher mass accuracy greatly enhances results (with some complications---Eric Deutsch).

Error estimates and thresholds should be routine for peptide IDs and protein matches. TransProteomicPipeline (TPP) from ISB has been designed for this purpose.

Page 58: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

58

Mining Un-assigned High Quality Spectra

(Nesvizhskii)

Typical search: SEQUEST, IPI databasesemi-constrained (tryptic on one

end)Met + 16+/- 3 Da, average mass

Average numbers (LCQ/LTQ data): 10-15% of all

spectra assigned peptide with high

confidence 20-25 % of all high quality spectra are not

assigned

Page 59: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

59

Why Are Spectra Not Assigned?

Possible causes of failure to assign peptide:

• Imperfect scoring scheme

• Constrained search (PTM, not tryptic etc.)

• Incorrect mass/ charge state

• Low spectrum quality / contaminant ion

• Correct sequence may not be in the database searched (e.g., SNP)

• Novel sequence (splice variants, fusion peptides?)

Use MS/MS data for genome annotation

Page 60: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

60

Finding and Mining High Quality Unassigned Spectra (Nesvizhskii)

Page 61: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

61

Further Analyses at the Peptide Level

The PPP, GPM, and PeptideAtlas databases are rich with peptide-level findings, which can be analyzed for many questions---e.g., which peptides are most likely to be detected from among the predicted tryptic peptides of various proteins, and why? Can peptides be used directly to identify sequences of splice isoforms and SNPs? Can PTMs be identified more readily? Answers: Yes to all three questions.

Proteotypic peptides will be a major feature of Next Phase PPP.

Page 62: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

62

What Kinds of Biological Insights Emerge from Annotation?

The aim of proteomics analyses is not just to create lists of peptides and proteins, but to advance our understanding of complex biological processes in health and disease. Going forward, quantitation of proteins and their PTMs will be increasingly important---and feasible.

Page 63: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

63

High Throughput Proteomics and Systems Biology

condition 1

condition 2

condition 3

Integration of genomic, transcriptomic, proteomic, metabolomic data

Understanding andmodeling cell

behavior

Systems Biology

Page 64: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

64

SUMMARY

Enthusiasm for continuing and expanding Plasma Proteome Project, confirmed at Seoul, Korea, World Congress of Proteomics Oct 2007

Commitment to combine PPP with concept of Disease Biomarker Initiative

Interest in linking with and absorbing datasets from other Biofluid Proteomes (saliva, urine, CSF, organ-related proximal fluids)

Page 65: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

65

Biology as an Information Science: NIH Roadmap National Centers for Biomedical Computing

Informatics for IntegratingBiology and the Bedside (i2b2)Isaac Kohane, PI

Center for Computational Biology(CCB)Arthur Toga, PI

Multiscale Analysis of Genomicand Cellular Networks (MAGNet)Andrea Califano, PI

National Alliance for MedicalImaging Computing (NA-MIC)Ron Kikinis, PI

The National Center ForBiomedical Ontology (NCBO)Mark Musen, PI

Physics-Based Simulation ofBiological Structures (SIMBIOS)Russ Altman, PI

National Center for Integrative Biomedical Informatics (NCIBI) Brian Athey, PI

Page 66: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

66

A Bioinformatics Approach to Discover Candidate Oncogenes

Few causal cancer genes have been discovered using gene expression microarrays

Oncogenic events are often heterogeneous– ERBB2/HER2 amplification in 20% of breast CA– Activating Ras mutations in 25% of melanomas– E2A-PBX1 translocation in 5-10% of leukemias

Chromosomal aberrations that result in marked over-expression of an oncogene should be detectable in transcriptome data

Protein products then may be identified in tumor, biofluids, and plasma

Page 67: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

67

Page 68: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

68

COPA of microarray data revealed ETV1 and ERG as outlier genes across multiple prostate cancer gene expression data sets [Tomlins et al., Science 2005, 310: 644 -648]

Page 69: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

69

COPA Unveils Androgen-Responsive TF Fusion Genes

Page 70: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

70

The Molecular Concept Map Project [Chinnaiyan, Rhodes]

Page 71: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

71

Page 72: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

72

Our Genetic Future

“Mapping the human genetic terrain may rank with the great expeditions of Lewis and Clark, Sir Edmund Hillary, and the Apollo Program.” --Francis Collins, Director

National Human Genome Research Institute, 1999

Next: Understand gene and protein expression Elucidate genetic, environmental, and

behavioral interactions in health and disease Engage scientists globally

Page 73: INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

73

Acknowledgements

HUPO PPP: Ruedi Aebersold and Young-Ki Paik, co-chairs; Eric Deutsch, Lennart Martens, Alexey Nesvizhskii, David States, bioinformatics; lab leaders and sponsors (see Proteomics 2005)

UM Proteomics Alliance for Cancer Research: Phil Andrews, David States, Alexey Nesvizhskii, George Michailidis, Mike Pisano, Arul Chinnaiyan, Dan Rhodes, Scott Tomlins, Arun Sreekumar, Adai Vellaichamy, Brian Haab

UM National Center for Integrative Biomedical Informatics: Brian Athey, David States, HV Jagadish, Jignesh Patel, Peter Woolf, Biaoyang Lin