pathway bioinformatics ---database, software, and discovery y. tom tang, ph.d. bioinformatics r...

Pathway Bioinformatics---Database, Software, and Discovery

Y. Tom Tang, Ph.D.Bioinformatics R & D

Hyseq Pharmaceuticals, Inc.

Sunnyvale, CA, USA

Outline of the Talk

1. Introduction to Pathway Bioinformatics

2. Overview Pathmetrics Technology and Products

3. Data Representation and SLIPR Format

4. Pathway Comparison and Pathway Database Searches

5. Pathway Predictions and Beyond

A Broad Definition of Bioinformatics

• Informatics

Its carrier is a set of digital codes and a language.

In its manifestation in the space-time continuum, it has utility (e.g. to decrease entropy of an open system).

• Bioinformatics

The essence of life is information (i.e. from digital code to emerging properties of biosystems.)

Bioinformatics is the study of information content of life

Pathways• It can be defined ad a modular unit of interacting molecules to

fulfill a cellular function.

• It is usually represented by a 2-D diagram with characteristic symbols linking the protein and non-protein entities.

A circle indicates a protein or a non-protein biomolecule. An symbol in between indicates the nature of molecule-molecule interaction.

An Example of a Pathway---EPO (erythropoeitin) pathways

Pathway Database --Increasing Level of Complexity

• The genome– 4 bases– 3 billion bp total– 3 billion bp/cell, identical

• The proteome– 20 amino acids– ~60K genes, ~200K proteins – ~10K proteins/cell; different cells/conditions, different expressions

• The pathome– ~200K reactions– ~20K pathways– ~1K pathways/cell; different cells/conditions, different expressions

Evolutionary Theory of Pathways --A New Field of Theoretical Studies

• The most important assumption for sequence informatics is evolution

• Evolution principle also applies to pathway informatics – From simple to complex

– Duplication, diversifying, and modular re-use

• Will provide new view toward fundamental questions toward a unified informatics theory of life– What is life?

– How does new function arise?

– How does evolution work? (pathway is the bridge between digital signal and emerging properties)

– When does life begin (what is the initial set of pathways)?

Data Representation in KEGG

• Entity: a molecule or a gene

• Binary relation: a relation between two entities

• Network: a graph formed from a set of related entities

• Pathway: metabolic pathway or regulatory pathway

Drosophila melanogaster Genes

According to the KEGG metabolic and regulatory pathways

Pathway Search by [ EC | Cpd | Gene | Seq ]

[ 1st Level | 2nd Level | 3rd Level | Text Search ]

1. Carbohydrate Metabolism 2. Energy Metabolism

2.1 Oxidative phosphorylation [PATH:dme00190] 2.2 ATP Synthesis [PATH:dme00193] 2.4 Carbon fixation [PATH:dme00710] 2.5 Reductive carboxylate cycle (CO2 fixation) [PATH:dme00720] 2.6 Methane metabolism [PATH:dme00680]2.7 Nitrogen metabolism [PATH:dme00910] 2.8 Sulfur metabolism [PATH:dme00920]

3. Lipid Metabolism 4. Nucleotide Metabolism 5. Amino Acid Metabolism 6. Metabolism of Other Amino Acids 7. Metabolism of Complex Carbohydrates8. Metabolism of Complex Lipids 9. Metabolism of Cofactors and Vitamins

http://www.genome.ad.jp/kegg-bin/mk_point_html?ec+D.melanogaster.kegg+A2B8A15

http://www.genome.ad.jp/kegg-bin/mk_point_html?cpd+D.melanogaster.kegg

http://www.genome.ad.jp/kegg-bin/mk_point_html?gene+D.melanogaster.kegg+A2B8A15

http://www.genome.ad.jp/kegg-bin/mk_homology_pathway_html?D.melanogaster.kegg+A2B8A15

http://www.genome.ad.jp/dbget-bin/get_htext?D.melanogaster.kegg+-f+T+w+A

http://www.genome.ad.jp/dbget-bin/get_htext?D.melanogaster.kegg+-f+T+w+B

http://www.genome.ad.jp/dbget-bin/get_htext?D.melanogaster.kegg+-f+T+w+C

http://www.genome.ad.jp/dbget-bin/hfind_www?-f+T+w+D.melanogaster.kegg

http://www.genome.ad.jp/dbget-bin/get_htext?D.melanogaster.kegg+-f+T+w+ABAB8A15#L1

http://www.genome.ad.jp/dbget-bin/get_htext?D.melanogaster.kegg+-f+T+w+A17#L2

http://www.genome.ad.jp/dbget-bin/get_htext?D.melanogaster.kegg+-f+T+w+A2B2CB6A15#L4


http://www.genome.ad.jp/dbget-bin/show_pathway?dme00190










Introduction to GenMAPP

• Gene MicroArray Pathway Profiler by Bruce Conklin at Gladstone Institute, UCSF.

• GenMAPP is a free computer application designed to visualize gene expression data on maps representing biological pathways and groupings of genes.

• The main features underlying GenMAPP version 1.0 are:

– Draw pathways with easy to use graphics tools

– Multiple species gene databases

– Color genes on MAPP files based on user-imported gene expression data

Two Main Challenges in Post-genomic Age

• Data integration: integrate diverse biological information – Scientific literature, existing body of knowledge about cellular systems

– Genomic sequences

– Protein sequences, motifs, and structures

– Expression data from microarray, dbEST, and RT-PCR

– Protein-protein interaction data from large-scale screening

• Functional discovery: assign functions to the 60K+ human genes– Only 5% of known genes have assigned function

– We have no clue what the function for the majority of discovered genes

– Without understanding function, no drug discovery can be done in either small molecule, or in biopharmaceuticals

– Will be the focus of next 20-years of life-science research

Pathmetrics provides solution on

• Data integration– Establish standard for pathway

curation and pathway database designing

– Develop pathway databases using existing knowledge in scientific literature

– Utilizes dbEST, microarray, and other types of expression data

– Utilizes genomic data such as promoter-region similarities

• Functional studies– Assign proteins with unknown

function into functional pathways

– Determine which cells those pathways work at what level

– Be much more efficient then large-scale random screening

– Discover the majority of pathways and protein functions

– Deliver many tissue-specific pathways for pharmaceutical industry

Basic Concepts

• Node – Protein, peptide, or non-protein biomolecules.

• Mode– The nature of interaction between two nodes. Qualitative data.

• Pathway– A linked list of interconnected nodes and modes. Represented

in either 2-D or 1-D format.

• Pathway Network– A network of cellular function and regulation involving interconnected

pathways.

Curating Pathway Databases

SLIPR standard for pathway curation Relational database design including diverse

information about genes, proteins, expression, and tissues

Input in graphical format, and graphical output displaying

SLIRPP standard for pathway curation SLIPR stands for Semi-LInear Pathway Representation. Like the FastA, it is pronounced as SlipR or Slipir.

• For linear comparison (homology) and display the alignments, • 2-D diagrams of pathways 1-D format.

• We call the 2-D diagrams graph pathways, and the corresponding 1-D representation semi-linear pathways.

• One graph pathway may be transformed into multiple semi-linear pathways. But we prefer one-to-one mapping between the 2-D graph or the SLIPR form. The

generation of 2-D graph pathways and the corresponding 1-D SLIPR form from scientific literature is called pathway curation.

• Pathways are curated by trained scientists with expertise on the relevant pathways. In addition to generating the 2-D and 1-D formats, they also have to generate a pathway

description file for each pathway they curate (pathway annotation), and a protein file that contains all the proteins in the pathway.

Mode Symbol Specifications

It is usually specified by two non-character ASCII symbols.

•- > Direct interaction with direction. Used when there is known direct interactions between two nodes (reverse orientation: <-).

•- | Direct inhibition with direction. Used when there is a direct inhibition from one node to the next. |- for reverse orientation.

•-- Association, indirect action. Used when there is uncertain interaction, indirect interaction, or simply co-expression.

•= = Parallel members. The members can all serve the same function. Usually variants of the same gene, or members from the same family.

•<> Clear interaction, but no direction of information flow (notice, no space within, no letters either). This could happen when more than two proteins are involved to form a large complex.

•** Bifurcating members (usually appears only in beginning or ending of a pathway, it can occur in the middle of a pathway only when a pathway bifurcates and immediately folds back, e.g. A->B->**C->**E->F).

•If a pathway starts to bifurcate in the middle or at the end, one can use a **[path_name] to record this event. E.g: •A->B->(xx)->C->D->**[New_path_1]->E->**[New_path_2].

•( ) Symbol for non-protein nodes. If the small molecule is uncertain, it can be omitted. If the small molecule is known, its name should be inserted in between, e.g. ->(Ca), or (cAMP).

All the small molecules should be included inside a set of parentheses, e.g. A1->(Ca)->A2->(Cytidine_Diphosphate_Choline).

•[ ] Symbol for another pathway. The path_id should be within the bracket.

When linked to other pathways, the path_ids should be put inside a bracket, e.g. A1->[Ca_triggered_path1], A1->[Gs_pathway].

• When an ID is given without a () or [], it means it is a protein node

SLIPR Format for Pathway Entries

• The format is based on a common sequence format, FASTA. Nodes are linked by modes with no space between them. Bifurcating branches are specified later within the same entry with PATHsub_ID and content. Eg.

>PW_ID PW_name PW_annotation Source Curator Date [Species]

Pr1->Pr2--(Ca)--Pr3==Pr4->**Pr5->**[PATHsub_XX]->Pr5->(Mg)<>ZZpr

[PATHsub_XX] AA1->AA2—(SM1)->AA3<>AA4<-AA5

– PW_ID: ID for the pathway– PW_name: A name– PW_annotation: a brief description about the pathway– Source: where this pathway is taken from: article, KEGG, GenMAPP, etc.– Curator: the person who inputs the pathway– Date: date of curation

Pathway Database in Simplest Format

• A SLIPR format pathway file• A FASTA format protein sequence file• A FASTA format non-protein molecule file• Flat file tools to do basic database manipulations:

– Index: generate index file

– Retrieval: logN scale speed of component access

– Insertion: cat to the end, new index

– Deletion: delete, and new index

– Updating: deletion, cat to the end, new index

Relational Database Implementation--an example with only protein nodes

gene_id

Gene_Table

gene_idchromosomestartstop

Protein_Table

seq_idcellular locationseq_txtgene_id <fk>

Interaction_Table

protein Aprotein Bpathway_id<fk>literature_id

Info flow direction

Pathway_Table

pathway_idpathway_namedescriptionspeciescuratorentry_data

protein=seq_id pathway_id

Protein_Motifs

motif_idseq_id <fk>

seq_id

Motif_Def_Table

motif_iddescriptionregular expresssionHMM_matrix

Literature_Table

literature_idauthorjournalpub_datePDF_file

literature_id

motif_id

Pathway Search Engines Comparing two pathways in SLIRPP standard using dynamic

programming algorithm Search a query pathway against a pathway database: advance

BLAST-type of searches into pathway level Find orthologous, paralogous, and homologous pathways with

alignments Like BLAST, there are different types of searches:

Node only search Mode only search Node and mode search

In node only searches, one can perform: protein-node only non-protein node only Protein-node and non-protein node

Alignment Scoring Matrices• Comparing protein nodes

– identity mapping and orthologs (current status)– percent_identity– percent_positive (PAM/BLOSSUM)– structural similarity

• Comparing non-protein nodes– identity mapping– structural similarity– Evolutionary linkage and functional similarity

• Comparing modes– identity mapping– SCIM matrix (similarity coefficient of interacting modes). A matrix of positive

and negative values between –1 and 1.

Protein Comparison vs. Pathway Comparison

Protein:

Pathway:

# of Node Node-comp Mode

20 BLOSUM/PAMMatrices

Peptide-bond

200K Pct_identityPct_positive

Structural Simil.

Identity_mappingSCIM matrixPeptide-bond (fused proteins

Pathway Level Search Engine

• Query: A pathway (associated query.pw, query.aa file)

• DB: Pathways (associated DB.pw, DB.aa file)

• Search Types: – Node only

• protein node only

• non-protein node only

• Any node

– Mode only

– Node and mode

PMsearch Documentation •PMsearch is a pathway comparison program.

•After a user specifies a query pathway, and a search database, PMsearch will compare the query pathway with each entry in the pathway database.

•The query pathway is specified by two input files: • A query.pw pathway file, and a query.aa, the protein file• The query.pw contains the pathway information, in FASTA format.• The query.aa contains the involved proteins, in FASTA format.

•The pathway database is also composed of two files, a db.pw and a db.aa file, except the database files contain more than one entry.

•Once a job is submitted, the search engine (pm_search) will perform the job, and report back all the homologous pathways that are above a user-specified threshold.

•The user can also specify other parameters, which are given in the user manual.

Specifics for pathway alignment

• It is a higher level alignment, containing protein or structural alignment within.

• Each element in the pathway can represent a node (protein or non-protein), or a mode.

• Distance between nodes and modes, and between protein nodes and non-protein nodes are infinite, you cannot align different types of elements.

In the simplest case, consider pathway with only protein nodes. Given an alignment z, the score is given by

where s(x,y) is the similarity of protein x and protein y, ngap is the

number of gaps in z, lgap is the total length of the gaps, Δ is a

parameter called the “gap opening” penalty, and δ is a second parameter called the “gap extension” penalty.

PMsearch uses a dynamic programming algorithms to find the alignment with the highest score.

( ) ( )1

( , ) ( , )k

z i t j t gap gapt

S a b s a b n l

How Alignments Are Determined And Scored

For the alignment to get to (m,n), it must go through one of: (m-1, n-1) (am and bn are a match),

(m-1, n) (meaning (m,n) is in a gap in sequence 2), (m, n-1) (meaning (m,n) is in a gap in sequence 1).

Recursion:For i = 1 to m For j = 1 to n H(i,j) = max {H(i-1,j-1)+s(i,j), Hh(i,j), Hv(i,j)}, where

Hh(i,j) = max {Hh(i,j-1)-δ, H(i,j-1)-δ-Δ }

Hv(i,j) = max {Hv(i-1,j)-δ, H(i-1,j)-δ-Δ }

EndEnd

Novel Pathway Prediction Engines

Predicting orthologous pathways across different organisms

A known query pathway from some organism as query A protein database or genomic database for the organism of interest to

search against Output is the ortholog pathway in the organism of interest

Predicting homologous pathways for an organism of interest

A known query pathway from some organism as query A protein database or genomic database for the organism A protein-protein correlation matrix for protein expression Output is a collection of homologous pathways

Open Questions for Pathway Comparison

• Like extending points in Rn to functional space, we need to generalize theory for protein alignment to a higher level, where the component itself may have alignment.

• How to calculate p-value in this pathway space?

• How to design intelligent scores?

– How to generate meaningful non-identity-mapping non-protein node comparison matrix

• How to integrate multiple component types into the alignment theory?

HOMOLOGS, ORTHOLOGS, AND PARALOGS

Homologs: proteins with good alignment and similar function

Orthologs: proteins performing the same function in different species

Paralogs: homologous proteins in the same species

How to tell the unique ortholog

The ortholog should have a much higher similarity to the query protein that any other protein in its species, and usually higher than most of the paralogs.

PMortholog Documentation PMortholog is a simple ortholog prediction program for pathways.

Inputs:(1) a pathway (query.pw and query.aa files)(2) a protein database, e.g., SwissProt

•Reports all apparent orthologous pathways

•Most accurate for closely related organisms (e.g. human<->mouse)

•False matches can appear when organisms are too distant, or possibly, because of other paralogous pathways in the organism.

PMortholog sample output: hitsPM_ORTHOLOG 0.1, Pathmetrics, Inc. [Oct-20-2001] [Build linux-x86] Reference: US Patent Pending. "Methods for Establishing Pathway Databaseand Perform Pathway Searches". Y. Yang, C. Piercy. February 20, 2001. Application number 60/269,711 Query pathway= hsa00625 (5 proteins) Database: /u1/pub_db/sp_db/allspecies.aa 374855 proteins.Summary of ortholog pathways: Hit_nu species ......... score--------------------------------------------------------------- 1: Homo sapiens ......... 100.00 2: Mus musculus ......... 65.20 3: Rattus norvegicus ......... 65.20 4: Caenorhabditis elegans ......... 44.20 5: Drosophila melanogaster ......... 37.80 6: Arabidopsis thaliana ......... 37.00 7: ......... 31.80 8: Saccharomyces cerevisiae ......... 26.60 9: Sinorhizobium meliloti ......... 25.80 10: Mesorhizobium loti ......... 24.80 11: Agrobacterium tumefaciens ......... 24.80 12: Escherichia coli ......... 22.60 13: Pseudomonas aeruginosa ......... 22.40 14: Schizosaccharomyces pombe ......... 18.80 15: Bacillus subtilis ......... 15.00 16: Oryza sativa ......... 11.0

PMortholog sample output: alignments

>Hit 1: Ortholog pathway for: Homo sapiens. With score: 100.00 Query: hsa:51144 hsa:2052 hsa:2053 hsa:51004 hsa:9420%_id: |1.00| |1.00| |1.00| |1.00| |1.00|Sbjct: gi15082281 gi13097729 gi181395 gi4680659 gi13094303 >Hit 2: Ortholog pathway for: Mus musculus. With score: 65.20 Query: hsa:51144 hsa:2052 hsa:2053 hsa:51004 hsa:9420%_id: |0.85| |0.88| |0.81| |0| |0.72|Sbjct: gi3142702 gi12857870 gi12832382 ------ gi12850151 >Hit 3: Ortholog pathway for: Rattus norvegicus. With score: 65.20 Query: hsa:51144 hsa:2052 hsa:2053 hsa:51004 hsa:9420%_id: |0.81| |0.88| |0.84| |0| |0.73|Sbjct: gi4098957 gi207689 gi55930 ------ gi1226240 >Hit 4: Ortholog pathway for: Caenorhabditis elegans. With score: 44.20 Query: hsa:51144 hsa:2052 hsa:2053 hsa:51004 hsa:9420%_id: |0.48| |0.56| |0.42| |0.44| |0.31|Sbjct: gi726418 gi1465805 gi3876864 gi2088820 gi13775482

Homolog Pathway Prediction Engines

• They are the crown jewels of Pathmetrics software tools

• Can predict many novel interactions• Use diverse input data, including sequence data,

expression data, and known interaction data• Employ complex numerical algorithms such as

dynamical programming and clustering

Example of Novel Pathway Prediction---predicting novel pathways homologous to

the query pathway

Node1 Node2 Node3 Node4Mode1=1 Mode2=1 Mode3=1

Node1 Hits

candidate1_1candidate1_2 . . . . . . . .candidate1_l

Node2 Hits

candidate2_1candidate2_2 . . . . . . . .candidate2_m

Node3 Hits

candidate3_1candidate3_2 . . . . . . . .candidate3_n

Node4 Hits

candidate4_1candidate4_2 . . . . . . . .candidate4_o

Pathway Searches and Pathway Predictions

Query Database Mode Output

Pathway Pathway_db SCIM* Homologous pathways

" Protein_db None Orthologous pathways

" Protein_db Promoter_simil.

matrixPredicted homologous pathways

" Protein_db Expression_assoc.

matrix

Predicted homologous pathways

* SCIM: Similarity coefficient of interacting modes

Gene Discovery vs. Pathway Discovery

Novel Pathways

1. DNA sequences Novel Genes PE Biosys. sequencer

EST reads

EST_db

Life Seq FL (Incyte)

2. Protein-protein interactions

Microarray or dbEST

Expression data

Expression_db

Assembly, seq. alignment (FASTA, Blast)

Gene Editing

MamPath

(Pathmetrics)

Clustering, pathway prediction engines

Path Editing

Confirming Predicted Pathways

• We can confirm at expression level predicted pathways using RT-PCR

• It will extend content of and add tremendous value to our pathway databases

• It will strengthen our IP positions on many novel predicted pathways

• We can provide this service to customers for specific tissue types

• Protein-level confirmation of important pathways can also be carried out using standard protein-protein interaction assays.

• This pinpointed approach toward pathway discovery saves tremendously on cost compared to some of the competitors’ technology

Real-Time PCRAccurate Measurement of Gene

Expression

• Real-time PCR (RT-PCR) gives quantitative measurement of mRNA level inside cells

• High accuracy. Delivers much reliable data than microarrays

• Can be very tissue-specific: can be performed at single-cell level

• Parallel operations allow ~1000 measurements per day per technician

• Quick turnaround time to meet any customer’s needs

Open Question for Pathway Prediction and Confirmation

• Theoretical questions about predictions: – How one can assign p-values and scores to the predictions with protein-

protein alignments and protein-protein co-expression data?

• Handling PCR confirmation data:– Data set (an example)– Proteins: P1 P2 P3

– ----------------------------------------------------------------

– Tissue_1 55 18 35

– Tissue_2 505 220 300

– Tisuse_3 250 107 130

How to assign a p-value to validate the prediction?

Trends in Bioinformatics

Seq comparison: Today

Functional comparison: The Future

Pathway discovery: Bridge to the future

pathway bioinformatics ---database, software, and discovery y. tom tang, ph.d. bioinformatics r...

Documents