pathway and network analysis for mrna and protein profiling …€¦ · pathway and network...

Pathway and network analysis for mRNA and protein profiling data

VU workshop, 2016

Bing Zhang, Ph.D.Professor of Molecular and Human Genetics

Lester & Sue Smith Breast CenterBaylor College of Medicine

[email protected]

Proteomeprofiling

Transcriptomeprofiling

DNA

RNA

Protein

PhenotypeNetworks

Transcription

Translation

mRNA decay

Protein degradation

Transcriptome

Proteome

Gene expression

VU workshop, 2016

Overall workflow of gene expression studies

Microarray

Biological question

Experimental design

Image analysis

Data Analysis

HypothesisExperimental validation

Reads mapping

RNA-Seq

Peptide/protein ID

Shotgun proteomics

Spectral counts;Intensities

Read countsSignal intensities

VU workshop, 2016

probe_set_id HNE0_1 HNE0_2 HNE0_3 HNE60_1 HNE60_2 HNE60_31007_s_at 8.6888 8.5025 8.5471 8.5412 8.5624 8.30731053_at 9.1558 9.1835 9.4294 9.2111 9.1204 9.2494117_at 7.0700 7.0034 6.9047 9.0414 8.6382 9.2663121_at 9.7174 9.7440 9.6120 9.7581 9.7422 9.73451255_g_at 4.2801 4.4669 4.2360 4.3700 4.4573 4.29791294_at 6.3556 6.2381 6.2053 6.4290 6.5074 6.27711316_at 6.5759 6.5330 6.4709 6.6636 6.6438 6.46881320_at 6.5497 6.5388 6.5410 6.6605 6.5987 6.72361405_i_at 4.3260 4.4640 4.1438 4.3462 4.3876 4.68491431_at 5.2191 5.2070 5.2657 5.2823 5.2522 5.18081438_at 7.0155 6.9359 6.9241 7.0248 7.0142 7.09711487_at 8.6361 8.4879 8.4498 8.4470 8.5311 8.42251494_f_at 7.3296 7.3901 7.0886 7.2648 7.6058 7.29491552256_a_at 10.6245 10.5235 10.6522 10.4205 10.2344 10.31441552257_a_at 10.3224 10.1749 10.1992 10.2464 10.2191 10.2405

Data matrixG

enes

Samples

Spectral counts;Intensities

Read countsSignal intensities

VU workshop, 2016

Differential expression and clustering

Differential expression

Clustering

log2(ratio)

-log1

0(p

valu

e)

Microarray

RNA-Seq

Proteomics

VU workshop, 2016

Differential expression: t-test

graph courtesy of www.socialresearchmethods.net

VU workshop, 2016

Data are not normally distributed

n Log transformation

n Mann-Whitney U test (Wilcoxon rank-sum test)q Based on ranks of observationsq Does not rely on data belonging to any particular distribution

VU workshop, 2016

Histogram of a

a

Freque

ncy

50 150 250

050

100150

Histogram of log(a)

log(a)

Freque

ncy

3.5 4.0 4.5 5.0 5.5

050

100150

200

Sample size is small

n Moderated t-test (limma)q Designed for experiments with small sample size

q The standard errors are moderated across genes, effectively borrowing information from the ensemble of genes to aid with inference about each individual gene.

VU workshop, 2016

Count data

n Negative binomial distributionq Cuffdiff

q DESeq

q edgeR

n Log-transformed countsq Voom/limma

VU workshop, 2016

Correction for multiple testing

n Why

q In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed

n How

q Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g., Standard Bonferroni Correction

q Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g., Benjamini and Hochberg correction.

VU workshop, 2016

Clustering (unsupervised analysis)

n Clustering algorithms are methods to divide a set of n objects(genes or samples) into g groups so that within group similarities arelarger than between group similarities

n Unsupervised techniques that do not require sample annotation inthe process

n Identify candidate subgroups in complex data. e.g. identification ofnovel sub-types in cancer, identification of co-expressed genes

Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 ……TNNC1 14.82 14.46 14.76 11.22 11.55 ……DKK4 10.71 10.37 11.23 19.74 19.73 ……ZNF185 15.20 14.96 15.07 12.57 12.37 ……CHST3 13.40 13.18 13.15 11.18 10.99 ……FABP3 15.87 15.80 15.85 13.16 12.99 ……MGST1 12.76 12.80 12.67 14.92 15.02 ……DEFA5 10.63 10.47 10.54 15.52 15.52 ……VIL1 11.47 11.69 11.87 13.94 14.01 ……AKAP12 18.26 18.10 18.50 15.60 15.69 ……HS3ST1 10.61 10.67 10.50 12.44 12.23 ………… …… …… …… …… …… ……

Gen

es

Samples

VU workshop, 2016

Hierarchical clustering

n Agglomerative hierarchical clusteringq Start out with all sample units in n

clusters of size 1.

q At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.

q The algorithm stops when all sample units are combined into a single cluster of size n.

VU workshop, 2016

Visualization of hierarchical clustering results

n Dendrogramq Output of a hierarchical

clustering

q Tree structure with the genesor samples as the leaves

q The height of the join indicates the distance between the branches

n Heat mapq Graphical representation of

data where the values are represented as colors.

VU workshop, 2016

Non-5ht 5htrostral rostralcaudal caudalBrain regionNeuron type

Omics studies generate lists of interesting genes

92546_r_at

92545_f_at

96055_at

102105_f_at

102700_at

161361_s_at

92202_g_at

103548_at

100947_at

101869_s_at

102727_at

160708_at

…...

Differential expression

Clustering Lists of genes with potential biological interest

log2(ratio)

-log1

0(p

valu

e)

Microarray

RNA-Seq

Proteomics

VU workshop, 2016

Organizing genes based on pathways

n Gene 1q Pathway 1

q Pathway 2

n Gene 2q Pathway 1

q Pathway 3

n Gene 3q Pathway 2

q Pathway 3

n Gene 4q Pathway 3

n Pathway 1q Gene 1

q Gene 2

n Pathway 2q Gene 1

q Gene 3

n Pathway 3q Gene 2

q Gene 3

q Gene 4

VU workshop, 2016

Advantages of pathway analysis

n Better interpretationq From interesting genes to interesting biological themes

n Improved robustnessq Robust against noise in the data

n Improved sensitivityq Detecting minor but concordant changes in a pathway

VU workshop, 2016

n Databasesq BioCarta (http://www.biocarta.com/genes/index.asp)

q KEGG (http://www.genome.jp/kegg/pathway.html)

q MetaCyc (http://metacyc.org)

q Pathway commons (http://www.pathwaycommons.org)

q Reactome (http://www.reactome.org)

q STKE (http://stke.sciencemag.org/cm)

q Signaling Gateway (http://www.signaling-gateway.org)

q Wikipathways (http://www.wikipathways.org)

n Limitation

q Limited coverage

q Inconsistency among different databases

q Relationship between pathways is not defined

Pathway databases

VU workshop, 2016

n Structured, precisely defined, controlled vocabulary for describingthe roles of genes and gene products

n Three organizing principles: molecular function, biological process,and cellular componentq Dopamine receptor D2, the product of human gene DRD2

q molecular function: dopamine receptor activity

q biological process: synaptic transmission

q cellular component: plasma membrane

n Terms in GO are linked by several types of relationshipsq Is_a (e.g. plasma membrane is_a membrane)q Part_of (e.g. membrane is part_of cell)q Has partq Regulatesq Occurs in

Gene Ontology

VU workshop, 2016

Gene Ontology

VU workshop, 2016

n Two types of GO annotationsq Electronic annotation

q Manual annotation

n All annotations must:q be attributed to a source

q indicate what evidence was found to support the GO term-gene/protein association

n Types of evidence codes

q Experimental codes - IDA, IMP, IGI, IPI, IEP

q Computational codes - ISS, IEA, RCA, IGC

q Author statement - TAS, NAS

q Other codes - IC, ND

Annotating genes using GO terms

VU workshop, 2016

n Downloads (http://www.geneontology.org)q Ontologies

n http://www.geneontology.org/page/download-ontology

q Annotationsn http://www.geneontology.org/page/download-annotations

n Web-based access q AmiGO: http://www.godatabase.org

q QuickGO: http://www.ebi.ac.uk/QuickGO

Access GO

VU workshop, 2016

Homo sapiens Mus musculus#term #gene #term #gene

GO/BP 6502 15228 6227 15709GO/MF 3144 16389 2961 17287GO/CC 947 16765 882 16801

Coverage of GO annotations

VU workshop, 2016

Over-representation analysis: concept

Development (1842 genes) n Is the observed overlap significantly larger than the expected value?

581

581

compareObserve

Expect

98

65

1842

1842Differentially expressed genes (581 genes)

Hoxa5Hoxa11Ltbp3Sox4Foxc1Edn1Ror2GnagSmad3Wdr5Trp63Sox9Pax1AcdRai1Pitx1……

Sash1Cd24aAgtPsrc1Ctla2bAngptl4Depdc7Sorbs1Macrod1Enpp2Tmem176a……

VU workshop, 2016

Over-representation analysis: method

Significant genes Non-significant genes Total

genes in the group k j-k j

Other genes n-k m-n-j+k m-jTotal n m-n m

Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group?

Zhang et.al. Nucleic Acids Res. 33:W741, 2005

€

p =

m − jn − i

#

$ %

&

' ( ji#

$ % &

' (

mn#

$ % &

' ( i= k

min(n, j )

∑ n

Observed k

j

m

VU workshop, 2016

92546_r_at92545_f_at96055_at102105_f_at102700_at……

Over-representation analysis: WebGestalt

WebGestalt

8 organisms Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast

Microarray Probe IDs •  Affymetrix •  Agilent •  Codelink •  Illumina

Gene IDs •  Gene Symbol •  GenBank •  Ensembl Gene •  RefSeq Gene •  UniGene •  Entrez Gene •  SGD •  MGI •  Flybase ID •  Wormbase ID •  ZFIN

Protein IDs •  UniProt •  IPI •  RefSeq Peptide •  Ensembl Peptide

196 ID types with mapping to Entrez Gene ID

59,278 functional categories with genes identified by Entrez Gene IDs

Gene Ontology •  Biological Process •  Molecular Function •  Cellular Component

Pathway •  KEGG •  Pathway Commons •  WikiPathways

Network module •  Transcription factor targets •  microRNA targets •  Protein interaction modules

Disease and Drug •  Disease association genes •  Drug association genes

Chromosomal location •  Cytogenetic bands

Genetic Variation IDs •  dbSNP

http://www.webgestalt.org

Jan. 1, 2015 – Dec. 31, 201563,932 visits from 27,409 visitors>300 citations

~200 ID types

~60K Functional categories

Analysis/visualization

Zhang et.al. Nucleic Acids Res. 33:W741, 2005Wang et al. Nucleic Acids Res. 41:W77, 2013

Gene list

Pathways/functional categories

Daily Unique Visitors (Jan 2015 – Dec 2015)

VU workshop, 2016

n Arbitrary thresholding

n Ignoring the order of genes in the significant gene list

Over-representation analysis: limitations

VU workshop, 2016

Gene Set Enrichment Analysis: concept

n Do genes in a gene set tend to locateat the top or bottom of the rankedgene list?

VU workshop, 2016

Gene Set Enrichment Analysis: method

Subramanian et.al. PNAS 102:15545, 2005

http://www.broad.mit.edu/gsea/

+1/k

-1/(n-k)

k: Number of genes in the gene set Sn: Number of all genes in the ranked gene list

VU workshop, 2016

n Organizing genes byq Pathways

q Gene Ontology

q Other functional categories

n Enrichment analysis methodsq Over-representation analysis

q Gene Set enrichment analysis

n Major limitationq Existing knowledge on pathways or gene functions is far from complete

q Connections between genes are not considered

Pathway-based analysis

VU workshop, 2016

Biological networks

Networks Nodes Edges

Physical interaction networks

Protein-protein interaction network

Proteins Physical interaction, undirected

Signaling network Proteins Modification,directed

Gene regulatory network

TFs/miRNAsTarget genes

Physical interaction,directed

Metabolic network Metabolites Metabolic reaction,directed

Functional association networks

Co-expression network

Genes/proteins

Co-expression,undirected

Genetic network Genes Genetic interaction,undirected

VU workshop, 2016

n Database of Interacting Proteins (DIP)http://dip.doe-mbi.ucla.edu/

n The Molecular INTeraction database (MINT)http://mint.bio.uniroma2.it/mint/

n The Biomolecular Object Network Databank (BOND)http://bond.unleashedinformatics.com/

n The General Repository for Interaction Datasets (BioGRID)http://www.thebiogrid.org/

n Human Protein Reference Database (HPRD)http://www.hprd.org

n Online Predicted Human Interaction Database (OPHID)http://ophid.utoronto.ca

n iRefhttp://wodaklab.org/iRefWeb

n The International Molecular Exchange Consortium (IMEX)http://www.imexconsortium.org

Protein-protein interaction databases

VU workshop, 2016

Properties of complex networks

Scale-free(hubs)

Hierarchical modularSmall world(six degree separation)

Human protein-protein interaction network9,198 proteins and 36,707 interactions

VU workshop, 2016

Network visualization

VU workshop, 2016

Network visualization

Cytoscape (http://www.cytoscape.org)

Gehlenborg et al. Nature Methods, 7:S56, 2010

Network visualization tools

Red Blue

Black

Node label color: protein type HNE targets Transcription factors Other proteins

Node size: prediction cutoff level

Node color: protein function

Loose cutoff level

Stringent cutoff level

Cell cycle RNA splicing

Both Neither

VU workshop, 2016

Scalability challenges

DNACNA

mutationmethylation

mRNAexpression

splicing

Proteinexpressionmodification

Network sizeData complexity

VU workshop, 2016

Barabasi and Oltvai, Nat Rev Genet, 2004

Making sense of the hairballs

Nodes ordered in one dimension on the basis of network structure

Data visualized as tracks

NetSAM is available in BioConductorShi et al. Nature Methods, 2013

VU workshop, 2016

NetGestalt: exploring multidimensional omics data over biological networks

n Genes ordered in the horizontal dimension

n Data visualized as tracks along the verticaldimension

q Composite continuous track (e.g. geneexpression matrices)

q Single continuous track (e.g. foldchanges, p values)

q Single binary track (e.g. significantgenes, GO annotations)

n Interactive data visualization and analysis

q Search, zoom, pan, filter

q Track comparison

q Statistical analysis

q Network analysis and visualization

q Pathway enrichment analysis

q Upload user networks and tracks

Barcode plot

Bar chart

Heat map

http://www.netgestalt.orgShi et al. Nature Methods, 2013

2D layout

1D layoutNetSAM

VU workshop, 2016

n Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process.

n Network-based gene function prediction

n Network-based gene prioritization

Network distance vs functional similarity

Sharan et al. Mol Syst Biol, 3:88, 2007

VU workshop, 2016

Network-based analysis

n Direct neighbor-based analysis

n Network module-based analysis

n Network diffusion analysis

VU workshop, 2016

Direct neighbor-based analysis

n Network expansion: to expand a list of “seed” genes to include other relatedgenes in the network

q All neighbors: to identify all direct neighbors of these seed genes inthe network

q Enriched neighbors: to identify all non-seed genes significantlyenriched with seed neighbors

VU workshop, 2016

TP53

PRKAA2

COPS5

PRMT1

MECR

PCBD1

MED23

MED16

MED17

MED10

TRIM24NCOA2

NCOA6

TGS1

AR

HDAC3

SP1

NCOR2

SIRT1

HDAC4

SREBF2

UBE2I

MED7

MED21MED14

MED24

MED1

NR2C2

SREBF1

PROX1

NCOA1

NR1I3

PPARGC1A

NR0B2

NR2F1

VDR

CREBBP

FOXO1

SUB1

CTNNB1

HNF1AMAPK14

MAPK1

MAPK3

SNCG

SMAD4

SMAD2

STK16

EXT2HNF4A

RNASEH2B

NPPAC16orf80

PNRC1

ZNHIT3

CITED2

PPARGC1B

RAD50

XPO1

PABPC4

NRBF2

SFRP1

WNT4

WNT2

SFRP2

WNT5A

TGFB3

TGFB1

BMP2

BMP7

ACVR1B

HGFCOL1A1 BAMBITRIM28

FOXO3

ZNF703 WNT16 FOXC2 FOXS1

PIAS4

LEF1

TCF7L2

EP300

MSX1

TBX5

HNRNPAB DAB2IP

NKX2-1

ADAM15

CTNNB1

GREM1

WWTR1

PPP3R1

AKT1AXIN2

GSK3B

DVL1

SMAD7

SMAD4SMAD3

NF1

KDM6B

SMAD2NOTCH1

FOXF2

NOG

EFNA1

PARD3BBCL9L

EPB41L5

HIF1ASOX9

SNAI1

PITX2FOXC1

PPP2CAS100A4

TGFBR3

TGFBR2

TGFBR1

TGFBRAP1TGFB1I1

TGFB2

FMOD

VASN

ENG

All neighbors Enriched neighbors

Direct neighbor-based analysis

n Gene prioritization: to prioritize genes in a list of candidate genes

VU workshop, 2016

PCDHB5

DKK2CCBP2

CDH18

CDH9

HCN1

ACVR2AACVR1B

CHRDL1

TLL1

KLK2

OPCMLLPPR4

ROBO1

LRRN3

NCAM2

GGT1

NXF5

PCBP1

MYO1B

ATM

TP53

FBXW7

PRIM2

FLGELF3

UBE2NL

EIF4A2

DNAH5

GUCY1A3

FAT4

CSMD1ZC3H13 ZNF208

ARID1A

ARID2

ING1

TCF7L2

SOX9

CRTC1DACH2

ISL1

CASP14

SDK1

PTPRM

CDH2CDH8

CDH11

CTNNB1

KIAA1804 SLITRK1

TNFRSF10C

RYR2

ADAM29

BRAF

NRAS

KRAS

PIK3CA

PTEN

MAP2K4

FAM5C

APC

FAM123B

ANK2

EVC2

SMAD4

SMAD2

TTN

RIMS1

TRPS1TRPC4

LIFR

PCDHA10

AFF2

ERBB4

DOCK2

LRP1BCNTN1 LOXL2

EPHA3

NRXN1 GRIK3 GRIA1KCNA4 BAI3

DMD

DPP10

NALCNCHRM2 EDNRB

CHST9

IFT80

n Identify network modules from a networkq e.g., NetSAM

n Use network modules as gene sets for enrichment analysisq Over-representation analysis; GSEA

Network module-based analysis

VU workshop, 2016

TIMM50

KSR1

FKBP10

PAQR3

DNAJC27

PEX5L

PEBP4

KSR2

MAP2K1MAP2K2

OIP5

TERF2IP

RAF1

CPS1

NUDT14BRAF

ARAF

LRPAP1

RALGDS

RRAS

RGL2

SHOC2

RAP1GDS1

HRAS

VARS2

ITIH1

ARHGEF1

DGKZ

ADCY6

CHRNA7

KRAS

NRAS

RAP1B

EIF2AK1

PIK3CA

FTL

IGDCC3

DNAJB6 ALKBH1

RASGRP1

PPM1BKIAA1161

MOXD1VRK2

MRPL36KIFC3

PLEKHA7

CBLL1

ITGAE

PTPRM

CDH1

LIMA1

AJAP1

MYO7A

BOC

CDON

CA9

CDH24

CTNND1

CDH3

CTNNA1

CDH2

CDH4

PTPN14

CDH16

KANK1

CDH8

CDH17

CTNNB1

PKP4

DSC3

JUPPKD1

DSG3

DSC2DSG2

PKP3

DSC1PKP2

DSP

PKP1

DSG1

VRK3

KRT1KRT5

KRT14

TCHP

KRT6A

KRT18KRT8

KRT17

KRT72

LMNA

ALOX12B

ALOX12TRIB1

URB2NARF TOR1AIP1

MLIP

HSBP1

KRT16

CALB2

USP53CPNE7

CLN5

CLEC3B

FGF13 AKAP12

NEURL2 BCL9L

CDH11

SOX17 CTNNA3

ERMN

LDHB

SOX1

KIAA1109

FSCN2VEZT

FAM84B

Network diffusion analysis

n Simulating a random walker’s behavior on a network (with restart)

VU workshop, 2016

n Simulating a random walker’s behavior on a network (with restart)q Gene prioritization

q Network expansion

Network diffusion analysis

VU workshop, 2016

http://www.gene2net.org

n Biological networksq Physical interaction networks

q Functional association networks

n Properties of complex networksq Scale-free

q Small world

q Hierarchical modular

n Network visualizationq Node-link diagrams

q NetGestalt

n Network-based analysisq Direct neighbor-based analysis

q Network module-based analysis

q Network diffusion analysis

Summary

VU workshop, 2016

Hands-on session

n WebGestaltq Pathway analysis for gene lists

q http://www.webgestalt.org

n NetGestaltq Pathway and network analysis for gene lists and data matrices

q http://www.netgestalt.org

n Gene2Net

q network diffusion analysis for gene lists

q http://www.gene2net.org

VU workshop, 2016

pathway and network analysis for mrna and protein profiling …€¦ · pathway and network...

Documents