pathway and network analysis for mrna and protein profiling …€¦ · pathway and network...
TRANSCRIPT
Pathway and network analysis for mRNA and protein profiling data
VU workshop, 2016
Bing Zhang, Ph.D.Professor of Molecular and Human Genetics
Lester & Sue Smith Breast CenterBaylor College of Medicine
Proteomeprofiling
Transcriptomeprofiling
DNA
RNA
Protein
PhenotypeNetworks
Transcription
Translation
mRNA decay
Protein degradation
Transcriptome
Proteome
Gene expression
VU workshop, 2016
Overall workflow of gene expression studies
Microarray
Biological question
Experimental design
Image analysis
Data Analysis
HypothesisExperimental validation
Reads mapping
RNA-Seq
Peptide/protein ID
Shotgun proteomics
Spectral counts;Intensities
Read countsSignal intensities
VU workshop, 2016
probe_set_id HNE0_1 HNE0_2 HNE0_3 HNE60_1 HNE60_2 HNE60_31007_s_at 8.6888 8.5025 8.5471 8.5412 8.5624 8.30731053_at 9.1558 9.1835 9.4294 9.2111 9.1204 9.2494117_at 7.0700 7.0034 6.9047 9.0414 8.6382 9.2663121_at 9.7174 9.7440 9.6120 9.7581 9.7422 9.73451255_g_at 4.2801 4.4669 4.2360 4.3700 4.4573 4.29791294_at 6.3556 6.2381 6.2053 6.4290 6.5074 6.27711316_at 6.5759 6.5330 6.4709 6.6636 6.6438 6.46881320_at 6.5497 6.5388 6.5410 6.6605 6.5987 6.72361405_i_at 4.3260 4.4640 4.1438 4.3462 4.3876 4.68491431_at 5.2191 5.2070 5.2657 5.2823 5.2522 5.18081438_at 7.0155 6.9359 6.9241 7.0248 7.0142 7.09711487_at 8.6361 8.4879 8.4498 8.4470 8.5311 8.42251494_f_at 7.3296 7.3901 7.0886 7.2648 7.6058 7.29491552256_a_at 10.6245 10.5235 10.6522 10.4205 10.2344 10.31441552257_a_at 10.3224 10.1749 10.1992 10.2464 10.2191 10.2405
Data matrixG
enes
Samples
Spectral counts;Intensities
Read countsSignal intensities
VU workshop, 2016
Differential expression and clustering
Differential expression
Clustering
log2(ratio)
-log1
0(p
valu
e)
Microarray
RNA-Seq
Proteomics
VU workshop, 2016
Differential expression: t-test
graph courtesy of www.socialresearchmethods.net
VU workshop, 2016
Data are not normally distributed
n Log transformation
n Mann-Whitney U test (Wilcoxon rank-sum test)q Based on ranks of observationsq Does not rely on data belonging to any particular distribution
VU workshop, 2016
Histogram of a
a
Freque
ncy
50 150 250
050
100150
Histogram of log(a)
log(a)
Freque
ncy
3.5 4.0 4.5 5.0 5.5
050
100150
200
Sample size is small
n Moderated t-test (limma)q Designed for experiments with small sample size
q The standard errors are moderated across genes, effectively borrowing information from the ensemble of genes to aid with inference about each individual gene.
VU workshop, 2016
Count data
n Negative binomial distributionq Cuffdiff
q DESeq
q edgeR
n Log-transformed countsq Voom/limma
VU workshop, 2016
Correction for multiple testing
n Why
q In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed
n How
q Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g., Standard Bonferroni Correction
q Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g., Benjamini and Hochberg correction.
VU workshop, 2016
Clustering (unsupervised analysis)
n Clustering algorithms are methods to divide a set of n objects(genes or samples) into g groups so that within group similarities arelarger than between group similarities
n Unsupervised techniques that do not require sample annotation inthe process
n Identify candidate subgroups in complex data. e.g. identification ofnovel sub-types in cancer, identification of co-expressed genes
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 ……TNNC1 14.82 14.46 14.76 11.22 11.55 ……DKK4 10.71 10.37 11.23 19.74 19.73 ……ZNF185 15.20 14.96 15.07 12.57 12.37 ……CHST3 13.40 13.18 13.15 11.18 10.99 ……FABP3 15.87 15.80 15.85 13.16 12.99 ……MGST1 12.76 12.80 12.67 14.92 15.02 ……DEFA5 10.63 10.47 10.54 15.52 15.52 ……VIL1 11.47 11.69 11.87 13.94 14.01 ……AKAP12 18.26 18.10 18.50 15.60 15.69 ……HS3ST1 10.61 10.67 10.50 12.44 12.23 ………… …… …… …… …… …… ……
Gen
es
Samples
VU workshop, 2016
Hierarchical clustering
n Agglomerative hierarchical clusteringq Start out with all sample units in n
clusters of size 1.
q At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.
q The algorithm stops when all sample units are combined into a single cluster of size n.
VU workshop, 2016
Visualization of hierarchical clustering results
n Dendrogramq Output of a hierarchical
clustering
q Tree structure with the genesor samples as the leaves
q The height of the join indicates the distance between the branches
n Heat mapq Graphical representation of
data where the values are represented as colors.
VU workshop, 2016
Non-5ht 5htrostral rostralcaudal caudalBrain regionNeuron type
Omics studies generate lists of interesting genes
92546_r_at
92545_f_at
96055_at
102105_f_at
102700_at
161361_s_at
92202_g_at
103548_at
100947_at
101869_s_at
102727_at
160708_at
…...
Differential expression
Clustering Lists of genes with potential biological interest
log2(ratio)
-log1
0(p
valu
e)
Microarray
RNA-Seq
Proteomics
VU workshop, 2016
Organizing genes based on pathways
n Gene 1q Pathway 1
q Pathway 2
n Gene 2q Pathway 1
q Pathway 3
n Gene 3q Pathway 2
q Pathway 3
n Gene 4q Pathway 3
n Pathway 1q Gene 1
q Gene 2
n Pathway 2q Gene 1
q Gene 3
n Pathway 3q Gene 2
q Gene 3
q Gene 4
VU workshop, 2016
Advantages of pathway analysis
n Better interpretationq From interesting genes to interesting biological themes
n Improved robustnessq Robust against noise in the data
n Improved sensitivityq Detecting minor but concordant changes in a pathway
VU workshop, 2016
n Databasesq BioCarta (http://www.biocarta.com/genes/index.asp)
q KEGG (http://www.genome.jp/kegg/pathway.html)
q MetaCyc (http://metacyc.org)
q Pathway commons (http://www.pathwaycommons.org)
q Reactome (http://www.reactome.org)
q STKE (http://stke.sciencemag.org/cm)
q Signaling Gateway (http://www.signaling-gateway.org)
q Wikipathways (http://www.wikipathways.org)
n Limitation
q Limited coverage
q Inconsistency among different databases
q Relationship between pathways is not defined
Pathway databases
VU workshop, 2016
n Structured, precisely defined, controlled vocabulary for describingthe roles of genes and gene products
n Three organizing principles: molecular function, biological process,and cellular componentq Dopamine receptor D2, the product of human gene DRD2
q molecular function: dopamine receptor activity
q biological process: synaptic transmission
q cellular component: plasma membrane
n Terms in GO are linked by several types of relationshipsq Is_a (e.g. plasma membrane is_a membrane)q Part_of (e.g. membrane is part_of cell)q Has partq Regulatesq Occurs in
Gene Ontology
VU workshop, 2016
Gene Ontology
VU workshop, 2016
n Two types of GO annotationsq Electronic annotation
q Manual annotation
n All annotations must:q be attributed to a source
q indicate what evidence was found to support the GO term-gene/protein association
n Types of evidence codes
q Experimental codes - IDA, IMP, IGI, IPI, IEP
q Computational codes - ISS, IEA, RCA, IGC
q Author statement - TAS, NAS
q Other codes - IC, ND
Annotating genes using GO terms
VU workshop, 2016
n Downloads (http://www.geneontology.org)q Ontologies
n http://www.geneontology.org/page/download-ontology
q Annotationsn http://www.geneontology.org/page/download-annotations
n Web-based access q AmiGO: http://www.godatabase.org
q QuickGO: http://www.ebi.ac.uk/QuickGO
Access GO
VU workshop, 2016
Homo sapiens Mus musculus#term #gene #term #gene
GO/BP 6502 15228 6227 15709GO/MF 3144 16389 2961 17287GO/CC 947 16765 882 16801
Coverage of GO annotations
VU workshop, 2016
Over-representation analysis: concept
Development (1842 genes) n Is the observed overlap significantly larger than the expected value?
581
581
compareObserve
Expect
98
65
1842
1842Differentially expressed genes (581 genes)
Hoxa5Hoxa11Ltbp3Sox4Foxc1Edn1Ror2GnagSmad3Wdr5Trp63Sox9Pax1AcdRai1Pitx1……
Sash1Cd24aAgtPsrc1Ctla2bAngptl4Depdc7Sorbs1Macrod1Enpp2Tmem176a……
VU workshop, 2016
Over-representation analysis: method
Significant genes Non-significant genes Total
genes in the group k j-k j
Other genes n-k m-n-j+k m-jTotal n m-n m
Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group?
Zhang et.al. Nucleic Acids Res. 33:W741, 2005
€
p =
m − jn − i
#
$ %
&
' ( ji#
$ % &
' (
mn#
$ % &
' ( i= k
min(n, j )
∑ n
Observed k
j
m
VU workshop, 2016
92546_r_at92545_f_at96055_at102105_f_at102700_at……
Over-representation analysis: WebGestalt
WebGestalt
8 organisms Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast
Microarray Probe IDs • Affymetrix • Agilent • Codelink • Illumina
Gene IDs • Gene Symbol • GenBank • Ensembl Gene • RefSeq Gene • UniGene • Entrez Gene • SGD • MGI • Flybase ID • Wormbase ID • ZFIN
Protein IDs • UniProt • IPI • RefSeq Peptide • Ensembl Peptide
196 ID types with mapping to Entrez Gene ID
59,278 functional categories with genes identified by Entrez Gene IDs
Gene Ontology • Biological Process • Molecular Function • Cellular Component
Pathway • KEGG • Pathway Commons • WikiPathways
Network module • Transcription factor targets • microRNA targets • Protein interaction modules
Disease and Drug • Disease association genes • Drug association genes
Chromosomal location • Cytogenetic bands
Genetic Variation IDs • dbSNP
http://www.webgestalt.org
Jan. 1, 2015 – Dec. 31, 201563,932 visits from 27,409 visitors>300 citations
~200 ID types
~60K Functional categories
Analysis/visualization
Zhang et.al. Nucleic Acids Res. 33:W741, 2005Wang et al. Nucleic Acids Res. 41:W77, 2013
Gene list
Pathways/functional categories
Daily Unique Visitors (Jan 2015 – Dec 2015)
VU workshop, 2016
n Arbitrary thresholding
n Ignoring the order of genes in the significant gene list
Over-representation analysis: limitations
VU workshop, 2016
Gene Set Enrichment Analysis: concept
n Do genes in a gene set tend to locateat the top or bottom of the rankedgene list?
VU workshop, 2016
Gene Set Enrichment Analysis: method
Subramanian et.al. PNAS 102:15545, 2005
http://www.broad.mit.edu/gsea/
+1/k
-1/(n-k)
k: Number of genes in the gene set Sn: Number of all genes in the ranked gene list
VU workshop, 2016
n Organizing genes byq Pathways
q Gene Ontology
q Other functional categories
n Enrichment analysis methodsq Over-representation analysis
q Gene Set enrichment analysis
n Major limitationq Existing knowledge on pathways or gene functions is far from complete
q Connections between genes are not considered
Pathway-based analysis
VU workshop, 2016
Biological networks
Networks Nodes Edges
Physical interaction networks
Protein-protein interaction network
Proteins Physical interaction, undirected
Signaling network Proteins Modification,directed
Gene regulatory network
TFs/miRNAsTarget genes
Physical interaction,directed
Metabolic network Metabolites Metabolic reaction,directed
Functional association networks
Co-expression network
Genes/proteins
Co-expression,undirected
Genetic network Genes Genetic interaction,undirected
VU workshop, 2016
n Database of Interacting Proteins (DIP)http://dip.doe-mbi.ucla.edu/
n The Molecular INTeraction database (MINT)http://mint.bio.uniroma2.it/mint/
n The Biomolecular Object Network Databank (BOND)http://bond.unleashedinformatics.com/
n The General Repository for Interaction Datasets (BioGRID)http://www.thebiogrid.org/
n Human Protein Reference Database (HPRD)http://www.hprd.org
n Online Predicted Human Interaction Database (OPHID)http://ophid.utoronto.ca
n iRefhttp://wodaklab.org/iRefWeb
n The International Molecular Exchange Consortium (IMEX)http://www.imexconsortium.org
Protein-protein interaction databases
VU workshop, 2016
Properties of complex networks
Scale-free(hubs)
Hierarchical modularSmall world(six degree separation)
Human protein-protein interaction network9,198 proteins and 36,707 interactions
VU workshop, 2016
Network visualization
VU workshop, 2016
Network visualization
Cytoscape (http://www.cytoscape.org)
Gehlenborg et al. Nature Methods, 7:S56, 2010
Network visualization tools
Red Blue
Black
Node label color: protein type HNE targets Transcription factors Other proteins
Node size: prediction cutoff level
Node color: protein function
Loose cutoff level
Stringent cutoff level
Cell cycle RNA splicing
Both Neither
VU workshop, 2016
Scalability challenges
DNACNA
mutationmethylation
mRNAexpression
splicing
Proteinexpressionmodification
Network sizeData complexity
VU workshop, 2016
Barabasi and Oltvai, Nat Rev Genet, 2004
Making sense of the hairballs
Nodes ordered in one dimension on the basis of network structure
Data visualized as tracks
NetSAM is available in BioConductorShi et al. Nature Methods, 2013
VU workshop, 2016
NetGestalt: exploring multidimensional omics data over biological networks
n Genes ordered in the horizontal dimension
n Data visualized as tracks along the verticaldimension
q Composite continuous track (e.g. geneexpression matrices)
q Single continuous track (e.g. foldchanges, p values)
q Single binary track (e.g. significantgenes, GO annotations)
n Interactive data visualization and analysis
q Search, zoom, pan, filter
q Track comparison
q Statistical analysis
q Network analysis and visualization
q Pathway enrichment analysis
q Upload user networks and tracks
Barcode plot
Bar chart
Heat map
http://www.netgestalt.orgShi et al. Nature Methods, 2013
2D layout
1D layoutNetSAM
VU workshop, 2016
n Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process.
n Network-based gene function prediction
n Network-based gene prioritization
Network distance vs functional similarity
Sharan et al. Mol Syst Biol, 3:88, 2007
VU workshop, 2016
Network-based analysis
n Direct neighbor-based analysis
n Network module-based analysis
n Network diffusion analysis
VU workshop, 2016
Direct neighbor-based analysis
n Network expansion: to expand a list of “seed” genes to include other relatedgenes in the network
q All neighbors: to identify all direct neighbors of these seed genes inthe network
q Enriched neighbors: to identify all non-seed genes significantlyenriched with seed neighbors
VU workshop, 2016
TP53
PRKAA2
COPS5
PRMT1
MECR
PCBD1
MED23
MED16
MED17
MED10
TRIM24NCOA2
NCOA6
TGS1
AR
HDAC3
SP1
NCOR2
SIRT1
HDAC4
SREBF2
UBE2I
MED7
MED21MED14
MED24
MED1
NR2C2
SREBF1
PROX1
NCOA1
NR1I3
PPARGC1A
NR0B2
NR2F1
VDR
CREBBP
FOXO1
SUB1
CTNNB1
HNF1AMAPK14
MAPK1
MAPK3
SNCG
SMAD4
SMAD2
STK16
EXT2HNF4A
RNASEH2B
NPPAC16orf80
PNRC1
ZNHIT3
CITED2
PPARGC1B
RAD50
XPO1
PABPC4
NRBF2
SFRP1
WNT4
WNT2
SFRP2
WNT5A
TGFB3
TGFB1
BMP2
BMP7
ACVR1B
HGFCOL1A1 BAMBITRIM28
FOXO3
ZNF703 WNT16 FOXC2 FOXS1
PIAS4
LEF1
TCF7L2
EP300
MSX1
TBX5
HNRNPAB DAB2IP
NKX2-1
ADAM15
CTNNB1
GREM1
WWTR1
PPP3R1
AKT1AXIN2
GSK3B
DVL1
SMAD7
SMAD4SMAD3
NF1
KDM6B
SMAD2NOTCH1
FOXF2
NOG
EFNA1
PARD3BBCL9L
EPB41L5
HIF1ASOX9
SNAI1
PITX2FOXC1
PPP2CAS100A4
TGFBR3
TGFBR2
TGFBR1
TGFBRAP1TGFB1I1
TGFB2
FMOD
VASN
ENG
All neighbors Enriched neighbors
Direct neighbor-based analysis
n Gene prioritization: to prioritize genes in a list of candidate genes
VU workshop, 2016
PCDHB5
DKK2CCBP2
CDH18
CDH9
HCN1
ACVR2AACVR1B
CHRDL1
TLL1
KLK2
OPCMLLPPR4
ROBO1
LRRN3
NCAM2
GGT1
NXF5
PCBP1
MYO1B
ATM
TP53
FBXW7
PRIM2
FLGELF3
UBE2NL
EIF4A2
DNAH5
GUCY1A3
FAT4
CSMD1ZC3H13 ZNF208
ARID1A
ARID2
ING1
TCF7L2
SOX9
CRTC1DACH2
ISL1
CASP14
SDK1
PTPRM
CDH2CDH8
CDH11
CTNNB1
KIAA1804 SLITRK1
TNFRSF10C
RYR2
ADAM29
BRAF
NRAS
KRAS
PIK3CA
PTEN
MAP2K4
FAM5C
APC
FAM123B
ANK2
EVC2
SMAD4
SMAD2
TTN
RIMS1
TRPS1TRPC4
LIFR
PCDHA10
AFF2
ERBB4
DOCK2
LRP1BCNTN1 LOXL2
EPHA3
NRXN1 GRIK3 GRIA1KCNA4 BAI3
DMD
DPP10
NALCNCHRM2 EDNRB
CHST9
IFT80
n Identify network modules from a networkq e.g., NetSAM
n Use network modules as gene sets for enrichment analysisq Over-representation analysis; GSEA
Network module-based analysis
VU workshop, 2016
TIMM50
KSR1
FKBP10
PAQR3
DNAJC27
PEX5L
PEBP4
KSR2
MAP2K1MAP2K2
OIP5
TERF2IP
RAF1
CPS1
NUDT14BRAF
ARAF
LRPAP1
RALGDS
RRAS
RGL2
SHOC2
RAP1GDS1
HRAS
VARS2
ITIH1
ARHGEF1
DGKZ
ADCY6
CHRNA7
KRAS
NRAS
RAP1B
EIF2AK1
PIK3CA
FTL
IGDCC3
DNAJB6 ALKBH1
RASGRP1
PPM1BKIAA1161
MOXD1VRK2
MRPL36KIFC3
PLEKHA7
CBLL1
ITGAE
PTPRM
CDH1
LIMA1
AJAP1
MYO7A
BOC
CDON
CA9
CDH24
CTNND1
CDH3
CTNNA1
CDH2
CDH4
PTPN14
CDH16
KANK1
CDH8
CDH17
CTNNB1
PKP4
DSC3
JUPPKD1
DSG3
DSC2DSG2
PKP3
DSC1PKP2
DSP
PKP1
DSG1
VRK3
KRT1KRT5
KRT14
TCHP
KRT6A
KRT18KRT8
KRT17
KRT72
LMNA
ALOX12B
ALOX12TRIB1
URB2NARF TOR1AIP1
MLIP
HSBP1
KRT16
CALB2
USP53CPNE7
CLN5
CLEC3B
FGF13 AKAP12
NEURL2 BCL9L
CDH11
SOX17 CTNNA3
ERMN
LDHB
SOX1
KIAA1109
FSCN2VEZT
FAM84B
Network diffusion analysis
n Simulating a random walker’s behavior on a network (with restart)
VU workshop, 2016
n Simulating a random walker’s behavior on a network (with restart)q Gene prioritization
q Network expansion
Network diffusion analysis
VU workshop, 2016
http://www.gene2net.org
n Biological networksq Physical interaction networks
q Functional association networks
n Properties of complex networksq Scale-free
q Small world
q Hierarchical modular
n Network visualizationq Node-link diagrams
q NetGestalt
n Network-based analysisq Direct neighbor-based analysis
q Network module-based analysis
q Network diffusion analysis
Summary
VU workshop, 2016
Hands-on session
n WebGestaltq Pathway analysis for gene lists
q http://www.webgestalt.org
n NetGestaltq Pathway and network analysis for gene lists and data matrices
q http://www.netgestalt.org
n Gene2Net
q network diffusion analysis for gene lists
q http://www.gene2net.org
VU workshop, 2016