characterizing gene functional expression profiles
DESCRIPTION
Characterizing Gene Functional Expression Profiles. Zoran Obradovic Slobodan Vucetic Hongbo Xie, Hao Sun, Pooja Hedge Information Science and Technology Center, Temple University. Outline. Microarray Data Analysis Process Functional Expression Profile Analysis - PowerPoint PPT PresentationTRANSCRIPT
Characterizing Gene Functional Expression
ProfilesZoran ObradovicSlobodan Vucetic
Hongbo Xie, Hao Sun, Pooja Hedge
Information Science and Technology Center, Temple University
Outline
1. Microarray Data Analysis Process
2. Functional Expression Profile Analysis Functional Expression Profile Ranking
Functional Expression Profile Clustering
3. Functional Characterization of Plasmodium Falciparum,
Saccharomyces Cerevisiae,
Mus Musculus and
Homo Sapiens
What is a DNA Microarray?
DNA microarray technology allows measuring expressions for tens of thousands of genes at a time
Analysis of Replicated Experiments
Gordon Smyth, Walter and Eliza Hall Institute
Scanning/Signal Detection
equal expression
higher expression in Cy3
higher expression in Cy5
Cy3 channel Cy5 channel
Microarray Data Analysis Process
1. Designing gene expression experiments2. Image processing and analysis3. Preprocessing raw intensity data4. Discovering differentially expressed genes5. Advanced analysis
Finding relevant pathways Discovering gene expression patterns Understanding gene functions
More information: www.ist.temple.edu/research/biocore.html
Designing Gene Expression Experiments
A saturated design
reference design loop design
Design experiment
Comparative designing
http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
Image Processing and Analysis (figure is obtained using Imagene software)
Preprocessing Raw Intensity Data
normalize
Analysis of Replicated Experiments
Gordon Smyth, Walter and Eliza Hall Institute
Discovering Differentially Expressed Genes
•Fold change (log ratio)Fold change (log ratio)•Statistics methodsStatistics methods 1)T-test1)T-test 2)ANOVA2)ANOVA 3)Non-parametric analysis3)Non-parametric analysis Wilcoxon Rank-Sum TestWilcoxon Rank-Sum Test
Advanced Analysis: Finding Relevant Pathways (figure is obtained using Ingenuity software)
Advanced Analysis: Discovering Gene Expression Patterns
Plasmodium Falciparum intraerythrocytic developmental cycle
Genes are sorted based on expression time peaks
Bozdech Z et al., PLoS Biol. 2003 Oct;1(1))
Advanced Analysis: Identifying Unknown Gene Functions Based on Expression Profiles
1 2 3 4 5 6 7-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
1 2 3 4 5 6 7-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4 5 6 7-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Gene 1 expression profile with function A
Gene 2 expression profilewith function B
Unknown sequence Tag
Functions ?Unknown sequence has high correlation
With gene 1 expression profile
Sequence Tag has function A
Is this alignment reliable ?
Standard practice:Basic Assumption: Expression profiles of functionally related genes are correlated
Objectives: Confirm a specific biological hypothesis; predict functional properties of less characterized genes; or uncover new/unexpected biological knowledge
Methodology: clustering genes based on similarity of their expression profiles; followed by functional analysis of the obtained clusters
Problems with old approaches
Genes with same function do not necessarily have the same expression profiles
Clustering on all genes expression profiles could be unreliable
Our Approach: Analyzing Microarray Functional
Expression Profiles (FEP)FEPs: Compute FEP as the average profile of all genes associated with a given highly correlated GO term
Advanced Analysis: Identifying Unknown Gene Functions Based on Expression Profiles
0 5 10 15 20 25 30 35 40 45 50-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
GO:0016311 : Dephosphorylation
GO:0004721 : phosphoprotein phosphatase activity
Questions that we address:
• How to perform functional analysis in an objective manner
• How to estimate biological significance of discovers
Tools and Applications Developed tools to identify:
• (1) Explore which functions have the conserved expression profiles
(Tool 1: functional expression profile ranking package)
• (2) Explore which functions have similar expression profiles and test of their functional similarity
(Tool 2: functional expression profile clustering package) Applications:
• Functional characterization of gene expression related to Intraerythrocytic Developmental Cycle of Plasmodium Falciparum, Saccharomyces Cerevisiae, Mus Musculus and Home Sapiens
Tools Architecture
Microarrayraw data
Functional expression
profile ranking
Functional expression
profile clustering
Gene Function Semantic Distance Mapping Space
List of significantlycorrelated GO terms
Data pre-
processing
Report
Gene function
annotationdatabase
Clusters of functional Expression profiles
Tool 1: Functional Expression Profile (FEP) Ranking Package
Objective: • Identify genes with same function having correlated
expression profiles Task:
• Evaluate gene expression correlation within each FEP
Methodology• Step 1: calculate average pairwise correlation coefficient S
among n gene expression profiles for a given function term• Step 2: randomly select n genes from the whole dataset and
compute average pairwise correlation coefficient S’• Step 3: repeated Step 2 m times (m>10,000) and compare
the distribution S’ to the original S to evaluate p-value
Dataset 1: Plasmodium Falciparum Intraerythrocytic Developmental Cycle(Bozdech Z et al., (2003) PLoS Biol. Oct; 1(1))
Objective: Identification of P.falciparum genes whose RNA levels vary periodically within the asexual intraerythrocytic developmental cycle (IDC) transcriptom Materials: 5080 ORFs, 3532 unique genes, 46 assays (sampled in time) using cDNAsMethods: Permutation test with Fast Fourier Transform alg. and correlationsFound: 60% of genes transcriptionally active and most genes only active once during the IDCFigure: Major morphological stages during the IDC and 2712 genes’ transcriptional profiles
Dataset 2: Saccharomyces Cerevisiae Cell Cycle (Spellman et al., (1998) Molecular Biology of the Cell 9, 3273-3297)
Objective: Identification of yeast genes whose RNA levels vary periodically within cell cycle process
Materials: 6178 ORFs, 4450 unique genes, 77 assays (sampled in time) using cDNAs
Methods: Periodicity and correlation algorithm
Found: Identified 800 genes that meet an objective minimum criterion for cell cycle regulation
Figure : The M/G1 clusters
Dataset 3: Homo Sapiens Cell Cycle(R.Cho, et al (2001) Nature, 27)
Objective: Identification of human genes whose RNA levels vary periodically within cell cycle process
Materials: 6800 ORFs, 5795 unique genes, 14 assays (sampled in time) Using affymatrix arrays
Methods: Fold change Found: 700 genes that display
transcriptional fluctuation with a periodicity consistent with that of the cell cycle
Figure: Clustering analysis of cell-cycle–regulated transcripts
DataSet 4: Mus Musculus Cell Cycle(Ishida, S et al (2001) Mol. Cell. Biol. 21, 4684-4699 )
Objective: Analysis of gene regulation during the mammalian cell cycle
Materials: 6347 unique genes, 14 assays
Methods: Clustering Found: Identified 7 distinct
clusters of genes that exhibit unique patterns of expression
Figure: Patterns of gene expression following growth stimulation and during the mammalian cell cycle
Applying FEP Ranking Package: Cumulative Distributions of GO Term p-Values of Human, Yeast, Mouse and P.F.
10-4
10-3
10-2
10-1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X (log(p))
Fra
ction o
f G
O t
erm
s w
ith P
valu
e less t
han X
Mouse
Human
YeastP.F.
p=0.05
Applying FEP Ranking Package: GO Terms with the Most Conserved FEP Among Multi-organisms
Applying FEP Ranking Package: Selection of GO Terms with Significantly Correlated Expression Patterns at Plasmodium Falciparum Developmental Cycle Data
Cumulative distribution of p-values for GO termsCumulative distribution of p-values for GO terms associated with at least two genesGO:0016311 :
Dephosphorylation
GO: 0007028: cytoplasm
Organization and biosynthesis
0 5 10 15 20 25 30 35 40 45 50-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
46% functions of all function GO terms are significantly correlated
52% processes of all process GO terms are significantly correlatedSelected:
Plasmodium Falciparum: Processes and Functions with the Highest/Lowest Correlation
FunctionsBiological Processes
acid phosphatase activity
zinc ion transport
calmodulin-dependent protein kinase I activity
terpene metabol
triose-phosphate isomerase activity
protein processing
guanylate kinase activity
DNA replication, synthesis of RNA
primer
glutamate-cysteine ligase activity
cell invasion
FunctionsBiological Processes
translation regulator activity
terpene biosynthesis
cell surface antigen activity, host-interacting
pigment biosynthesis
peptide binding tetrahydrobiopterin
metabolism
5,10-methylenetetra-hydrofolate-dependent
methyltransferase activity
coenzyme and prosthetic group
biosynthesis
cyclophilin-type peptidyl-prolyl cis-trans
isomerase activity
purine ribonucleoside biosynthesis
Highest correlation Lowest correlation
Plasmodium Falciparum: Findings by FEP Ranking Package
Of 12 FEPs referenced by Bozdech et al, two have p-value larger than 0.05.
• E.g. the average correlation coefficient among genes associated with Robonucleotide Synthesis function is only 0.258 (p-value = 0.11) which weakens the claim that is related to the Ring stage of IDC.
No linear relationship were found between number of genes associated with a given GO term and average correlation coefficient among these genes
Ranking of GO terms based on p-value could be useful in rapid identification of functions that are closely related with a specific developmental stage (of Plasmodium Falciparum)
All Datasets: Findings by FEP Ranking Package
To some extent genes with identical functions have similar expression profiles
However, a large fraction of functions do not follow the underlying hypothesis!
Higher level organisms seem to have lower fraction of significantly correlated expression profiles for identical functions.
Fractions of correlated FEPs:• Saccharomyces Cerevisiae: 59% (643/1,083)*
• Plasmodium Falciparum: 48.4% (428/ 884)
• Homo Sapiens: 16.4% (249/1514)
• Mus musculus: 13.3% (182/1366) *fractions are for both processes and functions
Tool 2: FEP Clustering Package Objective:
• Identifying genes with similar functions and similar expression profiles Tasks:
• Cluster FEPs selected by FEP ranking package• Evaluate found clusters for biological relevance by
• Identifying similar functions based on GO term hierarchy tree structure• Evaluating inter-cluster GO term distance
Methodology• Randomly generate k sets each containing same number of GO
terms as the corresponding cluster• Calculate total GO term distance within each generated set and
sum total distance of all sets to get the overall score S’• Repeat the procedure 1000 times and compare the distribution S’
to the overall distance obtained through clustering
Structure of GO Term Tree (Example)
/),(, * YXhYX ),(* YXh
GO:0008150 : Biological Process
GO:0007275 : development GO:0007582 : physiological process
GO:0007389 : pattern specification GO:0000003 : reproduction GO:0008152 : metabolism
GO:0009798 : axis specification
Level 3
Level 2
Level 1
Level 5GO:0009948 : anterior/posterior axis specification
Level 4
Measuring Distance of GO Terms
-- length of the minimal chain between X and Y terms in GO tree
-- is length of maximal chain from the top to the bottom
Determination of Number of Clusters
Measured
Larger z-score indicates a better grouping of functions within clusters.
)(tan
tan
clusterspermutateddeviationdards
clusterspermutatedmeancedisofsumoriginalscorez
Number of Clusters vs Z-score: Results for Plasmodium Falciparum
Plasmodium Falciparum biological processesnumber of clusters vs z-scores
Plasmodium Falciparum molecular functionnumber of clusters vs z-scores
2 4 6 8 10 12 14 16 18 207.5
8
8.5
9
9.5
10
10.5
11
11.5
12
2 4 6 8 10 12 14 16 18 2024
25
26
27
28
29
30
31
32
Applying FEP Clustering Package: Results on Plasmodium Falciparum Processes
Cluster vs Stage of IDC
k-mean clustering profiles of FEPs for 238 identified processesk-mean clustering profiles of FEPs for 238 identified processes
Cluster index
Number of EPS
Corresponding Stage
1 78 Trophozoite
2 80 Schizont
3 50 Ring
4 20 Schizont-Early Ring
1 2
3 4
Applying FEP Clustering Package: Results on Plasmodium Falciparum Functions
10 20 30 40
-0.5
0
0.5
1
1.5
10 20 30 40-1
-0.5
0
0.5
1
1.5
10 20 30 40
0
1
2
3
10 20 30 40
-1
0
1
2
3
4
Cluster vs stage of IDC
1 2
3 4
k-means clustering profiles of FEPs for 199 identified molecular functionsk-means clustering profiles of FEPs for 199 identified molecular functions
Cluster Number of FEPs
Corresponding Stage
1 48 Trophozoite
2 63 Schizont
3 53 Ring
4 35 Schizont-Early Ring
GO Trees of Functions: 4 Clusters of Plasmodium Falciparum
Statistical Evaluation: Fund vs. Random Clusters for P. Falciparum
1.3 1.32 1.34 1.36 1.38 1.4 1.42 1.44 1.46 1.48
x 105
0
5
10
15
20
25
30
35
1.05 1.1 1.15 1.2 1.25 1.3 1.35
x 105
0
5
10
15
20
25
30
35
found clusters found clusters
Molecular Functions Biological Processes
• larger distance from found cluster to random clusters for biological processes.• random clusters for biological processes have smaller variance
Statistical Evaluation: Clustering All GO Terms for P. Falciparum
Clustering all GO terms will lead to smaller z-score which means that we have worse quality clusters
Right figure is P.F. functional clustering result. Z-score is 8.5 compared to 12 for clustering correlated GO terms only
found clusters
Statistical Evaluation:Found vs. Random Clusters at S. Cerevisiae and Homo Sapiens
3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58
x 105
0
5
10
15
20
25
30
35
40
2.415 2.42 2.425 2.43 2.435 2.44 2.445 2.45
x 105
0
5
10
15
20
25
30
35
40
6.96 6.98 7 7.02 7.04 7.06 7.08 7.1 7.12 7.14 7.16
x 105
0
5
10
15
20
25
30
35
40
6.34 6.36 6.38 6.4 6.42 6.44 6.46 6.48
x 105
0
5
10
15
20
25
30
35
40
Yeast Processes
Yeast functions
Human Processes
Human functions
found clusters
found clusters
found clusters
found clusters
Remarks
Statistical significance of identified clusters (separation between clusters and random groupings) is increased by
• Normalizing data (Plasmodium Falciparum)
• Eliminating noise through singular vector decomposition (SVD)
• Reducing data through Principle Components Analysis
(105) Function clusters distance
Process clusters distance
Normalized 1.316 1.089
Without Normalization
1.368 1.213
Conclusions
Proposed microarray tools help identifying • genes with same function and correlated expression profiles
• genes with similar functions have similar expression profiles
Measuring GO tree based distance was useful for evaluating biological relevance of clusters; however, • many GO terms have only 1 associated gene
• many genes do not even have a GO term
• parenthood and siblings in GO trees should be differentiated, but there should be a smaller penalty for siblings relationship compared to parenthood
More robust clustering methods could be used
Thank You !More information: www.ist.temple.edu/research/biocore.html
Contact: Zoran Obradovic, director IST Center, Temple University 215 204-6265 [email protected]