characterizing gene functional expression profiles

41
Characterizing Gene Functional Expression Profiles Zoran Obradovic Slobodan Vucetic Hongbo Xie, Hao Sun, Pooja Hedge Information Science and Technology Center, Temple University

Upload: fareeda-hormizd

Post on 03-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Characterizing Gene Functional Expression Profiles. Zoran Obradovic Slobodan Vucetic Hongbo Xie, Hao Sun, Pooja Hedge Information Science and Technology Center, Temple University. Outline. Microarray Data Analysis Process Functional Expression Profile Analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Characterizing Gene Functional Expression Profiles

Characterizing Gene Functional Expression

ProfilesZoran ObradovicSlobodan Vucetic

Hongbo Xie, Hao Sun, Pooja Hedge

Information Science and Technology Center, Temple University

Page 2: Characterizing Gene Functional Expression Profiles

Outline

1. Microarray Data Analysis Process

2. Functional Expression Profile Analysis Functional Expression Profile Ranking

Functional Expression Profile Clustering

3. Functional Characterization of Plasmodium Falciparum,

Saccharomyces Cerevisiae,

Mus Musculus and

Homo Sapiens

Page 3: Characterizing Gene Functional Expression Profiles

What is a DNA Microarray?

DNA microarray technology allows measuring expressions for tens of thousands of genes at a time

Analysis of Replicated Experiments

Gordon Smyth, Walter and Eliza Hall Institute

Page 4: Characterizing Gene Functional Expression Profiles

Scanning/Signal Detection

equal expression

higher expression in Cy3

higher expression in Cy5

Cy3 channel Cy5 channel

Page 5: Characterizing Gene Functional Expression Profiles

Microarray Data Analysis Process

1. Designing gene expression experiments2. Image processing and analysis3. Preprocessing raw intensity data4. Discovering differentially expressed genes5. Advanced analysis

Finding relevant pathways Discovering gene expression patterns Understanding gene functions

More information: www.ist.temple.edu/research/biocore.html

Page 6: Characterizing Gene Functional Expression Profiles

Designing Gene Expression Experiments

A saturated design

reference design loop design

Design experiment

Comparative designing

http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp

Page 7: Characterizing Gene Functional Expression Profiles

Image Processing and Analysis (figure is obtained using Imagene software)

Page 8: Characterizing Gene Functional Expression Profiles

Preprocessing Raw Intensity Data

normalize

Analysis of Replicated Experiments

Gordon Smyth, Walter and Eliza Hall Institute

Page 9: Characterizing Gene Functional Expression Profiles

Discovering Differentially Expressed Genes

•Fold change (log ratio)Fold change (log ratio)•Statistics methodsStatistics methods 1)T-test1)T-test 2)ANOVA2)ANOVA 3)Non-parametric analysis3)Non-parametric analysis Wilcoxon Rank-Sum TestWilcoxon Rank-Sum Test

Page 10: Characterizing Gene Functional Expression Profiles

Advanced Analysis: Finding Relevant Pathways (figure is obtained using Ingenuity software)

Page 11: Characterizing Gene Functional Expression Profiles

Advanced Analysis: Discovering Gene Expression Patterns

Plasmodium Falciparum intraerythrocytic developmental cycle

Genes are sorted based on expression time peaks

Bozdech Z et al., PLoS Biol. 2003 Oct;1(1))

Page 12: Characterizing Gene Functional Expression Profiles

Advanced Analysis: Identifying Unknown Gene Functions Based on Expression Profiles

1 2 3 4 5 6 7-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5 6 7-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Gene 1 expression profile with function A

Gene 2 expression profilewith function B

Unknown sequence Tag

Functions ?Unknown sequence has high correlation

With gene 1 expression profile

Sequence Tag has function A

Is this alignment reliable ?

Standard practice:Basic Assumption: Expression profiles of functionally related genes are correlated

Objectives: Confirm a specific biological hypothesis; predict functional properties of less characterized genes; or uncover new/unexpected biological knowledge

Methodology: clustering genes based on similarity of their expression profiles; followed by functional analysis of the obtained clusters

Page 13: Characterizing Gene Functional Expression Profiles

Problems with old approaches

Genes with same function do not necessarily have the same expression profiles

Clustering on all genes expression profiles could be unreliable

Page 14: Characterizing Gene Functional Expression Profiles

Our Approach: Analyzing Microarray Functional

Expression Profiles (FEP)FEPs: Compute FEP as the average profile of all genes associated with a given highly correlated GO term

Advanced Analysis: Identifying Unknown Gene Functions Based on Expression Profiles

0 5 10 15 20 25 30 35 40 45 50-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

GO:0016311 : Dephosphorylation

GO:0004721 : phosphoprotein phosphatase activity

Page 15: Characterizing Gene Functional Expression Profiles

Questions that we address:

• How to perform functional analysis in an objective manner

• How to estimate biological significance of discovers

Page 16: Characterizing Gene Functional Expression Profiles

Tools and Applications Developed tools to identify:

• (1) Explore which functions have the conserved expression profiles

(Tool 1: functional expression profile ranking package)

• (2) Explore which functions have similar expression profiles and test of their functional similarity

(Tool 2: functional expression profile clustering package) Applications:

• Functional characterization of gene expression related to Intraerythrocytic Developmental Cycle of Plasmodium Falciparum, Saccharomyces Cerevisiae, Mus Musculus and Home Sapiens

Page 17: Characterizing Gene Functional Expression Profiles

Tools Architecture

Microarrayraw data

Functional expression

profile ranking

Functional expression

profile clustering

Gene Function Semantic Distance Mapping Space

List of significantlycorrelated GO terms

Data pre-

processing

Report

Gene function

annotationdatabase

Clusters of functional Expression profiles

Page 18: Characterizing Gene Functional Expression Profiles

Tool 1: Functional Expression Profile (FEP) Ranking Package

Objective: • Identify genes with same function having correlated

expression profiles Task:

• Evaluate gene expression correlation within each FEP

Methodology• Step 1: calculate average pairwise correlation coefficient S

among n gene expression profiles for a given function term• Step 2: randomly select n genes from the whole dataset and

compute average pairwise correlation coefficient S’• Step 3: repeated Step 2 m times (m>10,000) and compare

the distribution S’ to the original S to evaluate p-value

Page 19: Characterizing Gene Functional Expression Profiles

Dataset 1: Plasmodium Falciparum Intraerythrocytic Developmental Cycle(Bozdech Z et al., (2003) PLoS Biol. Oct; 1(1))

Objective: Identification of P.falciparum genes whose RNA levels vary periodically within the asexual intraerythrocytic developmental cycle (IDC) transcriptom Materials: 5080 ORFs, 3532 unique genes, 46 assays (sampled in time) using cDNAsMethods: Permutation test with Fast Fourier Transform alg. and correlationsFound: 60% of genes transcriptionally active and most genes only active once during the IDCFigure: Major morphological stages during the IDC and 2712 genes’ transcriptional profiles

Page 20: Characterizing Gene Functional Expression Profiles

Dataset 2: Saccharomyces Cerevisiae Cell Cycle (Spellman et al., (1998) Molecular Biology of the Cell 9, 3273-3297)

Objective: Identification of yeast genes whose RNA levels vary periodically within cell cycle process

Materials: 6178 ORFs, 4450 unique genes, 77 assays (sampled in time) using cDNAs

Methods: Periodicity and correlation algorithm

Found: Identified 800 genes that meet an objective minimum criterion for cell cycle regulation

Figure : The M/G1 clusters

Page 21: Characterizing Gene Functional Expression Profiles

Dataset 3: Homo Sapiens Cell Cycle(R.Cho, et al (2001) Nature, 27)

Objective: Identification of human genes whose RNA levels vary periodically within cell cycle process

Materials: 6800 ORFs, 5795 unique genes, 14 assays (sampled in time) Using affymatrix arrays

Methods: Fold change Found: 700 genes that display

transcriptional fluctuation with a periodicity consistent with that of the cell cycle

Figure: Clustering analysis of cell-cycle–regulated transcripts

Page 22: Characterizing Gene Functional Expression Profiles

DataSet 4: Mus Musculus Cell Cycle(Ishida, S et al (2001) Mol. Cell. Biol. 21, 4684-4699 )

Objective: Analysis of gene regulation during the mammalian cell cycle

Materials: 6347 unique genes, 14 assays

Methods: Clustering Found: Identified 7 distinct

clusters of genes that exhibit unique patterns of expression

Figure: Patterns of gene expression following growth stimulation and during the mammalian cell cycle

Page 23: Characterizing Gene Functional Expression Profiles

Applying FEP Ranking Package: Cumulative Distributions of GO Term p-Values of Human, Yeast, Mouse and P.F.

10-4

10-3

10-2

10-1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

X (log(p))

Fra

ction o

f G

O t

erm

s w

ith P

valu

e less t

han X

Mouse

Human

YeastP.F.

p=0.05

Page 24: Characterizing Gene Functional Expression Profiles

Applying FEP Ranking Package: GO Terms with the Most Conserved FEP Among Multi-organisms

Page 25: Characterizing Gene Functional Expression Profiles

Applying FEP Ranking Package: Selection of GO Terms with Significantly Correlated Expression Patterns at Plasmodium Falciparum Developmental Cycle Data

Cumulative distribution of p-values for GO termsCumulative distribution of p-values for GO terms associated with at least two genesGO:0016311 :

Dephosphorylation

GO: 0007028: cytoplasm

Organization and biosynthesis

0 5 10 15 20 25 30 35 40 45 50-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

46% functions of all function GO terms are significantly correlated

52% processes of all process GO terms are significantly correlatedSelected:

Page 26: Characterizing Gene Functional Expression Profiles

Plasmodium Falciparum: Processes and Functions with the Highest/Lowest Correlation

FunctionsBiological Processes

acid phosphatase activity

zinc ion transport

calmodulin-dependent protein kinase I activity

terpene metabol

triose-phosphate isomerase activity

protein processing

guanylate kinase activity

DNA replication, synthesis of RNA

primer

glutamate-cysteine ligase activity

cell invasion

FunctionsBiological Processes

translation regulator activity

terpene biosynthesis

cell surface antigen activity, host-interacting

pigment biosynthesis

peptide binding tetrahydrobiopterin

metabolism

5,10-methylenetetra-hydrofolate-dependent

methyltransferase activity

coenzyme and prosthetic group

biosynthesis

cyclophilin-type peptidyl-prolyl cis-trans

isomerase activity

purine ribonucleoside biosynthesis

Highest correlation Lowest correlation

Page 27: Characterizing Gene Functional Expression Profiles

Plasmodium Falciparum: Findings by FEP Ranking Package

Of 12 FEPs referenced by Bozdech et al, two have p-value larger than 0.05.

• E.g. the average correlation coefficient among genes associated with Robonucleotide Synthesis function is only 0.258 (p-value = 0.11) which weakens the claim that is related to the Ring stage of IDC.

No linear relationship were found between number of genes associated with a given GO term and average correlation coefficient among these genes

Ranking of GO terms based on p-value could be useful in rapid identification of functions that are closely related with a specific developmental stage (of Plasmodium Falciparum)

Page 28: Characterizing Gene Functional Expression Profiles

All Datasets: Findings by FEP Ranking Package

To some extent genes with identical functions have similar expression profiles

However, a large fraction of functions do not follow the underlying hypothesis!

Higher level organisms seem to have lower fraction of significantly correlated expression profiles for identical functions.

Fractions of correlated FEPs:• Saccharomyces Cerevisiae: 59% (643/1,083)*

• Plasmodium Falciparum: 48.4% (428/ 884)

• Homo Sapiens: 16.4% (249/1514)

• Mus musculus: 13.3% (182/1366) *fractions are for both processes and functions

Page 29: Characterizing Gene Functional Expression Profiles

Tool 2: FEP Clustering Package Objective:

• Identifying genes with similar functions and similar expression profiles Tasks:

• Cluster FEPs selected by FEP ranking package• Evaluate found clusters for biological relevance by

• Identifying similar functions based on GO term hierarchy tree structure• Evaluating inter-cluster GO term distance

Methodology• Randomly generate k sets each containing same number of GO

terms as the corresponding cluster• Calculate total GO term distance within each generated set and

sum total distance of all sets to get the overall score S’• Repeat the procedure 1000 times and compare the distribution S’

to the overall distance obtained through clustering

Page 30: Characterizing Gene Functional Expression Profiles

Structure of GO Term Tree (Example)

/),(, * YXhYX ),(* YXh

GO:0008150 : Biological Process

GO:0007275 : development GO:0007582 : physiological process

GO:0007389 : pattern specification GO:0000003 : reproduction GO:0008152 : metabolism

GO:0009798 : axis specification

Level 3

Level 2

Level 1

Level 5GO:0009948 : anterior/posterior axis specification

Level 4

Measuring Distance of GO Terms

-- length of the minimal chain between X and Y terms in GO tree

-- is length of maximal chain from the top to the bottom

Page 31: Characterizing Gene Functional Expression Profiles

Determination of Number of Clusters

Measured

Larger z-score indicates a better grouping of functions within clusters.

)(tan

tan

clusterspermutateddeviationdards

clusterspermutatedmeancedisofsumoriginalscorez

Page 32: Characterizing Gene Functional Expression Profiles

Number of Clusters vs Z-score: Results for Plasmodium Falciparum

Plasmodium Falciparum biological processesnumber of clusters vs z-scores

Plasmodium Falciparum molecular functionnumber of clusters vs z-scores

2 4 6 8 10 12 14 16 18 207.5

8

8.5

9

9.5

10

10.5

11

11.5

12

2 4 6 8 10 12 14 16 18 2024

25

26

27

28

29

30

31

32

Page 33: Characterizing Gene Functional Expression Profiles

Applying FEP Clustering Package: Results on Plasmodium Falciparum Processes

Cluster vs Stage of IDC

k-mean clustering profiles of FEPs for 238 identified processesk-mean clustering profiles of FEPs for 238 identified processes

Cluster index

Number of EPS

Corresponding Stage

1 78 Trophozoite

2 80 Schizont

3 50 Ring

4 20 Schizont-Early Ring

1 2

3 4

Page 34: Characterizing Gene Functional Expression Profiles

Applying FEP Clustering Package: Results on Plasmodium Falciparum Functions

10 20 30 40

-0.5

0

0.5

1

1.5

10 20 30 40-1

-0.5

0

0.5

1

1.5

10 20 30 40

0

1

2

3

10 20 30 40

-1

0

1

2

3

4

Cluster vs stage of IDC

1 2

3 4

k-means clustering profiles of FEPs for 199 identified molecular functionsk-means clustering profiles of FEPs for 199 identified molecular functions

Cluster Number of FEPs

Corresponding Stage

1 48 Trophozoite

2 63 Schizont

3 53 Ring

4 35 Schizont-Early Ring

Page 35: Characterizing Gene Functional Expression Profiles

GO Trees of Functions: 4 Clusters of Plasmodium Falciparum

Page 36: Characterizing Gene Functional Expression Profiles

Statistical Evaluation: Fund vs. Random Clusters for P. Falciparum

1.3 1.32 1.34 1.36 1.38 1.4 1.42 1.44 1.46 1.48

x 105

0

5

10

15

20

25

30

35

1.05 1.1 1.15 1.2 1.25 1.3 1.35

x 105

0

5

10

15

20

25

30

35

found clusters found clusters

Molecular Functions Biological Processes

• larger distance from found cluster to random clusters for biological processes.• random clusters for biological processes have smaller variance

Page 37: Characterizing Gene Functional Expression Profiles

Statistical Evaluation: Clustering All GO Terms for P. Falciparum

Clustering all GO terms will lead to smaller z-score which means that we have worse quality clusters

Right figure is P.F. functional clustering result. Z-score is 8.5 compared to 12 for clustering correlated GO terms only

found clusters

Page 38: Characterizing Gene Functional Expression Profiles

Statistical Evaluation:Found vs. Random Clusters at S. Cerevisiae and Homo Sapiens

3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58

x 105

0

5

10

15

20

25

30

35

40

2.415 2.42 2.425 2.43 2.435 2.44 2.445 2.45

x 105

0

5

10

15

20

25

30

35

40

6.96 6.98 7 7.02 7.04 7.06 7.08 7.1 7.12 7.14 7.16

x 105

0

5

10

15

20

25

30

35

40

6.34 6.36 6.38 6.4 6.42 6.44 6.46 6.48

x 105

0

5

10

15

20

25

30

35

40

Yeast Processes

Yeast functions

Human Processes

Human functions

found clusters

found clusters

found clusters

found clusters

Page 39: Characterizing Gene Functional Expression Profiles

Remarks

Statistical significance of identified clusters (separation between clusters and random groupings) is increased by

• Normalizing data (Plasmodium Falciparum)

• Eliminating noise through singular vector decomposition (SVD)

• Reducing data through Principle Components Analysis

(105) Function clusters distance

Process clusters distance

Normalized 1.316 1.089

Without Normalization

1.368 1.213

Page 40: Characterizing Gene Functional Expression Profiles

Conclusions

Proposed microarray tools help identifying • genes with same function and correlated expression profiles

• genes with similar functions have similar expression profiles

Measuring GO tree based distance was useful for evaluating biological relevance of clusters; however, • many GO terms have only 1 associated gene

• many genes do not even have a GO term

• parenthood and siblings in GO trees should be differentiated, but there should be a smaller penalty for siblings relationship compared to parenthood

More robust clustering methods could be used

Page 41: Characterizing Gene Functional Expression Profiles

Thank You !More information: www.ist.temple.edu/research/biocore.html

Contact: Zoran Obradovic, director IST Center, Temple University 215 204-6265 [email protected]