getting the story – biological model based on microarray data once the differentially expressed...

21
Getting the story – biological model based on microarray data • Once the differentially expressed genes are identified (sometimes hundreds of them), we need to figure out what it all means • Since we don't know much about function of most of the genes this is not easy • Complicated further by the fact that the gene function is context-specific. Depends on the tissue, developmental stage of the organism and multiple other factors • "Functional clustering" grouping genes with respect to their known function (ontology) • Establishing statistical significance

Upload: denis-newton

Post on 16-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Getting the story – biological model based on microarray data

• Once the differentially expressed genes are identified (sometimes hundreds of them), we need to figure out what it all means

• Since we don't know much about function of most of the genes this is not easy

• Complicated further by the fact that the gene function is context-specific. Depends on the tissue, developmental stage of the organism and multiple other factors

• "Functional clustering" grouping genes with respect to their known function (ontology)

• Establishing statistical significance between groups of genes identified in the analysis and "Functional clusters"

Page 2: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Analyzing Microarray Data

Experimental Design

UniversalControl

Not Treated

C 1 Treated

Not Treated

C 3Treated

Not

Treated

C 2

Treated

Not

Tre

ated

C 4

T

reat

ed

Data Normalization – reducing technical variability

expn=70 C4- 3 Vs AB4

l 2r

- 3

- 2

- 1

0

1

2

l 2a

4 5 6 7 8 9 10 11 12 13 14 15 16

Statistical Analysis (ANOVA):•Identifying differentially expressed genes•Factoring out variability sources

Data Mining

Time

Exp

ress

ion

Le

vel

0 50 100 150

-4-2

02

4

G1 S G2 M G1 S G2 M

CLUSTER 1 ; 39 N 17

Time

Exp

ress

ion

Le

vel

0 50 100 150

-4-2

02

4

G1 S G2 M G1 S G2 M

CLUSTER 1 ; 39 N 38

Page 3: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Data Integration and Interpretation

Page 4: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Gene Ontology (GO)

http://www.geneontology.org/

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

Page 5: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Molecular Function

Biochemical activity or action of the gene product.

MF describes a capability that the gene product has and there is no reference to where or when this activity or usage actually occurs.

Examples: enzyme transporter ligand

cytochrome c: electron transporter activity

Page 6: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Biological process

A biological objective to which the gene product contributes.

A biological process is accomplished via one or more ordered assemblies of molecular functions.

There is generally some temporal aspect to the process and it will often involve the transformation of some physical thing.

Examples: cell growth and maintenance

cytochrome c oxidative phosphorylation, induction of cell death

Page 7: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Cellular Component

A component of a cell that is part of some larger object or structure.

Examples: chromosome nucleus ribosome

cytochrome c: mitochondrial matrix, mitochondrial inner membrane

Page 8: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

•Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)

•Suppose that x out of n significant genes and y out of N total genes were classified into a specific "Functional group"

•Q1: Is this "Functional group" significantly correlated with our group of significant genes?

•Q2: Are significant genes overrepresented in this functional group when compared to their overall frequency among all analyzed genes?

•Q3: What is the chance of getting x or more significant genes if we randomly draw y out of N genes "out of a hat" with assumption that each gene remaining in the hat has an equal chance of being drawn? (

•H0: p(significant gene belonging to this category) = y/N

•Q3A: What is the p-value for rejecting this null hypothesis

First step of making a story: Statistical significance of a particular "Functional cluster"

Page 9: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Strategy for finding "Statistically Significant" GO categories:

•Identify all categories that contain at least 5 genes from the microarray (about 1800 in our case)

•Perform a Fisher's exact test for each category to test for statistically significant over-representation of differentially expressed genes

•Adjust individual Fisher's p-values for the fact that we are testing 1800 hypotheses by calculating FDR's

•Repeat this for different levels of the statistical significance used to select differentially expressed genes (FDR<0.01, 0.05, 0.1, 0.2) and observe the statistical significance of two most significant GO categories

Fisher's tests(http://eh3.uc.edu/teaching/cfg/2006/R/NickelFunctionalClusteringClean.R)

Page 10: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Top 2 GO Categories for genes with FDR< 0.01 GO Term 1 FDR for the category= 0.0442416 GOID = GO:0006936Term = muscle contractionDefinition = A process leading to shortening and/or development of tension in muscle tissue. Muscle contraction occurs by a sliding filament mechanism whereby actin filaments slide inward among the myosin filaments.Ontology = BP

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 3 12[2,] 33 9268GO Term 2 FDR for the category= 0.1769315 GOID = GO:0006937Term = regulation of muscle contractionDefinition = Any process that modulates the frequency, rate or extent of muscle contraction.Ontology = BP

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 2 13[2,] 11 9290

Statistically Significant GO Categories

Page 11: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Top 2 GO Categories for genes with FDR< 0.05 GO Term 1 FDR for the category= 0.006130206 GOID = GO:0005576Term = extracellular regionSynonym = extracellularDefinition = The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite.Ontology = CC

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 160 544[2,] 1381 7231GO Term 2 FDR for the category= 0.006130206 GOID = GO:0005615Term = extracellular spaceSynonym = intercellular spaceDefinition = That part of a multicellular organism outside the cells proper, usually taken to be outside the plasma membranes, and occupied by fluid.Ontology = CC

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 149 555[2,] 1266 7346

Statistically Significant GO Categories

Page 12: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Top 2 GO Categories for genes with FDR< 0.1 GO Term 1 FDR for the category= 0.1196382 GOID = GO:0001568Term = blood vessel developmentDefinition = Processes aimed at the progression of the blood vessel over time, from its formation to the mature structure. The blood vessel is the vasculature carrying blood.Ontology = BP

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 25 1731[2,] 40 7520GO Term 2 FDR for the category= 0.1196382 GOID = GO:0048514Term = blood vessel morphogenesisDefinition = Processes by which the anatomical structures of blood vessels are generated and organized. Morphogenesis pertains to the creation of form. The blood vessel is the vasculature carrying blood.Ontology = BP

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 23 1733[2,] 34 7526

Statistically Significant GO Categories

Page 13: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Top 2 GO Categories for genes with FDR< 0.2 GO Term 1 FDR for the category= 0.1717101 GOID = GO:0001568Term = blood vessel developmentDefinition = Processes aimed at the progression of the blood vessel over time, from its formation to the mature structure. The blood vessel is the vasculature carrying blood.Ontology = BP

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 37 3193[2,] 28 6058GO Term 2 FDR for the category= 0.1717101 GOID = GO:0048514Term = blood vessel morphogenesisDefinition = Processes by which the anatomical structures of blood vessels are generated and organized. Morphogenesis pertains to the creation of form. The blood vessel is the vasculature carrying blood.Ontology = BP

Two By Two matrix of gene memberships in this category [,1] [,2][1,] 33 3197[2,] 24 6062>

Statistically Significant GO Categories

Page 14: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Statistical significance of a particular "Functional cluster" - cont

gn+1g1 gngN... ...

g1 gx gx+1 gygn+y-x+1gy+1 gn+y-x

gN... ... ... ...Observed

Removing Functional Classification

Q: By randomly drawing y boxes to color their border blue, what is the chance to draw x or more red ones

Outcome (o1,...,oT): A set of y genes with selected from the list of N genesEvent of interest (E): Set of all outcomes for which the number of red boxes among the y boxes drawn is equal to x

Since drawing is random all outcomes are equally probable

)o(...)o()o( 21 TPPP

T1,..., tallfor ,1

)( 1)( T

1t

T

oPoP tt

Page 15: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Statistical significance of a particular "Functional cluster" - cont

Outcome (o1,...,oT): A set of y genes with selected from the list of N genesEvent of interest (E): Set of all outcomes for which the number of red boxes among the y boxes drawn is equal to x

)o,...,{oE EM

E1

T

M

T

1o(E)

M

1

M

1

Em

mm

P

All we have to do is calculating M and N where:T=number of different sets we can draw a set of y genes out of total of N genes

M=number of different ways to obtain x red boxes (significant genes) when drawing y boxes (genes) out of total of N boxes (genes), x of which are red (significant)

y

yNNNN

y

NT

*...*2*1

)1)...(2)(1(

xy

nN

x

nM

Comes from the fact that order in which we pick genes does not matter

First pick x red boxes. For each such set of x red boxes pick a set of y-x non-red boxes

Page 16: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Statistical significance of a particular "Functional cluster" - p-value

Fisher's exact test or the "hypergeometric" test

),min(

)),(min(...)1()(ny

xr

y

N

ry

nN

r

n

nypxpxpvaluep

P-value: Probability of observing x or more significant genes under the null hypothesis

Page 17: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

381 genes that were differentially expressed after the treating a cell line with three different carcinogens:

Dex and E2 and Irradiation

Dex

_Day

1

Dex

_Day

2

Dex

_Day

3

E2_

Day

4

E2_

Day

7

E2_

Day

10

Irr_

Day

1

Irr_

Day

2

Irr_

Day

3

Page 18: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Up

Page 19: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Finding important functional groups for up-regulated genes

Using the "Ease" annotation tool http://david.niaid.nih.gov/david/

We obtained following significant gene ontologiesUp_DexANDNE2ANDirr_381_GO.htm

Homework:1) Download and install Ease2) Select top 20 most-signficianly up-regulated genes in our W-C dataset and identify significantly over-represented categories (using the three-way ANOVA analysis)3) Repeat the analysis with 30, 40, 50 and 100 up-regulated and down-regulated gene4) Prepare questions for the next class regarding problems you run into

Page 20: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Regulating Transcription -transcription factor itself does not need to be transcriptionally regulated

Page 21: Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need

Modeling Microarray Data

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 1 ; N 14

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 2 ; N 3

TimeE

xp

re

ssio

n L

eve

l0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 3 ; N 86

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 4 ; N 37

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 5 ; N 12

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 6 ; N 23

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 7 ; N 19

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 8 ; N 22

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 9 ; N 31

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 10 ; N 9

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 11 ; N 17

Time

Exp

re

ssio

n L

eve

l

0 50 100 150

-2

-1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 12 ; N 29

Mathematical./Statistical Models

Computer Algorithms/Software