more analysis of gene expression data

43
More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University

Upload: taurus

Post on 22-Feb-2016

60 views

Category:

Documents


0 download

DESCRIPTION

More Analysis of Gene Expression Data. Brent D. Foy, Ph.D. Wright State University. Overview. Types of Data Sets Data Analysis Clustering Hierarchical Self-Organizing Maps Principal Components Analysis Statistical Hypothesis Testing (ANOVA). Types of Data – 1D, 2 Conditions. Many genes - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: More Analysis of Gene Expression Data

More Analysis of Gene Expression Data

Brent D. Foy, Ph.D.Wright State University

Page 2: More Analysis of Gene Expression Data

Overview

• Types of Data Sets• Data Analysis

– Clustering• Hierarchical• Self-Organizing Maps• Principal Components Analysis

– Statistical Hypothesis Testing (ANOVA)

Page 3: More Analysis of Gene Expression Data

Types of Data – 1D, 2 Conditions

• Many genes• 2 conditions• A few replicates

per condition

Gene Condition 1 Condition 2

Rep 1

Rep 2

Rep 3

Rep 1

Rep 2

Rep 3

A 150 160 150 180 190 180

B 50 40 45 50 45 40

C 800 760 680 400 450 425

Page 4: More Analysis of Gene Expression Data

Types of Data – 1D, 2 Conditions (cont)

• Conditions can be control vs treated, different cell types, different time points, etc.

• Typical Question – Which genes’ expression levels change due to condition?– T-test, Mann-Whitney, Comparison Analysis

Page 5: More Analysis of Gene Expression Data

Types of Data – 1D, Multiple Conditions

• Many genes• Multiple conditions• A few replicates per condition

Page 6: More Analysis of Gene Expression Data

Types of Data – 1D, Multiple Conditions (cont)

Gene Condition 1 Condition 2 Condition 3 …

Rep 1

Rep 2

Rep 3

Rep 1

Rep 2

Rep 3

Rep 1

Rep 2

Rep 3

A 150 160 150 180 190 180 150 155 135

B 50 40 45 50 45 40 80 90 105

C 800 760 680 400 450 425 200 220 400

Page 7: More Analysis of Gene Expression Data

Types of Data – 1D, Multiple Conditions (cont)

• Again, conditions can be treatments or chemicals, cell types, time points, etc.

• Typical question – Which genes’ expression levels change due to one or more conditions?– 1-way ANOVA, Kruskal-Wallis

Page 8: More Analysis of Gene Expression Data

Types of Data – 1D, Multiple Conditions (cont)

• Typical question – Which genes’ expression levels behave similarly for all the conditions?– Self-Organizing Maps, Hierarchical Clustering,

Principal Components Analysis• Typical question – Which conditions show similar

expression levels among genes? (Toxicogenomic Fingerprint)– Hierarchical Clustering, Principal Components

Analysis, (Self-Organizing Maps)

Page 9: More Analysis of Gene Expression Data

Types of Data – 2D, Multiple x Multiple Conditions

• Many genes• 2 Factors, multiple conditions per factor

– For example, Factor 1 could be dose of a chemical, and Factor 2 could be time point after dosing

• Multiple replicates per condition

Page 10: More Analysis of Gene Expression Data

Types of Data – 2D, Multiple x Multiple Conditions (cont)

Gene Dose 1 Dose 2 …

Time 1 Time 2 Time 1 Time 2 …

Rep 1

Rep 2

Rep 1

Rep 2

Rep 1

Rep 2

Rep 1

Rep 2

A 150 160 150 180 190 180 150 155

B 50 40 45 50 45 40 80 90

C 800 760 680 400 450 425 200 220

Page 11: More Analysis of Gene Expression Data

Types of Data – 2D, Multiple x Multiple Conditions (cont)

• Typical Question – Which genes’ expression levels change due to time? Due to dose? Due to an interaction between the two?– 2-way ANOVA

• Or, eliminate one of the dimensions and ask the same questions as before – At time 1, which doses show similar expression levels among genes?

Page 12: More Analysis of Gene Expression Data

Typical Applications of Clustering Algorithms

0

2

4

6

0 2 4 6

Gene A

Gen

e B

chem1 chem6chem2

chem3

chem4 chem5

Many samples/cell lines/chemicals,Many genes

Number of axes can be very large here

Many samples/cell lines/chemicals,Principal components of genes

0

2

4

6

0 2 4 6

Principal component 1

Prin

cipa

l com

pone

nt 2

chem1chem6

chem2chem3

chem4 chem5

Page 13: More Analysis of Gene Expression Data

Typical Applications of Clustering Algorithms

Many genes, multiple time points.(Different letters represent different genes.)

0

2

4

6

0 2 4 6

T1

T2 A FB

C

D E

Number of dimensions (time points) can be greater than 2

Many genes, multiple doses

02468

0 2 4 6 8

Dose 1

Dos

e 2

A

F

B

C

D

E

Reasons to cluster genes of similar behavior together?

Page 14: More Analysis of Gene Expression Data

Hierarchical Clustering

• Focus on 1D, multiple conditions type of data

• Here, group cell types according to similar gene response

Page 15: More Analysis of Gene Expression Data

Hierarchical Clustering (cont)Construct pairwise groupings of data elements based on similarity. Definition of similarity is typically the separation of data elements in n-dimensional space.

Chem 2

Chem 3

Chem 1

Chem 6

Chem 4

Chem 5

Generation 3 2 1 0# clusters 6 3 2 1

0

2

4

6

0 2 4 6

Gene A

Gen

e B

chem1 chem6chem2

chem3

chem4 chem5

Page 16: More Analysis of Gene Expression Data

0

1

2

3

4

5

6

0 2 4 6

Expression at T1 = 1 h

Exp

ress

ion

at T

2 =

4 h

A F

BC

D

E

Hierarchical clustering - chooses pairwise groupings based on distances between pairs of points

Once the two closest points are found, the two are grouped together, and a new point is placed at the average location of the old 2 points.

Page 17: More Analysis of Gene Expression Data

Hierarchical clustering

Advantages• Computationally efficient

• Produces tree-like structure

Disadvantage• Clusters are not optimal. Once

branches split, it’s permanent. There is no way to reevaluate whether it was the best division based on whole data set.

Page 18: More Analysis of Gene Expression Data

Principal Component Analysis

- Each data point is a single condition- Each axis is a linear combination of hundreds or thousands of gene expression levels

Page 19: More Analysis of Gene Expression Data

Principal Component Analysis

• Reduces the dimensionality of the data set– Thousands of genes are combined in a few linear

combinations to make 2 or 3 Principal Components (PC). Going from thousands of axes, with each axis representing the expression level for a gene, to 2 or 3 axes.

• These few PCs may capture most of the variability of the original data set

• Hope is that the first few PCs extract or expose the cluster structure of the original data set– i.e. Another clustering algorithm still needed after PCA

Page 20: More Analysis of Gene Expression Data

Principal Component Analysis – A Simple Example

01020304050607080

0 2 4 6 8 10

Gene A expression

Gen

e B

Exp

ress

ion

PC1

Page 21: More Analysis of Gene Expression Data

Self Organizing Maps

• Partition data into specified number of groupings.

• Iterative procedure, so seeks to produce optimal clusters.

• K-means clustering is a specific form of the self-organizing map

Page 22: More Analysis of Gene Expression Data

Self Organizing Maps - General Procedure

Consider n data points in d-dimensional space. In the hypothetical data set, there are 6 data points (gene expression levels) in 2-dimensional space (2 time points). Say you want k = 3 clusters.

1. Select k of your data points to each be the original center of a cluster

2. Place the next data point in the nearest cluster

3. Compute the new location of the cluster center

4. Repeat the previous 2 steps for each data point

5. After all data is placed in a cluster, use final cluster centers as starting point for another iteration beginning at step 2.

Page 23: More Analysis of Gene Expression Data

Self Organizing Maps – Simple Example

0246

0 2 4 6Time

Gene A Gene B Gene CGene D Gene E Gene F

0

1

2

3

4

5

6

0 2 4 6

Expression at T1 = 1 hE

xpre

ssio

n at

T2

= 4

h

A F

BC

D

E

Page 24: More Analysis of Gene Expression Data

0123456

0 2 4 6Expression at T1 = 1 h

Exp

ress

ion

at T

2 =

4 h

A F

BC

DE

Let Genes A, B, and C be initial cluster centers.

0123456

0 2 4 6Expression at T1 = 1 h

Exp

ress

ion

at T

2 =

4 h

A F

BC

DE

Clusters after 1st pass

0123456

0 2 4 6Expression at T1 = 1 h

Exp

ress

ion

at T

2 =

4 h

A FB

C

DE

Clusters after 2nd pass

Page 25: More Analysis of Gene Expression Data

Self Organizing Maps – Simple Example

012345

0 2 4 6Time

G ene A G ene F

0

2

4

6

0 2 4 6Time

Gene D G ene E

0

2

4

6

0 2 4 6Time

G ene B G ene C

Page 26: More Analysis of Gene Expression Data

Self Organizing Maps – Larger example

• X-axis is time after dose

• Y-axis is normalized gene expression level

• Group ~1000 genes into 24 categories

Page 27: More Analysis of Gene Expression Data

Self Organizing maps - details to consider

• Several methods exist for choosing initial data points for clusters.

• How to choose the initial number of clusters.

• Method of recalculating cluster center after adding a new data point can be varied. How much ‘weight’ is given to new data point.

• Routines for merging and dividing clusters and detecting outliers can be added at each iteration.

Page 28: More Analysis of Gene Expression Data

Self Organizing maps

Advantages• Able to come closer to ‘optimal’ clustering

through iterations.• Doesn’t force a tree-structure on data

Disadvantage• Larger number of options for clustering means

that details of process may be hidden.

Page 29: More Analysis of Gene Expression Data

Data Preprocessing

• Filter data– Remove genes with expression levels in the noise– Focus on a group of genes with a particular function

• Normalize data– Subtract a control condition– Scale so that a gene whose expression level changes

from 5000 to 10000 looks the same as a gene whose expression level changes from 500 to 1000. One possibility is to scale all genes to mean of 0 and standard deviation of 1.

Page 30: More Analysis of Gene Expression Data

Detecting Statistically Significant Changes

• Consider 1D, multiple conditions• 1-way ANOVA• Similar tests for 1D, 2 condition data:

– Fold changes– Tests Steve described in previous talk (Mann-

Whitney, Comparison Analysis)

Page 31: More Analysis of Gene Expression Data

1D, Multiple Condition Data

Gene Dose 1 Dose 2 Dose 3 …

Rep 1

Rep 2

Rep 3

Rep 1

Rep 2

Rep 3

Rep 1

Rep 2

Rep 3

A 150 160 150 180 190 180 150 155 135

B 50 40 45 50 45 40 80 90 105

C 800 760 680 400 450 425 200 220 400

Page 32: More Analysis of Gene Expression Data

1-Way ANOVA

• Question being asked is whether the expression level for each gene (taken one at a time) changes significantly as a function of dose.

• More specifically, it compares the variability within replicates for a given dose to the variability caused by changing the dose.

• If gene chip contains 1000 genes, then do 1000 ANOVAs.

• Consider “repeated measures ANOVA” if multiple measurements done on same animal

Page 33: More Analysis of Gene Expression Data

ANOVA for Hepatocytes exposed to Hydrazine, time 0

Source SS df MS F P

Columns 5566 2 2783 3.21 0.1798

Error 2602 3 867

Total 8168 5

Page 34: More Analysis of Gene Expression Data

2-way ANOVA

• Apply to 2D, multiple x multiple condition data sets

• Consider 3 doses, 5 time points per dose, 2 replicates per condition

• Can reveal significant effect of time, significant effect of dose, or a significant interaction between the two

• A “2-way repeated measures ANOVA” also exists

Page 35: More Analysis of Gene Expression Data

2-way ANOVA for Hydrazine Data – Output for 1 gene

Source SS df MS F P

Time 28724 4 7181 9.20 7.3e-4

Dose 1143 2 572 0.73 0.498

Time*dose 22940 8 2868 3.67 0.016

Error 10930 14 781

Total 64409 28

Page 36: More Analysis of Gene Expression Data

2-Way ANOVA – p-value Summary for 10 Genes

Page 37: More Analysis of Gene Expression Data

2-Way ANOVA – Dose effect

Red, 0 mMGreen, 50 mMBlue, 75 mM

Page 38: More Analysis of Gene Expression Data

2-way ANOVA – Time x Dose effect

Page 39: More Analysis of Gene Expression Data

Software

• Free– Eisen’s software Cluster, Treeview

• Hierarchical clustering, SOM• http://rana.lbl.gov/

– Genecluster• SOM• http://www-genome.wi.mit.edu/cancer/software/so

ftware.html

Page 40: More Analysis of Gene Expression Data

Software (cont)

• Commercial, gene-specific– Genelinker Gold

• PCA, clustering, SOM, statistics• http://microarray.genelinker.com/products.html#GeneLinkerGo

ld– GeneSpring

• PCA, clustering, SOM, statistics• http://www.sigenetics.com/cgi/SiG.cgi/Products/GeneSpring/in

dex.smf– Rosetta

• PCA, clustering, SOM, ANOVA• http://www.rosettabio.com/products/resolver/default.htm

– Several others

Page 41: More Analysis of Gene Expression Data

Software (cont)

• Tools, not gene specific– Matlab– SPSS– SAS

• A useful web site, briefly summarizes many software packages, up-to-date– http://ihome.cuhk.edu.hk/~b400559/arraysoft.ht

ml

Page 42: More Analysis of Gene Expression Data

CollaboratorsAFRL

Dr. John Frazier

Dr. Charles Wang

Dr. Victor Chan

AFOSR

Dr. Walt Kozumbo

AFIT

Dr. Dennis Quinn

Rebecca Olson

Tom Hopkins

2Lt Matt Campbell

WSU

Dr. Nick Reo

Dr. Steve Berberich

Dr. Tatiana Karpinets

Page 43: More Analysis of Gene Expression Data

Questions?