gene expression arrays

60
1 Gene Expression Arrays EPP 245 Statistical Analysis of Laboratory Data

Upload: jennessa-cherry

Post on 31-Dec-2015

55 views

Category:

Documents


3 download

DESCRIPTION

Gene Expression Arrays. EPP 245 Statistical Analysis of Laboratory Data. Basic Design of Expression Arrays. For each gene that is a target for the array, we have a known DNA sequence. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gene Expression Arrays

1

Gene Expression Arrays

EPP 245

Statistical Analysis of

Laboratory Data

Page 2: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

2

Basic Design of Expression Arrays

• For each gene that is a target for the array, we have a known DNA sequence.

• mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick

• The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample

Page 3: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

3

TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACGATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC

Exon Intron

Probe Sequence

• cDNA arrays use variable length probes derived from expressed sequence tags– Spotted and almost always used with two color methods– Can be used in species with an unsequenced genome

• Long oligoarrays use 60-70mers– Agilent two-color arrays– Spotted arrays from UC Davis or elsewhere– Usually use computationally derived probes but can use probes

from sequenced EST’s

Page 4: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

4

• Affymetrix GeneChips use multiple 25-mers– For each gene, one or more sets of 8-20 distinct

probes – May overlap – May cover more than one exon

• Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibitbinding.

• This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them

Page 5: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

5

Probe Design

• A good probe sequence should match the chosen gene or exon from a gene and should not match any other gene in the genome.

• Melting temperature depends on the GC content and should be similar on all probes on an array since the hybridization must be conducted at a single temperature.

Page 6: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

6

• The affinity of a given piece of DNA for the probe sequence can depend on many things, including secondary and tertiary structure as well as GC content.

• This means that the relationship between the concentration of the RNA species in the original sample and the brightness of the spot on the array can be very different for different probes for the same gene.

• Thus only comparisons of intensity within the same probe across arrays makes sense.

Page 7: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

7

Affymetrix GeneChips

• For each probe set, there are 8-20 perfect match (PM) probes which may overlap or not and which target the same gene

• There are also mismatch (MM) probes which are supposed to serve as a control, but do so rather badly

• Most of us ignore the MM probes

Page 8: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

8

Expression Indices

• A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene).

• There have been a large number of suggested methods.

• Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences

Page 9: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

9

Usable Methods

• Li and Wong’s dCHIP and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA

• ArrayAssist can use dCHIP, RMA, gcRMA, and others.

• The GLA method (Durbin, Rocke, Zhou) can be imported into ArrayAssist.

Page 10: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

10

Steps in Expression Index Construction

• Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays.

• We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set.

Page 11: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

11

• Data transformation is the process of changing the scale of the data so that it is more comparable from high to low.

• Common transformations are the logarithm and generalized logarithm

• Normalization is the process of adjusting for systematic differences from one array to another.

• Normalization may be done before or after transformation, and before or after probe set summarization.

Page 12: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

12

• One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes

• There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers

Page 13: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

13

The RMA Method

• Background correction that does not make 0 signal correspond to 0 amount

• Quantile normalization makes the overall distribution of intensity values across probes the same on each array

• Log2 transform

• Median polish summary of PM probes

Page 14: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

14

4.00 6.00 5.00 5.00

8.00 9.00 7.00 8.00

12.00 24.00 12.00 16.00

8.00 13.00 8.00

-1.00 1.00 0.00 0.00

0.00 1.00 -1.00 0.00

-4.00 8.00 -4.00 0.00

-1.67 3.33 -1.67

0.67 -2.33 1.67 0.00

1.67 -2.33 0.67 0.00

-2.33 4.67 -2.33 0.00

0.00 0.00 0.00

Analysis by means

•Remove Row Means•Remove Column Means•Rows and Columns have

mean 0•Influence of an outlier spreads

Page 15: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

15

4.00 6.00 5.00 5.00

8.00 9.00 7.00 8.00

12.00 24.00 12.00 12.00

8.00 9.00 7.00

-1.00 1.00 0.00 0.00

0.00 1.00 -1.00 0.00

0.00 12.00 0.00 0.00

0.00 1.00 0.00

-1.00 0.00 0.00 0.00

0.00 0.00 -1.00 0.00

0.00 11.00 0.00 0.00

0.00 0.00 0.00

Median Polish

•Remove Row Medians•Remove Column Medians•Rows and Columns may not

have median 0•Outliers contained•May have to be iterated

Page 16: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

16

Example Probe Set• Using the Affy HG U133 Plus 2.0

GeneChip with 54675 probe sets, from 604258 PM probes.

• Four chips derived from human IR exposed skin at 0, 1, 10, and 100 cGy

• Probe set number 10067/54675 has Affy ID 200618_at

• Gene is LASP1, LIM and SH3 protein 1, LIM protein subfamily, Src homology, actin binding.

Page 17: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

17

0 1 10 100

200618_at1 360 216 158 198 233.0

200618_at2 313 402 106 103 231.0

200618_at3 130 182 79 91 120.5

200618_at4 351 370 195 136 263.0

200618_at5 164 130 98 107 124.8

200618_at6 223 219 164 196 200.5

200618_at7 437 529 195 158 329.8

200618_at8 509 554 274 128 366.3

200618_at9 522 720 285 198 431.3

200618_at10 668 715 247 260 472.5

200618_at11 306 286 144 159 223.8

362.1 393.0 176.8 157.6

Mean Summarization

Page 18: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

18

0 1 10 100

200618_at1 2.56 2.33 2.20 2.30 2.35

200618_at2 2.50 2.60 2.03 2.01 2.28

200618_at3 2.11 2.26 1.90 1.96 2.06

200618_at4 2.55 2.57 2.29 2.13 2.38

200618_at5 2.21 2.11 1.99 2.03 2.09

200618_at6 2.35 2.34 2.21 2.29 2.30

200618_at7 2.64 2.72 2.29 2.20 2.46

200618_at8 2.71 2.74 2.44 2.11 2.50

200618_at9 2.72 2.86 2.45 2.30 2.58

200618_at10 2.82 2.85 2.39 2.41 2.62

200618_at11 2.49 2.46 2.16 2.20 2.33

2.51 2.53 2.21 2.18

Mean Summarizationof the Logs

Page 19: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

19

The GLA Method

• The Glog Average (GLA) method is simpler than the RMA method, though it can require estimation of a parameter

• Background correction is intended to make a measured value of zero correspond to a zero quantity in the sample

• Transformation uses the glog ~ ln for large values

• Normalization via lowess• Summary is a simple average of PM probes

Page 20: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

20

Probe Sets not Genes

• It is unavoidable to refer to a probe set as measuring a “gene”, but nevertheless it can be deceptive

• The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism

• Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied

Page 21: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

21

Two-Color Arrays

• Two-color arrays are designed to account for variability in slides and spots by using two samples on each slide, each labeled with a different dye.

• If a spot is too large, for example, both signals will be too big, and the difference or ratio will eliminate that source of variability

Page 22: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

22

Dyes

• The most common dye sets are Cy3 (green) and Cy5 (red), which fluoresce at approximately 550 nm and 649 nm respectively (red light ~ 700 nm, green light ~ 550 nm)

• The dyes are excited with lasers at 532 nm (Cy3 green) and 635 nm (Cy5 red)

• The emissions are read via filters using a CCD device

Page 23: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

23

Page 24: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

24

Page 25: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

25

Page 26: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

26

File Format

• A slide scanned with Axon GenePix produces a file with extension .gpr that contains the results:http://www.axon.com/gn_GenePix_File_Formats.html

• This contains 29 rows of headers followed by 43 columns of data (in our example files)

• For full analysis one may also need a .gal file that describes the layout of the arrays

Page 27: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

27

"Block" "Column" "Row" "Name" "ID" "X" "Y" "Dia." "F635 Median" "F635 Mean" "F635 SD" "B635 Median" "B635 Mean" "B635 SD" "% > B635+1SD" "% > B635+2SD" "F635 % Sat."

Page 28: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

28

"F532 Median" "F532 Mean" "F532 SD" "B532 Median" "B532 Mean" "B532 SD" "% > B532+1SD" "% > B532+2SD" "F532 % Sat."

Page 29: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

29

"Ratio of Medians (635/532)" "Ratio of Means (635/532)" "Median of Ratios (635/532)" "Mean of Ratios (635/532)" "Ratios SD (635/532)""Rgn Ratio (635/532)" "Rgn R² (635/532)" "F Pixels" "B Pixels" "Sum of Medians" "Sum of Means" "Log Ratio (635/532)" "F635 Median - B635""F532 Median - B532" "F635 Mean - B635" "F532 Mean - B532" "Flags"

Page 30: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

30

Analysis Choices

• Mean or median foreground intensity

• Background corrected or not

• Log transform (base 2, e, or 10) or glog transform

• Log is compatible only with no background correction

• Glog is best with background correction

Page 31: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

31

Block 1

Column 1

Row 1

Name NM_006182

ID discoidin domain receptor family, member

X 2575

Y 2565

Dia. 85

DDR1

Page 32: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

32

F635 Median 48

F635 Mean 54

F635 SD 23

B635 Median 34

B635 Mean 36

B635 SD 11

% > B635+1SD 52

% > B635+2SD 36

F635 % Sat. 0

F532 Median 109

F532 Mean 113

F532 SD 26

B532 Median 35

B532 Mean 36

B532 SD 7

% > B532+1SD 100

% > B532+2SD 100

F532 % Sat. 0

Page 33: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

33

Issues with Two-Color Arrays

• Chips have different overall intensities, so normalization across chips is needed.

• The overall intensity on the red channel may be greater or less than on the green channel, so normalization across dyes is needed.

• The red/green difference is can be different at different intensity levels

Page 34: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

34

Array normalization

• Array normalization is meant to increase the precision of comparisons by adjusting for variations that cover entire arrays

• Without normalization, the analysis would be valid, but possibly less sensitive

• However, a poor normalization method will be worse than none at all.

Page 35: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

35

Possible normalization methods

• We can equalize the mean or median intensity by adding or multiplying a correction term

• We can use different normalizations at different intensity levels (intensity-based normalization) for example by lowess or quantiles

• We can normalize for other things such as print tips

Page 36: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

36

Group 1 Group 2

Array 1 Array 2 Array 3 Array 4

Gene 1 1100 900 425 550

Gene 2 110 95 85 110

Gene 3 80 65 55 80

Example for Normalization

Page 37: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

37

. list Array Group Gene Expression

+---------------------------------+ | Array Group Gene Expres~n | |---------------------------------| 1. | 1 1 1 1100 | 2. | 2 1 1 900 | 3. | 3 2 1 425 | 4. | 4 2 1 550 | 5. | 1 1 2 110 | |---------------------------------| 6. | 2 1 2 95 | 7. | 3 2 2 85 | 8. | 4 2 2 110 | 9. | 1 1 3 80 | 10. | 2 1 3 65 | |---------------------------------| 11. | 3 2 3 55 | 12. | 4 2 3 80 | +---------------------------------+

Page 38: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

38

. sort Gene

. by Gene: anova Expression Group

---------------------------------------------------------------------------------> Gene = 1

Number of obs = 4 R-squared = 0.9042 Root MSE = 117.925 Adj R-squared = 0.8564

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 262656.25 1 262656.25 18.89 0.0491 | Group | 262656.25 1 262656.25 18.89 0.0491 | Residual | 27812.5 2 13906.25 -----------+---------------------------------------------------- Total | 290468.75 3 96822.9167

Page 39: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

39

-> Gene = 2

Number of obs = 4 R-squared = 0.0556 Root MSE = 14.5774 Adj R-squared = -0.4167

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 25 1 25 0.12 0.7643 | Group | 25 1 25 0.12 0.7643 | Residual | 425 2 212.5 -----------+---------------------------------------------------- Total | 450 3 150 --------------------------------------------------------------------------------> Gene = 3 Number of obs = 4 R-squared = 0.0556 Root MSE = 14.5774 Adj R-squared = -0.4167

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 25 1 25 0.12 0.7643 | Group | 25 1 25 0.12 0.7643 | Residual | 425 2 212.5 -----------+---------------------------------------------------- Total | 450 3 150

Page 40: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

40

Group 1 Group 2

Array 1 Array 2 Array 3 Array 4

Gene 1 975 851 541 608

Gene 2 -15 46 201 168

Gene 3 -45 16 171 138

Additive Normalization by Means

Page 41: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

41

. mean Expression

. ereturn list

scalars: e(df_r) = 11 e(N_over) = 1 e(N) = 12 e(k_eq) = 1 e(k_eform) = 0

macros: e(cmd) : "mean" e(title) : "Mean estimation" e(estat_cmd) : "estat_vce_only" e(varlist) : "Expression" e(predict) : "_no_predict" e(properties) : "b V"

matrices: e(b) : 1 x 1 e(V) : 1 x 1 e(_N) : 1 x 1 e(error) : 1 x 1

functions: e(sample)

. matrix ExpMeanMat = e(b)

. matlist ExpMeanMat

| Express~n -------------+----------- y1 | 304.5833

. scalar ExpMean = ExpMeanMat[1,1]

. display ExpMean304.58333. anova Expression Array. predict ArrayMean. generate NormExp1=Expression-ArrayMean +ExpMean

Page 42: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

42

. list Array Group Gene Expression ArrayMean NormExp1

+--------------------------------------------------------+ | Array Group Gene Expres~n ArrayM~n NormExp1 | |--------------------------------------------------------| 1. | 1 1 1 1100 430 974.5833 | 2. | 2 1 1 900 353.3333 851.25 | 3. | 3 2 1 425 188.3333 541.25 | 4. | 4 2 1 550 246.6667 607.9167 | 5. | 1 1 2 110 430 -15.41667 | |--------------------------------------------------------| 6. | 2 1 2 95 353.3333 46.24999 | 7. | 3 2 2 85 188.3333 201.25 | 8. | 4 2 2 110 246.6667 167.9167 | 9. | 1 1 3 80 430 -45.41667 | 10. | 2 1 3 65 353.3333 16.24999 | |--------------------------------------------------------| 11. | 3 2 3 55 188.3333 171.25 | 12. | 4 2 3 80 246.6667 137.9167 | +--------------------------------------------------------+

Page 43: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

43

. by Gene: anova NormExp1 Group

-------------------------------------------------------------------------------------> Gene = 1

Number of obs = 4 R-squared = 0.9209 Root MSE = 70.0991 Adj R-squared = 0.8814

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 114469.431 1 114469.431 23.30 0.0403 | Group | 114469.431 1 114469.431 23.30 0.0403 | Residual | 9827.77662 2 4913.88831 -----------+---------------------------------------------------- Total | 124297.207 3 41432.4024

Page 44: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

44

-> Gene = 2

Number of obs = 4 R-squared = 0.9209 Root MSE = 35.0496 Adj R-squared = 0.8814

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 28617.3614 1 28617.3614 23.30 0.0403 | Group | 28617.3614 1 28617.3614 23.30 0.0403 | Residual | 2456.9441 2 1228.47205 -----------+---------------------------------------------------- Total | 31074.3055 3 10358.1018

--------------------------------------------------------------------------------------> Gene = 3

Number of obs = 4 R-squared = 0.9209 Root MSE = 35.0496 Adj R-squared = 0.8814

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 28617.3612 1 28617.3612 23.30 0.0403 | Group | 28617.3612 1 28617.3612 23.30 0.0403 | Residual | 2456.94427 2 1228.47214 -----------+---------------------------------------------------- Total | 31074.3055 3 10358.1018

Page 45: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

45

Group 1 Group 2

Array 1 Array 2 Array 3 Array 4

Gene 1 779 776 687 679

Gene 2 78 82 137 136

Gene 3 57 56 89 99

Multiplicative Normalization by Means

Page 46: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

46

. generate NormExp2 = Expression*ExpMean/ArrayMean

. list Array Group Gene Expression ArrayMean NormExp2

+-------------------------------------------------------+ | Array Group Gene Expres~n ArrayM~n NormExp2 | |-------------------------------------------------------| 1. | 1 1 1 1100 430 779.1667 | 2. | 2 1 1 900 353.3333 775.8254 | 3. | 3 2 1 425 188.3333 687.3341 | 4. | 4 2 1 550 246.6667 679.1385 | 5. | 1 1 2 110 430 77.91666 | |-------------------------------------------------------| 6. | 2 1 2 95 353.3333 81.89268 | 7. | 3 2 2 85 188.3333 137.4668 | 8. | 4 2 2 110 246.6667 135.8277 | 9. | 1 1 3 80 430 56.66667 | 10. | 2 1 3 65 353.3333 56.03184 | |-------------------------------------------------------| 11. | 3 2 3 55 188.3333 88.94912 | 12. | 4 2 3 80 246.6667 98.78378 | +-------------------------------------------------------+

Page 47: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

47

. by Gene: anova NormExp2 Group

---------------------------------------------------------------------------------> Gene = 1

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 8884.90342 1 8884.90342 453.70 0.0022

---------------------------------------------------------------------------------> Gene = 2

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 3219.72043 1 3219.72043 696.33 0.0014

---------------------------------------------------------------------------------> Gene = 3

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 1407.54019 1 1407.54019 57.97 0.0168

Page 48: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

48

Group 1 Group 2

Array 1 Array 2 Array 3 Array 4

Gene 1 1025 971 512 512

Gene 2 102 102 102 102

Gene 3 75 70 66 74

Multiplicative Normalization by Medians

Page 49: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

49

. sort Array

. table Array, contents(p50 Expression)

------------------------- Array | med(Expres~n)----------+-------------- 1 | 110 2 | 95 3 | 85 4 | 110-------------------------

. input ArrayMed

ArrayMed 1. 110 2. 110 3. 110 4. 95 5. 95 6. 95 7. 85 8. 85 9. 85 10. 110 11. 110 12. 110

Page 50: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

50

. summarize Expression, detail

Expression------------------------------------------------------------- Percentiles Smallest 1% 55 55 5% 55 6510% 65 80 Obs 1225% 80 80 Sum of Wgt. 12

50% 102.5 Mean 304.5833 Largest Std. Dev. 363.114475% 487.5 42590% 900 550 Variance 131852.195% 1100 900 Skewness 1.27795499% 1100 1100 Kurtosis 3.132949

. generate NormExp3 = Expression*102.5/ArrayMed

Page 51: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

51

-> Gene = 1

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 235735.794 1 235735.794 324.00 0.0031

-------------------------------------------------------------------------------------> Gene = 2

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 0 1 0

-------------------------------------------------------------------------------------> Gene = 3

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 3.6253006 1 3.6253006 0.17 0.7228

Page 52: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

52

Intensity-based normalization

• Normalize by means, medians, etc., but do so only in groups of genes with similar expression levels.

• lowess is a procedure that produces a running estimate of the middle, like a robustified mean

• If we subtract the lowess of each array and add the average of the lowess’s, we get the lowess normalization

Page 53: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

53

Page 54: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

54

Page 55: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

55

Page 56: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

56

Fitting a model to genes

• We can fit a model to the data of each gene after the whole arrays have been background corrected, transformed, and normalized

• Each gene is then test for whether there is differential expression

Page 57: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

57

Page 58: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

58

Multiplicity Adjustments

• If we test thousands of genes and pick all the ones which are significant at the 5% level, we will get hundreds of false positives.

• Multiplicity adjustments winnow this down so that the number of false positives is smaller

Page 59: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

59

Types of Multiplicity Adjustments

• The Bonferroni correction aims to detect no significant genes at all if there are truly none, and guarantees that the chance that any will be detected is less than .05 under these conditions

• Generally, this is too conservative• Less conservative versions include

methods due to Holm, Hochberg, and Benjamini and Hochberg (FDR)

Page 60: Gene Expression Arrays

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

60