microarray statistical validation and functional annotation

25
Microarray statistical Microarray statistical validation and validation and functional annotation functional annotation

Upload: hailey-cook

Post on 26-Mar-2015

233 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Microarray statistical validation and functional annotation

Microarray statistical Microarray statistical validation and validation and

functional annotationfunctional annotation

Page 2: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

DNA microarray technology is an high DNA microarray technology is an high throughput method for gaining throughput method for gaining information on gene function.information on gene function.

Microarray technology is based on the Microarray technology is based on the availability of gene sequences arrayed availability of gene sequences arrayed on a solid surface and it allows parallel on a solid surface and it allows parallel expression analysis of thousands of expression analysis of thousands of genesgenes..

DNA microarray technology is an high DNA microarray technology is an high throughput method for gaining throughput method for gaining information on gene function.information on gene function.

Microarray technology is based on the Microarray technology is based on the availability of gene sequences arrayed availability of gene sequences arrayed on a solid surface and it allows parallel on a solid surface and it allows parallel expression analysis of thousands of expression analysis of thousands of genesgenes..

Page 3: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Microarray can be a valuable toolMicroarray can be a valuable tool– to define transcriptional signatures to define transcriptional signatures

bound to a pathological conditionbound to a pathological condition– to rule out molecular mechanisms to rule out molecular mechanisms

tightly bound to transcription tightly bound to transcription

Since our actual knowledge on genes Since our actual knowledge on genes function in high eukaryotes is quite function in high eukaryotes is quite limitedlimited– Microarray analysis frequently does not Microarray analysis frequently does not

imply a final answer to a biological imply a final answer to a biological problem but allows the discovery of new problem but allows the discovery of new research paths which let to explore it by research paths which let to explore it by a different perspectivea different perspective

Page 4: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

A gold standard methodology to A gold standard methodology to identify, with high sensitivity and identify, with high sensitivity and precision, “biologically precision, “biologically meaningful” differentially meaningful” differentially expressed genes is not yet expressed genes is not yet available.available. – Therefore, various approaches are under Therefore, various approaches are under

development to optimize the extraction of development to optimize the extraction of data linked to the “biology” of the problem data linked to the “biology” of the problem

under studyunder study..

Page 5: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

The principal steps of a microarray The principal steps of a microarray analysis are:analysis are:– Gene intensity measurements and data Gene intensity measurements and data

normalization.normalization.– Statistical validation of differential Statistical validation of differential

expression.expression.– Functional data mining.Functional data mining.

Page 6: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Statistical validation usually implies the selection Statistical validation usually implies the selection from the user of statistical significance parameters.from the user of statistical significance parameters.

For example:For example:– SAM (Significance Analysis of Microarrays) always SAM (Significance Analysis of Microarrays) always

requires the input of a “delta” value which defines requires the input of a “delta” value which defines the threshold of false positive in the validated the threshold of false positive in the validated dataset.dataset.

If the stringency of the statistical validation is too If the stringency of the statistical validation is too high biologically meaningful genes can be lost high biologically meaningful genes can be lost making more difficult to role out functional making more difficult to role out functional correlations between the differentially expressed correlations between the differentially expressed genes.genes.

If the stringency of the statistical validation is too If the stringency of the statistical validation is too loose the increase of false positives creates loose the increase of false positives creates background noise from which is difficult to extract background noise from which is difficult to extract trustful functional correlations between the trustful functional correlations between the differentially expressed genes.differentially expressed genes.

Page 7: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Page 8: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Page 9: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Statistical validation implies the Statistical validation implies the selection from the user of statistical selection from the user of statistical significance parameters.significance parameters.

For example:For example:– SAM (Significance Analysis of SAM (Significance Analysis of

Microarrays) requires the definition of a Microarrays) requires the definition of a “delta” value which defines the threshold “delta” value which defines the threshold of false positive in the validated dataset.of false positive in the validated dataset.

– When Fisher’s test is used the definition When Fisher’s test is used the definition of a threshold value is even more hard.of a threshold value is even more hard.

Page 10: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Page 11: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

It is important to remark that:It is important to remark that:– A statistical validation not always A statistical validation not always

implies the selection of the most implies the selection of the most “biologically” meaningful dataset“biologically” meaningful dataset

Therefore we are trying to integrate Therefore we are trying to integrate “biologically” important parameters, “biologically” important parameters, as Gene ontology, in the statistical as Gene ontology, in the statistical validation.validation.

Page 12: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Gene OntologyGene Ontology (GO) is a dynamic (GO) is a dynamic controlled vocabulary that can be controlled vocabulary that can be applied to all organisms even as applied to all organisms even as knowledge of gene and protein roles in knowledge of gene and protein roles in cells is accumulating and changing.cells is accumulating and changing.

GO might help to link differentially GO might help to link differentially expressed genes to specific functional expressed genes to specific functional classes.classes.

Page 13: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Molecular FunctionMolecular Function::the tasks performed by individual gene, products; the tasks performed by individual gene, products;

examples are transcription factor and DNA helicase.examples are transcription factor and DNA helicase.

Page 14: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Biological ProcessBiological Process::broad biological goals, such as mitosis or purine metabolism, broad biological goals, such as mitosis or purine metabolism,

that are accomplished by ordered assemblies of molecular that are accomplished by ordered assemblies of molecular functions functions

Page 15: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Cellular ComponentCellular Component::subcellular structures, locations, and macromolecular subcellular structures, locations, and macromolecular

complexes; examples include nucleus, telomere, complexes; examples include nucleus, telomere, and origin recognition complex and origin recognition complex

Page 16: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Recently has been shown that:Recently has been shown that: There is a strong instability of the size There is a strong instability of the size

and overlap of the gene lists that and overlap of the gene lists that result from varying gene selection result from varying gene selection methods. methods.

(Hosack et al, Genome Biology 2003, 4:P4)

Page 17: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

(Hosack et al, Genome Biology 2003, 4:P4)

The percentage of genes overlapping The percentage of genes overlapping in any two lists was highly variable,in any two lists was highly variable,

and ranged from 7% to 60%.and ranged from 7% to 60%.

The percentage of genes overlapping The percentage of genes overlapping in any two lists was highly variable,in any two lists was highly variable,

and ranged from 7% to 60%.and ranged from 7% to 60%.

Page 18: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

In spite of this striking variation:In spite of this striking variation: The top five biological biologically The top five biological biologically

themes linked to the data sets are the themes linked to the data sets are the same.same.

This evidence suggests that the This evidence suggests that the conversion of genes to themes favour conversion of genes to themes favour the "biological result" of the the "biological result" of the experiment to be determined despite experiment to be determined despite substantial differences in gene list substantial differences in gene list content resulting from the use of content resulting from the use of various normalization, gene intensity various normalization, gene intensity and statistical selection methods.and statistical selection methods.

(Hosack et al, Genome Biology 2003, 4:P4)

Page 19: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

(Hosack et al, Genome Biology 2003, 4:P4)

Page 20: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

Integrating GO in statistical validationIntegrating GO in statistical validation::– The number of GO classes are counted in the data set The number of GO classes are counted in the data set

under statistical validation.under statistical validation.– SAM analyses are performed using various delta SAM analyses are performed using various delta

parameters.parameters.– The GO classes present in the statistically validated The GO classes present in the statistically validated

subsets are counted.subsets are counted.– The presence of enrichment of GO classes in the SAM The presence of enrichment of GO classes in the SAM

validated sets is evaluated using a binomial test validated sets is evaluated using a binomial test corrected for Type I errors.corrected for Type I errors.

A score for each GO class is generated performing the A score for each GO class is generated performing the loglog22(p-value * % hits)(p-value * % hits)

– The SAM subset showing the best compromise The SAM subset showing the best compromise between number of enriched GO classes and number between number of enriched GO classes and number of HITs for each class is selected for further studiesof HITs for each class is selected for further studies

Page 21: Microarray statistical validation and functional annotation

CONCORDANT MORPHOLOGIC AND GENE CONCORDANT MORPHOLOGIC AND GENE EXPRESSION DATA SHOW THAT A VACCINE EXPRESSION DATA SHOW THAT A VACCINE FREEZES HER-2/FREEZES HER-2/neuneu PRENEOPLASTIC PRENEOPLASTIC

LESIONSLESIONS

CONCORDANT MORPHOLOGIC AND GENE CONCORDANT MORPHOLOGIC AND GENE EXPRESSION DATA SHOW THAT A VACCINE EXPRESSION DATA SHOW THAT A VACCINE FREEZES HER-2/FREEZES HER-2/neuneu PRENEOPLASTIC PRENEOPLASTIC

LESIONSLESIONS

Atypical hyperplasia and Atypical hyperplasia and in situ carcinomasin situ carcinomas

10 wks

Lobular carcinomaLobular carcinoma

22 wks

Cured mammary glandCured mammary gland

22 wks

(Quaglino et al submitted)

Page 22: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrayslo

g2(p

-valu

e *

%H

ITs)

Page 23: Microarray statistical validation and functional annotation

MicroarraysMicroarraysMicroarraysMicroarrays

We observed that:We observed that:– simple statistical validation and statistical validation simple statistical validation and statistical validation

mediated by GO classes analysis have strong mediated by GO classes analysis have strong overlap.overlap.

However, some interesting differentially However, some interesting differentially expressed genes can be only detected using expressed genes can be only detected using GO mediated statistical validation.GO mediated statistical validation.

Page 24: Microarray statistical validation and functional annotation

a

b

c

d

-3.0 3.01:1

i

i

prgwk

ntwk

2

22

i

j

prgwk

ntwk

2

10

i

i

prgwk

pbwk

2

22

e

i

i

prgwk

ntwk

2

22

i

j

prgwk

ntwk

2

10

i

i

prgwk

pbwk

2

22

Ig-linked immuno responsecommon to simple statistical analysisand GO-mediated statistical validation

Cell-linked immuno responsespecific of GO-mediated

statistical validation

Page 25: Microarray statistical validation and functional annotation

Subsets of SAM validated genes (SSVG)

Subsets of SAM validated genes (SSVG)

Consensusprogram

Consensusprogram Alignment

matrices (AMs)

Alignment matrices (AMs)

Patserprogram

Patserprogram

Starting dataset (SD)

Starting dataset (SD)

SAMprogram

SAMprogram

Any AM is over-represented

in SSVG?

Any AM is over-represented

in SSVG?

Selected SSVG

Selected SSVG

Yes

No

Discard

Run SAM withat least 3 different

threshold?

Run SAM withat least 3 different

threshold?

No

min(AMs specificp-value)

min(AMs specificp-value)

Patserprogram

Patserprogram

Filtering byAMs specific

P-value

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i) (l)

(m)

(n)

We also observed that the previously described approach can also be used to improve data mining related to the transcriptional signature present in co-regulated gene