t-statistics is widespread in assessing differential expression

166
T-statistics is widespread in T-statistics is widespread in assessing differential expression. assessing differential expression. Unstable variance estimates that Unstable variance estimates that arise when sample size is small arise when sample size is small can be corrected using: can be corrected using: Error fudge factors (SAM) Error fudge factors (SAM) Bayesian methods (Limma) Bayesian methods (Limma)

Upload: akio

Post on 13-Jan-2016

41 views

Category:

Documents


2 download

DESCRIPTION

T-statistics is widespread in assessing differential expression. Unstable variance estimates that arise when sample size is small can be corrected using: Error fudge factors (SAM) Bayesian methods (Limma). Limma. Linear model analysis of microarrays. {. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: T-statistics is widespread in assessing differential expression

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

Page 2: T-statistics is widespread in assessing differential expression

LimmaLimma

Linear model analysis of Linear model analysis of microarraysmicroarrays

Page 3: T-statistics is widespread in assessing differential expression

Bayesian regularized t-testBayesian regularized t-test(Baldi & Long 2001)(Baldi & Long 2001)

C

C

T

T

CT

nn

mmt

22

C

C

T

T

CT

nn

mmt

22

The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a

function of the mean expression of the genefunction of the mean expression of the gene

The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a

function of the mean expression of the genefunction of the mean expression of the gene

The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00

22

The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00

22

My gene

{

Page 4: T-statistics is widespread in assessing differential expression

Bayesian regularized t-testBayesian regularized t-test

The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,

to make more robust the t-test resultsto make more robust the t-test results

The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,

to make more robust the t-test resultsto make more robust the t-test results

Page 5: T-statistics is widespread in assessing differential expression

Bayesian regularized t-testBayesian regularized t-test

The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions

The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions

Page 6: T-statistics is widespread in assessing differential expression

BH correctionBH correction

• BH is the most used method for the correction of BH is the most used method for the correction of type I errors in microarray analysis.type I errors in microarray analysis.

• However, it has some limitation due to the initial However, it has some limitation due to the initial hypotheses:hypotheses:– The gene expressions are independent from each The gene expressions are independent from each

other.other.– The raw distribution of p values should be uniform in The raw distribution of p values should be uniform in

the non significant range.the non significant range.

Page 7: T-statistics is widespread in assessing differential expression
Page 8: T-statistics is widespread in assessing differential expression

The application of BH correction to these pvalues will not produceany differential expressed gene!

The application of BH correction to these pvalues will not produceany differential expressed gene!

Page 9: T-statistics is widespread in assessing differential expression

Let’s identify differentially expressedprobe sets by linear modelling

Let’s identify differentially expressedprobe sets by linear modelling

To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.

To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.

Page 10: T-statistics is widespread in assessing differential expression

Next step is the definition of the contrasts, which represent the differential expression couples to be considered.

Next step is the definition of the contrasts, which represent the differential expression couples to be considered.

If more than two conditions are available more contrasts can be evaluated

If more than two conditions are available more contrasts can be evaluated

Page 11: T-statistics is widespread in assessing differential expression

Contrast parameterization is saved with a specific name

Contrast parameterization is saved with a specific name

REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)

REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)

Page 12: T-statistics is widespread in assessing differential expression

Before evaluating differential expression raw p-value distribution is checked.

Before evaluating differential expression raw p-value distribution is checked.

AA

BB

CC

Page 13: T-statistics is widespread in assessing differential expression
Page 14: T-statistics is widespread in assessing differential expression

BB

CC

AAIf BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

If BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

Page 15: T-statistics is widespread in assessing differential expression

A

B

Page 16: T-statistics is widespread in assessing differential expression

These results can be saved in a new topTable containing only the probe sets shown in red on plots

These results can be saved in a new topTable containing only the probe sets shown in red on plots

Yes

Page 17: T-statistics is widespread in assessing differential expression

TopTable structureTopTable structure

AffyIDAffyID

Gene Symbol

Gene Symbol

Gene Description

Gene Description

Log2 FCLog2 FC

Average intensity

Average intensity

T statisticsT statistics

P-valuesP-values

Log-odd statistics

Log-odd statistics

Page 18: T-statistics is widespread in assessing differential expression

Exercise 10 Exercise 10 (30 minutes)(30 minutes)

• Go in the folder Go in the folder estrogen.IGF1estrogen.IGF1..• Create, with excel, Create, with excel, a tab delimited filea tab delimited file named targets.txt: named targets.txt:

– Targets file is made of three columns with the following header:Targets file is made of three columns with the following header:• NameName• FileNameFileName• TargetTarget

– In column In column NameName place a brief name (e.g. c1, c2, etc) place a brief name (e.g. c1, c2, etc)– In column In column FileNameFileName place the name of the corresponding .CEL place the name of the corresponding .CEL

filefile– In column In column TargetTarget place the experimental conditions (e.g. control, place the experimental conditions (e.g. control,

treatment, etc)treatment, etc)• Create a target only for MCF7 and Sker-3 with/without Create a target only for MCF7 and Sker-3 with/without

estrogen (E2) treatment.estrogen (E2) treatment.• Calculate Probe set summaries with RMACalculate Probe set summaries with RMA

See next page

Page 19: T-statistics is widespread in assessing differential expression

Exercise 10 Exercise 10 (30 minutes)(30 minutes)

• In this experiment we have a breast In this experiment we have a breast cancer tumor cell line (MCF7) and a tumor cancer tumor cell line (MCF7) and a tumor cell line derived by central nervous system cell line derived by central nervous system (SKER3).(SKER3).

• Question:Question:– Which are the probe sets controlled by E2 in a Which are the probe sets controlled by E2 in a

tissue independent manner?tissue independent manner?

See next page

Page 20: T-statistics is widespread in assessing differential expression

Exercise 10Exercise 10

• Calculate intesities with RMACalculate intesities with RMA

• Filter the data:Filter the data:– IQR 0.25, intensity 25% >100IQR 0.25, intensity 25% >100

• Calculate the models for E2 versus Calculate the models for E2 versus untreated cells both in mcf7 and sker3.untreated cells both in mcf7 and sker3.

• Contrasts:Contrasts:mcf7.e2 – mcf7.ctrlmcf7.e2 – mcf7.ctrl

sher3.e2 – sker3.ctrl sher3.e2 – sker3.ctrl

See next page

Page 21: T-statistics is widespread in assessing differential expression

Exercise 10Exercise 10

• Evaluate if the raw p-value distributions Evaluate if the raw p-value distributions are suitable for BH correction.are suitable for BH correction.

• Question:Question:– Is the raw p-value distribution good to perfom Is the raw p-value distribution good to perfom

BH correction?BH correction?• YES NOYES NO

See next page

Page 22: T-statistics is widespread in assessing differential expression

Exercise 10Exercise 10

• Use the “Table of Genes Ranked in order Use the “Table of Genes Ranked in order of Differential Expression”.of Differential Expression”.

• Plot differentially expressed genes with Plot differentially expressed genes with raw p-value raw p-value ≤≤ 0.05 and an absolute fold 0.05 and an absolute fold change change ≥≥ 1 for the two constrast. 1 for the two constrast.

• Save the subset of the topTables in Save the subset of the topTables in ex10.mcf7.xls, ex10.sker3.xlsex10.mcf7.xls, ex10.sker3.xls

• Save the project as ex10.lmaSave the project as ex10.lma

Page 23: T-statistics is widespread in assessing differential expression

BB

AA

A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.

A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.

Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.

Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.

DD EE FFGG

CC

Page 24: T-statistics is widespread in assessing differential expression

The various list subsets will be saved in your working directory

The various list subsets will be saved in your working directory

Yes

Page 25: T-statistics is widespread in assessing differential expression

Exercise 11 Exercise 11 (15 minutes)(15 minutes)

• Using Using "Venn Diagram between probe set "Venn Diagram between probe set lists“, lists“, evaluate the level of overlap between the evaluate the level of overlap between the Entrez Genes differentially expressed upon E2 Entrez Genes differentially expressed upon E2 treatment in MCF7 and in SKER3.treatment in MCF7 and in SKER3.

• Filter the expression data by the genes in Filter the expression data by the genes in common between the two conditions and export common between the two conditions and export the Normalized Expression Values the Normalized Expression Values (ex10.common.txt).(ex10.common.txt).

Page 26: T-statistics is widespread in assessing differential expression

Analysis pipe-lineAnalysis pipe-line

NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction

QualityQualitycontrolcontrol

Page 27: T-statistics is widespread in assessing differential expression

AnnotationAnnotation

• An important issue in microarray data An important issue in microarray data analysis is the specific association of analysis is the specific association of probe identifiers with genome annotated probe identifiers with genome annotated transcripts. transcripts.

• A critical point in annotation is the way A critical point in annotation is the way in which the association between in which the association between probes and genes is produced.probes and genes is produced.

Page 28: T-statistics is widespread in assessing differential expression

Annotation in AffymetrixAnnotation in Affymetrix• NetAffxNetAffx: Affymetrix annotation repository: Affymetrix annotation repository• Bioconductor:Bioconductor:

– uses a specific annotation library, AnnBuilder, to create annotation uses a specific annotation library, AnnBuilder, to create annotation libraries starting from the association probe set identifierlibraries starting from the association probe set identifierGeneBank GeneBank accession number (i.e. the primary target for probes design). accession number (i.e. the primary target for probes design).

• RESOURCERER (Tsai et al. 2001):RESOURCERER (Tsai et al. 2001):– the annotation tool at TIGR center uses EST and gene sequences the annotation tool at TIGR center uses EST and gene sequences

stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). – They provide an analysis of publicly available EST and gene sequence They provide an analysis of publicly available EST and gene sequence

data for the identification of transcripts and their placement in a genomic data for the identification of transcripts and their placement in a genomic context, and the identification of orthologs and paralogs wherever context, and the identification of orthologs and paralogs wherever possible. possible.

• Neither Bioconductor nor TIGR methods operate at the probe level, Neither Bioconductor nor TIGR methods operate at the probe level, nor do they consider the limited reliability of some sets due to probe nor do they consider the limited reliability of some sets due to probe cross-hybridization or erroneous probe/transcript annotation. cross-hybridization or erroneous probe/transcript annotation.

• Ensembl:Ensembl:– Annotation with the Ensembl tool is built by direct matching of Affymetrix Annotation with the Ensembl tool is built by direct matching of Affymetrix

probes over the Ensembl sequence database. probes over the Ensembl sequence database. – Its weak point is that matching of only 50% of the probes of a specific set Its weak point is that matching of only 50% of the probes of a specific set

to an Ensembl gene is needed for a true association definition "probe set to an Ensembl gene is needed for a true association definition "probe set identifier"/"Ensembl gene identifier". identifier"/"Ensembl gene identifier".

Page 29: T-statistics is widespread in assessing differential expression

Gene OntologyGene Ontology

Page 30: T-statistics is widespread in assessing differential expression

OntologiesOntologies

• An ontology is a specification of a An ontology is a specification of a conceptualization:conceptualization:– a hierarchical mapping of concepts within a given frame a hierarchical mapping of concepts within a given frame

of reference.of reference.

• An ontology is a restricted structured vocabulary of An ontology is a restricted structured vocabulary of terms that represent domain knowledge. terms that represent domain knowledge.

• An ontology specifies a vocabulary that can be An ontology specifies a vocabulary that can be used to exchange queries and assertions. used to exchange queries and assertions.

• A commitment to the use of the ontology is an A commitment to the use of the ontology is an agreement to use the shared vocabulary in a agreement to use the shared vocabulary in a consistent way.consistent way.

Page 31: T-statistics is widespread in assessing differential expression

The Gene OntologyThe Gene Ontology• The goal of the Gene Ontology (GO) Consortium is to The goal of the Gene Ontology (GO) Consortium is to

produce a controlled vocabulary that can be applied to all produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in organisms even as knowledge of gene and protein roles in cells is accumulating and changing. cells is accumulating and changing. – http://www.geneontology.org/

• For genes and gene products the Gene Ontology For genes and gene products the Gene Ontology Consortium (GO) is an initiative that is designed to address Consortium (GO) is an initiative that is designed to address the problem of defining the problem of defining common set of terms and common set of terms and descriptions for basic biological functionsdescriptions for basic biological functions..

• GO provides a restricted vocabulary as well as clear GO provides a restricted vocabulary as well as clear indications of the relationships between terms.indications of the relationships between terms.

Page 32: T-statistics is widespread in assessing differential expression

The Gene OntologyThe Gene Ontology

• The Gene Ontology (GO) consortium produces three independent ontologies for gene products.

• The three ontologies are:– molecular function of a gene product which is defined to

be biochemical activity or action of the gene product (MF 7220).

– biological process interpreted as a biological objective to which the gene product contributes (BP 9529).

– cellular component is a component of a cell that is part of some larger object or structure (CC 1536).

Page 33: T-statistics is widespread in assessing differential expression

The Graph Structure of GOThe Graph Structure of GO

• The GO ontologies are structured as directed acyclic graphs (DAGs) that represent a network in which each term may be a child of one or more parents.

• GO node is interchangeable with GO term.• Child terms are more specific than their

parents:– The term “transmembrane receptor protein-

tyrosine kinase” is child of • “transmembrane receptor” and “protein tyrosine

kinase”.

Page 34: T-statistics is widespread in assessing differential expression

The Graph Structure of GOThe Graph Structure of GO

• The relationship between a child and a parent can be characterized by the relations:– is a – has a (part of)

• “mitotic chromosome” is a child of “chromosome” and the relationship is an is a relation.

• “telomere” is a child of “chromosome” with the has a relation.

Page 35: T-statistics is widespread in assessing differential expression

Top node

Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)

GO structureGO structure

Page 36: T-statistics is widespread in assessing differential expression

Induced GO graph for a set of diff exprs genes.Induced GO graph for a set of diff exprs genes.

GO can be used to link differentially expressed GO can be used to link differentially expressed genes to specific functional classesgenes to specific functional classes..

Top node

The induced GO graph colored according to unadjusted hypergeometric p-The induced GO graph colored according to unadjusted hypergeometric p-valuevalue0.010.01

Page 37: T-statistics is widespread in assessing differential expression

Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms

shown below as different colors.shown below as different colors.

Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms

shown below as different colors.shown below as different colors.

Page 38: T-statistics is widespread in assessing differential expression

Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes

Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes

Page 39: T-statistics is widespread in assessing differential expression

What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of

differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a

discovered GO term?discovered GO term?

What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of

differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a

discovered GO term?discovered GO term?

Page 40: T-statistics is widespread in assessing differential expression

Example:Example: Population Size: Population Size: 40 genes40 genes

Subset of differentially Subset of differentially expressed genes: expressed genes: 12 genes12 genes

10 genes, shown in light blue, have a common GO term 10 genes, shown in light blue, have a common GO term and 8 occur within the set of differentially expressed and 8 occur within the set of differentially expressed genes.genes.

Page 41: T-statistics is widespread in assessing differential expression

Contingency MatrixContingency Matrix

A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed

membership and membership to a GO term.

A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed

membership and membership to a GO term.

Page 42: T-statistics is widespread in assessing differential expression

outout

inin

GO termGO term

outoutininSubsetSubset

22

44 2626

88

ContingencyContingencyMatrixMatrix

Page 43: T-statistics is widespread in assessing differential expression

Hypergeometric Hypergeometric DistributionDistribution

aa bb

cc dd

a+ca+c

a+ba+b

b+db+d

c+dc+d

!!!!!

)!()!()!()!(

)!()!(!

!!)!(

!!)!(

dcban

dbcadcba

dcban

dbdb

caca

The probability of any The probability of any particularparticularmatrix occurring by randommatrix occurring by randomselection, given no associationselection, given no associationbetween the two variables, is givenbetween the two variables, is givenby the by the hypergeometric rulehypergeometric rule..

Page 44: T-statistics is widespread in assessing differential expression

Assigning Significance to the FindingsAssigning Significance to the Findings

The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two

variables, variables, differential expression membership and membership to differential expression membership and membership to

a a particular Gene Ontology term. particular Gene Ontology term.

The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two

variables, variables, differential expression membership and membership to differential expression membership and membership to

a a particular Gene Ontology term. particular Gene Ontology term.

88 22

44 2626

inin outout

inin

outout

SubsetSubset

GO termGO term p p .0002 .0002

( 2x2 contingency matrix )( 2x2 contingency matrix )

Page 45: T-statistics is widespread in assessing differential expression

EASEEASE(Expression Analysis Systematic Explorer)(Expression Analysis Systematic Explorer)

• EASE analysis identifies prevalent biological EASE analysis identifies prevalent biological themes within gene clusters.themes within gene clusters.

• The highest-ranking themes derived by a The highest-ranking themes derived by a computational method can recapitulate manually computational method can recapitulate manually derived themes in previously published derived themes in previously published microarray, proteomics and SAGE results, and microarray, proteomics and SAGE results, and to provide evidence that these themes are stable to provide evidence that these themes are stable to varying methods of gene selection.to varying methods of gene selection.

Hosack et al. Genome Biol., 4:R70-R70.8, 2003.Hosack et al. Genome Biol., 4:R70-R70.8, 2003.

Page 46: T-statistics is widespread in assessing differential expression
Page 47: T-statistics is widespread in assessing differential expression
Page 48: T-statistics is widespread in assessing differential expression

• Consider all of the ResultsConsider all of the Results

EASE reports all themes represented in a cluster EASE reports all themes represented in a cluster and although some themes may not meet and although some themes may not meet statistical significance it may still be important statistical significance it may still be important to note that particular biological roles or to note that particular biological roles or pathways are represented in the cluster.pathways are represented in the cluster.

• Independently Verify RolesIndependently Verify Roles

Once found, biological themes should be Once found, biological themes should be independently verified using annotation independently verified using annotation resources.resources.

EASE ResultsEASE Results

Page 49: T-statistics is widespread in assessing differential expression

GOstats packageGOstats package

• To perform an analysis using the To perform an analysis using the Hypergeometric-based test, one needs to define Hypergeometric-based test, one needs to define a a gene universegene universe and a list of and a list of selected genesselected genes from the universe.from the universe.

• To identify the set of expressed genes from a To identify the set of expressed genes from a microarray experiment, R. Gentleman (GOstats microarray experiment, R. Gentleman (GOstats developer) proposed that a non-specific filter be developer) proposed that a non-specific filter be applied and that the genes that pass the filter be applied and that the genes that pass the filter be used to form the universe for any subsequent used to form the universe for any subsequent functional analyses.functional analyses.

Page 50: T-statistics is widespread in assessing differential expression

In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.

In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.

Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.

Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.

Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!

Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!

A

B

D

C

Page 51: T-statistics is widespread in assessing differential expression

If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.

If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.

Page 52: T-statistics is widespread in assessing differential expression

The reason of this representation is the selection of the GO terms that

contains smaller subsets.

Page 53: T-statistics is widespread in assessing differential expression

GO identifierGO identifierGO identifierGO identifier

Description of Description of GO termGO term

Description of Description of GO termGO term

significancesignificancesignificancesignificance

N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse

N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse

N. of genes in the N. of genes in the differentially differentially expressed setexpressed set

N. of genes in the N. of genes in the differentially differentially expressed setexpressed set

Page 54: T-statistics is widespread in assessing differential expression

To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function

To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function

Page 55: T-statistics is widespread in assessing differential expression

It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.

It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.

A

C

B

D

Page 56: T-statistics is widespread in assessing differential expression
Page 57: T-statistics is widespread in assessing differential expression

Exercise 13 Exercise 13 (20 minutes)(20 minutes)

• Using GOenrichment function, check if Using GOenrichment function, check if there is any overlap between the GO there is any overlap between the GO classes BP found enriched (p-value classes BP found enriched (p-value 0.01) using the set of probe sets found 0.01) using the set of probe sets found differentially expressed upon E2 treatment differentially expressed upon E2 treatment in MCF7 or SKER3.in MCF7 or SKER3.

• Question:Question:– Which are the BP or MF GO terms in common Which are the BP or MF GO terms in common

between the two set of differentially between the two set of differentially expressed probe sets?expressed probe sets? See next page

Page 58: T-statistics is widespread in assessing differential expression

Exercise 13 Exercise 13 (10 minutes)(10 minutes)

• Using plotGO see which are the parents of the Using plotGO see which are the parents of the GO term(s) in common between the probe sets GO term(s) in common between the probe sets differentially expressed in MCF7 and those in differentially expressed in MCF7 and those in SKER3 upon E2 treatment.SKER3 upon E2 treatment.

• Using extractAffyids function, check the number Using extractAffyids function, check the number of probe sets derived by limma differential of probe sets derived by limma differential expression also present in the common GO expression also present in the common GO termsterms.

• Question:– Probe sets belonging to the common GO terms are Probe sets belonging to the common GO terms are

the same in the two differential expression analyses?the same in the two differential expression analyses?

Page 59: T-statistics is widespread in assessing differential expression

IngenuityIngenuity

Page 60: T-statistics is widespread in assessing differential expression

Human curated databasesHuman curated databases

• Databases generated by the human analysis of Databases generated by the human analysis of published literature are much powerful but they published literature are much powerful but they are mainly commercial and quite expensive.are mainly commercial and quite expensive.

• The Ingenuity Pathways Knowledge Base is the The Ingenuity Pathways Knowledge Base is the world's largest curated database of biological world's largest curated database of biological networks created from millions of individually networks created from millions of individually modeled relationships between:modeled relationships between:– proteins, genes, complexes, cells, tissues, drugs, proteins, genes, complexes, cells, tissues, drugs,

diseases.diseases.

Page 61: T-statistics is widespread in assessing differential expression

Human curated databasesHuman curated databases• Ingenuity has built a knowledge base of pathway Ingenuity has built a knowledge base of pathway

interactions that is scientifically accurate, semantically interactions that is scientifically accurate, semantically consistent, contextually rich, broad in coverage, and up-to-consistent, contextually rich, broad in coverage, and up-to-date. date.

• Accuracy Accuracy – Ingenuity Pathways encodes relationships – Ingenuity Pathways encodes relationships which are manually curated by expert biologists and which are manually curated by expert biologists and checked by a second curator in a strict quality control checked by a second curator in a strict quality control process. process.

• Semantic ConsistencySemantic Consistency – To support consistent – To support consistent representation of pathway models, a proprietary ontology representation of pathway models, a proprietary ontology was built.was built.– Ontology contains 300,000 classes of biological objects spanning Ontology contains 300,000 classes of biological objects spanning

genes, proteins, cells and cell components, anatomy,etc.genes, proteins, cells and cell components, anatomy,etc.

Page 62: T-statistics is widespread in assessing differential expression

IngenuityIngenuity

• The Ingenuity Pathways Analysis uses the human curated Ingenuity database to identify relations between genes.

• The relations that can be grasped are:– Regulates– Regulated by– Binds

Due to the limited time, only part of Ingenuity functionalities will be described.

Page 63: T-statistics is widespread in assessing differential expression

BB

CC

AAIf BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

If BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

Page 64: T-statistics is widespread in assessing differential expression

A

B

Page 65: T-statistics is widespread in assessing differential expression

To create a template A you can use a function implemented in the affylmGUI.A

B

CD

Page 66: T-statistics is widespread in assessing differential expression
Page 67: T-statistics is widespread in assessing differential expression

The P value for subsetting is used to discriminate between the differentially expressed with respect to the other probe sets that are used for Ingenuity functional classes enrichment

Page 68: T-statistics is widespread in assessing differential expression

Start an Ingenuity session at:https://analysis.ingenuity.com/pa/login/login.jsp

Page 69: T-statistics is widespread in assessing differential expression

Click on

A B

C

Click on

D

Data can be uploaded on an existing project or a new project can be created

Page 70: T-statistics is widespread in assessing differential expression

Browse the file a click on next

Click on next

Page 71: T-statistics is widespread in assessing differential expression

Select the type of values associated to the uploaded data

Page 72: T-statistics is widespread in assessing differential expression

Expression Value 1 is a log ratioBoth expression value 2 and 3 are p-values

Click on

Page 73: T-statistics is widespread in assessing differential expression

To consider only the differentially expressed genes Exp Val Type 2 is set to 0 and the recalculate button is pressed.60 genes will be used in this analysis to generate networks

Click on

The analysis will take a while to be done

Page 74: T-statistics is widespread in assessing differential expression

Result of the analysis will appear in the project manager

Double click

Page 75: T-statistics is widespread in assessing differential expression

Select a network and click on

Page 76: T-statistics is widespread in assessing differential expression
Page 77: T-statistics is widespread in assessing differential expression
Page 78: T-statistics is widespread in assessing differential expression

A

B

C

D

Page 79: T-statistics is widespread in assessing differential expression

Export data and use them to extract a specific subset of genes from a topTable

Page 80: T-statistics is widespread in assessing differential expression

Overlay expression values coming from a specific analysis

Page 81: T-statistics is widespread in assessing differential expression

Only 1 gene was found diffentially expressed and few other belong to the microarray set

Page 82: T-statistics is widespread in assessing differential expression

After deleting all the not informative genes it is possible to expand the information related to the the genes present in the network

Page 83: T-statistics is widespread in assessing differential expression
Page 84: T-statistics is widespread in assessing differential expression

Exercise 14Exercise 14

• Create a template A for the set of probe sets Create a template A for the set of probe sets differentially expressed in MCF7 upon IGF1 treatment.differentially expressed in MCF7 upon IGF1 treatment.

• Do the same for the differentially expressed probe sets Do the same for the differentially expressed probe sets found in SKER3 upon IGF1 treatment.found in SKER3 upon IGF1 treatment.

• Load data in Ingenuity.Load data in Ingenuity.• Generate networks for the two data sets.Generate networks for the two data sets.• Question:Question:

– There is any overlap between MCF7 and SKER3 differentially There is any overlap between MCF7 and SKER3 differentially expressed genes at the level of Ingenuity networks?expressed genes at the level of Ingenuity networks?

• Yes……………… No………………………….Yes……………… No………………………….

– There is any cluster characterized by an heavily connected gene There is any cluster characterized by an heavily connected gene (hub gene)?(hub gene)?

• Yes……………… No………………………….Yes……………… No………………………….

Page 85: T-statistics is widespread in assessing differential expression

Exercise 15Exercise 15

• Continuing the analysis on the previously loaded Continuing the analysis on the previously loaded data.data.

• Select IGF1 or Estrogen Receptor and make Select IGF1 or Estrogen Receptor and make growing the connection with other genes.growing the connection with other genes.

• Overlay the expression of the differentially Overlay the expression of the differentially expressed genes found in MCF7 or in SKER3expressed genes found in MCF7 or in SKER3

• Question:Question:– There is any direct connection between differentially There is any direct connection between differentially

expressed genes and estrogen receptor?expressed genes and estrogen receptor?• Yes……………………….No……………………..Yes……………………….No……………………..

Page 86: T-statistics is widespread in assessing differential expression

ClusteringClustering

Page 87: T-statistics is widespread in assessing differential expression

Is it available an ideal clustering Is it available an ideal clustering procedure?procedure?

• No!No!– Each clustering algorithm has it ideal data Each clustering algorithm has it ideal data

structure.structure.

• Since we do not know which is the data Since we do not know which is the data structure:structure:

• Various clustering methods have to be applied in Various clustering methods have to be applied in order to identify the one that better fit to the data order to identify the one that better fit to the data under analysisunder analysis

N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)

Page 88: T-statistics is widespread in assessing differential expression

Supervised versus unsupervised Supervised versus unsupervised clusteringclustering

• Supervised clusteringSupervised clustering try to find the best try to find the best partition for data that belong to a know set partition for data that belong to a know set of classesof classes

• Unsupervised clusteringUnsupervised clustering try to define the try to define the number and the size of the classes in number and the size of the classes in which the transcription profiles can be which the transcription profiles can be fitted in.fitted in.

Page 89: T-statistics is widespread in assessing differential expression

The Expression Matrix is a representation of data from multipleThe Expression Matrix is a representation of data from multiplemicroarray experiments.microarray experiments.

N

D

X11 X12 X13 … X1d (L)

X21 X22 X23 … X2d (L)

Xn1 Xn2 Xn3 … xnd (L)

experiment

Probe set

Each element is a log ratioEach element is a log ratio

+

-

0

Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down

modulation as modulation as GREENGREEN

Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down

modulation as modulation as GREENGREEN

Page 90: T-statistics is widespread in assessing differential expression

Large data set can be loaded as tab delimited

files

Large data set can be loaded as tab delimited

files

To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.

To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.

Page 91: T-statistics is widespread in assessing differential expression

This file can be generated joining the columns on the clinical parameters by an underscore “_”.

This file can be generated joining the columns on the clinical parameters by an underscore “_”.

Join function in excelJoin function in excel

Page 92: T-statistics is widespread in assessing differential expression
Page 93: T-statistics is widespread in assessing differential expression
Page 94: T-statistics is widespread in assessing differential expression

Loading data as tab delimited fileLoading data as tab delimited file

Select as format description tab delimited files

Select as format description tab delimited files

Export expression data as tab delimited files

Export expression data as tab delimited files

Page 95: T-statistics is widespread in assessing differential expression

Select the first numerical value and load the dataSelect the first numerical value and load the data

Page 96: T-statistics is widespread in assessing differential expression

Expression VectorsExpression Vectors

• Gene Expression Vectors Gene Expression Vectors encapsulate the expression of a encapsulate the expression of a gene over a set of experimental gene over a set of experimental conditions or sample types.conditions or sample types.

--0.80.8

0.80.8 1.51.5

1.81.8 0.50.5

--1.31.3

--0.40.4

1.51.5

-2

0

2

1 2 3 4 5 6 7 8

loglog22(time(timett//timetime00))

Page 97: T-statistics is widespread in assessing differential expression

Data reformattingData reformatting• Clustering can be performed using as reference a virtual array:Clustering can be performed using as reference a virtual array:

– A virtual array can be calculated averaging gene expression over the A virtual array can be calculated averaging gene expression over the experimental conditions.experimental conditions.

• Clustering can be performed building virtual two-dye Clustering can be performed building virtual two-dye experiments:experiments:

where i=1…I, j=1…Jwhere i=1…I, j=1…J

• Clustering can be performed also without the use of a common Clustering can be performed also without the use of a common reference by:reference by:– Genes centeringGenes centering

– Experiments centeringExperiments centering

C

T2log

j

i

C

T2logor

Page 98: T-statistics is widespread in assessing differential expression

row

rowii

XZ

row

rowii

XZ

col

colii

XZ

col

colii

XZ

Data reformattingData reformatting

row

rowii

XZ

row

rowii

XZ

col

colii

XZ

Gene centering

Array centering

Centering at gene levels removes thescaling differences!

Centering at gene levels removes thescaling differences!

Page 99: T-statistics is widespread in assessing differential expression

Various data reformating are availableVarious data reformating are available

We will use mainly gene/row adjustmentWe will use mainly gene/row adjustment

Page 100: T-statistics is widespread in assessing differential expression

Distance and SimilarityDistance and Similarity

• The ability to calculate a distance (or The ability to calculate a distance (or similarity, it’s inverse) between two similarity, it’s inverse) between two expression vectors is fundamental to expression vectors is fundamental to clustering algorithms.clustering algorithms.

• Distance between vectors is the basis Distance between vectors is the basis upon which decisions are made when upon which decisions are made when grouping similar patterns of expression.grouping similar patterns of expression.

• Selection of a Selection of a distance metricdistance metric defines the defines the concept of distance.concept of distance.

Page 101: T-statistics is widespread in assessing differential expression

x = (5,5)

y = (9,8)Euclidean distance:d(x,y) = (42+32) = 5

Manhattan distance:d(x,y) = 4+3 = 7

4

35

Distance is Defined by a MetricDistance is Defined by a Metric

Page 102: T-statistics is widespread in assessing differential expression

Distance is Defined by a MetricDistance is Defined by a Metric

Euclidean Pearson Distance Metric:

4.2

1.4

1.00

0.90D

D

-2

0

2

log

log 22(

time

(tim

e tt/tim

e/t

ime 00))

Page 103: T-statistics is widespread in assessing differential expression

Many distance metrics are available.If a selection is not performed the deafult

selection for each type of clustering approach will be used.

Many distance metrics are available.If a selection is not performed the deafult

selection for each type of clustering approach will be used.

Page 104: T-statistics is widespread in assessing differential expression

Hierarchical Clustering Hierarchical Clustering (HCL(HCL)

• HCL is an agglomerative/divisive HCL is an agglomerative/divisive clustering method. clustering method.

• The iterative process continues until all The iterative process continues until all groups are connected in a hierarchical groups are connected in a hierarchical tree.tree.

Page 105: T-statistics is widespread in assessing differential expression
Page 106: T-statistics is widespread in assessing differential expression
Page 107: T-statistics is widespread in assessing differential expression

Hierarchical Clustering Hierarchical Clustering (agglomerative)(agglomerative)

g8g1 g2 g3 g4 g5 g6 g7

g7g1 g8 g2 g3 g4 g5 g6

g7g1 g8 g4 g2 g3 g5 g6

g1 is most like g8

g4 is most like {g1, g8}

Page 108: T-statistics is widespread in assessing differential expression

g7g1 g8 g4 g2 g3 g5 g6

g6g1 g8 g4 g2 g3 g5 g7

g6g1 g8 g4 g5 g7 g2 g3

Hierarchical ClusteringHierarchical Clustering

g5 is most like g7

{g5,g7} is most like {g1, g4, g8}

Page 109: T-statistics is widespread in assessing differential expression

g6g1 g8 g4 g5 g7 g2 g3

Hierarchical TreeHierarchical Tree

Page 110: T-statistics is widespread in assessing differential expression

Hierarchical ClusteringHierarchical Clustering

• During construction of the hierarchy, During construction of the hierarchy, decisions must be made to determine decisions must be made to determine which clusters should be joined. which clusters should be joined.

• The distance or similarity between clusters The distance or similarity between clusters must be calculated. The rules that govern must be calculated. The rules that govern this calculation are this calculation are linkage methodslinkage methods..

Page 111: T-statistics is widespread in assessing differential expression

Agglomerative Linkage MethodsAgglomerative Linkage Methods

• Linkage methods are rules or metrics that Linkage methods are rules or metrics that return a value that can be used to return a value that can be used to determine which elements (clusters) determine which elements (clusters) should be linked.should be linked.

• Three linkage methods that are commonly Three linkage methods that are commonly used are: used are: – Single LinkageSingle Linkage– Average LinkageAverage Linkage– Complete LinkageComplete Linkage

Page 112: T-statistics is widespread in assessing differential expression

Single LinkageSingle Linkage• Cluster-to-cluster distance is defined Cluster-to-cluster distance is defined

as as the minimum distance between the minimum distance between members of one cluster and members of one cluster and members of the another clustermembers of the another cluster. .

• Single linkage tends to create Single linkage tends to create ‘elongated’ clusters with individual ‘elongated’ clusters with individual genes chained onto clusters.genes chained onto clusters.

DAB

Single

Page 113: T-statistics is widespread in assessing differential expression

Average LinkageAverage Linkage• Cluster-to-cluster distance is Cluster-to-cluster distance is

defined as defined as the average distance the average distance between all members of one between all members of one cluster and all members of another cluster and all members of another clustercluster. .

• Average linkage has a slight Average linkage has a slight tendency to produce clusters of tendency to produce clusters of similar variance.similar variance.

DAB

Ave.

Page 114: T-statistics is widespread in assessing differential expression

Complete LinkageComplete Linkage

DAB

• Cluster-to-cluster distance is Cluster-to-cluster distance is defined as defined as the maximum distance the maximum distance between members of one cluster between members of one cluster and members of the another and members of the another clustercluster. .

• Complete linkage tends to create Complete linkage tends to create clusters of similar size and clusters of similar size and variabilityvariability

Complete

Page 115: T-statistics is widespread in assessing differential expression

HCLHCL• A clustering result can be represented by A clustering result can be represented by

many different graphical views.many different graphical views.

1 2 3 4 1 2 34 12 34

Page 116: T-statistics is widespread in assessing differential expression

HCLHCL

• HCL does not converge to a unique result HCL does not converge to a unique result and each run represent one of the and each run represent one of the possible solution.possible solution.

• To obtain information on cluster stability a To obtain information on cluster stability a resampling method should be applied:resampling method should be applied:– Bootstrapping:

• resampling with replacement

– Jackknifing:• resampling without replacement

Page 117: T-statistics is widespread in assessing differential expression

To perform HCL click on HCL iconTo perform HCL click on HCL icon

Page 118: T-statistics is widespread in assessing differential expression

To see results click onTo see results click on

Page 119: T-statistics is widespread in assessing differential expression

Visualization can be reformattedVisualization can be reformatted

Page 120: T-statistics is widespread in assessing differential expression

Bootstrapping (ST)Bootstrapping (ST)

Bootstrapping – resampling with replacement

Original expression matrix:

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various bootstrapped matrices (by experiments):

Exp 2 Exp 3 Exp 4

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 2 Exp 4 Exp 4 Exp 1 Exp 3 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 5

Page 121: T-statistics is widespread in assessing differential expression

Jackknifing (ST)Jackknifing (ST)Jackknifing – resampling without replacement

Original expression matrix:

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various jackknifed matrices (by experiments):

Exp 1 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 2 Exp 3 Exp 4 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

1000 bootstrapsEuclidean distanceAverage clustering

1000 bootstrapsEuclidean distanceAverage clustering

Page 122: T-statistics is widespread in assessing differential expression

To run HCL with resamplingTo run HCL with resampling

Page 123: T-statistics is widespread in assessing differential expression

To see results click onTo see results click on

Page 124: T-statistics is widespread in assessing differential expression

A sub set of genes can be selected clicking on the

node of interest

A sub set of genes can be selected clicking on the

node of interest

Locating the mouse over the

node and clicking on the right mouse

botton various information about

the group of genes can be saved

Locating the mouse over the

node and clicking on the right mouse

botton various information about

the group of genes can be saved

Page 125: T-statistics is widespread in assessing differential expression
Page 126: T-statistics is widespread in assessing differential expression

Principal component analysisPrincipal component analysis

• Principal component analysis (PCA) involves a Principal component analysis (PCA) involves a mathematical procedure that transforms a number of mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called uncorrelated variables called principal componentsprincipal components. .

• The first principal component accounts for as much of The first principal component accounts for as much of the variability in the data as possiblethe variability in the data as possible

• Each succeeding component accounts for as much of Each succeeding component accounts for as much of the remaining variability as possible. the remaining variability as possible.

• The components can be thought of as axes in n-The components can be thought of as axes in n-dimensional space, where n is the number of dimensional space, where n is the number of components. Each axis represents a different trend in components. Each axis represents a different trend in the data.the data.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

Page 127: T-statistics is widespread in assessing differential expression

2

1

2° PC will be orthogonal to the 1st

Page 128: T-statistics is widespread in assessing differential expression

A

Cluster 1

Cluster 2

Cluster 1

MCF7 SKER-3E2 IGFE2 IGF

MCF7 SKER-3

E2 IGFE2 IGF

Cluster 2

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables

Page 129: T-statistics is widespread in assessing differential expression

Quaglino et al. J. Clin. Invest. 2004

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables

Page 130: T-statistics is widespread in assessing differential expression

We have already used PCA for quality controlWe have already used PCA for quality control

Page 131: T-statistics is widespread in assessing differential expression

Results clicking onResults clicking on

Click on right mouse buttonOver 3D/2D

view

Click on right mouse buttonOver 3D/2D

view

Page 132: T-statistics is widespread in assessing differential expression

Cluster Affinity Search Technique Cluster Affinity Search Technique (CAST)(CAST)

• CAST uses an iterative approach to CAST uses an iterative approach to segregate elements with ‘high affinity’ into segregate elements with ‘high affinity’ into a cluster.a cluster.

• The process iterates through two phases:The process iterates through two phases:– additionaddition of high affinity elements to the of high affinity elements to the

cluster being createdcluster being created– removalremoval or clean-up of low affinity elements or clean-up of low affinity elements

from the cluster being createdfrom the cluster being created

Page 133: T-statistics is widespread in assessing differential expression

Clustering Affinity Search Technique (CAST)-1Clustering Affinity Search Technique (CAST)-1Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as%age of maximum affinity at that point

1. Create a new empty cluster C1.

3. Move the two most similar genes into the new cluster.

Empty cluster C1

G2G4

G9

G8

G12

G6

G1

G7

G13

G11

G14

G3

G5 G15

G10

Unassigned genes

4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1)

2. Set initial affinity of all genes to zero

5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds theuser-specified threshold affinity, pick the unassigned gene whose affinity is the highest,and add it to cluster C1. Update the affinities of all the genes accordingly.

ADD GENES:

Page 134: T-statistics is widespread in assessing differential expression

CAST – 2CAST – 2

6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, removethe lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene.

7. Repeat step 6 while C1 contains a low-affinity gene.

8. Repeat steps 5-7 as long as changes occur to the cluster C1.

REMOVE GENES:

9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps1-8.

10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster

Current cluster C1

G2G4

G9

G8

G12G6

G1 G7

G13

G11

G14

G3

G5

G15G10

Unassigned genes

Page 135: T-statistics is widespread in assessing differential expression

Click onClick on

Parameter to be setParameter to be set

Page 136: T-statistics is widespread in assessing differential expression

SOMsSOMs

Page 137: T-statistics is widespread in assessing differential expression

Self-organizing maps (SOMs) – 1Self-organizing maps (SOMs) – 1

1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal

N = NodesG = GenesG1 G6

G3

G5

G4G2

G11

G7G8

G10

G9

G12 G13

G14G15

G19G17

G22

G18

G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

Page 138: T-statistics is widespread in assessing differential expression

SOMs – 2SOMs – 22. Choose a random gene, e.g., G9

3. Move the nodes in the direction of G9. The node closest to G9 (N2) is movedthe most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved.

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

Page 139: T-statistics is widespread in assessing differential expression

SOM Neighborhood OptionsSOM Neighborhood Options

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

Bubble Neighborhood

Gaussian

Neighborhoodradius

All move, alpha is scaled.

Some move, alpha is constant.

Page 140: T-statistics is widespread in assessing differential expression

SOMs – 3SOMs – 3

4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) arerepeated many (usually several thousand) times. However, with each iteration, the amountthat the nodes are allowed to move is decreased.

5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than itsdistance to any other node

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3

N4

N5N6

Page 141: T-statistics is widespread in assessing differential expression

Click onClick on

Page 142: T-statistics is widespread in assessing differential expression

Exercise 16 Exercise 16 (result representation)(result representation)

• This exercise is based on the breast cancer data This exercise is based on the breast cancer data set published by Chin on Cancer Cell 2006 set published by Chin on Cancer Cell 2006 (hgu133A HT platform)(hgu133A HT platform)

• Using the clinical data (E-TABM-158-Using the clinical data (E-TABM-158-clinical.data.txt) available in large.data.set dir:clinical.data.txt) available in large.data.set dir:– Construct a target file, like the one used in time Construct a target file, like the one used in time

course.course.– Load the data in E-TABM-158-processed-data.txt Load the data in E-TABM-158-processed-data.txt

using the created target file.using the created target file.– Filter the data by IQR 0.5 and 25% of samples should Filter the data by IQR 0.5 and 25% of samples should

have a signal over 100 as intensity.have a signal over 100 as intensity.– Attach the hgu133a to the data setAttach the hgu133a to the data set– Save project as ex14.lmaSave project as ex14.lma

Page 143: T-statistics is widespread in assessing differential expression

Exercise 16Exercise 16– Filter the data on the basis of a list of EGs Filter the data on the basis of a list of EGs

derived by Ingenuity related to cell signaling derived by Ingenuity related to cell signaling (use the advance search at Ingenuity). (use the advance search at Ingenuity).

– Export the data as tab delimited file Export the data as tab delimited file (ex14.filtered.txt), save as ex14.filtered.lma. (ex14.filtered.txt), save as ex14.filtered.lma.

• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene

cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.

– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.

Page 144: T-statistics is widespread in assessing differential expression

Exercise 16 Exercise 16 (30 minutes)(30 minutes)

• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene

cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.

– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.

– Subset and save the clusters you have Subset and save the clusters you have identified.identified.

– Combine them in excel.Combine them in excel.– Load them in TMEV and see how PCA divide Load them in TMEV and see how PCA divide

the samples.the samples.

Page 145: T-statistics is widespread in assessing differential expression

ClassificationClassification

Page 146: T-statistics is widespread in assessing differential expression

ClassificationClassification

• The task of diagnosing cancer on the basis of microarray data has been termed class prediction in the literature.

• The task is to classify and predict the The task is to classify and predict the diagnostic category of a sample on the diagnostic category of a sample on the basis of its gene expression profile. basis of its gene expression profile.

Page 147: T-statistics is widespread in assessing differential expression

The example of classification The example of classification problem used in PAM publicationproblem used in PAM publication

• Data for small round blue cell tumors (SRBCT) of childhood (Khan et al. 2001), consisting of expression measurements on 2,308 genes, were obtained from glass-slide cDNA microarrays.

• The tumors are classified as:– Burkitt lymphoma (BL),– Ewing sarcoma (EWS), – neuroblastoma (NB), – rhabdomyosarcoma(RMS).

• A total of 63 training samples and 25 test samples

were provided, although five of the latter were not SRBCTs.

Page 148: T-statistics is widespread in assessing differential expression

PAMPAM

• PAM is a modification of the nearest-PAM is a modification of the nearest-centroid method, called ‘‘nearest shrunken centroid method, called ‘‘nearest shrunken centroid.’’centroid.’’

• PAM uses ‘‘de-noised’’ versions of the PAM uses ‘‘de-noised’’ versions of the centroids as prototypes for each class. centroids as prototypes for each class.

Centroids (Centroids (greygrey) and shrunken centroids () and shrunken centroids (redred) for the SRBCT dataset) for the SRBCT datasetThe overall centroid has been subtracted from the centroid from each class.The overall centroid has been subtracted from the centroid from each class.

Page 149: T-statistics is widespread in assessing differential expression

• SBRCT classification: training (tr, green), cross-validation (cv, red), and test (te, blue) errors are shown as a function of the threshold parameter .

• The value 4.34 is chosen and yields a subset of 43 selected genes.

Page 150: T-statistics is widespread in assessing differential expression

• Shrunken differences dik for the 43 genes having at least one nonzero difference. • The genes with nonzero components in each class are almost mutually exclusive.

Page 151: T-statistics is widespread in assessing differential expression

PAM performancePAM performance

• Misclassification rates for seven classifiers on six microarray datasets based on 50 Misclassification rates for seven classifiers on six microarray datasets based on 50 random partitions into learning sets (two-thirds of the data) and test sets (one-third of random partitions into learning sets (two-thirds of the data) and test sets (one-third of the data)the data)

• The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks NNR and DLDA do surprisingly well and can almost keep up except on the prostate NNR and DLDA do surprisingly well and can almost keep up except on the prostate data (the largest dataset in the analysis).data (the largest dataset in the analysis).

• The success of such methodologically simple tools is limited to gene expression The success of such methodologically simple tools is limited to gene expression datasets with small sample size.datasets with small sample size.

Page 152: T-statistics is widespread in assessing differential expression

Large data setLarge data set

• oneChannelGUI interface has some limits oneChannelGUI interface has some limits (RAM memory) in loading/handling large (RAM memory) in loading/handling large set of .CEL files. set of .CEL files.

• This is expecially true for a large time This is expecially true for a large time course experiment like our example.course experiment like our example.

• To overcome this problem probe set To overcome this problem probe set average expression intensities are average expression intensities are calculated by Expression Console.calculated by Expression Console.

Page 153: T-statistics is widespread in assessing differential expression
Page 154: T-statistics is widespread in assessing differential expression
Page 155: T-statistics is widespread in assessing differential expression

Loading tab delimited file the Bioconductor annotation library is not automatically defined.

Annotation Library information can be attached using:

Page 156: T-statistics is widespread in assessing differential expression

Riorganize clinical information

Load a large data set as tab delimited file.Save in a file the description of the clinical parameters collapsed in the Target column of the targets file.

Page 157: T-statistics is widespread in assessing differential expression

Riorganize clinical information

Page 158: T-statistics is widespread in assessing differential expression

run PAMR analysis

Page 159: T-statistics is widespread in assessing differential expression
Page 160: T-statistics is widespread in assessing differential expression
Page 161: T-statistics is widespread in assessing differential expression

If the selected probe sets are less than 50If the selected probe sets are less than 50

Page 162: T-statistics is widespread in assessing differential expression
Page 163: T-statistics is widespread in assessing differential expression

Yes

Page 164: T-statistics is widespread in assessing differential expression

Nice separation between ER positive Nice separation between ER positive and negative samples can be achieved and negative samples can be achieved

also on the test set also on the test set

Page 165: T-statistics is widespread in assessing differential expression

Exercise 17Exercise 17

• Load the ex14.lma or the ex14.filtered.lma• Attach the clinical parameters description• Divide the data in training and test set on the

base of one of the non-continuous parameters (e.g Yes/No; Pos/Neg).

• Use PAMR to define the minimal subset of genes, if any, discriminating between the two groups.

• Question:– Both sub set are able to generate a classifier? – For which clinical parameters?

Page 166: T-statistics is widespread in assessing differential expression