Download - Functional genomics and gene expression data analysis Joaquín Dopazo Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO), Spain

Functional genomics and gene expression data analysis

Joaquín Dopazo

Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO), Spain.

http://bioinfo.cnio.es

The use of high throughput methodologies allows us to query our systems in a new way but, at the same time, generates new challenges for data analysis and requires from us a change in our data management habits

National Institute of Bioinformatics, Functional Genomics node

From genotype to phenotype. (only the genetic component)

>protein kinase

acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

…are expressed and constitute the

transcriptome...

… which accounts for the function providing they are expressed in the proper moment and place...

…in cooperation with other proteins (interactome) …

…conforming complex interaction networks

(metabolome)...

Genes in the DNA... …whose final

effect can be different because of the variability.

Now: 23531 (NCBI 34 assembly 02/04) Recent estimations: 20.000 to 100.000.

50% mRNAs do not code for proteins (mouse)50% display alternative splicing

Each protein has an average of 8 interactions

A typical tissue is expressing among 5000 and 10000

genes

More than 4 millon SNPs have been

mapped

25%-60% unknown

...and code for proteins (proteome)

that...

>protein kunase

acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

Pre-genomics scenario in the lab

Sequence

Molecular databases

Search results

Phylogenetic tree

alignment

Conserved region

MotifMotif

databases

Information

Secondary and tertiary protein structure

Bioinformatics tools for pre-genomicsequence data analysis

The aim:

Extracting as much information as possible for one single data

Genome sequencing

2-hybrid systemsMass spectrometry for protein complexes

Post-genomic vision

ExpressionArrays

Literature, databases

Who?

Where, when and how much?

What do we know?

In what way?

SNPs

And who else?

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

genes

interactions

Post-genomic vision

Gene expression

Information

polimorphisms

InformationDatabases

The new tools:Clustering

Feature selectionData integration

Information mining

Gene expression profiling.The rationale, what we would like and related problems

Differences at phenotype level are the visible cause of differences at molecular level which, in many cases, can be detected by measuring the levels of gene expression. The same holds for different experiments, treatments, etc.

• Classification of phenotypes / experiments (Can I distinguish among classes, values of variables, etc. using molecular gene expression data?)

• Selection of differentially expressed genes among the phenotypes / experiments (did I select the relevant genes, all the relevant genes and nothing but the relevant genes?)

• Biological roles the genes are playing in the cell (what general biological roles are really represented in the set of relevant genes?)

A note of caution:

Question Experiment test

Is gene A involved in process B?

Experiment (sometimes) test Question

Is there any gene (or set of genes) involved in any process?

Genome-wide technologies allows us to produce vast amounts of data. But... data is not knowledgeMisunderstanding of this has lead to “new” (not necessarily good) ways of asking (scientific) questions

Cy5 Cy3

cDNA arrays Oligonucleotide arrays

Gene expression analysis using DNA microarrays

There are two dominant technologies: spotted arrays and oligo arrays although new players are arriving to the arena

Transforming images into data

Test sample labeled red (Cy5)Reference sample labeled green (Cy3)

Red : gene overexpressed in test sampleGreen : gene underexpressed in test sampleYellow - equally expressed

red/green - ratio of expression

Normalisation

Before (left) and after (right) normalization. A) BoxPlots, B) BoxPlots of subarrays and C) MA plots (ratio versus intensity)

(a) After normalization by average (b) after print-tip lowess normalization (c) after normalization taking into account spatial effects

There are many sources of error that can affect and seriously biass the interpretation of the results. Differences in the efficience of labeling, the hibridisation, local effects, etc.

Normalisation is a necessary step before proceeding with the analysis

A

B

C

The data

Characteristics of the data:

• Many more variables (genes) than measurements (experiments / arrays)

• Low signal to noise ratio

• High redundancy and intra-gene correlations

• Most of the genes are not informative with respect to the trait we are studying (account forunrelated physiological conditions, etc.)

• Many genes have no annotation!!

Genes(thousands)

Experimental conditions (from tens up to no more than a few houndreds)

A B C

Expression profile of a gene across the experimental conditions

Expression profile of all the genes for a experimental condition (array)

Different classes of experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc.

...

Co-expressing genes... What do they have in common?

Different phenotypes...

What genes are responsible for?

Genes interacting in a network (A,B,C..)...

How is the network?

A

B C

DE

Molecular classification of samples

Multiple array experiments.Can we find groups of experiments with similar gene expression profiles?

Unsupervised

Supervised

Reverse engineering

Unsupervised clustering methods:Useful for class discovery (we do not have any

a priori knowledge on classes)

Non hierarchical hierarchical

K-means, PCA UPGMA

SOM SOTA

Different levels of information

quick and robust

An unsupervised problem: clustering of genes.

•Gene clusters are unknown beforehand

•Distance function

•Cluster gene expression patterns based uniquely on their similarities.

•Results are subjected to further interpretation (if possible)

Perou et al., PNAS 96 (1999)

Clustering of experiments:The rationale

Distinctive gene expression patterns in human mammary epithelial cells and breast cancers

Overview of the combined in vitro and breast tissue specimen cluster diagram. A scaled-down

representation of the 1,247-gene cluster diagram The black bars show the positions of the clusters

discussed in the text: (A) proliferation-associated, (B) IFNregulated, (C) B lymphocytes, and (D) stromal

cells.

If enough genes have their expression levels altered in the different experiments, we might be able of finding these classes by comparing gene expression profiles.

Clustering of experiments:The problems

Any gene (regardeless its relevance for the classification) has the same weight in the comparison. If relevant genes are not in overwhelming majority it produces:

Noise

and/or

irrelevant trends

Supervised analysis.If we already have information on the classes, our question

to the data should use it.Class prediction based on gene expression profiles:

Problems:

How can classes A, B, C... be distiguished based on the corresponding profiles of gene expression?

How a continuous phenotypic trait (resistence to drugs, survival, etc.) can be predicted?

And

Which genes among the thousands analysed are relevant for the classification?

Genes(thousands)

Experimental conditions (from tens up to no more than a few houndreds)

A B C

Predictor

Gene selection

Gene selection.We are interested in selecting those genes showing differential expression among the classes studied.

• Contingency table (Fisher's test)

For discrete data (presence/absence, etc).

• T-test

We could compare gene expression data between two types of patients.

• ANOVA

Analysis of variance. We compare between two or more groups the value of an interval data. The pomelo tool

Gene selection and class discrimination

Genes differentially expressed among classes (t-test or ANOVA), with p-value < 0.05

10 10cases controls

Sorry... the data was a collection of random numbers labelled for two classes

This is a multiple-testing statistic contrast.

Adjusted p-values must be used!

NE EEC

NEEEC

Gene selection

between normal endometrium (ne) and endometrioid

endometrial carcinomas (eec)

G Symbol A Number

Hierarchical Clustering of 86 genes with different expression patterns between Normal Endometrium and Endometrioid

Endometrial Carcinoma (p<0.05) selected among the ~7000 genes in the CNIO

oncochip

Moreno et al., BREAST AND

GYNAECOLOGICAL CANCER LABORATORY, Molecular Pathology Programme, CNIO

And, genes are not only related to discrete classes...

Pomelo: a tool for finding differentially expressed genes

• Among classes

• Survival

• Related to a continuous parameter

Of predictors and molecular signaturesA B

Model, or classificator

A/B?

1 Training

(with internal and/or external CV)

A

2. Classification / predictionA/B?

CV

Unknown sample

Predictor of clinical outcome in breast cancer

van’t Veer et al., Nature, 2002

Genes are arranged to their correlation eith the pronostic groups

Pronostic classifier with optimal accuracy

Information mining How are structured?

Clustering

What is this gen?

Links

My data...

?

What are these groups?

Information mining

Cell cycle...

DBs Information

Information mining applications.

1) use of biological information as a validation criteria

Information mining of DNA array data. Allows quick assignation of function, biological role and

subcellular location to groups of genes.

Used to understand why genes differ in their expression between two different conditions

Sources of information: • Free text• Curated terms (ontologies, etc.)

Gene OntologyCONSORTIUM

http://www.geneontology.org • The objective of GO is to provide controlled vocabularies

for the description of the molecular function, biological process and cellular component of gene products.

• These terms are to be used as attributes of gene products by collaborating databases, facilitating uniform queries across them.

• The controlled vocabularies of terms are structured to allow both attribution and querying to be at different levels of granularity.

FatiGO: GO-driven data analysisThe aim: to develop a statistical framework able to deal with multiple-testing questions

The Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29

GO: source of information. A reduced number of curated terms

How does FatiGO work? Compares two sets of genes (query and reference) Has Ontology information [Process, Function and Component] on

different organisms Select level [2-5]. Important: annotations are upgraded to the level

chosen. This increases the power of the test: there are less terms to be tested and more genes by term.

Cluster GenesQuery

ClusterGenes

Reference

Remove genes

repeated

in Cluster Query

Remove genes repeated

between Clusters

Remove genes

repeated

in Cluster Reference

CleanCluster Query

CleanCluster

Reference

GO – DB

Search GO term at level and ontology

selected

DistributionOf GO Terms

In QueryCluster

DistributionOf GO TermsIn Reference

Cluster

p-valuemultiple test

Important: since we are performing as many tests as GO terms, multiple-testing adjustment must be used

Number Genes with GO Term at level and ontology selected for each Cluster

Unadjusted p-valueStep-down min p adjusted p-value

FDR (indep.) adjusted p-valueFDR (arbitrary depend.) adjusted p-value

Tables GO Term – Genes

Genes of old versions (Unigene)Genes without result

Repeated Genes

GO Tree with diferent levels of information

FatiGO ResultsThe application extracts biological relevant terms (showing a significant differential distribution) for a set of genes

PTL LBC

Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO

Limphomas from mature lymphocytes (LB) and precursor T-lymphocyte (PTL).

Genes differentially expressed, selected among the ~7000 genes in the CNIO oncochip

Genes differentially expressed among both groups were mainly related to immune response (activated in mature lymphocytes)

Understanding why genes differ in their expression between two

different phenotypes

Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO

Biological processes shown by the genes differentially expressed among PTL-LB

Looking for significant differences.Statistical approaches

Don’t worry, be happy

2-fold increase/decrease

Individual test

Hundred of differentially expressed genes

Panic

Bonferroni

FWER

Hardly a few differentially expressed genes (or even none)

Looking for more heuristic and/or realistic ways of finding differentially expressed genes

Use of external information1. Use of biological information as a validation

criteria

2. Use of biological information as part of the algorithm

False Discovery Rate (FDR), controls the expected number of false rejections among the rejected hypotheses (differentially expressed genes), instead of the more conservative FWER, that controls the probability that one of more of the rejected hypotheses is true.

Necessity of a tool and the appropriate statistical framework for the management of the information

Applications2) Use of biological information as a

threshold criteria The problem:

We might be interested in understanding, e.g., which genes differ between tissues, diseases, etc.

Typically:

We examine each gene selecting only those that show significant differences using an appropriate statistical model, and correcting for multiple testing.

The threshold, thus, is based on expression values in absence of any other information. Conventional levels (e.g., Type I error rate of 0.05) attending exclusively to statistical criteria are used.

A B

B

A

Metabolism

Transport

...Reproduction

Use biological information as a validation criteria

Use of biological information as a threshold criteria

Information-driven approach

We examine the GO terms associated to each gene and see, correcting for multiple testing, if some of them are overrepresented

The threshold is based on levels (e.g., Type I error rate of 0.05) of distribution of GO terms

A B

B

A

GO terms

Metabolism

Transport

...Reproduction

Present

Absent

The rationale: genes are differentially expressed because some biological reason

The procedure becomes more sensitive

Comparing genes differentially expressed between organs

testis kidney

Díaz-Uriarte et al., CAMDA 02

Other approaches that include

information in the algorithm: GSEA

Figure 1: Schematic overview of GSEA.The goal of GSEA is to determine whether any a priori defined gene sets (step 1) are enriched at the top of a list of genes ordered on the basis of expression difference between two classes (for example, highly expressed in individuals with NGT versus those with DM2). Genes R1,...RN are ordered

on the basis of expression difference (step 2) using an appropriate difference measure (for example, SNR). To determine whether the members of a gene set S are enriched at the top of this list (step 3), a Kolmogorov-Smirnov (K-S) running sum statistic is computed: beginning with the top-ranking gene, the running sum increases when a gene annotated to be a member of gene set S is encountered and decreases otherwise. The ES for a single gene set is defined as the greatest positive deviation of the running sum across all N genes. When many members of S appear at the top of the list, ES is high. The ES is computed for every gene set using actual data, and the MES achieved is recorded (step 4). To determine whether one or more of the gene sets are enriched in one diagnostic class relative to the other (step 5), the entire procedure (steps 2–4) is repeated 1,000 times, using permuted diagnostic assignments and building a histogram of the maximum ES achieved by any pathway in a given permutation. The MES achieved using the actual data is then compared to this histogram (step 6, red arrow), providing us with a global P value for assessing whether any gene set is associated with the diagnostic categorization.

Mootha et al., Nat Genet. 2003 Jul;34(3):267-73

ISW applied to a dataset for which no genes differentially expressed

could be found

IGT + Diabetic Normal tolerance to glucose

Mootha et al., Nat Genet. 2003

17 NTG vs.

8 IGT 18 DM2

No differentially expressed genes between both conditions were found.

Gene Set Enrichment Analysis detects Oxidative phosphorylation

ISW detects 5 pathwaysarrangement

Pathways over- and underrepresented

Algorithms are used if they are available in programs.GEPAS, a package for DNA array data analysisArray

Scanning,

Image processing

Preprocessor+ hub

Supervised clustering

SVM

Unsupervised clustering

HierarchicalSOMSOTA

SomTree

Datamining

FatiGO

FatiWise

Viewers

SOTATreeTreeViewSOMplot

External tools

EP, HAPI

Two-conditions comparisonGene selection

Two-classes

Multiple classes

Continuous variable

Categorical variable

survival

NormalizationDNMAD

Predictor

tnasas

In silico CGH

A

BC D

E

F

G

Bioinformatics Group, CNIO

http://bioinfo.cnio.es http://gepas.bioinfo.cnio.es

http://fatigo.bioinfo.cnio.es

From left to right: Lucía Conde, Joaquín Dopazo, Alvaro Mateos, Fátima Al-Shahrour, Víctor Calzado, Hernán Dopazo, Javier Herrero, Javier Santoyo, Ramón Díaz, Michal Karzinstky & Juanma Vaquerizas

Download - Functional genomics and gene expression data analysis Joaquín Dopazo Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO), Spain

Top Related