building a global map of (human) gene expression misha kapushesky european bioinformatics institute,...

67
Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Upload: dominic-rice

Post on 20-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Building a Global Map of (Human) Gene Expression

Misha KapusheskyEuropean Bioinformatics Institute, EMBL

St. Petersburg, RussiaMay, 2010

Page 2: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

From one genome to many biological states

• While there is only one genome sequence, different genes are expressed in many different cell types and tissues, different developmental or disease states

• The size and structure of this “expression space” is still largely unknown

• Most individual experiments are looking at small regions• We would like to build a map of the global human gene

expression space

Page 3: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Mapping the human transcriptome

Traditional researchA microarray experiment

Everest Lhasa

Kathmandu

The map we want to build

Page 4: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

How to build such a global map

• This space is huge - There are thousands of potentially different states – cell types, tissue types, developmental stages, disease states, systems under various treatments (drugs, radiation, stress, …) –

• It is not feasible to study them all in a single laboratory experiment (costs, rare samples, …)

• However thousands of gene expression experiments are performed every year (microarrays, new generation sequencing)

• Can we use the published data to build the global expression map?

Page 5: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 6: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 7: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 8: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 9: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

ArrayExpress

• www.ebi.ac.uk/arrayexpress• Data from over 280,000 assays and over 10,000

independent studies (microarrays, sequencing, …)• Gene expression and other functional genomics assays• Over 200 species• Data collection and exchange from GEO

Page 10: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Can we integrate these data to answer questions that go beyond what was done in the individual studies?

• On a quantitative level - data on only the same microarray platform can be integrated

Page 11: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

A global map of human gene expression

• Angela Gonzales (EBI)• Misha Kapushesky (EBI)• Janne Nikkila (Helsinki

University of Technology) • Helen Parkinson (EBI), • Wolfgang Huber (EMBL)• Esko Ukkonen (University of

Helsinki)

Margus Lukk et al, Nature Biotechnology, 28, p322-324 (April, 2010)

Page 12: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

• We collected over 9000 raw data files from Affymetrix U133A from GEO and ArrayExpress

• Applying strict quality controls, removing the duplicates • Data on 5372 samples remained

from 206 different studies generated

in 163 different laboratories

grouped in 369 different biological ‘conditions’ (tissue types, diseases, various cell lines, etc)

• The 369 conditions grouped in different larger ‘metagroups’

The most popular gene expression microarray platform: Affymetrix U133A

Page 13: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Different metagroupings (4 and 15):

Page 14: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

5372 samples (369 different conditions)~

18,0

00 g

enes

After RMA normalisation we obtain:

Page 15: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Principal Component Analysis – each dot is one of the 5372 samples

1st

2nd

Page 16: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Human gene expression map21/04/2316

1st

2nd

Page 17: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Human gene expression map21/04/2317

2nd

Hematopoietic axis

Page 18: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Human gene expression map21/04/2318

2nd

Hematopoietic axis

Page 19: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Human gene expression map21/04/2319

Hematopoietic axis

Malignancy

Page 20: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Hematopoietic and malignancy axes

Lukk et al, Nature Biotechnology, 28: 322

Page 21: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

1st

2nd

3rd

Page 22: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Coloured by tissues of origin3rd PC

Page 23: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Tissues of originNeurological axis

Page 24: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

First 3 (5) principal components

1. Hematopoietic axis – blood, ‘solid tissues’, ‘incompletely differentiated cells and connective tissues’

2. Malignancy axis - Cell lines – cancer – normals and other diseases

3. Neurological axis – nervous system / the rest

4. RNA degradation

5. Samples seem to ‘cluster’ by the tissues of origin

Page 25: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 26: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Human gene expression map21/04/2326

Hierarchical clustering of 97 groups with at least 10 replicates each

Page 27: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Comparison of the 97 larger sample groups to the rest

Incompletely differentiated cell type and connective tissue group

Page 28: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Conclusions so far

• We have identified 6 major transcription profile classes in these data:

1. cell lines

2. incompletely differentiated cells and connective tissues

3. neoplasms

4. blood

5. brain

6. muscle

• Cell lines cluster together!

Page 29: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 30: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 31: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Gene expression across the 5372 samples

• The expression of most genes is relatively constant• There are only 1034 probesets (mapping to less than

900) genes where normalised signal variability has standard deviation > 2

Page 32: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Clustering of 97 sample groups and 1000 most variable probesets (about 900 genes)

1. Immune repsonse2. Nervous system development3. Lipid raft4. Mitosis5. Neurotransmitter uptake6. Cytoskeletal protein binding7. Extracellular matrix8. Extracellular regions9. Extracellular matirx10. Extracellular region11. Mitosis

12. Defence response13. Nervous system development14. Actin cytoskeleton organisation and biogenesis15. Protein carrier activity16. No significant resout17. Antigen presentation, exogenous antigen18. Trans – 1,2-dyhydrobenzene, 1,2-dyhydrogenase

activity19. S100 alpha binding

1 2 3 4 5 6 7 8 9 10 11 12 13 1415

16 17 18 19

Page 33: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Clustering based on subset of these genes produce similar results

• Clustering based on 350 most variable probesets gives almost the same result

• Even clustering based on 30 most variable probesets is very close

Page 34: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 35: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

24 most variable genesCALD1 Actin- and myosin-binding protein implicated in the regulation of actomyosin interactions in smooth muscle and nonmuscle cells CDH1 calcium dependent cell-cell adhesion glycoprotein COL1A1 Type I collagen - fibrillar forming collagen (alpha 1 chain) COL1A2 Type I collagen - fibrillar forming collagen (alpha 2 chain) COL3A1 Collagen type III occurs in most soft connective tissues along with type I collagenCOL6A3 Collagen VI acts as a cell-binding proteinCXCR4 Receptor for the C-X-C chemokine CXCL12/SDF-1, participates in a signal transductionDCN May affect the rate of fibrils formationDKK3 Inhibitor of Wnt signaling pathway (Potential)FN1 Involved in cell adhesion, cell motility, opsonization, wound healing, and maintenance of cell shapeHBA1 Involved in oxygen transport from the lung to the various peripheral tissuesHLA-DRA One of the HLA class II alpha chain paralogues, plays a central role in the immune system HLA-DRA1HLA-DRB3 Plays a central role in the immune system by presenting peptides derived from extracellular proteinsJGA1 Cluster of closely packed pairs of transmembrane channels, the connexonsKRT15 Encodes a member of the keratin gene familyKRT18 Type I intermediate filament chain keratin LUM A member of the small leucine-rich proteoglycan (SLRP) family LYZ Encodes human lysozymePLS3 Actin-bundling protein found in intestinal microvilli, hair cell stereocilia, and fibroblast filopodiaS100AB S-100 is a group of low molecular weight (10–12 kD) calcium-binding proteins highly conserved among vertebratesSPARC Appears to regulate cell growth through interactions with the extracellular matrix and cytokinesSPARCL1 Seems to be little knownTACSTD2 Tumor-associated calcium signal transducer 2

Page 36: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

www.ebi.ac.uk/gxa/human/U133a

Page 37: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Can we go beyond the 6 major classes?

Page 38: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Human gene expression map21/04/2339

Hierarchical clustering of all 369 sample groups

Some finer groups:

Cancer:•Sarcomas•Carcinomas•Neuroblastomas

Normal:•Liver and gut

Page 39: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Leukemia

Normal blood and bloodnon-neoplastic disease

Other blood neoplasmBlood cell lines

Page 40: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Identifying condition specific genes by supervised analysis

• Using linear models to find condition specific genes, multiple testing correction, differential expression cut-offs

• Example - 174 leukemia specific genes include most well known markers (e.g, BCR, ETV6, FLT3,

HOXA9, MUST3, PRDM2, RUNX1, and TAL1) Many confirmed as associated with leukemia

Page 41: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

• Beyond the major 6 classes the ‘signal’ becomes weak

• The problem may be lab effectsThe large biological effects are stronger than the lab

effects

However, when we zoom into particular subclasses, the lab effects may be taking precedence

Page 42: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Mapping the human transcriptome

Traditional researchA microarray experiment

Everest Lhasa

Kathmandu

The map we want to build

Our current view on global transcriptome

Page 43: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

97 groups – colours recycled

Frontal cortex

Muscular dystrophySkeletal muscle

Brain

Heart and heart parts

CerebellumCaudate nucleus

Hippocampaltissue

Nervous system tumors

Mono-nuclearcells

AML

Page 44: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Second approach

• Integrating data on statistics level

Page 45: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Gene Expression Atlas

• Ele Holloway• Ibrahim Emam• Pavel Kurnosov • Helen Parkinson• Anrey Zorin• Tony Burdett • Gabriella Rustici• Eleanor William• Andrew Tikhonov

Page 46: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Global Differential Expression Analysis

• Selected ~10% of the data from ArrayExpress (including GEO imports), manually curated for quality and mapped to a custom-built ontology of experimental factors, EFO: http://www.ebi.ac.uk/efo

• Data on differential expression of genes in 1000+ studies, comprising ~30000 assays, in over 5000 conditions

• For each experiment, differentially expressed genes have been identified computationally via moderated t-tests and statistical meta-analysis

Meta-Analysis Approaches

• Vote counting: number of independent studies supporting an observation for a particular gene

• Effect size integration: compute effect size statistics in each study, assess relevant statistical model and compute combined z-score, for each gene/condition/study combination (extension of Choi et al, 2003)

Page 47: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Analysing each contributing dataset separately:

AML CML normal

genes

AML CML normalgene 1 0 1 0gene 2 1 1 0gene 3 0 0 0

gene n 0 0 1

one-way ANOVA

Page 48: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Combining the datasets

Experiments 1, 2, 3, …, mAML e1 AML e2 AML e3 CML e1 CML e2 CML e3 CML e4 normal

gene 1 0 0 0 1 1 0 1 0gene 2 1 1 1 1 0 0 0 0gene 3 0 0 0 0 0 0 0 0

gene n 0 1 1 0 0 0 0 1

Page 49: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Effect size-based meta-analysis

• We have for each gene in each experiment/condition:p-value for significance

simulaneous t-statistics & confidence intervals

d.e. label (“up” or “down”)

• However, we would like to:Measure of strength of d.e. effect size

Ability to combine d.e. findings statistically

• Effect SizeStandardized mean difference or similar (e.g., correlation coef.)

Page 50: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Meta-analysis Procedure

• For each gene-experiment-condition combinationCompute effect size from simultaneous d.e. t-statistics

• Combine effect sizes across multiple studiesUsing fixed-effects or random-

effects modelsObtain for each gene-condition

combination:• Mean effect size estimate• Combined z-score• Overall p-value

Page 51: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Long tail of annotations…

Page 52: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Annotating data with ontologies

• Diverse nature of annotations on data• Need to support complex queries which contain semantic

informationE.g. which genes are under-expressed in brain samples in

human or mouse

• If we annotate with do we get this data?

cancer

adenocarcinoma

James Malone

Page 53: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Decoupling knowledge from data

Atlas/AE

James Malone

Page 54: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Semantically-enriched Queries with EFO

Page 55: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

We can use the ontology structure

We can perform effect size meta-analysis on a hierarchy,if we follow several rules:

Page 56: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Increased statistical power

Page 57: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Condition-specificity through EFO

Page 58: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Condition-specific Gene Expression

Page 59: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Query for genes

Query for conditions

species

The ‘advanced query’ option allows building more complex queries

http://www.ebi.ac.uk/gxa

www.ebi.ac.uk/gxa

Page 60: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Query results for gene ASPM

ArrayExpress61

ASPM is downregulated in ‘normlal’ condition in comparison to a disease in 9 studies out of 10Upregulated in ‘Glioblastoma’ in 3 indepnendent studies

Zoom into one of the ‘Glioblastoma’ studies. Each bar represents an expression level in a particular sample

Page 61: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

‘wnt pathway’ genes in various cancers

ArrayExpress62

Page 62: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Integrating both approaches

• First approach gives the global view, but obsucres the detail

• The second approach gives detail, but doesn’t allow easily to integrate everything in one map

• Can we combine both approaches?

Page 63: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Other data

• RNAseq data• Proteomics data – Human Proteome Atlas from KTH in

Stockholm (collaboration with Mathias Uhlen)

• Time series – what states a cell goes through to become from an ESC to a mature cell?

Page 64: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 65: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
Page 66: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Two ways of integrating the data

• On a quantitative level – normalise all data together Advantages – results easier to interpret

Disadvantages – lab effects

• On a statistics level – analyse each dataset separately firstAdvantages – less lab effects

Disadvantages – combined data difficult to interpret (in each experiment each conditions is compared to something else)

• How to combine the two approaches?

Page 67: Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

Acknowledgements• Margus Lukk• Misha Kapushesky • Angela Gonzales• Helen Parkinson• Gabriela Rustici• Ugis Sarkans• Ele Holloway • Roby Mani • Mohammadreza Shojatalab • Nikolay Kolesnikov • Niran Abeygunawardena • Anjan Sharma • Miroslaw Dylag• Ekaterina Pilicheva • Ibrahim Emam• Pavel Kurnosov• Andrew Tikhonov• Andrey Zorin

• CollaboratorsAudrey Kaufman (EBI)Wolfgang Huber (EBI)Sami Kaski (Helsinki)Morris Swertz (Groningen)…

• FundingEuropean Commision

• FELICS• MolPAGE• ENGAGE• MuGEN• SLING• DIAMONDS• EMERALD

NIH (NHGRI)EMBL

• Anna Farne• Eleanor Williams • Tony Burdett• James Malone• Holly Zheng• Tomasz Adamusiak• Susanna-Assunta Sansone• Philippe Rocca-Serra • Natalija Sklyar• Marco Brandizi• Chris Taylor• Eamonn Maguire• Maria Krestyaninova• Mikhail Gostev• Johan Rung• Natalja Kurbatova• Katherine Lawler• Nils Gehlenborg • Lynn French