dr andrew harrison departments of mathematical sciences and biological sciences

28
Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex [email protected] Looking for signals in tens of thousands of GeneChips There are >10 5 GeneChip experiments in the public domain, that cost ~$10 9 to produce. Extracting further information from this resource will be very cost effective.

Upload: alka

Post on 12-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Looking for signals in tens of thousands of GeneChips. Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Dr Andrew Harrison

Departments of Mathematical Sciences and Biological Sciences

University of Essex

[email protected]

Looking for signals in tens of thousands of GeneChips

There are >105 GeneChip experiments in the public domain, that cost ~$109 to produce. Extracting further information from this resource will be very cost effective.

Page 2: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Faculty Degrees in …..Dr Andrew Harrison Physics Professor Graham Upton StatisticsDr Berthold Lausen Statistics+ Dr Hugh Shanahan (Royal Holloway) Physics

PhD studentsFarhat Memon Computer ScienceAnne Owen MathematicsFajriyah Rohmatul Statistics

Microarray informatics at Essex UniversityDepartments of Mathematical Sciences and Biological Sciences

AlumniDr Jose Arteaga-Salas StatisticsDr Renata Camargo Computer ScienceDr Caroline Johnston Molecular Biology and BioinformaticsDr William Langdon Computer Science and PhysicsDr Joanna Rowsell MathematicsDr Olivia Sanchez-Graillet Computer Science and BioinformaticsDr Maria Stalteri Inorganic Chemistry and Bioinformatics+ 4 former MSc students

Current MSc and UG studentsAleksandra Iljina Statistics and Data AnalysisLina Hamadeh Statistics and Data AnalysisMadalina Ghita Mathematics

Page 3: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

There is a huge multiple-testing problem.

m=log2(Fold Change), a=log2(Average Intensity)

What can be learnt from comparing different experiments?

Perfect Match (PM)

Mismatch (MM)

The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene -

Harrison, Johnston and Orengo, 2007, BMC Bioinformatics, 8: 195

Page 4: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Some genes are represented by multiple probe-sets.

Probe-set A Probe-set B

If they are measuring the same thing the signals should be up and down regulated together.

Is that always true? No

Stalteri and Harrison, 2007, BMC Bioinformatics, 8:13

Page 5: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Probes map to different exons. Alternative splicing may cause some exons to be upregulated and others to be downregulated.

Page 6: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Genes come in pieces.

But exons do not. Multiple probes mapping to the same exon should measure the same thing.

Page 7: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

We are studying the correlations in expression across >6,000 GeneChips (HGU-133A), sampling RNA from many tissues and phenotypes.

Page 8: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

The correlations in intensities (log2) between probes in probeset 208772_at on the HG-U133A array.

The number in each square is the correlation ×10

Blue = low correlationYellow = high correlation

Average intensity in GEO

The correlation calculated for PM probes 9 and 11 , the data in the earlier scatter plot, is reported as 8 (0.76 multiplied by 10 and rounded).

Probe order along the gene

Page 9: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

This probeset shows no coherent correlations amongst its probes.

Page 10: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Some probesets clearly have outliers.

Page 11: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Probes 1-11 all map to the same exon.

This is a different probe-set mapping to the same exon – there seems to be one outlier.

Page 12: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

The outliers are correlated with each other!

Page 13: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences
Page 14: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Virtually all of the probes in the group have runs of Guanines within their 25 bases.

TCCTGGACTGAGAAAGGGGGTTCCT

GAGACACACTGTACGTGGGGACCAC

GGTAGACTGGGGGTCATTTGCTTCC

There is little sequence similarity between the probes, they are from probe-sets picking up different biology, yet they are correlated!

Page 15: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

3 0.14

4 0.42

5 0.49

6 0.62

7 0.75

Number of contiguous Gs

Mean Correlation

Comparing probes with runs of Gs.

We are only looking at a small fraction of the entire probe, yet it is dominating the effects across all experiments.

Page 16: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Probes all have the same sequence in a cell – a run of guanines will result in closely packed DNA with just the right properties to form G-quadruplexes.

Upton et al. 2008 BMC Genomics, 9, 613

GGGG

GGGG

GGGG

G-quadruplexes

Page 17: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

How do we deal with known outliers such as G-quadruplexes?

What is the best way to calculate expression in the presence of outliers?

Page 18: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

G-stacks bias which genes are reported to be clustered together within published experiments.

Page 19: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences
Page 20: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Kerkhoven et al. 2008, PLoS ONE 3(4): e1980

Probes containing GCCTCCC will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridization.

Page 21: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Log(magnitude) of averaged probe values

Colour coded by size. Note the perimeter of bright-dark pairs.

Cell (0,0) contains a probe which does not measure any biology

Page 22: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Corner correlations(correlations with values in cell (0,0))

Numbers are correlations times 10 (red greater than 0.8) Negative correlations appear as blanksFilled circles indicate probes not listed in CDF file. Large circles indicate correlations greater than 0.8

Page 23: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Correlations with cell (0,0)

Being in the opposite corner has not reduced the correlations of the interior row and column

Page 24: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

What are in the sheep pens?

Entries are log(mean(Intensity))

Entries are correlation with cell (0,0)

Sheep!

Page 25: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Many thousands of probes are correlated with each other simply because they are adjacent to bright probes.

We believe that the focus of the scanner may be responsible – regions adjacent to bright spots will gain the same fraction of light.

A comparison of many images at different levels of blurriness will appear to indicate that dark regions adjacent to bright regions are correlated in their intensities.

Page 26: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

A CEL file contains information about the ID of the scanner as well as the date on which the image was scanned – how does the impact of blur change over time for each scanner?

Upton and Harrison, 2010, Stat Appl Genet Mol Biol, 9(1), Article 37

Page 27: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

How best to transform a DAT image into a CEL file?We are testing whether ideas from astronomy are applicable.

We are checking whether the temporal patterns in scanner performance for human and other organisms are related.

Page 28: Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences

Bioinformatix, Genomix, Mathematix, Physix, Statistix, Transcriptomix

are needed in order to extract reliable information from Affymetrix GeneChips

Thank you for your attention.