example data – maldi-tof peptide intensity vs m/z previous lecture: proteomics informatics

70
Example data – MALDI-TOF m/z 2280 2400 m/z 1300 1460 m/z 1444.0 1458.0 m/z 2378.0 2394.0 Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Upload: adrian-cobb

Post on 23-Dec-2015

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Example data – MALDI-TOF

m/z1000 4500

In

te

nsity

1800

0

D:\Users\Fenyo\Desktop\ATP.txt (15:42 02/03/11)Description: none available m/z2280 2400

Inte

nsity

700

0

D:\Users\Fenyo\Desktop\ATP.txt (15:46 02/03/11)Description: none available

m/z1300 1460

Inte

nsity

45

0

D:\Users\Fenyo\Desktop\ATP.txt (15:50 02/03/11)Description: none available

m/z1444.0 1458.0

Inte

nsity

35

0

D:\Users\Fenyo\Desktop\ATP.txt (15:54 02/03/11)Description: none available

m/z2378.0 2394.0

Inte

nsity

700

0

D:\Users\Fenyo\Desktop\ATP.txt (16:07 02/03/11)Description: none available

Peptide intensity vs m/z

Previous Lecture: Proteomics Informatics

Page 2: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Gene Expression Analysis (I)

This Lecture

Page 3: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Learning Objectives

• Microarray experimental details• Microarray data formats• QC analysis and data exploration• Normalization• Differential expression• Functional enrichment• Databases

Page 4: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

proteinRNADNAtranscription translation

replication

The Central Dogma of Molecular Biology

DNA is transcribed into RNA which is then translated into protein

Measured by Microarray

Page 5: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

What is a Microarray

• A simple concept: Dot Blot + Northern • Reverse the hybridization - put the probes

on the filter and label the bulk RNA • Make probes for lots of genes - a massively

parallel experiment• Make it tiny so you don’t need so much

RNA from your experimental cells.• Make quantitative measurements

Page 6: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Microarrays are Popular

At NYU Med Center we are now collecting about 3 GB of microarray data per week (60 chips, 6-10 different experiments)

PubMed search "microarray"= 13,948 papers

2005 = 44062004 = 35092003 = 24212002 = 15572001 = 8342000 = 294 294

834

1557

2421

3509

4406

294

834

1557

2421

3509

4406

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

2000 2001 2002 2003 2004 2005

Page 7: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

A Filter Array

Page 8: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

DNA Chip Microarrays• Put a large number (~100K) of cDNA sequences or

synthetic DNA oligomers onto a glass slide (or other subtrate) in known locations on a grid.

• Label an RNA sample and hybridize • Measure amounts of RNA bound to each square in

the grid• Make comparisons

– Cancerous vs. normal tissue– Treated vs. untreated– Time course

• Many applications in both basic and clinical research

Page 9: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

cDNA Microarray Technologies

• Spot cloned cDNAs onto a glass microscope slide– usually PCR amplified segments of plasmids

• Label 2 RNA samples with 2 different colors of flourescent dye - control vs. experimental

• Mix two labeled RNAs and hybridize to the chip

• Make two scans - one for each color• Combine the images to calculate ratios of

amounts of each RNA that bind to each spot

Page 10: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Spot your own Chip (plans available for free from Pat Brown’s website)

Robot spotter

Ordinary glass microscope slide

Page 11: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 12: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Combine scans for Red & Green

False color image is made from digitized fluorescence data, not by superimposing scanned images

Page 13: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

cDNA Spotted Microarrays

Page 14: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 15: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Data Acquisition

• Scan the arrays• Quantitate each spot• Subtract background• Normalize• Export a table of fluorescent intensities

for each gene in the array

Page 16: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Affymetrix “Gene chip” system

• Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene)

• RNA labeled and scanned in a single “color”– one sample per chip

• Can have as many as 20,000 genes on a chip• Arrays get smaller every year (more genes)• Chips are expensive• Proprietary system: “black box” software,

can only use their chips

Page 17: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Affymetrix Gene Chip

Page 18: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 19: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 20: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Affymetrix Technology

Page 21: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 22: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 23: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Affymetrix Software• Affymetrix System is totally automated• Computes a single value for each gene from 40

probes - (using surprisingly kludgy math)• Highly reproducible

(re-scan of same chip or hyb. of duplicate chips with same labeled sample gives very similar results)

• Incorporates false results due to image artefacts– dust, bubbles– pixel spillover from bright spot to neighboring dark

spots

Page 24: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Affymetrix Pivot Tablenormal tumor tumor normal normal tumor

ID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL

AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 PAFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 PAFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 PAFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 PAFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 PAFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 PAFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 PAFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 PAFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 PAFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 AAFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 AAFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 AAFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 AAFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 AAFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 AAFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 AAFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 AAFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 AAFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 AAFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 AAFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 AAFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 AAFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 AAFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 AAFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 PAFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 AAFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 PAFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 PAFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 PAFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 PAFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 PAFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 PAFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 PAFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 PAFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 PAFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 PAFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P

Page 25: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Plot of raw data (PM probes)

Page 26: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Plot of log2 data (PM probes)

Page 27: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

MA plot: log of fold change (M) vs log of Intensity (A)

Hypox1 vs

Hypox2 Hypox3 Norm1 Norm2 Norm3

M = log2 (A/B)A = ½ log2 (A*B) = ½ (log2 (A) + log2 (B))

Page 28: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Goals of a Microarray Experiment

1. Find the genes that change expression between experimental and control samples

2. Classify samples based on a gene expression profile

3. Find patterns: Groups of biologically related genes that change expression together across samples/treatments

Page 29: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Basic Data Analysis• Fold change (relative increase or decrease in

intensity for each gene)• Set cutoff filter for low values

(background +noise)• Cluster genes by similar changes - only really

meaningful across multiple treatments or time points

• Cluster samples by similar gene expression profiles

Page 30: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Streamlined Affy Analysis

Normalize

normal tumor tumor normal normal tumorID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL

AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 PAFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 PAFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 PAFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 PAFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 PAFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 PAFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 PAFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 PAFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 PAFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 AAFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 AAFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 AAFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 AAFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 AAFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 AAFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 AAFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 AAFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 AAFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 AAFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 AAFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 AAFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 AAFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 AAFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 AAFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 PAFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 AAFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 PAFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 PAFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 PAFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 PAFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 PAFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 PAFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 PAFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 PAFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 PAFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 PAFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P

Raw data Filter

ClassificationSignificance Clustering

Gene lists

Function(Genome Ontology)

(RMA)

•Present/Absent•Minimum value•Fold change

•t-test•SAM•Rank Product

•PAM•Machine learning

Page 31: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold.

Page 32: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Thomas Hudson, Montreal Genome Center

Page 33: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Normalization• Can control for many of the experimental

sources of variability (systematic, not random or gene specific)

• Bring each image to the same average brightness

• Can use simple math or fancy - – divide by the mean (whole chip or by sectors)– LOESS (locally weighted regression)

• No sure biological standards

Page 34: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

RMA• Robust Multichip Average• Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P. (2003), A Comparison of

Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2):185-193

log(medpol(PMij − BG)) = µ i + α j + e ijfor (array i, probe j)

Page 35: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Are the Treatments Different?• Analysis of microarray data has tended to focus on

making lists of genes that are up or down regulated between treatments

• Before making these lists, ask the question:"Are the treatments different?"

• PCA/MDS or cluster the samples• If the treatment is responsible for differences, then

use statistical methods to find the genes most responsible

• If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.

Page 36: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Statistics• When you have variability in

measurements, you need replication and statistics to find real differences

• It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates

• Non-parametric (i.e. rank or permutation) or paired value statistics may be more appropriate (low number of samples, high standard deviation)

Page 37: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Multiple Comparisons

• In a microarray experiment, each gene (each probe or probe set) is really a separate experiment

• Yet if you treat each gene as an independent comparison, you will always find some with significant differences– (the tails of a normal distribution)

• Different genes are NOT independent

Page 38: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

False Discovery• Statisticians call false positives a "type 1 error" or a

"False Discovery"• The FDR must be smaller than the number of real

differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

• You can’t know the true false discovery rate for your data, but it can be estimated in a number of different ways.

• In biology we tend to be comfortable with an estimated FDR of 5-10%

Page 39: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

SAMSignificance Analysis of MicroarraysTusher, Tibshirani and Chu (2001): Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116-5121, (Apr 24).

• R package, Excel plugin• Free• Permutation based• Most published method of

microarray data analysis

Page 40: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

40

SAM- procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 41: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

• Calculate “relative difference” – a value that incorporates the change in expression between conditions and the variation of measurements in each condition

• Calculate “expected relative difference” – derived from controls generated by permutations of data

• Plot against each other, set cutoff to identify deviating genes

• Calculate FDR for chosen cutoff from the control permutations

Page 42: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

d(i) x I (i) x U (i)

s(i) s0

Relative Difference

)(),( ixix UIMean expression of gene i in conditions I and U

s(i) Gene-specific scatter

s0 Constant to reduce variation of low expressed genes

Page 43: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Permutation tests

i) For each gene, compute the d-value (similar to a t-statistic). This is the observed d-value (di) for that gene.

ii) Randomly shuffle the expression values between groups A and B. Compute the d-value for each randomized set.

iii) Take the average of the randomized d-values for each gene. This is the ‘expected relative difference’ (dE) of that gene. Difference between (di) and (deE) is used to measure significance.

iv) Plot d(i) vs. dE(i) v) Calculate FDR = average number of genes that exceed in the permuted data.

Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6

Gene 1

Group A Group B

Exp 1Exp 4 Exp 5Exp 2Exp 3 Exp 6

Gene 1

Group A Group B

Original grouping

Randomized grouping

SAM Two-Class Unpaired

)()( idid E

Page 44: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

SAM Two-Class Unpaired

Significant positive genes (i.e., mean expression of group B >

mean expression of group A)

Significant negative genes (i.e., mean expression of group A > mean expression of group B)

“Observed d = expected d” line

The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.

• Plot d(i) vs. dE(i)• For most of the

genes:

)()( idid E

Page 45: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Higher LevelMicroarray data analysis

• Clustering and pattern detection• Data mining and visualization• Controls and normalization of results• Statistical validatation• Linkage between gene expression data and gene

sequence/function/metabolic pathways databases• Discovery of common sequences in co-regulated

genes• Meta-studies using data from multiple

experiments

Page 46: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Types of Clustering

• Herarchical– Link similar genes, build up to a tree of all

• Self Organizing Maps (SOM)– Split all genes into similar sub-groups– Finds its own groups (machine learning)

• Principle Component– every gene is a dimension (vector), find a single

dimension that best represents the differences in the data

Page 47: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Cluster by fold change

Page 48: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

GeneSpring

Page 49: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 50: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

SOM Clusters

Page 51: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Classification

How to sort samples into two classes based on gene expression data

Cancer vs. normalCancer sub-types

(benign vs. malignant)Responds well to drug vs. poor

response (i.e. tamoxifen for breast cancer)

Page 52: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

PAM: Prediction Analysis for MicroarraysClass Prediction and Survival Analysis for Genomic Expression Data MiningPerforms sample classification from gene expression data,via "nearest shrunken centroid method'' of Tibshirani, Hastie, Narasimhan and Chu (2002): "Diagnosis of multiple cancer types by shrunken centroids of gene expression" PNAS 2002 99:6567-6572 (May 14).

Page 53: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

BioConductorAll of these normalization, statistical,

and clustering methods are available in a free software package called BioConductor, which is part of the R statistical environmentwww.bioconductor.org

command line interface

> data(SpikeIn)> pms <- pm(SpikeIn)> mms <- mm(SpikeIn)> par(mfrow = c(1, 2))> concentrations <- matrix(as.numeric(sampleNames(SpikeIn)), 20,+ 12, byrow = TRUE)> matplot(concentrations, pms, log = "xy", main = "PM", ylim = c(30,+ 20000))> lines(concentrations[1, ], apply(pms, 2, mean), lwd = 3)> matplot(concentrations, mms, log = "xy", main = "MM", ylim = c(30,+ 20000))> lines(concentrations[1, ], apply(mms, 2, mean), lwd = 3)

Page 54: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Functional Genomics

Take a list of "interesting" genes and find their biological relationshipsGene lists may come from

significance/classfication analysis of microarrays, proteomics, or other high-throughput methods

Requires a reference set of "biological knowledge"

Page 55: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Genome Ontology

How to organize biological knowledge?Biologists work on a variety of

different research organisms: yeast, fruit fly, mouse, … human

the same gene can have very different functions (antennapedia)

and very different names (sonic hedgehog…)

Page 56: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

GO

Biologists got together and developed a sensible system called Genome Ontology (GO)

3 hierarchical sets of terminologyBiological ProcessCellular Component (location within

cell)Molecular Function

about 1000 categories of functions

Page 57: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 58: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

List (and convert) gene identifiers from many genomic resources including NCBI, PIR and Uniprot/SwissProt as well as Illumina and Affymetrix gene IDs

Gene IDs matched to GO function annotations (for human)

Test for enrichment of GO categories (or KEGG pathways, disease associations, etc.) in list.

Groups significant categories into clusters

Page 59: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

DAVID uses a modified Fishers Exact text to get p-values for enrichment.

Basic idea: is enrichment of this category in this list greater than frequency of the category in the genome.

DAVID enrichment score: EASE

A Hypothetical Example: In human genome background (20,000 gene total), 40 genes are involved in p53 signaling pathway. A given gene list has found that 3 out of 300 belong to p53 signaling pathway. Then  we ask the question if 3/300 is more than random chance comparing to the human background of 40/20000.

Fisher Exact P-Value =  0.008. However, EASE Score is more conservative. EASE Score =  0.06 (using 3-1 instead of 3). Since P-Value > 0.01, this user gene list is specifically associated (enriched) in p53 signaling pathway no more than random chance

Page 60: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Microarray Databases

• Large experiments may have hundreds of individual array hybridizations

• Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments

• Data-mining - look for similar patterns of gene expression across different experiments

Page 61: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Public Databases

• Gene Expression data is an essential aspect of annotating the genome

• Publication and data exchange for microarray experiments

• Data mining/Meta-studies• Common data format - XML• MIAME (Minimal Information About a

Microarray Experiment)

Page 62: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Array Express at EMBL

Page 63: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 64: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

GEO at the NCBI

Page 65: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 66: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 67: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 68: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics
Page 69: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Sumary

• Microarray experimental details• Microarray data formats• QC analysis and data exploration• Normalization• Differential expression• Functional enrichment• Databases

Page 70: Example data – MALDI-TOF Peptide intensity vs m/z Previous Lecture: Proteomics Informatics

Next Lecture: Next Generation Sequencing Informatics