critical dynamic processes in the pathogenesis of cancer ... · robert stengel october 24,...

Robert StengelOctober 24, 2007

! Background

! Small, round, blue-cell tumor example

! Application to colon cancer, metastases, andnormal tissue

Multiclass Analysis ofMicroarray Data

Lecture for MOL 457, Computational Aspects of Molecular Biology,

Princeton University

Critical Dynamic Processes in thePathogenesis of Cancer

Production of aSingle Protein

! Up-regulation of mRNA transcripts in tumor cells

– Causal input (tumor enhancing gene)

– Defensive response (tumor suppressor gene)

– Bystander effect

– Tissue effect

– Artifact

! Down-regulation of mRNA transcripts in tumor cells

– Present, but mutated (and, therefore, not detected)

– Eliminated or suppressed by tumor growth

– Tissue effect

– Artifact

Putative Paradigm for Microarray Analysis:mRNA Transcript Expression Level Infers

Cell Function & Carcinogenesis

Limitations ofMicroarray Analysis

! Human genome microarray probe sets describe wild-typenucleotide sequences– They do not see mutations

! Microarray data are “noisy”– Biological variation

– Measurement error

! mRNA expression level alone is not sufficient to determinefunction– Inhibition and amplification of biological processes are multi-gene/-

transcript/-protein effects

– Ancillary data are required to interpret significance of transcript levels

! Realistic goal is to classify samples by statistical analysis oftranscript expression levels

MicroarrayClassification Objectives

! Class comparison– Identify expression profiles (feature sets)

for predefined classes

! Class prediction– Develop algorithms that predict class

membership for a novel expression profile

! Class discovery– Identify significant new classes, sub-classes,

or features

Typical Pairs of Colon CancerMicroarray Expression Levels

(Alon, Notterman et al, 1999)

-100

-50

0

5 0

100

150

200

250

-80 -60 -40 -20 0 2 0 4 0 6 0 8 0

Tumor

Normal

H57136

M96839

! Samples not well differentiated in individualtranscript clusters (overlapping)

Classification of Data

! Data set characterized by two features

Clustering of Data

! How many clusters?

Discriminants of Data

! Where are the boundaries between sets?

The Data Set Revealed

The discriminant is the Delaware River

Towns and Crossroads ofPennsylvania and New Jersey

Another ClassificationExample

! Party for returning graduatealumni/ae, Reunions, 1999

! From this picture, who arethe:– Grad alums

– Current students

– Spouses

– Children

– Visitors from abroad

– Hosts

– Oldest alums

– Youngest alums

– Party crashers?

! What features are used toclassify?

... and What if theData are Distorted?

! Grad alums

! Current students

! Spouses

! Children

! Visitors from abroad

! Hosts

! Oldest alums

! Youngest alums

! Party crashers

Supervised and UnsupervisedLearning of Discriminants

! Learning depends on “closeness”of related features

! Previously unknown correlationsor features are detected

! Classification occurs after learningvia exogenous knowledge

! Same answer given for allquestions

! Learning depends on priordefinition and knowledge ofclass

! Complex correlation betweenfeatures revealed

! Classification is inherent inlearning

! Different answers given fordifferent questions

Supervised

Learning

Unsupervised

Learning

Effect ofFeature Set Dimension

-50

0

5 0

100

150

200

-20 -10 0 1 0 2 0 3 0 4 0

Tumor

Normal

D14657

M94363

! Best line or curve mayclassify with significanterror

! Best plane or surfaceclassifies with equal or lesserror

Characteristics ofClassification Features

! Additional features

– Orthogonal feature (low correlation) may addnew information to the set

– Co-expressed feature (high correlation) isredundant; averaging may reduce error

-50

0

5 0

100

150

200

-20 -10 0 1 0 2 0 3 0 4 0

Tumor

Normal

D14657

M94363

! Strong feature

– Individual feature provides good classification

– Minimal overlap of feature values in each class

– Significant difference in class mean values

– Low variance in class is desirable

Two-Sample Student t Testfor Transcript Selection

!

t =m

A"m

B( )

#A

2

nA

+#B

2

nB

! m = mean value of data set

! ! = standard deviation of data set

! n = number of points in data set

! t test applied to mean tumor/normal expressionlevels of individual transcripts

! |t| > 3, mA ! mB with "99.7% confidence[error p # 0.003 (Gaussian)] [n > 25]

! t compares mean values of two data sets– |t| is reduced by uncertainty in the data sets (!)

– |t| is increased by number of points in the data sets (n)

But All Data Sets AreNot Gaussian!

! Association between t and p-value is different for Gaussianand non-Gaussian distributions

! All finite data sets have means and standard deviationsthat describe the average and the spread

! Classification goal is to find transcripts whose expressionlevels are significantly different, not to compute p-value

! Large value of t virtually assures small p-value forcomparison of any unimodal distributions

– Actual p-value is of little concern if it is “small enough”

! Statistical significance can be confirmed by standardtechniques (e.g., Bonferroni correction, Benjamini-Hochberg procedure)

Example of Transcript-by-TranscriptTumor/Normal Classification by t Test

(data from Alon, Notterman et al, 1999)

!

t = mT"m

N( )#T

2

36+#N

2

22

! 1,151 probe sets* areover/under-expressed intumor/normal comparison, p# 0.003

! Genetically dissimilarsamples are apparent

“Cancer-positive probe sets”

“Cancer-negative probe sets”

* Each probe set represents

one mRNA transcript

Correlation Matrices

! Probe set correlation

! Sample correlation

x =

x =

+1 0 –1

Ensemble Mean Values

! Treat each probe set (row) as a redundant, corruptedmeasurement of the same tumor/normal indicator

! Compute column averages for each sample sub-group(i.e., sum each column and divide by n)

!

zij = kiy + "ij , i =1,m, j =1, n

!

ˆ z j =1

nzij

i=1

n

"

! Random variable sums approach Gaussiandistribution by central limit theorem as n -> $

Class Prediction and EvaluationUsing Ensemble Mean Values

(data from Alon, Notterman, 1999)

! Single-featureprediction

– Cancer-positivegenes: 2 errors

– Cancer-negativegenes: 1 error

! Two-featureprediction (cross-plot): 1 error(mislabeling?)

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180

Average Value for Genes Up-Regulated in Tumors

Avera

ge V

alu

e f

or

Gen

es D

ow

n-R

eg

ula

ted

in

Tu

mo

rs

Tumors

Normals

Clustering of Sample Averages for PrimaryColon Cancer vs. Normal Mucosa[NCI Program Project Grant (PPG), 144-sample set]

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00 2000.00

Average Value of Up-Regulated Genes (Primary Colon Cancer)

Av

era

ge

Va

lue

of

Do

wn

-Re

gu

late

d G

en

es

(P

rim

ary

Co

lon

Ca

nc

er)

Primary Colon Cancer

Normal Mucosa

# Probe sets

Up: 1067

Down: 290

Constant: 19

! 144 samples,3,437 probesets analyzed

! 47 primarycolon cancer

! 22 normalmucosa

! AffymetrixHGU-133AGeneChip

! Alltranscripts“Present” inall samples

Clustering of Sample Averages forPrimary Polyp vs. Normal Mucosa

(PPG, 144-sample set)

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

0.00 500.00 1000.00 1500.00 2000.00 2500.00

Average Value of Up-Regulated Genes (Primary Polyp)

Av

era

ge

Va

lue

of

Do

wn

-Re

gu

late

d G

en

es

(P

rim

ary

Po

lyp

)

Primary Polyp

Normal Mucosa# Probe sets

Up: 1014

Down: 219

Constant: 40

! 21 primarypolyp

! 22 normalmucosa

Clustering of Sample Averages forPrimary Polyp vs. Primary Colon Cancer

(PPG, 144-sample set)

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

900.00

1000.00

0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00

Average Value of Up-Regulated Genes (Primary Polyp)

Avera

ge V

alu

e o

f D

ow

n-R

eg

ula

ted

Gen

es (

Pri

mary

Po

lyp

)

Primary Polyp


# Probe sets

Up: 273

Down: 126

Constant: 172

! 21 primarypolyp

! 47 primarycolon cancer

Identified Cancer Links inSelected Transcript Training Sets

(PPG, 144 samples)

! Bioalma linkages via GeneCards®

(Weizmann Institute of Science)

Number of Identified Cancer Genes in SetOver-Expressed (10) Under-Expressed (10) Both (20)


vs. Normal Mucosa 7 (70%) 5 (50%) 12 (60%)

Primary Polyp vs.

Norma Mucosa 7 (70%) 5 (50%) 12 (60%)


vs. Primary Polyp 7 (70%) 6 (60%) 13 (65%)

Total 21 (70%) 16 (53%) 37 (62%)

Small, Round Blue-Cell TumorClassification Example

http://www.medstudents.com.br/patoclin/courses/

sarcomas/dsrct/dsrct.htm

! Childhood cancers, including

– Ewing’s sarcoma (EWS)

– Burkitt’s Lymphoma (BL)

– Neuroblastoma (NB)

– Rhabdomyosarcoma (RMS)

! cDNA microarray analysis presented byJ. Khan, et al., 2001

» 96 transcripts chosen from 2,308 probes fortraining

» 63 EWS, BL, NB, and RMS training samples

» Classification with principal componentanalysis and a linear neural network

– Source of data for my analysis

Desmoplastic small,

round blue-cell tumors

Overview of PresentSRBCT Analysis

! Transcript selection by t test– 96 transcripts, 12 highest and lowest t values for each class

– Overlap with Khan set: 32 transcripts

! Ensemble averaging of highest and lowest t valuesfor each class

! Cross-plot of ensemble averages

! Classification by sigmoidal neural network

! Validation of neural network– Novel set simulation

– Leave-one-out simulation

Ranking by EWS t Values(Top and Bottom 12)

! 24 transcripts selected from 12 highest and lowest t values for EWS vs.remainder

! Repeated for BL vs. remainder, NB vs. remainder, and RMS vs. remainder

Sort by EWS t Value EWS BL NB RMSImage ID Transcript Description t Value t Value t Value t Value

770394 Fc fragment of IgG, receptor, transporter, alpha 12.04 -6.67 -6.17 -4.791435862 antigen identified by monoclonal antibodies 12E7, F21 and O13 9.09 -6.75 -5.01 -4.03377461 caveolin 1, caveolae protein, 22kD 8.82 -5.97 -4.91 -4.78814260 follicular lymphoma variant translocation 1 8.17 -4.31 -4.70 -5.48491565 Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain 7.60 -5.82 -2.62 -3.68841641 cyclin D1 (PRAD1: parathyroid adenomatosis 1) 6.84 -9.93 0.56 -4.30

1471841 ATPase, Na+/K+ transporting, alpha 1 polypeptide 6.65 -3.56 -2.72 -4.69866702 protein tyrosine phosphatase, non-receptor type 13 6.54 -4.99 -4.07 -4.84713922 glutathione S-transferase M1 6.17 -5.61 -5.16 -1.97308497 KIAA0467 protein 5.99 -6.69 -6.63 -1.11770868 NGFI-A binding protein 2 (ERG1 binding protein 2) 5.93 -6.74 -3.88 -1.21345232 lymphotoxin alpha (TNF superfamily, member 1) 5.61 -8.05 -2.49 -1.19

786084 chromobox homolog 1 (Drosophila HP1 beta) -5.04 -1.05 9.65 -0.62796258 sarcoglycan, alpha (50kD dystrophin-associated glycoprotein) -5.04 -3.31 -3.86 6.83431397 -5.04 2.64 2.19 0.64825411 N-acetylglucosamine receptor 1 (thyroid) -5.06 -1.45 5.79 0.76859359 quinone oxidoreductase homolog -5.23 -7.27 0.78 5.40

75254 cysteine and glycine-rich protein 2 (LIM domain only, smooth muscle) -5.30 -4.11 2.20 3.68448386 -5.38 -0.42 3.76 0.14

68950 cyclin E1 -5.80 0.03 -1.58 5.10774502 protein tyrosine phosphatase, non-receptor type 12 -5.80 -5.56 3.76 3.66842820 inducible poly(A)-binding protein -6.14 0.60 0.66 3.80214572 ESTs -6.39 -0.08 -0.22 4.56295985 ESTs -9.26 -0.13 3.24 2.95

Strong t-value distinctions

between classes

Comparison of PresentSRBCT Set with Khan Top 10

EWS BL NB RMS Most KhanStudent t Student t Student t Student t Significant Gene

Image ID Gene Description Value Value Value Value t Value Class

296448

insulin-like growth factor 2

(somatomedin A) -4.789 -5.226 -1.185 5.998 RMS RMS

207274

Human DNA for insulin-like

growth factor II (IGF-2); exon

7 and additional ORF -4.377 -5.424 -1.639 5.708 RMS RMS

841641

cyclin D1 (PRAD1:

parathyroid adenomatosis 1) 6.841 -9.932 0.565 -4.300 BL (-) EWS/NB365826 growth arrest-specific 1 3.551 -8.438 -6.995 1.583 BL (-) EWS/RMS486787 calponin 3, acidic -4.335 -6.354 2.446 2.605 BL (-) RMS/NB

770394

Fc fragment of IgG, receptor,

transporter, alpha 12.037 -6.673 -6.173 -4.792 EWS EWS244618 ESTs -4.174 -4.822 -3.484 5.986 RMS RMS

233721

insulin-like growth factor

binding protein 2 (36kD) 0.058 -7.487 -1.599 2.184 BL (-) Not BL43733 glycogenin 2 4.715 -4.576 -3.834 -3.524 EWS EWS

295985 ESTs -9.260 -0.133 3.237 2.948 EWS (-) Not EWS

! Red: both sets

! Black: Khan set only

Khan selection

distinguished by

strong, consistent

relation to t values

Clustering of SRBCTEnsemble Averages

EWS set BL set

NB set RMS set

Artificial NeuralNetworks

Neural NetworkTraining Set

!

Sample1 Sample 2 Sample 3 ... Sample 62 Sample 63

EWS EWS EWS ... RMS RMS

EWS(+)Average EWS(+)Average EWS(+)Average ... EWS(+)Average EWS(+)Average

EWS(")Average EWS(")Average EWS(")Average ... EWS(")Average EWS(")Average

BL(+)Average BL(+)Average BL(+)Average ... BL(+)Average BL(+)Average

BL(")Average BL(")Average BL(")Average ... BL(")Average BL(")Average

NB(+)Average NB(+)Average NB(+)Average ... NB(+)Average NB(+)Average

NB(")Average NB(")Average NB(")Average ... NB(")Average NB(")Average

RMS(+)Average RMS(+)Average RMS(+)Average ... RMS(+)Average RMS(+)Average

RMS(")Average RMS(")Average RMS(")Average ... RMS(")Average RMS(")Average

#

$

% % % % % % % % % % % % % % %

&

'

( ( ( ( ( ( ( ( ( ( ( ( ( ( (

! Each input row is an ensemble average for a transcript set,normalized in (–1,+1)

Identifier

Target Output

Transcript

Training

Set

SRBCT Neural NetworkTraining and Application

! Neural network– 8 ensemble-average inputs

– various # of sigmoidalneurons

– 4 linear neurons

– 4 outputs

! Training accuracy– Train on all 63 samples

– Test on all 63 samples

– 100% accuracy

Leave-One-Out Validation ofSRBCT Neural Network

! Remove a single sample

! Train on remaining samples (125 times)

! Evaluate class of the removed sample

! Repeat for each of 63 samples

! 6 sigmoids: 99.96% accuracy (3 errorsin 7,875 trials)

! 12 sigmoids: 99.99% accuracy (1 errorin 7,875 trials)

Novel-Set Validation ofSRBCT Neural Network

! Network always chooses one of four classes(i.e., “unknown” is not an option)

! Test on 25 novel samples (400 times each)– 5 EWS

– 5 BL

– 5 NB

– 5 RMS

– 5 samples of unknown class

! 99.96% accuracy on first 20 novel samples(3 errors in 8,000 trials)

! 0% accuracy on unknown classes

Observations of SRBCT Classificationusing Ensemble Averages

! t test identified strong features for classification in this dataset

! Neural networks easily classified the four data types

! Caveat: Small, round blue-cell tumors occur in differenttissue types

– Ewing’s sarcoma: Bone tissue

– Burkitt’s Lymphoma: Lymph nodes

– Neuroblastoma: Nerve tissue

– Rhabdomyosarcoma: Soft tissue

! Genetic variation may be linked to tissuedifferences as well as tumor genetics

260-Sample PPG ColonCancer Data Set

! Primary colon cancer (130)

! Primary polyp (31)

! Normal mucosa (24)

! Liver metastasis (33)

! Normal liver (12)

! Lung metastasis (20)

! Normal lung (5)

! Microadenoma (4)

! High grade dysplasia (1)

! Affymetrix HGU-133A GeneChip

! 22,283 probe sets

! 17,455 probe setsverified

! Transcripts are“Present” in 80%of samples

! 7,343 probe setsanalyzed

9-Class Colon CancerNeural Network

! Neural network– 18 ensemble-average inputs

– various # of sigmoidal neurons

– 9 linear neurons

– 9 outputs

– 12-48 transcripts in eachensemble average

Training and Validation Accuracy of9-Class Colon Cancer Neural Network

! 9-class PPG set is more heterogeneous thanthe 4-class SRBCT set

! Generalization is better with fewer neurons

Transcripts in Training Set Leave-One-OutEnsemble Neurons Trials Correct % Correct Correct % Correct

12 4 260 176 68% 151 58%12 9 260 188 72% 156 60%48 4 260 179 69% 173 67%48 9 260 194 75% 170 65%48 18 260 234 90% 172 66%

Classification of Primary Tumors,Polyps, and Normal Mucosa

! Primary colon cancer (130), primary polyp (31),normal mucosa (24)


6 3 370 323 87% 310 84%6 4 370 328 89% 314 85%6 6 370 328 89% 312 84%

12 1 370 296 80% 298 81%12 2 370 336 91% 320 86%12 3 370 336 91% 319 86%12 4 370 333 90% 319 86%12 6 370 334 90% 323 87%24 3 370 311 84% 311 84%24 6 370 330 89% 320 86%24 6 370 334 90% 318 86%

Primary Tumor vs. Polyp

! Positive vs. negative meta-transcripts

! Samplecorrelation

! Significant correlation between tumors and polyps

Classification of Primary Tumors,Metastases, and Normal Tissue

! Primary colon cancer (130), normal mucosa (24), liver metastasis(33), normal liver (12), lung metastasis (20), normal lung (5)

! Elimination of organ-specific effects on tumor comparisons


12 1 448 260 58% 275 61%12 3 448 364 81% 343 77%12 6 448 427 95% 367 82%12 9 448 427 95% 379 85%12 12 448 433 97% 377 84%

Classification of Primary ColonCancer According to Gender

! Male (61), female (44)


6 4 210 168 80% 164 78%12 2 210 178 85% 177 84%12 4 210 178 85% 176 84%12 8 210 178 85% 176 84%12 9 210 179 85% 176 84%24 2 210 184 88% 174 83%24 4 210 184 88% 174 83%

Classification of Primary ColonCancer According to Age at Diagnosis

! 50 annotated samples

! Implication that genetic distinction occurs in the 60s

Transcripts in Training Set Leave-One-OutEnsemble Neurons Trials Correct % Correct Correct % CorrectDiscriminant = 50 yr

24 2 200 184 92% 180 90%24 4 400 368 92% 354 89%48 2 400 366 92% 358 90%

Discriminant = 60 yr24 2 400 372 93% 362 91%48 2 400 360 90% 360 90%

Discriminant = 70 yr24 2 400 312 78% 305 76%24 2 400 336 84% 312 78%

Conclusions

! Simple tests for transcript selection– Mean (t test)

– Covariance (correlation matrices)

! Multiclass classification using neural networks

! Heterogeneity of colon cancer data

! Confounding effect of tissue-specific transcripts

! Supervised classification according to– Primary tumor, polyp, normal mucosa

– Primary tumor, metastasis, normal tissue

– Patient gender and age

Future Work on PPGColon Cancer Data

! Classification by outcome

! Analysis of complete data set

! Orthogonal statistics of the data set

! Alternative targets and data sets forclassification of data, e.g.,– Clinical data

– Pathology data

– Mutated DNA drawn from same sample set

! Identification of functional pathways

Acknowledgments

! Gunter Schemmann, Princeton University! Daniel Notterman, Princeton University! Francis Barany, Cornell Weill Medical School! Phillip Paty, Memorial Sloan Kettering Cancer Center! Eytan Domany, Weizmann Institute of Science! Jürg Ott, Rockefeller University! Arnold Levine, Institute for Advanced Study! Hao Liu, Bristol-Meyers-Squibb (formerly with UMDNJ)! Douglas Welsh, Princeton University! … and members of the Simons Center for Systems

Biology, Institute for Advanced Study

critical dynamic processes in the pathogenesis of cancer ... · robert stengel october 24,...

Documents