critical dynamic processes in the pathogenesis of cancer ... · robert stengel october 24,...
TRANSCRIPT
Robert StengelOctober 24, 2007
! Background
! Small, round, blue-cell tumor example
! Application to colon cancer, metastases, andnormal tissue
Multiclass Analysis ofMicroarray Data
Lecture for MOL 457, Computational Aspects of Molecular Biology,
Princeton University
Critical Dynamic Processes in thePathogenesis of Cancer
Production of aSingle Protein
! Up-regulation of mRNA transcripts in tumor cells
– Causal input (tumor enhancing gene)
– Defensive response (tumor suppressor gene)
– Bystander effect
– Tissue effect
– Artifact
! Down-regulation of mRNA transcripts in tumor cells
– Present, but mutated (and, therefore, not detected)
– Eliminated or suppressed by tumor growth
– Tissue effect
– Artifact
Putative Paradigm for Microarray Analysis:mRNA Transcript Expression Level Infers
Cell Function & Carcinogenesis
Limitations ofMicroarray Analysis
! Human genome microarray probe sets describe wild-typenucleotide sequences– They do not see mutations
! Microarray data are “noisy”– Biological variation
– Measurement error
! mRNA expression level alone is not sufficient to determinefunction– Inhibition and amplification of biological processes are multi-gene/-
transcript/-protein effects
– Ancillary data are required to interpret significance of transcript levels
! Realistic goal is to classify samples by statistical analysis oftranscript expression levels
MicroarrayClassification Objectives
! Class comparison– Identify expression profiles (feature sets)
for predefined classes
! Class prediction– Develop algorithms that predict class
membership for a novel expression profile
! Class discovery– Identify significant new classes, sub-classes,
or features
Typical Pairs of Colon CancerMicroarray Expression Levels
(Alon, Notterman et al, 1999)
-100
-50
0
5 0
100
150
200
250
-80 -60 -40 -20 0 2 0 4 0 6 0 8 0
Tumor
Normal
H57136
M96839
! Samples not well differentiated in individualtranscript clusters (overlapping)
Classification of Data
! Data set characterized by two features
Clustering of Data
! How many clusters?
Discriminants of Data
! Where are the boundaries between sets?
The Data Set Revealed
The discriminant is the Delaware River
Towns and Crossroads ofPennsylvania and New Jersey
Another ClassificationExample
! Party for returning graduatealumni/ae, Reunions, 1999
! From this picture, who arethe:– Grad alums
– Current students
– Spouses
– Children
– Visitors from abroad
– Hosts
– Oldest alums
– Youngest alums
– Party crashers?
! What features are used toclassify?
... and What if theData are Distorted?
! Grad alums
! Current students
! Spouses
! Children
! Visitors from abroad
! Hosts
! Oldest alums
! Youngest alums
! Party crashers
Supervised and UnsupervisedLearning of Discriminants
! Learning depends on “closeness”of related features
! Previously unknown correlationsor features are detected
! Classification occurs after learningvia exogenous knowledge
! Same answer given for allquestions
! Learning depends on priordefinition and knowledge ofclass
! Complex correlation betweenfeatures revealed
! Classification is inherent inlearning
! Different answers given fordifferent questions
Supervised
Learning
Unsupervised
Learning
Effect ofFeature Set Dimension
-50
0
5 0
100
150
200
-20 -10 0 1 0 2 0 3 0 4 0
Tumor
Normal
D14657
M94363
! Best line or curve mayclassify with significanterror
! Best plane or surfaceclassifies with equal or lesserror
Characteristics ofClassification Features
! Additional features
– Orthogonal feature (low correlation) may addnew information to the set
– Co-expressed feature (high correlation) isredundant; averaging may reduce error
-50
0
5 0
100
150
200
-20 -10 0 1 0 2 0 3 0 4 0
Tumor
Normal
D14657
M94363
! Strong feature
– Individual feature provides good classification
– Minimal overlap of feature values in each class
– Significant difference in class mean values
– Low variance in class is desirable
Two-Sample Student t Testfor Transcript Selection
!
t =m
A"m
B( )
#A
2
nA
+#B
2
nB
! m = mean value of data set
! ! = standard deviation of data set
! n = number of points in data set
! t test applied to mean tumor/normal expressionlevels of individual transcripts
! |t| > 3, mA ! mB with "99.7% confidence[error p # 0.003 (Gaussian)] [n > 25]
! t compares mean values of two data sets– |t| is reduced by uncertainty in the data sets (!)
– |t| is increased by number of points in the data sets (n)
But All Data Sets AreNot Gaussian!
! Association between t and p-value is different for Gaussianand non-Gaussian distributions
! All finite data sets have means and standard deviationsthat describe the average and the spread
! Classification goal is to find transcripts whose expressionlevels are significantly different, not to compute p-value
! Large value of t virtually assures small p-value forcomparison of any unimodal distributions
– Actual p-value is of little concern if it is “small enough”
! Statistical significance can be confirmed by standardtechniques (e.g., Bonferroni correction, Benjamini-Hochberg procedure)
Example of Transcript-by-TranscriptTumor/Normal Classification by t Test
(data from Alon, Notterman et al, 1999)
!
t = mT"m
N( )#T
2
36+#N
2
22
! 1,151 probe sets* areover/under-expressed intumor/normal comparison, p# 0.003
! Genetically dissimilarsamples are apparent
“Cancer-positive probe sets”
“Cancer-negative probe sets”
* Each probe set represents
one mRNA transcript
Correlation Matrices
! Probe set correlation
! Sample correlation
x =
x =
+1 0 –1
Ensemble Mean Values
! Treat each probe set (row) as a redundant, corruptedmeasurement of the same tumor/normal indicator
! Compute column averages for each sample sub-group(i.e., sum each column and divide by n)
!
zij = kiy + "ij , i =1,m, j =1, n
!
ˆ z j =1
nzij
i=1
n
"
! Random variable sums approach Gaussiandistribution by central limit theorem as n -> $
Class Prediction and EvaluationUsing Ensemble Mean Values
(data from Alon, Notterman, 1999)
! Single-featureprediction
– Cancer-positivegenes: 2 errors
– Cancer-negativegenes: 1 error
! Two-featureprediction (cross-plot): 1 error(mislabeling?)
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180
Average Value for Genes Up-Regulated in Tumors
Avera
ge V
alu
e f
or
Gen
es D
ow
n-R
eg
ula
ted
in
Tu
mo
rs
Tumors
Normals
Clustering of Sample Averages for PrimaryColon Cancer vs. Normal Mucosa[NCI Program Project Grant (PPG), 144-sample set]
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00 2000.00
Average Value of Up-Regulated Genes (Primary Colon Cancer)
Av
era
ge
Va
lue
of
Do
wn
-Re
gu
late
d G
en
es
(P
rim
ary
Co
lon
Ca
nc
er)
Primary Colon Cancer
Normal Mucosa
# Probe sets
Up: 1067
Down: 290
Constant: 19
! 144 samples,3,437 probesets analyzed
! 47 primarycolon cancer
! 22 normalmucosa
! AffymetrixHGU-133AGeneChip
! Alltranscripts“Present” inall samples
Clustering of Sample Averages forPrimary Polyp vs. Normal Mucosa
(PPG, 144-sample set)
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
0.00 500.00 1000.00 1500.00 2000.00 2500.00
Average Value of Up-Regulated Genes (Primary Polyp)
Av
era
ge
Va
lue
of
Do
wn
-Re
gu
late
d G
en
es
(P
rim
ary
Po
lyp
)
Primary Polyp
Normal Mucosa# Probe sets
Up: 1014
Down: 219
Constant: 40
! 21 primarypolyp
! 22 normalmucosa
Clustering of Sample Averages forPrimary Polyp vs. Primary Colon Cancer
(PPG, 144-sample set)
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
900.00
1000.00
0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00
Average Value of Up-Regulated Genes (Primary Polyp)
Avera
ge V
alu
e o
f D
ow
n-R
eg
ula
ted
Gen
es (
Pri
mary
Po
lyp
)
Primary Polyp
Primary Colon Cancer
# Probe sets
Up: 273
Down: 126
Constant: 172
! 21 primarypolyp
! 47 primarycolon cancer
Identified Cancer Links inSelected Transcript Training Sets
(PPG, 144 samples)
! Bioalma linkages via GeneCards®
(Weizmann Institute of Science)
Number of Identified Cancer Genes in SetOver-Expressed (10) Under-Expressed (10) Both (20)
Primary Colon Cancer
vs. Normal Mucosa 7 (70%) 5 (50%) 12 (60%)
Primary Polyp vs.
Norma Mucosa 7 (70%) 5 (50%) 12 (60%)
Primary Colon Cancer
vs. Primary Polyp 7 (70%) 6 (60%) 13 (65%)
Total 21 (70%) 16 (53%) 37 (62%)
Small, Round Blue-Cell TumorClassification Example
http://www.medstudents.com.br/patoclin/courses/
sarcomas/dsrct/dsrct.htm
! Childhood cancers, including
– Ewing’s sarcoma (EWS)
– Burkitt’s Lymphoma (BL)
– Neuroblastoma (NB)
– Rhabdomyosarcoma (RMS)
! cDNA microarray analysis presented byJ. Khan, et al., 2001
» 96 transcripts chosen from 2,308 probes fortraining
» 63 EWS, BL, NB, and RMS training samples
» Classification with principal componentanalysis and a linear neural network
– Source of data for my analysis
Desmoplastic small,
round blue-cell tumors
Overview of PresentSRBCT Analysis
! Transcript selection by t test– 96 transcripts, 12 highest and lowest t values for each class
– Overlap with Khan set: 32 transcripts
! Ensemble averaging of highest and lowest t valuesfor each class
! Cross-plot of ensemble averages
! Classification by sigmoidal neural network
! Validation of neural network– Novel set simulation
– Leave-one-out simulation
Ranking by EWS t Values(Top and Bottom 12)
! 24 transcripts selected from 12 highest and lowest t values for EWS vs.remainder
! Repeated for BL vs. remainder, NB vs. remainder, and RMS vs. remainder
Sort by EWS t Value EWS BL NB RMSImage ID Transcript Description t Value t Value t Value t Value
770394 Fc fragment of IgG, receptor, transporter, alpha 12.04 -6.67 -6.17 -4.791435862 antigen identified by monoclonal antibodies 12E7, F21 and O13 9.09 -6.75 -5.01 -4.03377461 caveolin 1, caveolae protein, 22kD 8.82 -5.97 -4.91 -4.78814260 follicular lymphoma variant translocation 1 8.17 -4.31 -4.70 -5.48491565 Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain 7.60 -5.82 -2.62 -3.68841641 cyclin D1 (PRAD1: parathyroid adenomatosis 1) 6.84 -9.93 0.56 -4.30
1471841 ATPase, Na+/K+ transporting, alpha 1 polypeptide 6.65 -3.56 -2.72 -4.69866702 protein tyrosine phosphatase, non-receptor type 13 6.54 -4.99 -4.07 -4.84713922 glutathione S-transferase M1 6.17 -5.61 -5.16 -1.97308497 KIAA0467 protein 5.99 -6.69 -6.63 -1.11770868 NGFI-A binding protein 2 (ERG1 binding protein 2) 5.93 -6.74 -3.88 -1.21345232 lymphotoxin alpha (TNF superfamily, member 1) 5.61 -8.05 -2.49 -1.19
786084 chromobox homolog 1 (Drosophila HP1 beta) -5.04 -1.05 9.65 -0.62796258 sarcoglycan, alpha (50kD dystrophin-associated glycoprotein) -5.04 -3.31 -3.86 6.83431397 -5.04 2.64 2.19 0.64825411 N-acetylglucosamine receptor 1 (thyroid) -5.06 -1.45 5.79 0.76859359 quinone oxidoreductase homolog -5.23 -7.27 0.78 5.40
75254 cysteine and glycine-rich protein 2 (LIM domain only, smooth muscle) -5.30 -4.11 2.20 3.68448386 -5.38 -0.42 3.76 0.14
68950 cyclin E1 -5.80 0.03 -1.58 5.10774502 protein tyrosine phosphatase, non-receptor type 12 -5.80 -5.56 3.76 3.66842820 inducible poly(A)-binding protein -6.14 0.60 0.66 3.80214572 ESTs -6.39 -0.08 -0.22 4.56295985 ESTs -9.26 -0.13 3.24 2.95
Strong t-value distinctions
between classes
Comparison of PresentSRBCT Set with Khan Top 10
EWS BL NB RMS Most KhanStudent t Student t Student t Student t Significant Gene
Image ID Gene Description Value Value Value Value t Value Class
296448
insulin-like growth factor 2
(somatomedin A) -4.789 -5.226 -1.185 5.998 RMS RMS
207274
Human DNA for insulin-like
growth factor II (IGF-2); exon
7 and additional ORF -4.377 -5.424 -1.639 5.708 RMS RMS
841641
cyclin D1 (PRAD1:
parathyroid adenomatosis 1) 6.841 -9.932 0.565 -4.300 BL (-) EWS/NB365826 growth arrest-specific 1 3.551 -8.438 -6.995 1.583 BL (-) EWS/RMS486787 calponin 3, acidic -4.335 -6.354 2.446 2.605 BL (-) RMS/NB
770394
Fc fragment of IgG, receptor,
transporter, alpha 12.037 -6.673 -6.173 -4.792 EWS EWS244618 ESTs -4.174 -4.822 -3.484 5.986 RMS RMS
233721
insulin-like growth factor
binding protein 2 (36kD) 0.058 -7.487 -1.599 2.184 BL (-) Not BL43733 glycogenin 2 4.715 -4.576 -3.834 -3.524 EWS EWS
295985 ESTs -9.260 -0.133 3.237 2.948 EWS (-) Not EWS
! Red: both sets
! Black: Khan set only
Khan selection
distinguished by
strong, consistent
relation to t values
Clustering of SRBCTEnsemble Averages
EWS set BL set
NB set RMS set
Artificial NeuralNetworks
Neural NetworkTraining Set
!
Sample1 Sample 2 Sample 3 ... Sample 62 Sample 63
EWS EWS EWS ... RMS RMS
EWS(+)Average EWS(+)Average EWS(+)Average ... EWS(+)Average EWS(+)Average
EWS(")Average EWS(")Average EWS(")Average ... EWS(")Average EWS(")Average
BL(+)Average BL(+)Average BL(+)Average ... BL(+)Average BL(+)Average
BL(")Average BL(")Average BL(")Average ... BL(")Average BL(")Average
NB(+)Average NB(+)Average NB(+)Average ... NB(+)Average NB(+)Average
NB(")Average NB(")Average NB(")Average ... NB(")Average NB(")Average
RMS(+)Average RMS(+)Average RMS(+)Average ... RMS(+)Average RMS(+)Average
RMS(")Average RMS(")Average RMS(")Average ... RMS(")Average RMS(")Average
#
$
% % % % % % % % % % % % % % %
&
'
( ( ( ( ( ( ( ( ( ( ( ( ( ( (
! Each input row is an ensemble average for a transcript set,normalized in (–1,+1)
Identifier
Target Output
Transcript
Training
Set
SRBCT Neural NetworkTraining and Application
! Neural network– 8 ensemble-average inputs
– various # of sigmoidalneurons
– 4 linear neurons
– 4 outputs
! Training accuracy– Train on all 63 samples
– Test on all 63 samples
– 100% accuracy
Leave-One-Out Validation ofSRBCT Neural Network
! Remove a single sample
! Train on remaining samples (125 times)
! Evaluate class of the removed sample
! Repeat for each of 63 samples
! 6 sigmoids: 99.96% accuracy (3 errorsin 7,875 trials)
! 12 sigmoids: 99.99% accuracy (1 errorin 7,875 trials)
Novel-Set Validation ofSRBCT Neural Network
! Network always chooses one of four classes(i.e., “unknown” is not an option)
! Test on 25 novel samples (400 times each)– 5 EWS
– 5 BL
– 5 NB
– 5 RMS
– 5 samples of unknown class
! 99.96% accuracy on first 20 novel samples(3 errors in 8,000 trials)
! 0% accuracy on unknown classes
Observations of SRBCT Classificationusing Ensemble Averages
! t test identified strong features for classification in this dataset
! Neural networks easily classified the four data types
! Caveat: Small, round blue-cell tumors occur in differenttissue types
– Ewing’s sarcoma: Bone tissue
– Burkitt’s Lymphoma: Lymph nodes
– Neuroblastoma: Nerve tissue
– Rhabdomyosarcoma: Soft tissue
! Genetic variation may be linked to tissuedifferences as well as tumor genetics
260-Sample PPG ColonCancer Data Set
! Primary colon cancer (130)
! Primary polyp (31)
! Normal mucosa (24)
! Liver metastasis (33)
! Normal liver (12)
! Lung metastasis (20)
! Normal lung (5)
! Microadenoma (4)
! High grade dysplasia (1)
! Affymetrix HGU-133A GeneChip
! 22,283 probe sets
! 17,455 probe setsverified
! Transcripts are“Present” in 80%of samples
! 7,343 probe setsanalyzed
9-Class Colon CancerNeural Network
! Neural network– 18 ensemble-average inputs
– various # of sigmoidal neurons
– 9 linear neurons
– 9 outputs
– 12-48 transcripts in eachensemble average
Training and Validation Accuracy of9-Class Colon Cancer Neural Network
! 9-class PPG set is more heterogeneous thanthe 4-class SRBCT set
! Generalization is better with fewer neurons
Transcripts in Training Set Leave-One-OutEnsemble Neurons Trials Correct % Correct Correct % Correct
12 4 260 176 68% 151 58%12 9 260 188 72% 156 60%48 4 260 179 69% 173 67%48 9 260 194 75% 170 65%48 18 260 234 90% 172 66%
Classification of Primary Tumors,Polyps, and Normal Mucosa
! Primary colon cancer (130), primary polyp (31),normal mucosa (24)
Transcripts in Training Set Leave-One-OutEnsemble Neurons Trials Correct % Correct Correct % Correct
6 3 370 323 87% 310 84%6 4 370 328 89% 314 85%6 6 370 328 89% 312 84%
12 1 370 296 80% 298 81%12 2 370 336 91% 320 86%12 3 370 336 91% 319 86%12 4 370 333 90% 319 86%12 6 370 334 90% 323 87%24 3 370 311 84% 311 84%24 6 370 330 89% 320 86%24 6 370 334 90% 318 86%
Primary Tumor vs. Polyp
! Positive vs. negative meta-transcripts
! Samplecorrelation
! Significant correlation between tumors and polyps
Classification of Primary Tumors,Metastases, and Normal Tissue
! Primary colon cancer (130), normal mucosa (24), liver metastasis(33), normal liver (12), lung metastasis (20), normal lung (5)
! Elimination of organ-specific effects on tumor comparisons
Transcripts in Training Set Leave-One-OutEnsemble Neurons Trials Correct % Correct Correct % Correct
12 1 448 260 58% 275 61%12 3 448 364 81% 343 77%12 6 448 427 95% 367 82%12 9 448 427 95% 379 85%12 12 448 433 97% 377 84%
Classification of Primary ColonCancer According to Gender
! Male (61), female (44)
Transcripts in Training Set Leave-One-OutEnsemble Neurons Trials Correct % Correct Correct % Correct
6 4 210 168 80% 164 78%12 2 210 178 85% 177 84%12 4 210 178 85% 176 84%12 8 210 178 85% 176 84%12 9 210 179 85% 176 84%24 2 210 184 88% 174 83%24 4 210 184 88% 174 83%
Classification of Primary ColonCancer According to Age at Diagnosis
! 50 annotated samples
! Implication that genetic distinction occurs in the 60s
Transcripts in Training Set Leave-One-OutEnsemble Neurons Trials Correct % Correct Correct % CorrectDiscriminant = 50 yr
24 2 200 184 92% 180 90%24 4 400 368 92% 354 89%48 2 400 366 92% 358 90%
Discriminant = 60 yr24 2 400 372 93% 362 91%48 2 400 360 90% 360 90%
Discriminant = 70 yr24 2 400 312 78% 305 76%24 2 400 336 84% 312 78%
Conclusions
! Simple tests for transcript selection– Mean (t test)
– Covariance (correlation matrices)
! Multiclass classification using neural networks
! Heterogeneity of colon cancer data
! Confounding effect of tissue-specific transcripts
! Supervised classification according to– Primary tumor, polyp, normal mucosa
– Primary tumor, metastasis, normal tissue
– Patient gender and age
Future Work on PPGColon Cancer Data
! Classification by outcome
! Analysis of complete data set
! Orthogonal statistics of the data set
! Alternative targets and data sets forclassification of data, e.g.,– Clinical data
– Pathology data
– Mutated DNA drawn from same sample set
! Identification of functional pathways
Acknowledgments
! Gunter Schemmann, Princeton University! Daniel Notterman, Princeton University! Francis Barany, Cornell Weill Medical School! Phillip Paty, Memorial Sloan Kettering Cancer Center! Eytan Domany, Weizmann Institute of Science! Jürg Ott, Rockefeller University! Arnold Levine, Institute for Advanced Study! Hao Liu, Bristol-Meyers-Squibb (formerly with UMDNJ)! Douglas Welsh, Princeton University! … and members of the Simons Center for Systems
Biology, Institute for Advanced Study