1
Robust diagnosis of DLBCL from gene Robust diagnosis of DLBCL from gene expression data from different expression data from different
laboratories laboratories
DIMACS - RUTCOR Workshop on
Boolean and Pseudo-Boolean Functions
in Memory of Peter L. HammerJanuary 19-22, 2009
2
Peter L Hammer
Sorin Alexe David E Axelrod
RUTGERS UNIV
Gustavo StolovitzkyIBM TJ WATSON RESEARCH
Gyan BhanotArnold J LevineINSTITUTE FOR ADVANCED STUDY PRINCETON David Weissmann
CANCER INSTITUTE OF NEW JERSEY
3
Overview Overview
Motivation
Pattern-based ensemble classifiers
Case study – compare data from two labs for
DLBCL vs FL diagnosisShipp et al. (2002) Nature Med.; 8(1), 68-74. (Whitehead Lab)
Stolovitzky G. (2005) In Deisboeck et al Complex Systems Science in BioMedicine (in press) (preprint: http://www.wkap.nl/prod/a/Stolovitzky.pdf). (DellaFavera Lab)
Alexe, Alexe, Axelrod, Hammer, Weissmann (2005) Artificial Intelligence in MedicineBhanot, Alexe, Stolowitzky, Levine (2005) Genome Informatics
4
Non-Hodgkin lymphomas
FL low grade non-Hodgkin lymphoma / no cure if advanced stage
second most frequent subtype of nodal lymphoid malignancies
Incidence has risen from 2–3/ to more than 5–7/ 100,000/year (’50 –’00)
t(14;18) translocation:over-expression of anti-apoptotic bcl2 25-60% FL cases evolve to DLBCL
DLBCL high grade non-Hodgkin lymphoma / high variability to treatment most frequent subtype of NHL
< 2 years survival if untreated
Biomarkers: FL transformation to DLBCL• p53/MDM2 (Moller et al., 1999)• p16 (Pyniol, 1998)• p38MAPK (Elenitoba-Johnson et al., 2003)• c-myc (Lossos et al., 2002)
5
Gene arrays
Gene arrays are a way to study the variation of
mRNA levels between different types of cells.
This allows diagnosis and inference of pathways
that cause disease / early stage diagnosis
Identify molecular profiles of disease –
personalized medicine
6
Lymphoma datasetsLymphoma datasets
Data: WI (Shipp et al., 2002) Affy HuGeneFL
CU (DallaFavera Lab, Stolovitzky, 2005) Affy Hu95Av2
Samples:
WI: 58 DLBCL & 19 FL
CU: 14 DLBCL & 7 FL
Genes:
WI: 6817
CU: 12581
7
Diagnosis problemDiagnosis problem
InputTraining (biomedical) data: 2 classes: FL and DLBCL
m samples described by N >> features
OutputCollection of robust biomarkers, modelsRobust, accurate classifier /
tested on out-of-sample data
8
Data preprocessingCreating training and test dataNormalizationNoise estimation
Robust feature selectionFiltering Support set selection
Artificial Neural Networks
Support Vector Machines
Weighted Voting System (LAD)
k-Nearest Neighbors
Decision Trees(C4.5)
Logistic Regression
Pattern data(training)
Raw data(training)
Principal Components
META-CLASSIFIERValidation(test data)
Input data
Calibration
INDIVIDUAL CLASSIFIERS
INTERMEDIATE CLASSIFIERS
Classifier (Weighted Voting)
Biology-based feature selectionFilteringSupport set selection
9
Patterns (Patterns (Logical Analysis of DataLogical Analysis of Data, Hammer 1988), Hammer 1988)
Positive Patterns Negative Patterns
Model
-Exhaustive collections of patterns
-Pattern space
-Classification / attribute analysis / new class
identification
10
Data Preprocessing
50 % P calls, UL = 16000, LL = 20 2/1 stratify WI data to train/test CU data test Normalize data to median 1000 per array Generate 500 data sets using noise + k fold stratified
sampling + jackknife Find genes with high correlation to phenotype using t-test
or SNR. Keep genes that are in > 90% of datasets
11
Choosing support setsChoosing support sets
Create quality patterns using small subsets of genes, validate using weighted voting with 10 fold cross validation
Sort genes by their appearance in good patterns
Select top genes to cover each sample by at least 10 patterns
Alexe, Alexe, Hammer, Vizvari (2005)
12
The 30 genes that The 30 genes that
best distinguish best distinguish
FL from DLBCLFL from DLBCLG
ene
sym
bol
Ship
p e
t a
l.
Gene
s@
Work
t-te
st
p5
3 r
eg
ula
ted
Bio
log
ical
function
SEPP1 * * * oxidative stress
TXNIP * * metastases suppressor
DNASE1L3 * * apoptosis
CDH11 * * * cell adhesion
LUCA15 * apoptosis
GPR18 * * * signaling pathway
CLU * * * apoptosis
LY9 * * cell adhesion
RHOH * * T-cell differentiation
ELF2 transcription
CCNG2 * cell cycle
CR2 complement activation
CDKN2D * cell cycle
PPP2R5C * signal transduction
G18 cell growth
LY86 * apoptosis
ARPC1B cell motility
MCM7 * * * * cell cycle
BCL2A1 * * * apoptosis
IMPDH2 * * GMP biosynthesis
RRP45 * immune response
STAT1 NF-kappaB cascade
DLG7 * * * cell-cell signaling
SLC1A5 * * transport
TUBB2 * * microtubule movement
PSMA6 protein catabolism
PSMC1 * * * spinocerebellar ataxia
LGALS3 * * * sugar binding
CLTA * * transport
PAGA * * cell proliferation
13
#Gene index
Gene description Accession #
Pearson correlation of
genes in support set
with DLBCL vs FL outcome
Frequency of participation in the
definition of combinatorial
biomarkers
Functional gene group #
(*)
1 506 DNA REPLICATION LICENSING FACTOR CDC47 HOMOLOG D55716_at 0.45 42.08 12 1612 (clone GPCR W) G protein-linked receptor gene (GPCR) gene, 5' end of cds L42324_at -0.49 30.00 23 972 Rad2 HG4074-HT4344_at 0.45 23.33 14 2137 HIGH AFFINITY IMMUNOGLOBULIN GAMMA FC RECEPTOR I "A FORM" PRECURSOR M63835_at 0.43 23.33 25 605 5-aminoimidazole-4-carboxamide-1-beta-D-ribonucleotide transformylase/inosinicase D82348_at 0.53 22.50 -6 6815 Tubulin, Beta 2 HG1980-HT2023_at 0.50 8.33 47 7102 HLA-A MHC class I protein HLA-A (HLA-A28,-B40, -Cw3) M94880_f_at -0.43 8.33 28 2988 RCH1 RAG (recombination activating gene) cohort 1 U28386_at 0.48 7.08 19 4028 LDHA Lactate dehydrogenase A X02152_at 0.62 6.25 610 4292 PKM2 Pyruvate kinase, muscle X56494_at 0.55 5.00 611 4485 IDH2 Isocitrate dehydrogenase 2 (NADP+), mitochondrial X69433_at 0.47 5.00 612 1430 Protein tyrosine phosphatase (CIP2)mRNA L25876_at 0.44 4.17 513 1988 INSULIN-LIKE GROWTH FACTOR BINDING PROTEIN 3 PRECURSOR M35878_at -0.28 4.17 214 582 KIAA0175 gene D79997_at 0.45 2.08 -15 1092 GAMMA-INTERFERON-INDUCIBLE PROTEIN IP-30 PRECURSOR J03909_at 0.53 2.08 316 2929 Mitochondrial serine hydroxymethyltransferase gene, nuclear encoded mitochondrion protein U23143_at 0.42 2.08 -17 3005 Bcl-2 related (Bfl-1) mRNA U29680_at 0.44 2.08 518 4010 PGK1 Phosphoglycerate kinase 1 V00572_at 0.36 2.08 619 2789 CENPA Centromere protein A (17kD) U14518_at 0.51 0.00 520 6703 Dents Disease candidate gene X81836_s_at 0.37 0.00 -
Table 1. Selected non-minimal support set of 20 genes for distingushing DLBCL from FL cases. * 1: DNA replication, recombination and repair, 2: cell surface proteins and receptors, 3: protein synthesis and degradation, 4: structural proteins, 5: cell cycle and apoptosis, 6: metabolism, -: other.
Genes identified by LAD (AIIM 2005) to distinguish DLBCL from FL
14
Examples of FL and DLBCL patternsExamples of FL and DLBCL patterns
Pos Neg Pos Neg
P1 >- 1.13 >- 0.62 97 0 91 23
P2 £0.91 >- 0.77 95 0 79 31
N1 >- 0.26 £- 0.55 0 100 3 54
Training set Test set
Gene Symbol
Pattern
Prevalence (%)
GPR18 CLU DLG7 MCM7
WI training data:
Each DLBCL case satisfies at least one of the patterns P1 and P2
Each FL case satisfies the pattern N1 (and none of the patterns P1 and P2)
15
Pattern dataPattern data
WI training data
WI test data
Positive patterns
Negative
patternsD
LB
CL
FL
CU test data
16
Meta-classifier performanceMeta-classifier performance
Sensitivity (%)
Specificity (%)
Error rate(%)
Sensitivity (%)
Specificity (%)
Error rate(%)
ANN 0.08 94.74 92.31 5.88 82.35 84.62 17.02SVM 0.08 97.37 92.31 3.92 97.06 76.92 8.51kNN 0.09 97.37 100.00 1.96 91.18 84.62 10.64WV 0.07 92.11 92.31 7.84 94.12 76.92 10.64C4.5 0.06 94.74 84.62 7.84 94.12 69.23 12.77LR 0.07 97.37 84.62 5.88 94.12 69.23 12.77ANN 0.10 100.00 100.00 0.00 97.06 76.92 8.51SVM 0.10 100.00 100.00 0.00 97.06 76.92 8.51kNN 0.10 100.00 100.00 0.00 100.00 69.23 8.51WV 0.10 100.00 100.00 0.00 97.06 76.92 8.51C4.5 0.10 100.00 100.00 0.00 91.18 76.92 12.77LR 0.05 100.00 76.92 5.88 100.00 61.54 10.64
100.00 100.00 0.00 100.00 76.92 6.38Meta-classifier
Weight
Training Test
Tra
ined
on
raw
data
Tra
ined
on
patte
rn d
ata
Classifier
17
Error distribution: raw and pattern dataError distribution: raw and pattern data
0 10 20 30 40 50
Meta-classifier
Classifiers trained on pattern data
Classifiers trained on raw data
CU test dataWI test data
19
p53 related genes p53 related genes identified by filtering identified by filtering
procedure procedure
CCNB1 EPRS PMAIP1 E2F3MCM7 GSK3B ACAA2 MDM4BRCA1 COL6A1 E2F5* AMPD2BCL2A1 HRAS POLA RBBP4PPP2R4 SERPING1 HMGB2 CCNG2*EIF2S2 CCNA2 PSMB5 HARSCOMT CCT6A ACTA2 CASP6IARS PRKDC INSR RPS6KA1MPI CAD SNRPA GRP58ALAS1 TNFRSF1B G1P2 TP53MRPL3 ZNF184* IMPDH1 SMAD2NCF2 ALDOA MAP2K2 ATP5C1AARS KARS TOP2A TIMP3KIF11 MAD2L1 CXCL1 THBS2CDK4 GOT1 BAG1 MYCBPATP1B1 CDC25B TOP1 DTRCDC20 PSMA1 MAP4 TIMP3PRIM1 KIAA0101 FDFT1 CBSCDC2 PCNA MTA1 CDKN2D*TOP2A TCF3 CDKN1A RELACDK2 CYC1 HLAE*MYC UPP1 PLK1CCNE1 TOPBP1 CDK7
Gene symbol
FL FL DLBCL DLBCL progressionprogression
21
Examples of p53 responsive genes patternsExamples of p53 responsive genes patterns M
CM
7
CCN
B1
BCL
2A1
CCN
E1
KIA
A01
01
CD
C2
CBS
E2F
5
Pos
Neg Pos
Neg
P1 >- 0.66 >- 0.89 93 11 86 29P2 >- 0.66 >- 0.78 90 11 71 29P3 >- 0.8 >- 0.33 69 11 64 14N1 £- 0.66 3 74 14 71N2 £- 0.56 £- 0.18 3 68 21 57N3 £- 0.11 >0.11 3 68 7 71
Gene symbol
Patte
rn
Prevalence (%)
Training set Test set
WI data:Each DLBCL case satisfies one of the patterns P1, P2, P3Each FL case satisfies one of the patterns N1, N2, N3
22
p53 combinatorial biomarkerp53 combinatorial biomarker
77% FL & 21% DLBCL cases (3.7 fold) at most one gene over-expressed
79% DLBCL & 23% FL cases (3.4 fold)
at least two genes over-expressed
0
10
20
30
40
50
60
70
80
90
<= 1 >=2
# of over-expressed genes in DLBCL vs. FL
(p53, PLK1, CDK2)
% c
ases DLBCL
FL
Each individual gene: over- expressed in about 40-70% DLBCL & 20-40% FL
(specificity 50-60%, sensitivity 60-70%)
23
What are these genes?What are these genes?
Plk1 (stpk13): polo-like kinase serine threonine protein kinase 13, M-phase specific
cell transformation, neoplastic, drives quiescent cells into mitosis over-expressed in various human tumors Takai et al., Oncogene, 2005: plk1 potential target for cancer therapy, new
prognostic marker for cancer Mito et al, Leuk Lymph, 2005: plk1 biomarker for DLBCL
Cdk2 (p33): cyclin -dependent kinase: G2/M transition of mitotic cell cycle, interacts with cyclins A, B3, D, E
P53 tumor suppressor gene (Levine 1982)
24
Conclusions Conclusions
Pattern-based meta-classifier is robust against noise
Good prediction of FL DLBCL
Biology based analysis also possible
Yields useful biomarker
Should study biologically motivated sets of genes build pathways