Transcript
Page 1: Using Gene Expression Data to Predict Clinical Information ...cs229.stanford.edu/proj2016/poster/Abell-UsingGeneExpressionData… · progesterone receptor in breast cancer was very

Using Gene Expression Data to Predict Clinical Information in Seven Human Cancers

Nathan Abell

Dataset Overview

References and Acknowledgements

Future Directions

[email protected] of Genetics

Stanford University School of Medicine

In this project, it quickly became obvious that very low-dimensional sets of genes, forming coherent signatures, could be used to represent the disease sub-type in all studied tissues. Additionally, several other phenotypes, such as progesterone receptor status in breast cancers, were also easily predictable. Much more complex, however, were quantitative outcomes like survival time, or age of disease onset, which rarely were accurate within years of their target. I found, clearly, that variable reduction was the crucial step, with many classification and regression algorithms later performing similarly well (or poorly).

To proceed further, I would start by incorporating information about the selected genes, to see if they were shared across tissues or private. I would also incorporate more tissues, attempt to incorporate matched normal tissue, and attempt to include additional data types like copy number variation.

[1] RG Verhaak, KA Hoadley, E Purdometal. Integratedgenomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98-110. [2] KA Hoadley, C Yau, DM Wolf, et al. Multiplatformanalysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158(4):929-44. [3] https://cancergenome.nih.gov[4] J Friedman, T Hastie, R Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 2010;33(1):1-22.[5] https://CRAN.R-project.org/package=e1071[6] WN Venables, BD Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York. 2002. ISBN 0-387-95457-0[7] A Liaw and M Wiener. Classification and Regression by randomForest. R News 2(3), 18--22. 2002.

Background Statistical Approach Predicting Clinical Attributes

Clinical Outcomes

Fig. 1: Ten human tissues with the indicated number of samples in the Genomic Data

Commons

Fig. 3: Distributions of clinical outcomes in breast and kidney tumors

Fig. 2: Representative Pearson correlation heatmap between lung cancers revealing the

extent of gene expression correlation

Fig. 4: Visual overview of the procedure applied to each tissue separatelyFig. 6A: ROC plots for two example predictions: left, breast cancer

progesterone receptor; right, bladder cancer stage (early vs late)

Feature Selection Across Tissues

Fig. 5B: LASSO regularization path, dashed lines showing

estimates for optimal values of lambda by the misclassification rate

Fig. 5A: Principal component analysis before (above) and after

(right) LASSO variable reduction for breast tumors colored by histology

Fig. 5C: Fitted LASSO parameters for disease sub-type in all tissues

NormalizationTissue Type Sample Size

Bladder 414Brain 667Breast 1102Kidney 891Lung 1035

Prostate 495Skin 103

Split 70/30

LASSO

• LASSO• PCR• SVR

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●●●

● ●●

●●

●●

● ●●

● ●

●●●

●● ●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●●

●●

●●

●●●

● ●

●●

● ●●

●● ●●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●

● ●●

●●

●●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

● ●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

● ●●●

●●

●●●●

● ●●●

●●●●●

●●

●●

●●

● ●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●●

●●

●●

●●

●●

−30

−20

−10

0

10

20

−30 0 30 60PC1 (16.9% explained var.)

PC2

(6.0

% e

xpla

ined

var

.)

groups●

Infiltrating Ductal Carcinoma

Infiltrating Lobular Carcinoma

NA

-5 -4 -3 -2 -1

0.1

0.2

0.3

0.4

log(Lambda)

Mis

clas

sific

atio

n Er

ror

155 140 138 131 125 109 93 84 72 61 52 43 34 29 25 21 19 12 6 1

●●

●●

● ●●●

●●

● ●

●●

●●

● ●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

−10

−5

0

5

10

−5 0 5 10PC1 (11.5% explained var.)

PC2

(4.7

% e

xpla

ined

var

.)

groups●

Infiltrating Ductal Carcinoma

Infiltrating Lobular Carcinoma

NA

Tissue 𝛌 CV Accuracy

Bladder 0.0672 0.9035Brain 0.0091 0.9310Breast 0.0235 0.9811Kidney 0.0234 0.9503Lung 0.0444 0.9628

Prostate 0.0796 0.9651Skin 0.1090 0.8961

0

200

400

histological_type

coun

t histological_typeKidney Clear Cell Renal CarcinomaKidney Papillary Renal Cell CarcinomaKidney Chromophobe

histological type: kidneyA

0

200

400

600

800

histological_type

coun

t

histological_type

Infiltrating Carcinoma NOSInfiltrating Ductal CarcinomaInfiltrating Lobular CarcinomaMedullary CarcinomaMetaplastic CarcinomaMixed Histology (please specify)Mucinous CarcinomaOther, specifyNA

histological type: breast

0

100

200

300

400

stage_event_pathologic_stage

coun

t

stage_event_pathologic_stage

Stage IStage IIStage IIIStage IV

stage: kidneyB

0

100

200

300

stage_event_pathologic_stage

coun

t

stage_event_pathologic_stage

Stage IStage IAStage IBStage IIStage IIAStage IIBStage IIIStage IIIAStage IIIBStage IIICStage IVStage XNA

stage: breast

0

100

200

300

400

hemoglobin_result

coun

t

hemoglobin_result

ElevatedLowNormal

hemoglobin: kidneyC

0

200

400

600

breast_carcinoma_progesterone_receptor_status

coun

t

breast_carcinoma_progesterone_receptor_status

IndeterminateNegativePositiveNA

progesterone receptor: breastD

0

10

20

30

25 50 75age_at_initial_pathologic_diagnosis

coun

t

stage: kidneyE

0

20

40

40 60 80age_at_initial_pathologic_diagnosis

coun

t

stage: breast

• Logistic• LDA• SVM• RF

Validation

All tissues responded similarly to the LASSO, with very robust

performance for classifiers (particularly subtype. Fig 5C). In

a multinomial context, the LASSO generally helps separate

the desired groups. However, this did not extend to

quantitative responses, which failed to show the normal

regularization path (Fig 5B).

The heterogeneity of known cancers share one key property - genetic and transcriptomic abnormalities. To this end, the Genomic Data Commons (GDC) has aggregated and standardized tens of thousands of experimental datasets from dozens of human cancers [1-3]. Here, we describe a pipeline for the prediction of specific clinical features (ranging from blood tests to pathological features and survival outcomes) on all available gene expression data for seven human cancers.

Each tissue consists of a sample set, each with ~60000 expression measurements. Thus, many measurements are highly correlated, for biological and experimental reasons (Fig. 1). This presents an immediate problem, as many predictors are almost perfectly co-linear. Reducing the large set of genes to a representative set of variables is a crucial first task.

Some samples in each tissue are annotated with clinical

information, such as disease subtype or survival time. These

vary between categorical, binomial, and multinomial

response variables (Fig 3) with some variability between

datasets. Thus we focus on subsets of these attributes.

I began by normalizing each dataset for various factors like depth and variance. Then, I separated each tissue into training and

validation sets, using only the training sets. Using cross-validation, I obtained very small subsets of variables with non-zero LASSO coefficients for each tissue, and used them to train models (also by cross-validation within the test set) on a variety of models depending on whether the response was continuous or categorical. This was largely done using packages in R, including

glmnet, MASS, e1071, and randomForest [4-7].

For all disease subtype attributes (Fig. 5C), each validation was over 0.9 accurate in validation. So, I attempted other complex categorical responses; two are shown above. The progesterone receptor in breast cancer was very predictable from gene expression, while bladder tumor stage was much more difficult to predict. Across all built models, significant

variation was observed with respect to classifier performance, though always better than regression-based predictions.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

P(FP)

P(TP

)

Multinomial LASSOLinear Discriminant AnalysisSupport Vector Machine, Gaussian KernelRandom Forests

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

P(FP)

P(TP

)

Multinomial LASSOLinear Discriminant AnalysisSupport Vector Machine, Gaussian KernelRandom Forests

Top Related