alexander statnikov discovery systems laboratory department of biomedical informatics
DESCRIPTION
Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer Genome-Wide Association Study. Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/2007. Project history. Joint project with Chun Li and Constantin Aliferis - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/1.jpg)
Alexander StatnikovAlexander StatnikovDiscovery Systems LaboratoryDiscovery Systems Laboratory
Department of Biomedical InformaticsDepartment of Biomedical InformaticsVanderbilt UniversityVanderbilt University
10/3/200710/3/2007
1
![Page 2: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/2.jpg)
Project historyJoint project with Chun Li and Constantin Aliferis
Cancer Research 2005 paper by Hu et al.: “Genome-Wide Association Study in Esophageal Cancer Using GeneChip Mapping 10K Array” Reported near-perfect classification of cancer patients & healthy
controls on the basis of only SNP data from a case-control GWA study.
This finding suggests that esophageal cancer is a solely genetic disease…
Initial idea of Chun LiAt DSL we had independently obtained the GWA dataset
prior to Chun and Constantin have initiated this project2
![Page 3: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/3.jpg)
BackgroundSNPs make up >90% of all human genetic
variation and have been extensively studied for functional relationships between phenotype & genotype.
Modern high-throughput genotyping technologies allow fast evaluation of SNPs on a genome-wide scale at a relatively low cost.
During last 2 years, several studies have reported success in using SNP genotyping assays in GWA studies in cancer. Probably, the strongest result is reported in the study by Hu et al.
3
![Page 4: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/4.jpg)
Claims of Hu et al.“Using the generalized linear model (GLM)
with adjustment for potential confounders and multiple comparisons, we identified 37 SNPs associated with disease.”
“When the 37 SNPs identified from the GLM recessive mode were used in a principal components analysis, the first principal component correctly predicted 46 of 50 cases and 47 of 50 controls.” […] “The permutation tests indicated that our PCA classification can be generalized.”
4
![Page 5: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/5.jpg)
5
![Page 6: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/6.jpg)
Study dataset & its preparationStudy dataset:
50 esophageal squamous cell carcinoma patients50 healthy controls (matched by age, sex, place of
residence)10k Affymetrix SNP arrays with 11,555 SNPsAdditional variables:
Age Tobacco use Alcohol consumption Family history Consumption of pickled vegetables
Removed ~1.5k SNPs to minimize genotyping errorsImplemented recessive A encodingImputed missing genotypes
6
![Page 7: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/7.jpg)
SNP selection: Original method of Hu et al.
(denoted as GLM1)
Fit a GLM model using data for all 100 subjects:Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption
Obtain deviances: D1 - deviance of the above fitted model D0 - deviance of the null model (without predictor
variables)From χ2 distribution, compute a p-value for the test
statistic D0-D1 with 3 degrees of freedomPerform Bonferroni correction at 0.05 alpha level
7
![Page 8: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/8.jpg)
SNP selection: Unbiased GLM-based method
(denoted as GLM2)
Fit a GLM model using data for all 100 subjects:Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption
Obtain deviances: D1 - deviance of the above fitted model D0΄- deviance of the model with family history and alcohol
consumptionFrom χ2 distribution, compute a p-value for the test
statistic D0΄-D1 with 1 degree of freedomPerform Bonferroni correction at 0.05 alpha level
8
![Page 9: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/9.jpg)
Recap of SNP selection methods
9
MethodGLM1
(Hu et al.)GLM2
(Current study)
D1SNP, family history, alcohol consumption
D0 Nullfamily history,
alcohol consumption
Degrees of freedom
3 1
![Page 10: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/10.jpg)
Classification:Original method of Hu et al.
Perform principal component analysis (PCA) on selected SNPs using all 100 subjects in the dataset.
Extract the first principal component (PC1).Use the following rule to classify each of the
same 100 subjects as used for the PCA:
If PC1 > 0, classify as control, otherwise classify as case
10
![Page 11: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/11.jpg)
Evaluation of classification performance
Hu et al. used proportion of correct classifications; their classifier is trained and tested in the same dataset
We employ area under ROC curve performance metric and repeated 10-fold cross-validation scheme
11
SNP dataset (100 subjects)
0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0
0.83
0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0
0.83
0.6 0.9 0.9 0.6 0.9 0.5 0.9 0.9 0.9 1.0
0.81
1.0 0.8 0.9 0.7 0.9 0.8 0.7 0.8 0.6 0.7
0.79
…
![Page 12: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/12.jpg)
Reproducing findings of Hu et al.
12
Using GLM1 method, Hu et al. reported 37 significant SNPs, we found 226!
Apparently, they used an extra filtering step that was not reported in the paper (personal comm. with their PI).
Nevertheless, the application of PCA-based classifier (as in Hu et al.) to GLM1 significant SNPs resulted in 0.93 proportion of correct classifications and 0.98 AUC.
Major findings are reproduced using methods of Hu et al.
![Page 13: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/13.jpg)
Bias in SNP selection method GLM1 of Hu et al.
13
Calculation of p-values in GLM1 does not reflect significance of the SNP, but the significance of 3 variables combined (SNP, family history, and alcohol consumption)
Family history & alcohol consumption are strong risk factors p-value is biased towards 0.
![Page 14: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/14.jpg)
Bias in SNP selection method GLM1 by Hu et al.
14
Bonferroni adjusted α-level
On the contrary, GLM2 reflects significance of SNPs and does not suffer from the above bias:Its distribution of SNP
p-values is uniformIt returns no SNPs
significant at the Bonferroni adjusted alpha-level
The distribution of SNP p-values for method GLM1 is not uniform: most p-values are <10-3
![Page 15: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/15.jpg)
Empirical demonstration of bias in SNP selection method
15
Main idea: Create a null distribution where SNPs are completely unrelated to the response variable and see how frequently methods GLM1 and GLM2 find statistically significant SNPs.
1.Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact.
2.Apply GLM1 and GLM2 to the permuted SNP data.
Repeat 1,000 times
![Page 16: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/16.jpg)
Results of permutation experiments
16
GLM1 found significant SNPs in all 1000 permutations! The number of significant SNPs found in a permuted dataset ranges from 185 to 1,938 (357 on average).
GLM2 found significant SNPs in only 48/1000 permutations. The number of significant SNPs found in a permuted dataset ranges from 1 to 3.
GLM1 is biased, while GLM2 is not.
![Page 17: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/17.jpg)
Bias in the classification performance estimate of Hu et al.
17
All data-analysis methods of Hu et al. use data for all subjects. Neither cross-validation nor independent sample validation were performed.
We repeated their data-analysis (GLM1+PCA) embedded in the repeated 10-fold cross-validation design. The resulting performance is only 0.68 AUC (versus 0.98 AUC).
0.30 AUC bias (overestimation) in the reported results
![Page 18: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/18.jpg)
Empirical demonstration of performance estimation bias
18
Main idea: Create a null distribution where SNPs are completely unrelated to the response variable (i.e. AUC=0.5), apply GLM1+PCA methodology and record resulting performance estimates.
1. Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact.
2. Apply GLM1 to the permuted SNP data.3. Build and apply classifier using PCA.4. Estimate classification performance (AUC).
Repeat 1,000 times
![Page 19: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/19.jpg)
Results of permutation experiments
19
Classification performance of GLM1+PCA; both methods applied as in Hu et al. to all data (no cross-validation): 0.99 AUC
Classification performance of GLM1+PCA; GLM1 applied to all data, PCA applied by cross-validation (incomplete cross-validation): 0.98 AUC
Classification performance by GLM1+PCA applied by cross-validation: 0.50 AUC
0.48-0.49 AUC bias (overestimation) under the null
![Page 20: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/20.jpg)
20
![Page 21: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/21.jpg)
Classification:Support Vector Machines (SVMs)
Supervised baseline technique for many types high-throughput data (microarray, proteomics, etc).
Trained and applied by cross-validation
21
* ****** * *
* *
**
*
**
***
* ***
**
*
*
SNP 1
SNP 2
Controls
Cases
?
?
SNP 2
SNP 1
Cases
Controls
Cases
Controls?
?
kernel
![Page 22: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/22.jpg)
SNP selection for fitting SVMs: Recursive Feature Elimination
Among the best performing techniques for the analysis of microarray gene expression data
Applied only to a training set during cross-validation
10,000
SNPs
SVM model
Performanceestimate
5,000 SNPs
5,000 SNPs
Important for classification
Not important for classification
SVM model
Performanceestimate
2,500 SNPs
2,500 SNPs
Important for classification
Not important for classification
Discarded Discarded
…
22
![Page 23: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/23.jpg)
Classification results: repeated 10-fold cross-valid. estimates
23
“+” denotes building of classifier by ensembling technique
![Page 24: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/24.jpg)
24
![Page 25: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/25.jpg)
Feedback on our analysis from Hu et al.
25
1. Concerning bias in SNP selection:“If we use p-values to rank the SNPs, the two
methods [GLM1 and GLM2] will give the same order.”
Our comment:Ranking of SNPs is irrelevant because the method of
Hu et al. (GLM1) as described and used in their paper is the method for selection (and not ranking) of SNPs.
![Page 26: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/26.jpg)
Feedback on our analysis from Hu et al.
26
2. Concerning bias in estimation of classifier performance:“It was not our purpose to develop a classifier in this
initial pilot effort.”“…we made these calculations as a frame of reference
only.”The authors presented results of their “cross-
validation effort”. SNPs were selected by GLM1 on all 100 subjects and the classifier was trained and tested by cross-validation (2/3 of data is used for training and 1/3 of data is used for testing). This cross-validation procedure was repeated 1,000 times with different splits into training and testing set.
![Page 27: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/27.jpg)
Feedback on our analysis from Hu et al.
27
The authors obtain the following histogram of classification performance estimates
Our comment:These results are expected
because their SNP selection procedure utilizes both training and testing data. This is “incomplete cross-validation” and is shown to cause biased performance estimation of the classifier.
Proportion of correct classifications
![Page 28: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/28.jpg)
Publications
28
Statnikov A, Li C, Aliferis CF (2007) “Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study.” PLoS ONE 2(9): e958.
Statnikov A, Li C, Aliferis CF (2007) “A statistical reappraisal of the findings of an esophageal cancer genome-wide association study.” Cancer Research, (accepted).
![Page 29: Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics](https://reader030.vdocuments.site/reader030/viewer/2022012910/56814842550346895db55994/html5/thumbnails/29.jpg)
Conclusions
29
Data-analysis pitfalls in Hu et al. led researchers to (1) identify non-statistically significant SNPs and (2) derive biased estimates of classification performance.
Environmental factors and family history have modest association with the disease, while SNPs do not appear to be associated.
It is crucially important to have sound statistical analysis in genome-wide association studies.
The amount of work involved in demonstration of errors (even obvious), correcting the analysis, communicating with authors, and publishing the rebuttal is significantly greater than publishing the original paper!