canb 7640 final project presentationshelmi.com/education/canb.pdf · canb 7640 final project...
TRANSCRIPT
CANB 7640 Final Project Presentation
Chronic Obstructive Pulmonary Disease (COPD) is an umbrella term used to describeprogressive lung diseases including emphysema, chronic bronchitis, refractory (non-reversible) asthma, and some forms of bronchiectasis. This disease is characterized byincreasing breathlessness. http://www.copdfoundation.org/
3rd leading cause of death in the US
• Dr. Farnoush Banaei-Kashani (PhD)
Assistant Professor
• Shahab Helmi
PhD Student
• Dr. Katerina Kechris (PhD)
Associate professor
• Dr. Russell Bowler (MD, PhD)
Professor
• Sean Jacobson (MS)
Data Analyst
4
6
Predict how COPD progress over time
Reverse engineering -> what are the causes? (Future work)
Mainly from http://www.copdgene.org/ (PRIVATE)
Metabolomics
Genetics
Genomics
Proteomics
Clinical
CT Scan
The dataset used in this project has 5000 samples and each sample has around 150 features.
SID NewGold 1 NewGold 2 …
Data preprocessor:• Handling null values• Data normalization• Discretization
Overlap Module
Predictor• KNN• KNN+ Decision Tree• Naïve Bayes
KNN
KNN + Decision Tree 1
A B … D3
KNN + Decision Tree 2
A B … D3
A-A A-B … A-D3
min(σ1𝑘 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑡𝑒𝑠𝑡, 𝑡𝑟𝑎𝑖𝑛
𝑘)
C# and LINQ
Microsoft SQL Server
Train-Test Ratio
90-10
80-20
70-30
Features
All 150
Numerical-only
Categorical-only
Genetic-only
Disease history-only
0
10
20
30
40
50
60
70
80
90
k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
90-10 – KNN – All features
Accuracy 1 Accuracy 2
61
.4
65
65 6
5.5
65
.5
67 6
7.8
67
.8
67
.5
67
63
.3
66
66
67
.7
67
.8 68
.4 69 69
.2
69
.5
69
.6
64
.6 65
.5 66
.7 67
.6 68
.7
69
.1 69
.6
69
.9
69
.9
69
.8
K=1 K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9 K=10
KNN KNN+DT1 KNN+DT2NB = 54%
0
10
20
30
40
50
60
70
80
k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
All Categorical Numerical Genetic Disease
Feature Max Accuracy 1 Max Accuracy 2
Disease History 70.5% 85.25%
All 69.9% 84.65%
Categorical 69.1% 84.05%
Numerical 68.1% 83.95%
Genetic 64% 82.9%
Working with domain experts
Better feature selection (medical doctors)
Better data preprocessing (statisticians) + PCA analysis
Testing all feature combinations!
Dimensionality curse (2150 combinations) -> smart algorithm
But may solve the mystery of COPD progess
More classification algorithms, such as SVMs, NNs, …