clinical data classification of alzheimer's disease
DESCRIPTION
TRANSCRIPT
![Page 1: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/1.jpg)
Alzheimer's Disease-‐ Clinical Data Classifica4on
By George Kalangi
Venkata Gopi
![Page 2: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/2.jpg)
Overview: • Introduc4on • Analysis of commonly used terms and explana4on of data sets
• Overall Programming Process
• Genera4ng a merged file with CDGLOBAL
• Genera4on of files for future status predic4on • Data Preprocessing • Classifica4on (Algorithms) used on the data
• Analysis on the output data from WEKAb
G
![Page 3: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/3.jpg)
Introduc4on • What is Alzheimer’s Disease? • Brain disorder • Most common form of demen4a
– Term for the loss • Memory • Other intellectual abili4es • Serious enough to interfere with daily life
• Clinical Demen4a Ra4o (0,0.5,1,2,3)
Mild to Severe Dementia 1.0 to 3.0 Questionable Dementia 0.5
Normal 0
G
![Page 4: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/4.jpg)
Datasets (60 Files)
" 56 comma separated files 1 File – Data Dic4onary (Explains the terms used)
1 File – Clinical Demen4a Ra4ng (Has CDGLOBAL)
Rest Assessments Data Defini4ons
Other like visits having abbrevia4ons
G
![Page 5: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/5.jpg)
Environment Setup
• Programming Languages used for the project are PHP, MySQL, Java, Postgresql
• Tools used are WEKA (Waikato Environment for Knowledge Analysis), MySQLWorkBench,
and NetBeans
• -‐Front End (PHP) • -‐Back End (MySQL)
G
V
![Page 6: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/6.jpg)
Overall Programming Process
• A selected dataset (FAQ) is given by the user. • At the backend MYSQL queries are defined enough to create the required tables and insert the required data to the corresponding tables.
• Here aeer the required opera4ons are performed on the tables.
• Final output files are stored in .csv format.
G
V
![Page 7: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/7.jpg)
Genera4ng a merged file with CDGLOBAL (For current)
• For the given datasets as input, (Eg:adni_faq_2011-‐01-‐20.csv) and from the adni_cdr_2011-‐01-‐20.csv) file
-‐-‐the RID’s and VISCODE’s of faq and cdr are compared and based on that CDGLOBAL column in cdr file is merged to faq file.
• During Remove CDGLOBAL which has -‐1 and VISCODE’s f,nv,uns1 are trimmed off.
Result file is “Merged_dataset_file.csv”
G
![Page 8: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/8.jpg)
Query used for genera4ng merged file: • Select f.cID ,f.RID ,f.VISCODE ,f.EXAMDATE ,f.FAQSOURCE,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f.FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQREM,f.FAQTRAVL,f.FAQTOTAL ,cdr.cdglobal from cdr,faq f where cdr.rid=f.rid and cdr.VISCODE=f.VISCODE and cdr.cdglobal not in (-‐1)";
G
![Page 9: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/9.jpg)
Genera4on of files for future status predic4on
• Predic4on dataset is generated by mapping the first 4me visit to the 6 month’s Class and 6 month visit to the 12 month’s Class and so on.
• SQL query opera4ons are performed on the merged file to separate the 6 month’s 4me interval classes.
• Following are the files generated: -‐ File_dataset_m06.csv
-‐File_dataset_m12.csv and so on
V
![Page 10: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/10.jpg)
Query used for genera4ng class files: • Select v.ID as ID,v.RID as RID,v.VISCODE ,v.EXAMDATE,v.FAQSOURCE ,v.FAQFINAN ,v.FAQFORM ,v.FAQSHOP ,v.FAQGAME ,v.FAQBEVG,v.FAQMEAL ,v.FAQEVENT ,v.FAQTV ,v.FAQREM ,v.FAQTRAVL ,v.FAQTOTAL ,m12.cdrglobal from `table_adni_faq_2011-‐01-‐20_m06` v,`table_adni_faq_2011-‐01-‐20_m12` m12 where v.rid=m12.rid
V
![Page 11: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/11.jpg)
Preprocessing • Aeer we get required .csv files, we use WEKA to preprocess the data.
• Load the file into WEKA.
• Apply Filter “weka.filters.unsuperwised.apributes.Remove” to trim off the unused fields.
• Apply “NumericaltoNominal” to convert all the values in the data to Nominal before classifying and fetching to a classifier algorithm.
G
![Page 12: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/12.jpg)
Classifica4on Algorithms Used
• The Classify panel enables the user to apply classifica4on and regression algorithms (indiscriminately called classifiers in Weka) to the resul4ng dataset, to es4mate the accuracy of the resul4ng predic4ve model.
• J48 uses C4.5 (a successor of ID3) Algorithm
• Naïve Bayesian Classifica4on Algorithm
G
![Page 13: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/13.jpg)
What is classifica4on? • Given a collec4on of records (training set )
– Each record contains a set of a"ributes, one of the apributes is the class
-‐-‐ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Example: If we have items in a house which are not classified then we can’t arrange
items in our house.
We classify the items depending on their usage as cooking items, decora4on items etc., such that we could arrange them accordingly and can use it in an efficient and easier way.
G
![Page 14: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/14.jpg)
Decision Tree Classifica/on Task G
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Test Data
Assign Cheat to “No”
![Page 15: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/15.jpg)
Decision Tree
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Test Data
G
![Page 16: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/16.jpg)
J 48 uses C 4.5 Algorithm
• Decision trees represent a supervised approach to classifica4on
• Decision trees are a classic way to represent informa4on from a machine learning algorithm, and offer a fast and powerful way to express structures in data.
• A decision tree is a simple structure where non-‐terminal nodes represent tests on one or more apributes and terminal nodes reflect decision outcomes.
• The basic algorithm described above recursively classifies un4l each leaf is pure, meaning that the data has been categorized as close to perfectly as possible.
• The latest public domain implementa4on of Quinlan's model is C4.5. The Weka classifier package has its own version of C4.5 known as J48.
• This process ensures maximum accuracy on the training data.
![Page 17: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/17.jpg)
Why decision tree Algorithm? • Advantages:
– Inexpensive to construct – Easy to interpret for small-‐sized trees – Accuracy is comparable to other classifica4on techniques for many simple data sets
– There could be more than one tree possible for the same data
• Disadvantages: -‐ Under fivng: when the model is too simple, both training and test errors are large
![Page 18: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/18.jpg)
All about Cross Valida4on • We perform cross valida4on when amount of data is small and we
need to have independent training and test set from it.
• It is important that each class is represented in its actual propor4ons in the training and test sets: Stra4fica4on
• An important cross valida4on technique is stra4fied 10 fold cross valida4on, where the instance set is divided into 10 folds.
• We have 10 itera4ons with taking different single fold for tes4ng and the rest for training.
V
![Page 19: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/19.jpg)
Evalua4on
• Metrics for Performance Evalua4on – How to evaluate the performance of a model?
• Methods for Model Comparison – How to compare the rela4ve performance among compe4ng models?
V
![Page 20: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/20.jpg)
Metrics for Performance Evalua4on: Confusion Matrix
• A confusion matrix contains informa4on about actual and predicted classifica4ons done by a classifica4on system. Performance of systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier:
• We get confusion matrix aeer supplying data to a Classifier
• Based on the confusion matrix we can evaluate using the measures like, precision, F-‐measure, accuracy and Recall.
G
![Page 21: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/21.jpg)
Example • Suppose there are a sample of 27 animals — 8 cats, 6 dogs, and 13 rabbits.
• Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.
• We can see from the matrix that the system in ques4on has trouble dis4nguishing between cats and dogs, but can make the dis4nc4on between rabbits and other types of animals prepy well.
• All correct guesses are located in the diagonal of the table, so it's easy to visually inspect the table for errors, as they will be represented by any non-‐zero values outside the diagonal.
G
![Page 22: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/22.jpg)
Limita4on of accuracy Limita/on of accuracy:
• Consider a 2-‐class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
– It has some disadvantages as a performance es4mate. For example, if there were 95 cats and only 5 dogs in the data set, the classifier could easily be biased into classifying all the samples as cats. The overall accuracy would be 95%, but in prac4ce the classifier would have a 100% recogni4on rate for the cat class but a 0% recogni4on rate for the dog class, so you'll probably want to look at some of the other numbers. ROC Area, or area under the ROC curve, is also taken as preferred measure.
– Accuracy is misleading because model does not detect any class 1 example.
G
![Page 23: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/23.jpg)
Metrics for Evalua4on • Accuracy: The accuracy (AC) is the propor4ons of the total number of
predic4ons that were correct, what percentage of people were correctly classified. It is determined using the equa4on:
Accuracy = (# True Posi4ves + # True Nega4ves) / N
Where N = Total # predic4ons.
• Precision: Finally, precision (P) is the propor4on of the predicted posi4ve cases that were correct. Of all the people that are classified as demented, what percentage of them is actually demented?
It is calculated using the equa4on
Precision = (# True Posi4ves) / (# True Posi4ves + # False Posi4ve)
€
Accuracy =TP +TN
TP +TN + FP + FNV
![Page 24: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/24.jpg)
Evalua4on
• F-‐measure:
F-‐measure =2* (# True Posi4ves ) / ( # 2*True Posi4ves + # True Nega4ves + #False Posi4ves)
• Recall: Recall is the ra4o of the number of true posi4ves and the sum of true posi4ves and false nega4ves. It is calculated using the equa4on:
Recall = (# True Posi4ves) / (# True Posi4ves + # False Nega4ves)
V
![Page 25: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/25.jpg)
Methods for Model Comparison ROC (Receiver Opera/ng Characteris/c)
• Developed in 1950s for signal detec4on theory to analyze noisy signals – Characterize the trade-‐off between posi4ve hits and false alarms
• ROC curve plots TP (on the y-‐axis) against FP (on the x-‐axis)
V
![Page 26: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/26.jpg)
Using ROC for Model Comparison M1 is better for small
FPR M2 is better for large
FPR
A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:.
.90-1 = excellent (A). .80-.90 = good (B). .70-.80 = fair (C). .60-.70 = poor (D). .50-.60 = fail (F) Area Under the ROC curve A
V
![Page 27: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/27.jpg)
Naïve Bayes • It is a simple probabilis4c classifier based on applying bayes theorem with
independence assump4ons. Naive Bayes classifier assumes that the presence (or absence) of a par4cular feature of a class is unrelated to the presence (or absence) of any other feature.
• For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these proper4es to independently contribute to the probability that this fruit is an apple.
• An advantage of the naive Bayes classifier is that it requires a small amount of training data to es4mate the parameters (means and variances of the variables) necessary for classifica4on. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the en4re set. Best suited for apributes, which are independent. It is very simple, very fast.
V
![Page 28: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/28.jpg)
Challenges faced
• Ini4ally data files all being processed using JDBC and MySQL and later its been found to be hec4c if at all other dataset being used. Hence PHP based MYSQL is used which is generalized for all datasets.
• Table crea4on ini4ally for loading the data, later done with file opera4ng func4ons.
• Running all the “MYSQL” commands sequen4ally, later enhanced using php as front end.
• Ini4ally J48 tree was not able to process due to the data being in numerical values. Later done by Discre4za4on/NumericaltoNominal of CDGLobal columns.
V G
![Page 29: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/29.jpg)
Preprocess Output G
![Page 30: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/30.jpg)
Result file for current status(J48) G
![Page 31: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/31.jpg)
Current status (Naïve Bayes) V
![Page 32: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/32.jpg)
Future status (J48) V
![Page 33: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/33.jpg)
Future status (Naïve Bayes) V
![Page 34: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/34.jpg)
MMSE (J48)
![Page 35: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/35.jpg)
References:
http://kent.dl.sourceforge.net/project/weka/documentation/3.6.x/WekaManual-3-6-2.pdf
http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf
http://stackoverflow.com/questions/2903933/how-to-interpret-weka-classification
http://www.slideshare.net/dataminingtools/weka-credibility-evaluating-whats-been-learned
![Page 36: Clinical Data Classification of alzheimer's disease](https://reader033.vdocuments.site/reader033/viewer/2022051412/549fddfbac79594b4c8b49b2/html5/thumbnails/36.jpg)
Thank you