بسم الله الرحمن الرحيمUniversity of Gezira
Faculty of Mathematical and Computer Sciences
A DissertationSubmitted to the University of Gezira in Partial Fulfillment of the Requirements for the Award of the Degree of Master of Science in
computer Sciences entitle:
A comparative Study of Factor Analysis and Principle Component Analysis
On Classification Performance Using Neural Network
BY Abuzer Hussein Ibrahim
Ahmed
Supervisor: Dr Murtada Khalfallah Elbashir
IntroductionResearch problem
Presentation contents
Research ObjectivesPrevious studies
MethodologyResults
ConclusionsRecommendations
Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue. The objective of data mining is to identify valid novel and understandable correlations and patterns in existing data. Some of the major techniques of data mining are classification, association and clustering. Data mining is upcoming research area to solve various problems and classification is one of main problem in the field of data mining. Before using the dataset in the classification it needs some preprocesses such as Data cleaning, data transformation and data reduction, the last one is very important, because usually represent the dataset in an dimensional space thesedimensional spaces are too large, however I need to reduce the size of the dataset before applying a learning algorithm. A common way to attempt to resolve this problem is to use dimensionality reduction techniques.
Introduction
Difficult to extract knowledge from large amount of data. Difficult maintain intrinsic information of high-dimensional
data when are transformed to low dimensional space for analysis.
Difficult to visualize the data in high dimension.
Research problem
1. To explain the importance of using Factor Analysis (FA) and Principal Component Analysis (PCA) algorithms with neural network (NN). 2. Reduction in dimensions using Factor Analysis (FA) and Principal Component Analysis (PCA) algorithms to get the new reduced features without affecting the original dimensions.3. To compare between Factor Analysis (FA) and Principal Component Analysis (PCA) in terms of performance measures.
Research Objectives
Dataset Features
Dimensionality reduction algorithm
Principal Component Analysis (PCA)
Factor analysis (FA)
MATLAB (drtoolbox)
Neural Network
Calculate the performance measures
Accuracy
Receiver Operating Characteristics (ROC)
New Features
Methodology
Is the MATLAB Toolbox for Dimensionality Reduction, The toolbox can be
obtained from http://lvdmaaten.github.io/drtoolbox You are free to use, This MATLAB toolbox implements 34 techniques for dimensionality reduction included (Principal Component Analysis (PCA) , Factor analysis(FA)).
dimensionality reduction toolbox (drtoolbox)
The Computation of the Principal Component Analysis (PCA) 1/ Calculate the covariance matrix from the input data. COV(X,Y)=Σ ( Xi - X ) ( Yi - Y ) / N2/ Compute the eigenvalues and eigenvectors from the covariance matrix .3/ Form the actual transition matrix by taking the predefined number of components (eigenvectors).4/ Chosen the components (eigenvectors) that the eigenvector with the highest .5/ Finally, multiply the original feature with the obtained transition matrix .which yields a lower- dimensional representation
Principal component analysis (PCA)
The Computation of the Factor analysis (FA)
1.Explore and choose relevant variables to construction of the covariance matrix. 2.Extract initial factors (via principal components) from covariance matrix is specified in terms of is eigenvalue- eigenvector pair (), Where
3. Choose number of factors
4. Choose estimation method, estimate model
5. Calculator estimating factor loading 6 Rotate and interpret
Factor Analysis (FA)
Neural network is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome
Neural network (NN)
Accuracythe proportion of true results (both true positives and true negatives ) in
the population .
Where:
TP: true positives (predicted positive, actual positive)TN: true negatives (predicted negative, actual negative)FP: false positives (predicted positive, actual negative)FN: false negatives (predicted negative, actual positive)
Accuracy=
performance measures
Sensitivity or RecallProportion of actual positives which are predicted positive.
= Specificity
proportion of actual negative which are predicted negative.
Specificity = TN
TN + FP
proportion of predicted positives which are actual positive.
Precision =
Precision
F ScoreHarmonic Mean of Precision and recall . Tries to give a good combination of the other 2 metrics
F Score = 2 (Precision. recall) (Precision + recall)
Receiver Operating Characteristics (ROC), curve the true positive rate (Sensitivity) is plotted in function of the false positive rate and can be calculated as (1 - Specificity)for different cutoff point.
Each point on the Roc curve represents a Sensitivity specificity pair corresponding to particular division
Receiver Operating Characteristics (ROC) analyses
Area under the curve is between 0 and 1 and increasingly being recognized as a better measure for evaluating algorithm performance than accuracy. A bigger AUC value implies a better ranking performance for a classifier
Area under the curve (AUC)
The results show that dimensionality reduction in toolbox (drtoolbox) in MATLAB software with several performance measures with different data sets using FA and PCA algorithms in reveal a number of points: 1. In all performance measures (accuracy, Specificity, Sensitivity, precision, F-Score, roc curves and area under the curve) FA algorithm better than PCA algorithm. 2. The FA algorithm it has given a better result in all datasets although there are different in the number of Instances, number of attributes and type of attributes if compare to the PCA algorithm .3.Extraction of knowledge in FA using NN is a better than PCA.
Results
Datasets
Performance Measures
Accuracy
Specify
Sensitivity
Precision
F-score
Climate Model Simulation Crashes
0.9452
0.9552
0.8333
0.6250
0.7142
Heart disease
0.9198
0.9250
0.5000
0.7692
0.6095
Musk (Version 1) 0.8889
0.8965
0.8824
0.9091
0.8956
Pima Indians Diabetes
0.7619
0.6667
0.8000
0.8571
0.8276
Wine Quality
0.8400
0.9091
0.7857
0.8967
0.8375
Results of the performance measures for neural network with Principle Component Analysis algorithm (PCA)
Datasets
Performance Measures
Accuracy
Specify
Sensitivity
Precision
F-score
Climate Model Simulation Crashes
0.9589
0.9705
0.8000
0.6667
0.7273
Heart disease
0.9441
0.9602
0.7273
0.5714
0.6395
Musk (Version 1) 0.9367
0.8845
0.9723
0.9231
0.9471
Pima Indians Diabetes
0.8571
0.7857
0.8929
0.8928
0.8929
Wine Quality
0.8800
0.9230
0.7857
0.9091
0.8429
Results of the performance measures for neural network with Factor Analysis algorithm (FA)
4.The Roc of FA, PCA with different datasets represents the roc to each dataset and indicates to FA is better than PCA.
5. the value of the area under the curve in FA bigger than the
value in the PCA and that indicates to FA is better than PCA.
6. visualize the data in FA is better than PCA
(cont).Results
Data set for Climate Model Simulation Crashes 1.Roc curve
2.Roc Curve for Heart disease data sets
3.Roc Curve for Musk (Version 1) Data Set
4.Roc Curve for Pima Indians Diabetes dataset.
5.Roc Curve for Wine Quality dataset
Datasets
Methods
FA PCA Climate Model Simulation Crashes
0.866 0.808
Heart disease 0.902 0.848 Musk (Version 1) 0.795 0.689 Pima Indians Diabetes
0.955 0.881 Wine Quality 0.819 0.809
7.The Area under the Curve (AUC):(cont).Results
8. Neural network given good efficiency when using the feature selection methods.9.The Neural network(NN) maintain intrinsic information of high-dimensional data.
(cont ).Results
Results of the performance measures for Neural Network with all variables
Datasets
Performance Measures
Accuracy
Specify
Sensitivity
Precision
F-score
Climate Model Simulation Crashes
0.9315
0.9552
0.6667
0.5714
0.6153
Heart disease
0.9142
0.9136
0.7042
0.6241
0.6617
Musk (Version 1) 0.8412
0.8571
0.8286
0.8788
0.8523
Pima Indians Diabetes
0.7142
0.7307
0.7068
0.8541
0.7735
Wine Quality 0.8000
0.9091
0.7142
0.9090
0.7991
The comparison shows that FA it gives relatively good results in feature reduction and computational complexity .
The FA algorithm it has given a better results in all datasets although there are different in the number of Instances.
It can be stated that neural network has proven to be a powerful classifier for high dimensional data sets and it also gives good efficiency when using the features selection methods.
can say that the FA algorithm seems to be the best method to deal with dataset.
Conclusions
To obtain the best results this research recommends by : using more than classification algorithm such as Logistic
regression (LR), decision tree and support vector machine (SVM) and much more datasets should be taken.
Increase the data sets more than five. by using more than mathematical model to obtain the best
results and best performance.
Recommendations
[1]Guleria, Pratiyush, and Manu Sood. "Data Mining In Education: A review on The Knowledge Discovery Perspective." International Journal of Data Mining & Knowledge Management Process 4.5 .2014[2]Arora, Rohit. "Comparative analysis of classification algorithms on different datasets using WEKA." International Journal of Computer Applications 54.13 .2012. [3]Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.[4] Prompramote, Supawan, Yan Chen, and Yi-Ping Phoebe Chen. "Machine learning in bioinformatics." Bioinformatics technologies. Springer Berlin Heidelberg, 2005. [5]E. Postma and E. Postma, “Dimensionality Reduction : A Comparative Review Dimensionality Reduction : A Comparative Review,” 2009.[6]Zaïane, Osmar R. "CMPUT690 Principles of Knowledge Discovery in Databases." University of 1999. [7]Zhao, Lizhuang, and Mohammed J. Zaki. "Tricluster: an effective algorithm for mining coherent clusters in 3d microarray data." Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005.[8]Hall, Mark A. Correlation-based feature selection for machine learning. Diss. The University of Waikato, 1999.[9]Lemm, Steven, et al. "Introduction to machine learning for brain imaging." Neuroimage 56.2 .2011.[10]Cunningham, Pádraig, Matthieu Cord, and Sarah Jane Delany. "Supervised learning." Machine learning techniques for multimedia. Springer Berlin Heidelberg, 2008. [11]Pareek, Astha, and Dr Manish Gupta. "Review of data mining techniques in cloud computing database." International Journal of Advanced Computer Research (IJACR) Volume 2 .2012.[12] Arellano, M., et al. "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations}." Review of Economic Studies 31 .2003.[13]Fehske, A., J. Gaeddert, and Jeffrey H. Reed. "A new approach to signal classification using spectral correlation and neural networks." New Frontiers in Dynamic Spectrum Access Networks, 2005. DySPAN 2005. 2005 First IEEE International Symposium on. IEEE, 2005.
References
[14]Hull, Jason, David Ward, and Radoslaw R. Zakrzewski. "Verification and validation of neural networks for safety-critical applications." American Control Conference, 2002. Proceedings of the 2002. Vol. 6. IEEE, 2002.[15]S. S. Panwar, “OF COMPUTER © I A E M E DATA REDUCTION TECHNIQUES TO ANALYZE NSL-KDD DATASET,” pp. 21–31, 2014.[16] Wu, Xindong, et al. "Data mining with big data." Knowledge and Data Engineering, IEEE Transactions on 26.1 .2014.[17]Khosla, Nitin. Dimensionality Reduction Using Factor Analysis. Diss. Griffith University, Australia, 2004.[18]Johnson, Richard Arnold, and Dean W. Wichern. Applied multivariate statistical analysis. Vol. 4. Englewood Cliffs, NJ: Prentice hall, 1992.[19]Chen, Yisong, et al. "Discovering hidden knowledge in data classification via multivariate analysis." Expert Systems 27.2 .2010.[20]Kumar, Sandeep, Deepak Kumar, and Rashid Ali. "Factor Analysis Using Two Stages Neural Network Architecture." International Journal of Machine Learning and Computing 2.6 .2012. [21]D. Singh, J. P. Choudhary, and M. De, “A comparative study on principal component analysis and factor analysis for the formation of association rule in data mining domain.2013”[22]F. Abujarad and A. S. Omar, “Factor and Principle Component Analysis for Automatic Landmine Detection Based on Ground Penetrating Radar,” pp. 2014.[23]Slaby, Antonin. "ROC analysis with matlab." Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on. IEEE, 2007.[24]Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters 27.8 .2006.
cont.)) References