comparison of machine learning algorithms for …/67531/metadc330322... · comparison of machine...

1
COMPARISON OF MACHINE LEARNING ALGORITHMS FOR IDENTIFYING CANCER TYPES Garima Saxena 1 , Joseph Helsing 2 , Omar Costilla Reyes 3 , Rajeev K. Azad 1,4 1. Department of Biological Sciences, University of North Texas, Denton, Texas 2. Department of Computer Science and Computer Engineering, University of North Texas 3. Department of Electrical Engineering, University of North Texas 4. Department of Mathematics, University of North Texas Name of the presenter: Garima Saxena Organization: University Of North Texas Email: [email protected] Contact 1. Anaissi A, Kennedy PJ, Goyal M, Catchpoole DR. A balanced iterative random forest for gene selection from microarray data. BMC Bioinformatics. 2013 Aug 27;14:261. doi: 10.1186/1471-2105-14-261. PubMed PMID: 23981907; PubMed Central PMCID: PMC3766035. 2. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. References Microarray technologies helps in visualization of expression of thousands of genes at a glance. This is exceedingly helpful in studying a disease like cancer where interplay of various genes results in varied types of tumors. This technology generates large amounts of data that is difficult to effectively analyze manually. Machine learning algorithms, such as random forests, have been shown to effectively predict useful genes and types of cancer from microarray datasets(1). In this study, we propose using additional machine learning algorithms, such as artificial neural networks, support vector machines, and random forests to analyze gene expression datasets to quickly and accurately identify types of cancers. We shall compare how these additional methods compare to random forests as proposed in the previous study(1). We will also be testing how changing the parameters for these algorithms affects their performance. INTRODUCTION Methods & Materials Each of the instances for each dataset were classified by the machine learning algorithms according to their gene expression levels. This was done using various parameter settings for the algorithms. The experiments were run multiple times using different random seeds and then averaged to account for the random nature of the algorithms. All of these experiments were performed using the WEKA software package(2). Experiments Random forests was the strongest overall classifier, having the highest average accuracy for 60% of the datasets. Artificial neural networks was the next strongest classifier, having highest average accuracy for 40% of the datasets. Support vector machines while not having the highest average accuracy outperformed artificial neural networks on some instances. Certain algorithmic parameters like number of trees, hidden layers and learning rates, and kernels can be fine tuned to achieve higher accuracy rates for some datasets. Discussions & Conclusions Machine Learning Algorithms Artificial Neural Networks (ANN) Computer models designed to imitate the human brain for decision making tasks. The ANN using various learning rates, .1, .5, .9, and numbers of hidden layers, 0, 1, 2, with 0, 20, and 15 nodes per hidden layer respectively. Additionally, the momentum value was kept at 1 to test the learning rates in isolation. Support Vector Machines (SVM) Models which utilize supervised learning and decision making algorithms, kernels, to separate data into discrete sets. In this study, the each dataset was analyzed using the PolyKernel, Normalized PolyKernel, Puk, and RBFKernel. Results http://www.emeraldinsight.com/content_images/fig/2850060203001.png http://www.thebookmyproject.com/wp-content/uploads/Intrusion-Detection-Technique-by-using-K-means-Fuzzy-Neural-Network-and-SVM-classifiers.jpg Random Forests (RF) An ensemble learning method for classifying data. It constructs a series of randomly generated trees during training and the most frequent output class is considered the correct classification. In this study, 14 different random genes were selected for each run, and the number of trees was alternated between 10, 50, 100, 150, 200, 250, 300 for each dataset. Datasets The data sets we used were courtesy of (1). They represent multiple types of cancer each with various classes. Adenocarcinoma Brain Breast2 Breast3 Colon Leukemia Lymphoma NCI Prostate SRBCT http://www.unc.edu/depts/our/hhmi/hhmi-ft_learning_modules/cancermodule/images/cancergrowth.png Figure 1: Basic Working of Artificial Neural Networks Figure 2: Basic Working of Support Vector Machines Figure 3: Basic Working of Random Forests Graph 1: Graph showing comparison of average accuracy of random forests, support vector machines, and artificial neural networks across all the cancer datasets. Figure 4: Cancer Growth and its Proliferation Graph 3: Graph showing comparison of average accuracy using different number of hidden layers and different learning rate values in artificial neural networks across all the cancer datasets. Graph 2: Graph showing comparison of average accuracy using different number of trees in random forests across all the cancer datasets. Graph 4: Graph showing comparison of average accuracy using different kernels in support vector machines across all the cancer datasets.

Upload: others

Post on 17-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMPARISON OF MACHINE LEARNING ALGORITHMS FOR …/67531/metadc330322... · COMPARISON OF MACHINE LEARNING ALGORITHMS FOR IDENTIFYING CANCER TYPES Garima Saxena1, Joseph Helsing2,

COMPARISON OF MACHINE LEARNING ALGORITHMS FOR IDENTIFYING CANCER TYPES

Garima Saxena1, Joseph Helsing2, Omar Costilla Reyes 3, Rajeev K. Azad1,4

1. Department of Biological Sciences, University of North Texas, Denton, Texas2. Department of Computer Science and Computer Engineering, University of North Texas

3. Department of Electrical Engineering, University of North Texas4. Department of Mathematics, University of North Texas

Name of the presenter: Garima SaxenaOrganization: University Of North TexasEmail: [email protected]

Contact1. Anaissi A, Kennedy PJ, Goyal M, Catchpoole DR. A balanced iterative random forest

for gene selection from microarray data. BMC Bioinformatics. 2013 Aug 27;14:261. doi: 10.1186/1471-2105-14-261. PubMed PMID: 23981907; PubMed Central PMCID: PMC3766035.

2. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.

References

Microarray technologies helps in visualization ofexpression of thousands of genes at a glance. This isexceedingly helpful in studying a disease like cancerwhere interplay of various genes results in variedtypes of tumors. This technology generates largeamounts of data that is difficult to effectively analyzemanually. Machine learning algorithms, such asrandom forests, have been shown to effectivelypredict useful genes and types of cancer frommicroarray datasets(1).

In this study, we propose using additional machinelearning algorithms, such as artificial neuralnetworks, support vector machines, and randomforests to analyze gene expression datasets toquickly and accurately identify types of cancers. Weshall compare how these additional methodscompare to random forests as proposed in theprevious study(1). We will also be testing howchanging the parameters for these algorithms affectstheir performance.

INTRODUCTION

Methods & Materials

• Each of the instances for each dataset were classified by the machine learning algorithms according to their gene expression levels.

• This was done using various parameter settings for the algorithms.

• The experiments were run multiple times using different random seeds and then averaged to account for the random nature of the algorithms.

• All of these experiments were performed using the WEKA software package(2).

Experiments

• Random forests was the strongest overall classifier, having the highest average accuracy for 60% of the datasets.

• Artificial neural networks was the next strongest classifier, having highest average accuracy for 40% of the datasets.

• Support vector machines while not having the highest average accuracy outperformed artificial neural networks on some instances.

• Certain algorithmic parameters like number of trees, hidden layers and learning rates, and kernels can be fine tuned to achieve higher accuracy rates for some datasets.

Discussions & Conclusions

• Machine Learning Algorithms• Artificial Neural Networks (ANN)

• Computer models designed to imitate the human brain for decision making tasks. The ANN using various learning rates, .1, .5, .9, and numbers of hidden layers, 0, 1, 2, with 0, 20, and 15 nodes per hidden layer respectively. Additionally, the momentum value was kept at 1 to test the learning rates in isolation.

• Support Vector Machines (SVM)• Models which utilize supervised learning and

decision making algorithms, kernels, to separate data into discrete sets. In this study, the each dataset was analyzed using the PolyKernel, Normalized PolyKernel, Puk, and RBFKernel.

Resultshttp://www.emeraldinsight.com/content_images/fig/2850060203001.png

http://www.thebookmyproject.com/wp-content/uploads/Intrusion-Detection-Technique-by-using-K-means-Fuzzy-Neural-Network-and-SVM-classifiers.jpg

• Random Forests (RF)• An ensemble learning method for classifying

data. It constructs a series of randomly generated trees during training and the most frequent output class is considered the correct classification. In this study, 14 different random genes were selected for each run, and the number of trees was alternated between 10, 50, 100, 150, 200, 250, 300 for each dataset.

• Datasets• The data sets we used were courtesy of

(1). They represent multiple types of

cancer each with various classes.

• Adenocarcinoma

• Brain

• Breast2

• Breast3

• Colon

• Leukemia

• Lymphoma

• NCI

• Prostate

• SRBCT http://www.unc.edu/depts/our/hhmi/hhmi-ft_learning_modules/cancermodule/images/cancergrowth.png

Figure 1: Basic Working of Artificial Neural Networks

Figure 2: Basic Working of Support Vector Machines

Figure 3: Basic Working of Random Forests

Graph 1: Graph showing comparison of average accuracy of random forests, support vector machines, and artificial neural networks across all the cancer datasets.

Figure 4: Cancer Growth and its Proliferation Graph 3: Graph showing comparison of average accuracy using different number of hidden layers and different learning rate values in artificial neural networks across all the cancer datasets.

Graph 2: Graph showing comparison of average accuracy using different number of trees in random forests across all the cancer datasets.

Graph 4: Graph showing comparison of average accuracy using different kernels in support vector machines across all the cancer datasets.