comparison of machine learning algorithms for …/67531/metadc330322... · comparison of machine...

COMPARISON OF MACHINE LEARNING ALGORITHMS FOR IDENTIFYING CANCER TYPES

Garima Saxena1, Joseph Helsing2, Omar Costilla Reyes 3, Rajeev K. Azad1,4

1. Department of Biological Sciences, University of North Texas, Denton, Texas2. Department of Computer Science and Computer Engineering, University of North Texas

3. Department of Electrical Engineering, University of North Texas4. Department of Mathematics, University of North Texas

Name of the presenter: Garima SaxenaOrganization: University Of North TexasEmail: [email protected]

Contact1. Anaissi A, Kennedy PJ, Goyal M, Catchpoole DR. A balanced iterative random forest

for gene selection from microarray data. BMC Bioinformatics. 2013 Aug 27;14:261. doi: 10.1186/1471-2105-14-261. PubMed PMID: 23981907; PubMed Central PMCID: PMC3766035.

2. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.

References

Microarray technologies helps in visualization ofexpression of thousands of genes at a glance. This isexceedingly helpful in studying a disease like cancerwhere interplay of various genes results in variedtypes of tumors. This technology generates largeamounts of data that is difficult to effectively analyzemanually. Machine learning algorithms, such asrandom forests, have been shown to effectivelypredict useful genes and types of cancer frommicroarray datasets(1).

In this study, we propose using additional machinelearning algorithms, such as artificial neuralnetworks, support vector machines, and randomforests to analyze gene expression datasets toquickly and accurately identify types of cancers. Weshall compare how these additional methodscompare to random forests as proposed in theprevious study(1). We will also be testing howchanging the parameters for these algorithms affectstheir performance.

INTRODUCTION

Methods & Materials

• Each of the instances for each dataset were classified by the machine learning algorithms according to their gene expression levels.

• This was done using various parameter settings for the algorithms.

• The experiments were run multiple times using different random seeds and then averaged to account for the random nature of the algorithms.

• All of these experiments were performed using the WEKA software package(2).

Experiments

• Random forests was the strongest overall classifier, having the highest average accuracy for 60% of the datasets.

• Artificial neural networks was the next strongest classifier, having highest average accuracy for 40% of the datasets.

• Support vector machines while not having the highest average accuracy outperformed artificial neural networks on some instances.

• Certain algorithmic parameters like number of trees, hidden layers and learning rates, and kernels can be fine tuned to achieve higher accuracy rates for some datasets.

Discussions & Conclusions

• Machine Learning Algorithms• Artificial Neural Networks (ANN)

• Computer models designed to imitate the human brain for decision making tasks. The ANN using various learning rates, .1, .5, .9, and numbers of hidden layers, 0, 1, 2, with 0, 20, and 15 nodes per hidden layer respectively. Additionally, the momentum value was kept at 1 to test the learning rates in isolation.

• Support Vector Machines (SVM)• Models which utilize supervised learning and

decision making algorithms, kernels, to separate data into discrete sets. In this study, the each dataset was analyzed using the PolyKernel, Normalized PolyKernel, Puk, and RBFKernel.

Resultshttp://www.emeraldinsight.com/content_images/fig/2850060203001.png

http://www.thebookmyproject.com/wp-content/uploads/Intrusion-Detection-Technique-by-using-K-means-Fuzzy-Neural-Network-and-SVM-classifiers.jpg

• Random Forests (RF)• An ensemble learning method for classifying

data. It constructs a series of randomly generated trees during training and the most frequent output class is considered the correct classification. In this study, 14 different random genes were selected for each run, and the number of trees was alternated between 10, 50, 100, 150, 200, 250, 300 for each dataset.

• Datasets• The data sets we used were courtesy of

(1). They represent multiple types of

cancer each with various classes.

• Adenocarcinoma

• Brain

• Breast2

• Breast3

• Colon

• Leukemia

• Lymphoma

• NCI

• Prostate

• SRBCT http://www.unc.edu/depts/our/hhmi/hhmi-ft_learning_modules/cancermodule/images/cancergrowth.png

Figure 1: Basic Working of Artificial Neural Networks

Figure 2: Basic Working of Support Vector Machines

Figure 3: Basic Working of Random Forests

Graph 1: Graph showing comparison of average accuracy of random forests, support vector machines, and artificial neural networks across all the cancer datasets.

Figure 4: Cancer Growth and its Proliferation Graph 3: Graph showing comparison of average accuracy using different number of hidden layers and different learning rate values in artificial neural networks across all the cancer datasets.

Graph 2: Graph showing comparison of average accuracy using different number of trees in random forests across all the cancer datasets.

Graph 4: Graph showing comparison of average accuracy using different kernels in support vector machines across all the cancer datasets.

https://webmail.unt.edu/OWA/redir.aspx?C=IkG1eLLgrEOmjcgQZJAorKAuXqVEDdEI_eG60-mwcCijZXnD6nBZOuCKDfKsSRb_FNMY96KMFWw.&URL=http://www.ncbi.nlm.nih.gov/pubmed/23981907

comparison of machine learning algorithms for …/67531/metadc330322... · comparison of machine...

Documents