studies in computational intelligence978-3-030-10674-4/1.pdfmethod to improve the document...

Studies in Computational Intelligence

Volume 816

Series editor

Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Polande-mail: [email protected]

The series “Studies in Computational Intelligence” (SCI) publishes new develop-ments and advances in the various areas of computational intelligence—quickly andwith a high quality. The intent is to cover the theory, applications, and designmethods of computational intelligence, as embedded in the fields of engineering,computer science, physics and life sciences, as well as the methodologies behindthem. The series contains monographs, lecture notes and edited volumes incomputational intelligence spanning the areas of neural networks, connectionistsystems, genetic algorithms, evolutionary computation, artificial intelligence,cellular automata, self-organizing systems, soft computing, fuzzy systems, andhybrid intelligent systems. Of particular value to both the contributors and thereadership are the short publication timeframe and the world-wide distribution,which enable both wide and rapid dissemination of research output.

The books of this series are submitted to indexing to Web of Science,EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink.

More information about this series at http://www.springer.com/series/7092

http://www.springer.com/series/7092

Laith Mohammad Qasim Abualigah

Feature Selectionand Enhanced Krill HerdAlgorithm for TextDocument Clustering

123

Laith Mohammad Qasim AbualigahUniversiti Sains MalaysiaPenang, Malaysia

ISSN 1860-949X ISSN 1860-9503 (electronic)Studies in Computational IntelligenceISBN 978-3-030-10673-7 ISBN 978-3-030-10674-4 (eBook)https://doi.org/10.1007/978-3-030-10674-4

Library of Congress Control Number: 2018965455

© Springer Nature Switzerland AG 2019This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionor information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.The publisher, the authors, and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, express or implied, with respect to the material contained herein orfor any errors or omissions that may have been made. The publisher remains neutral with regard tojurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AGThe registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

https://doi.org/10.1007/978-3-030-10674-4

List of Publications

Journal

1. Abualigah, L. M., Khader, A. T., Al-Betar, M. A., Alomari, O. A.: Text featureselection with a robust weight scheme and dynamic dimension reduction to textdocument clustering, (2017). Expert Systems with Applications. Elsevier.(IF:3.928).

2. Abualigah, L. M., Khader, A. T., Hanandeh, E. S., Gandomi, A. H.: A novelhybridization strategy for krill herd algorithm applied to clustering techniques,(2017). Applied Soft Computing. Elsevier. (IF:3.541).

3. Abualigah, L. M., Khader, A. T., Hanandeh, E. S.: A new feature selectionmethod to improve the document clustering using particle swarm optimizationalgorithm, (2017). Journal of Computational Science. Elsevier. (IF:1.748).

4. Abualigah, L. M., Khader, A. T.: Unsupervised text feature selection techniquebased on hybrid particle swarm optimization algorithm with genetic operatorsfor the text clustering, (2017). Journal of Supercomputing. Springer. (IF:1.326).

5. Bolaji, A. L. A., Al-Betar, M. A., Awadallah, M. A., Khader, A. T., Abualigah,L. M.: A comprehensive review: Krill Herd algorithm (KH) and its applications,(2016). Applied Soft Computing. Elsevier. (IF:3.541).

6. Abualigah, L. M., Khader, A. T., Al-Betar, M. A., Hanandeh, E. S., Alyasseri,Z. A.: A hybrid strategy for krill herd algorithm with harmony search algorithmto improve the data clustering, (2017). Intelligent Decision Technologies. IOSPress. (Accepted).

7. Abualigah, L. M., Khader, A. T., Hanandeh, E. S., Gandomi, A. H.: A HybridKrill Herd Algorithm and K-mean Algorithm for Text Document Clusteringanalysis. Engineering Applications of Artificial Intelligence. Elsevier. (Under3rd revision). (IF: 2.894).

8. Abualigah, L. M., Khader, A. T., Hanandeh, E. S.: Multi-objective modifiedkrill herd algorithm for intelligent text document clustering. InformationSystems and Applications. Springer. (Under review). (IF:1.530).

9. Abualigah, L. M., Khader, A. T., Hanandeh, E. S., Rehman, S. U., Shandilya,S. K.: b-HILL CLIMBING TECHNIQUE FOR IMPROVING THE TEXTDOCUMENT CLUSTERING PROBLEM. Current Medical Imaging Reviews.(Under review). (IF:0.308).

v

Chapter

1. Abualigah, L. M., Khader, A. T., Hanandeh, E. S.: A novel weighting schemeapplied to improve the text document clustering techniques. Book Series:Studies in Computational Intelligence published by Springer. Book Title:Innovative Computing, Optimization and Its Applications. Springer.

2. Abualigah, L. M., Khader, A. T., Hanandeh, E. S.: Modified Krill HerdAlgorithm for Global Numerical Optimization Problems. Book Title: Advancesin Nature-inspired Computing and Applications. Springer. (Accepted).

Conference

1. Abualigah, L. M., Khader, A. T., Al-Betar, M. A., Awadallah, M. A.: A krillherd algorithm for efficient text documents clustering. In Computer Applicationsand Industrial Electronics (ISCAIE), (2016) IEEE Symposium on (pp. 67–72).IEEE.

2. Abualigah, L. M., Khader, A. T., Al-Betar, M. A.: Unsupervised featureselection technique based on genetic algorithm for improving the TextClustering. In Computer Science and Information Technology (CSIT), (2016)7th International Conference on (pp. 1–6). IEEE.

3. Abualigah, L. M., Khader, A. T., Al-Betar, M. A.: Unsupervised featureselection technique based on harmony search algorithm for improving the TextClustering. In Computer Science and Information Technology (CSIT), (2016)7th International Conference on (pp. 1–6). IEEE.

4. Abualigah, L. M., Khader, A. T., Al-Betar, M. A.: Multi-objectives-based textclustering technique using K-mean algorithm. In Computer Science andInformation Technology (CSIT), (2016) 7th International Conference on(pp. 1–6). IEEE.

5. Abualigah, L. M., Khader, A. T., Al-Betar, M. A., Hanandeh, E. S.:Unsupervised Text Feature Selection Technique Based on Particle SwarmOptimization Algorithm for Improving the Text Clustering. First EAIInternational Conference on Computer Science and Engineering (2017). EAI.

6. Abualigah, L. M., Khader, A. T., Al-Betar, M. A., Hanandeh, E. S.: A newhybridization strategy for krill herd algorithm and harmony search algorithmapplied to improve the data clustering. First EAI International Conference onComputer Science and Engineering, (2017). EAI.

vi List of Publications

7. Abualigah, L. M., Khader, A. T., Al-Betar, M. A., Alyasseri Z. A., Alomari,O. A., Hanandeh, E. S.: Feature Selection with b-Hill climbing Search for TextClustering Application. Second Palestinian International Conference onInformation and Communication Technology, (2017). IEEE.

8. Abualigah, L. M., Sawaiez, A. M., Khader, A. T., Rashaideh, H., Al-Betar,M. A.: b-Hill Climbing Technique for the Text Document Clustering. NewTrends in Information Technology (NTIT), (2017). IEEE.

List of Publications vii

Acknowledgements

Beginning this Ph.D. thesis has been a life-changing experience for me and I wouldnot have achieved it without the guidance and support of many people. I mustdeclare many external donations from individuals, who extended a helping handthroughout this study.

I am thankful to Allah SWT for giving me strength to finish this study. I am alsograteful to my supervisor, Prof. Dr. Ahamad Tajudin Khader from the School ofComputer Sciences at Universiti Sains Malaysia, for his wise counsel, helpfuladvice, connected support, and supervision throughout the duration of this study.I am also thankful to my co-supervisor, Dr. Mohammed Azmi Al-Betar from theDepartment of Information Technology, Al-Huson University College, Al-BalqaApplied University, for his assistance. I am also thankful to Dr. Essam SaidHanandeh from the Department of Computer Information System, ZarqaUniversity, for his assistance.

My family deserves special thanks. Words cannot express how grateful I am tomy father, mother, and brothers for all of the sacrifices that they have done for me.Finally, I thank all of my friends who encouraged me throughout the duration ofthis study.

ix

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . 21.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.7 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Krill Herd Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Krill Herd Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Why the KHA has been Chosen for Solving the TDCP . . . . . . . 112.4 Krill Herd Algorithm: Procedures . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Mathematical Concept of Krill Herd Algorithm . . . . . . . 122.4.2 The Genetic Operators . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Text Document Clustering Applications . . . . . . . . . . . . . . . . . . . 223.4 Variants of the Weighting Schemes . . . . . . . . . . . . . . . . . . . . . . 223.5 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.1 Cosine Similarity Measure . . . . . . . . . . . . . . . . . . . . . . 263.5.2 Euclidean Distance Measure . . . . . . . . . . . . . . . . . . . . . 27

3.6 Text Feature Selection Method . . . . . . . . . . . . . . . . . . . . . . . . . 27

xi

3.7 Metaheuristics Algorithm for Text Feature Selection . . . . . . . . . . 283.7.1 Genetic Algorithm for the Feature Selection . . . . . . . . . 293.7.2 Harmony Search for the Feature Selection . . . . . . . . . . . 293.7.3 Particle Swarm Optimization for the Feature

Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.8 Dimension Reduction Method . . . . . . . . . . . . . . . . . . . . . . . . . . 303.9 Partitional Text Document Clustering . . . . . . . . . . . . . . . . . . . . . 32

3.9.1 K-mean Text Clustering Algorithm . . . . . . . . . . . . . . . . 333.9.2 K-medoid Text Clustering Algorithm . . . . . . . . . . . . . . 34

3.10 Meta-heuristics Algorithms for Text Document ClusteringTechnique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.10.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.10.2 Harmony Search Algorithm . . . . . . . . . . . . . . . . . . . . . 353.10.3 Particle Swarm Optimization Algorithm . . . . . . . . . . . . . 363.10.4 Cuckoo Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . 373.10.5 Ant Colony Optimization Algorithm . . . . . . . . . . . . . . . 383.10.6 Artificial Bee Colony Optimization Algorithm . . . . . . . . 383.10.7 Firefly Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.11 Hybrid Techniques for Text Document Clustering . . . . . . . . . . . 403.12 The Krill Herd Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.12.1 Modifications of Krill Herd Algorithm . . . . . . . . . . . . . 433.12.2 Hybridizations of Krill Herd Algorithm . . . . . . . . . . . . . 453.12.3 Multi-objective Krill Herd Algorithm . . . . . . . . . . . . . . 46

3.13 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2 Research Methodology Outline . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Text Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.2 Stop Words Removal . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.4 Text Document Representation . . . . . . . . . . . . . . . . . . . 64

4.4 Term Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4.1 The Proposed Weighting Scheme . . . . . . . . . . . . . . . . . 654.4.2 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5 Text Feature Selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . 684.5.1 Text Feature Selection Descriptions

and Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.5.2 Representation of Feature Selection Solution . . . . . . . . . 684.5.3 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.5.4 Metaheuristic Algorithms for Text Feature

Selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

xii Contents

4.6 Proposed Detailed Dimension Reduction Technique . . . . . . . . . . 764.7 Text Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.7.1 Text Document Clustering Problem Descriptionsand Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.7.2 Solution Representation of Text DocumentClustering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.7.3 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Developing Krill Herd-Based Algorithms . . . . . . . . . . . . . . . . . . 82

4.8.1 Basic Krill Herd Algorithm for Text DocumentClustering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.8.2 Summary of Basic Krill Herd Algorithm . . . . . . . . . . . . 874.8.3 Modified Krill Herd Algorithm for Text Document

Clustering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.8.4 Hybrid Krill Herd Algorithm for Text Document

Clustering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.8.5 Multi-objective Hybrid Krill Herd Algorithm

for Text Document Clustering Problem . . . . . . . . . . . . . 924.9 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.9.1 Comparative Evaluation and Analysis . . . . . . . . . . . . . . 944.9.2 Measures for Evaluating the Quality

of Final Solution (Clusters) . . . . . . . . . . . . . . . . . . . . . . 974.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.1.1 Benchmark Text Datasets . . . . . . . . . . . . . . . . . . . . . . . 1055.2 Feature Section Methods for Text Document Clustering

Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2.2 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 1085.2.3 Summary of Feature Selection Methods . . . . . . . . . . . . . 118

5.3 Basic Krill Herd Algorithm for Text Document ClusteringProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.3.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.3.3 Parameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.3.4 Comparative and Analysis . . . . . . . . . . . . . . . . . . . . . . 1265.3.5 Basic Krill Herd Algorithm Summary . . . . . . . . . . . . . . 132

5.4 Modified KH Algorithm for Text Document ClusteringProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.4.1 Experiments Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.4.2 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 1335.4.3 Modified Krill Herd Algorithm Summary . . . . . . . . . . . 137

Contents xiii

5.5 Hybrid KH Algorithm for Text Document ClusteringProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395.5.2 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 1395.5.3 Hybrid Krill Herd Algorithm Summary . . . . . . . . . . . . . 146

5.6 Multi-objective Hybrid KH Algorithm for Text DocumentClustering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.6.2 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 1475.6.3 Multi-objective Hybrid Krill Herd Algorithm

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1515.7 Comparing Results Among Proposed Methods . . . . . . . . . . . . . . 1515.8 Comparison with Previous Methods . . . . . . . . . . . . . . . . . . . . . . 1565.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.2 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.3 Contributions Against Objectives . . . . . . . . . . . . . . . . . . . . . . . . 1636.4 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

xiv Contents

Abbreviations

ABC Ant Colony OptimizationASDC Average Similarity of Documents CentroidBCO Bee Colony OptimizationBKHA Basic Krill Herd AlgorithmBPSO Binary Particle Swam OptimizationCS Cuckoo SearchDDF Detailed Document FrequencyDDR Detailed Dimension ReductionDF Document FrequencyDFTF Document Frequency with Term FrequencyDR Dimension ReductionDTF Detailed Term FrequencyFE Feature ExtractionFF Fitness FunctionFS Feature SelectionGA Genetic AlgorithmHKHA Hybrid Krill Herd AlgorithmHS Harmony SearchIDF Inverse Document FrequencyKH Krill HerdKHA Krill Herd AlgorithmKHM Krill Herd MemoryKI Krill IndividualLFW Length Feature WeightMKHA Modified Krill Herd AlgorithmNLP Natural Language ProcessingNP Nondeterministic Polynomial timePSO Particle Swam OptimizationTC Text ClusteringTD Text Document

xv

TDCP Text Document Clustering ProblemTF Term FrequencyTFSP Text Feature Selection ProblemVSM Vector Space ModelWTDC Web Text Documents Clustering

xvi Abbreviations

List of Figures

Fig. 1.1 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Fig. 2.1 A flowchart of basic krill herd algorithm (Bolaji et al.

2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Fig. 2.2 A schematic represents the sensing domain around a KI

(Bolaji et al. 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Fig. 3.1 An example of the clustering search engine results

by Yippy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Fig. 4.1 The research methodology stages . . . . . . . . . . . . . . . . . . . . . . . . 62Fig. 4.2 Sigmoid function used in binary PSO algorithm. . . . . . . . . . . . . 76Fig. 4.3 Representation of the text clustering solution . . . . . . . . . . . . . . . 81Fig. 4.4 The flowchart of adapting basic KH algorithm to text

document clustering problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Fig. 4.5 The flowchart of the modified krill herd algorithm . . . . . . . . . . . 88Fig. 4.6 The flowchart of the hybrid krill herd algorithm. . . . . . . . . . . . . 90Fig. 4.7 The flowchart of the multi-objective hybrid krill herd

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Fig. 5.1 The general design of the experiments in the first stage . . . . . . . 106Fig. 5.2 The general design of the experiments in the second stage. . . . . 106Fig. 5.3 A snapshot of the dataset file . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Fig. 5.4 The experimental of the proposed methods in the first stage . . . 108Fig. 5.5 The number of features in each dataset using three feature

selection algorithms and dimension reduction techniques . . . . . . 115Fig. 5.6 Average computation time of the k-mean iteration

(in second) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118Fig. 5.7 The average similarity document centroid (ASDC) values

of the basic text clustering algorithms for 20 runs plottedagainst 1000 iterations on seven text datasets . . . . . . . . . . . . . . . 131

Fig. 5.8 The average similarity document centroid (ASDC) valuesof the text clustering using modified KH algorithmsfor 20 runs plotted against 1000 iterations on seven textdatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

xvii

Fig. 5.9 The average similarity document centroid (ASDC) valuesof the text clustering using hybrid KH algorithms for 20 runsplotted against 1000 iterations on seven text datasets . . . . . . . . . 145

Fig. 5.10 The average similarity document centroid (ASDC) valuesof the text clustering using multi-objective hybrid KHalgorithm for 20 runs plotted against 1000 iterations on seventext datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

xviii List of Figures

List of Tables

Table 3.1 Overview of the feature selection algorithms . . . . . . . . . . . . . . 47Table 3.2 Overview of the text clustering algorithms. . . . . . . . . . . . . . . . 50Table 4.1 Terms frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Table 4.2 Terms weight using TF-IDF. . . . . . . . . . . . . . . . . . . . . . . . . . . 67Table 4.3 Terms weight using LFW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Table 4.4 The text feature selection problem and optimization

terms in the genetic algorithm context . . . . . . . . . . . . . . . . . . . 70Table 4.5 The text feature selection problem and optimization

terms in the harmony search algorithm context . . . . . . . . . . . . 72Table 4.6 The text feature selection problem and optimization

terms in the particle swarm optimization algorithmcontext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Table 4.7 The text document clustering and optimizationterm in the krill herd solutions context. . . . . . . . . . . . . . . . . . . 83

Table 4.8 The parameters values for different variant of textfeature selection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Table 4.9 The best configuration parameters of the TD clusteringalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Table 5.1 Description of document datasets that used in thisresearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Table 5.2 Summary of experimental methods using k-meanclustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Table 5.3 Comparing the performance of the k-mean text clusteringalgorithms in terms of the purity and entropy measures . . . . . . 110

Table 5.4 Comparing the performance of the k-mean text clusteringalgorithms in terms of the precision and recall measures . . . . . 111

Table 5.5 Comparing the performance of the k-mean text clusteringalgorithms in terms of the accuracy and F-measuremeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xix

Table 5.6 The number of the best results obtained by the proposedmethods in terms of the evaluation measures over alldatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Table 5.7 The number of the best results obtained by the featureselection algorithms (GA, HS, and PSO) in termsof the evaluation measures over all datasets . . . . . . . . . . . . . . . 114

Table 5.8 Summary of the experimental using DDR to adjustthe threshold value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Table 5.9 Dimension reduction ratio when threshold ¼ 25using DDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Table 5.10 The average ranking of k-mean clustering algorithmsbased on the F-measure. i.e., lower rank value is the bestmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Table 5.11 Convergence scenarios of the basic krill herdalgorithm (BKHA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Table 5.12 The results of BKHA convergence scenarios(Scenario.(1) through Scenario.(5)). . . . . . . . . . . . . . . . . . . . . . 122

Table 5.13 The results of BKHA convergence scenarios (Scenario.(6)through Scenario.(10)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123



Table 5.16 The versions of the proposed basic krill herd algorithms(BKHAs) for text document clustering problem (TDCP) . . . . . 126

Table 5.17 Comparing the performance of the text document clusteringalgorithms along with original datasets and proposed datasetsusing accuracy measure values. . . . . . . . . . . . . . . . . . . . . . . . . 127

Table 5.18 Comparing the performance of the text document clusteringalgorithms along with original datasets and proposed datasetsusing purity measure values . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Table 5.19 Comparing the performance of the text document clusteringalgorithms along with original datasets and proposed datasetsusing entropy measure values. . . . . . . . . . . . . . . . . . . . . . . . . . 128

Table 5.20 Comparing the performance of the text document clusteringalgorithms along with original datasets and proposed datasetsusing precision measure values . . . . . . . . . . . . . . . . . . . . . . . . 129

Table 5.21 Comparing the performance of the text document clusteringalgorithms along with original datasets and proposed datasetsusing recall measure values . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Table 5.22 Comparing the performance of the text document clusteringalgorithms along with original datasets and proposed datasetsusing F-measure values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

xx List of Tables

Table 5.23 The best results obtained by the basic clustering algorithmsin terms of the evaluation measures over all datasets . . . . . . . . 130

Table 5.24 Versions of the proposed modified KH algorithm . . . . . . . . . . 133Table 5.25 Comparing the performance of the text document clustering

algorithms using the average accuracy measure values . . . . . . 134Table 5.26 Comparing the performance of the text document clustering

algorithms using the average purity measure values . . . . . . . . . 134Table 5.27 Comparing the performance of the text document clustering

algorithms using the average entropy measure values . . . . . . . 134Table 5.28 Comparing the performance of the text document clustering

algorithms using the average precision measure values . . . . . . 135Table 5.29 Comparing the performance of the text document clustering

algorithms using the average recall measure values . . . . . . . . . 135Table 5.30 Comparing the performance of the text document clustering

algorithms using the average F-measure measure valuesand its ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Table 5.31 The average ranking of the modified clustering algorithmsbased on the F-measure. i.e., lower rank value is the bestmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Table 5.32 The best results obtained by the modified clusteringalgorithms in terms of the evaluation measures overall datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Table 5.33 Versions of the proposed hybrid KH algorithm . . . . . . . . . . . . 139Table 5.34 Comparing the performance of the hybrid text document

clustering algorithms using the average accuracy measurevalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Table 5.35 Comparing the performance of the hybrid text documentclustering algorithms using the average purity measurevalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Table 5.36 Comparing the performance of the hybrid text documentclustering algorithms using the average entropy measurevalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Table 5.37 Comparing the performance of the hybrid text documentclustering algorithms using the average precision measurevalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Table 5.38 Comparing the performance of the hybrid text documentclustering algorithms using the average recall measurevalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Table 5.39 Comparing the performance of the hybrid text documentclustering algorithms using the average F-measure values . . . . 143

Table 5.40 The average ranking of the hybrid clustering algorithmsbased on the F-measure. i.e., lower rank value is thebest method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

List of Tables xxi

Table 5.41 The best results obtained by the hybrid clustering algorithmsin terms of the evaluation measures over all datasets . . . . . . . . 144

Table 5.42 Comparing the performance of the multi-objective hybridtext document clustering algorithms using the averageaccuracy measure values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Table 5.43 Comparing the performance of the multi-objective hybridtext document clustering algorithms using the average puritymeasure values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Table 5.44 Comparing the performance of the multi-objective hybridtext document clustering algorithms using the averageentropy measure values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Table 5.45 Comparing the performance of the multi-objective hybridtext document clustering algorithms using the averageprecision measure values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Table 5.46 Comparing the performance of the multi-objective hybridtext document clustering algorithms using the average recallmeasure values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Table 5.47 Comparing the performance of the multi-objective textdocument clustering algorithms using the average F-measurevalues and its ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Table 5.48 The average ranking of the multi-objective clusteringalgorithm based on the F-measure. i.e., lower rank value isthe best method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Table 5.49 The best results obtained by the multi-objective clusteringalgorithm in terms of the evaluation measures overall datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Table 5.50 Comparison results between the krill herd-based methodsaccording to F-measure evaluation criteria . . . . . . . . . . . . . . . . 152

Table 5.51 The average ranking of the krill-based algorithms based onthe average F-measure. i.e., lower rank value is thebest method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Table 5.52 Significance tests of the basic KH algorithms on the originaldatasets and the proposed datasets using t-test with a\0:05.Highlight (bold) denote that result is significantly different . . . 155

Table 5.53 Significance tests of the basic KH algorithms and themodified KH algorithm using t-test with a\0:05. Highlight(bold) denote that result is significantly different . . . . . . . . . . . 155

Table 5.54 Significance tests of the modified KH algorithms and thehybrid KH algorithm using t-test with a\0:05. Highlight(bold) denote that result is significantly different . . . . . . . . . . . 156

Table 5.55 Significance tests of the hybrid KH algorithms and themulti-objective hybrid KH algorithm using t-test witha\0:05. Highlight (bold) denote that result is significantlydifferent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

xxii List of Tables

Table 5.56 Key to the comparator methods . . . . . . . . . . . . . . . . . . . . . . . . 157Table 5.57 Description of text document datasets that used

by the comparative methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 158Table 5.58 A comparison of the results obtained by MHKHA and

best-published results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

List of Tables xxiii

Abstrak

Pengklusteran dokumen teks adalah satu tren baru dalam galian teks di manadokumen-dokumen diasingkan kepada beberapa kluster yang koheren, di manadokumen-dokumen dalam kluster yang sama adalah serupa. Dalam kajian ini, satukaedah baru untuk menyelesaikan masalah pengklusteran dokumen teks dijalankandalam dua peringkat: (i) Satu kaedah pemilihan fitur menggunakan algoritma optimakumpulan partikel dengan satu skima pemberat yang baru dan satu teknik pengu-rangan dimensi yang lengkap dicadangkan untuk mendapatkan satu subset barufitur-fitur yang lebih bermaklumat dengan ruang berdimensi rendah. Subset baru inidigunakan untuk memperbaiki prestasi algoritma pengklusteran teks dalam per-ingkat berikutnya dan ini mengurangan masa pengiraannya. Algoritma pengklus-teran min-k digunakan untuk menilai keberkesanan subset-subset yang diperolehi.(ii) Empat algoritma krill herd iaitu (a) algoritma krill herd asas, (b) algoritma krillherd yang telah diubahsuai, (c) algoritma krill herd hibrid, dan (d) algoritma hibridpelbagai objektif krill herd, disarankan untuk menyelesaikan masalah pengklusteranteks; algoritma ini adalah penambahbaikan lanjutan kepada versi-versi yang ter-dahulu. Untuk proses penilaian, tujuh set data teks penanda aras digunakan denganpencirian dan kesukaran yang berbeza. Keputusan menunjukkan bahawa kaedahyang dicadangkan dan algoritma yang diperolehi mencapai keputusan terbaik ber-banding dengan kaedah-kaedah lain yang diutarakan dalam literatur.

xxv

Abstract

Text document (TD) clustering is a new trend in text mining in which the TDs areseparated into several coherent clusters, where documents in the same cluster aresimilar. In this study, a new method for solving the TD clustering problem worked inthe following two stages: (i) A new feature selection method using particle swarmoptimization algorithm with a novel weighting scheme and a detailed dimensionreduction technique are proposed to obtain a new subset of more informative fea-tures with low-dimensional space. This new subset is used to improve the perfor-mance of the text clustering (TC) algorithm in the subsequent stage and reduce itscomputation time. The k-mean clustering algorithm is used to evaluate the effec-tiveness of the obtained subsets. (ii) Four krill herd algorithms (KHAs), namely,(a) basic KHA, (b) modified KHA, (c) hybrid KHA, and (d) multi-objective hybridKHA, are proposed to solve the TC problem; these algorithms are incrementalimprovements of the preceding versions. For the evaluation process, seven bench-mark text datasets are used with different characterizations and complexities. Resultsshow that the proposed methods and algorithms obtained the best results in com-parison with the other comparative methods published in the literature.

xxvii