workbook pattern recognition - upbimag.pub.ro/~rasche/course/patrec/patrec1.pdf · workbook pattern...

Workbook Pattern RecognitionAn Introduction for Engineers and Scientists

C. RascheMay 26, 2018

This workbook provides a rapid, practical access to the topic of pattern recognition. The emphasis lies onapplying and exploring the statistical classification methods in Matlab or Python; the mathematical formu-lation is minimal. Plenty of code examples are given that allow to immediately play with these methods andthat can serve as a reference guide (even the author uses the them as such). We start with the very simpleand easily implementable k-Nearest-Neighbor classifier, followed by the popular and robust linear classi-fiers. We learn how to apply the Principal Component Analysis and how to properly fold the data. We thenintroduce clustering methods (K-Means and hierarchical algorithms), decision trees, ensemble classifiersand string matching methods. After having introduced those basic techniques we expand by introducingthe modern classifiers such as Support Vector Machines and Deep Neural Networks, and we explain whenit is meaningful to employ them. Analogously, we expand on clustering methods and introduce modernclustering methods such as the density-based methods. During the entire discourse we explain how todeal with very large datasets.

Prerequisites basic programming skillsRecommended basic linear algebra, basic signal processing

Speed Links

Task Example Code Check ListClassification G.1 18.5Clustering G.2 20.3

Data Preparation G.3Folding Explicitly G.6 (Knn Example)Dist Matrix, NNS ADistance Measures BReading F

Contents

1 Introduction 91.1 The Recognition Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Data Format, Formalism & Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.1 Types of Feature Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3 Model (Algorithm) Selection, Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.2 User Interface for Classification (Matlab) . . . . . . . . . . . . . . . . . . . . . . . . . . 161.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4 Varia: Code, Software Packages, Training Data Sets . . . . . . . . . . . . . . . . . . . . . . . 16

2 The Recognition Challenge 182.1 The Modelling Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 The Computational Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

3 Data Preparation (Loading, Inspection, Adjustment, Scaling) 243.1 Visual Inspection, Group Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Special Entries (Not a Number, Infinity, etc.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Permute Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Load Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.6 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Nearest-Centroid Classifier 294.1 Nearest-Shrunken Centroid (NSC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 k-Nearest Neighbor (kNN) Classifier 315.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Linear Classifier; Linear and Quadratic Discriminant 356.1 Covariance Matrix Σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Linear and Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4 Usage in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5 Implementation Matrix Decomposition (Matlab) . . . . . . . . . . . . . . . . . . . . . . . . . . 406.6 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Dimensionality Reduction 427.1 Feature Transformation (Unsupervised Learning) . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.1.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2.1 Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.2.2 Wrapper Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8 Evaluating and Improving Classifiers 488.1 Types of Error Estimation, Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8.1.1 Variance in Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488.1.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8.2 Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.2.2 Measures and Response Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.2.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8.3 Three or More Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.4 More Tricks & Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8.4.1 Class Imbalance Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.4.2 Learning Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.4.3 Improvement with Hard Negative Mining and Artificial Samples . . . . . . . . . . . . . 55

9 Clustering - K-Means 569.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.2 Usage in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.3 Determining k - Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2

10 Clustering - Hierarchical 6110.1 Pairwise Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6110.2 Linking (Agglomerative) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6110.3 Thresholding the Hierarchy - Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6310.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

11 Decision Tree 6611.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6911.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

12 Ensemble Classifiers 7012.1 Combining Classifier Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

12.1.1 Voting and Other Combination Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7012.1.2 Learning the Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7212.1.3 Component Classifiers without Discriminant Functions . . . . . . . . . . . . . . . . . . 7212.1.4 Error-Correcting Output Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

12.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7312.3 Bagging, Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

12.3.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7412.4 Stage-Wise, Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7512.5 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

13 Recognition of Sequences 7613.1 String Matching Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7613.2 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

14 Density Estimation 7814.1 Non-Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

14.1.1 Histogramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7914.1.2 Kernel Estimator (Parzen Windows) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

14.2 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8014.2.1 Gaussian Mixture Models (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

14.3 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

15 Support Vector Machines 8215.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8215.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

16 Deep Neural Network (DNN) 8416.1 Traditional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

16.1.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8516.2 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

16.2.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8616.3 Deep Belief Network (DBN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

16.3.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8716.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

17 Naive Bayes Classifier 8817.1 Usage in Matlab, Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8817.2 Usage in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8917.3 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3

18 Classification: Rounding the Picture & Check List 9018.1 Bayesian Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

18.1.1 Rephrasing Classifier Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9018.2 Estimating Classifier Complexity - Big O Notation . . . . . . . . . . . . . . . . . . . . . . . . . 9118.3 Parametric (Generative) vs. Non-Parametric (Discriminative) . . . . . . . . . . . . . . . . . . 9118.4 Algorithm-Independent Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9218.5 Check List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

19 Clustering III 9419.1 Partitioning Methods II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

19.1.1 Fuzzy C-Means (K-Means) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9419.1.2 K-Medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

19.2 Density-Based Clustering (DBSCAN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9519.2.1 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9719.2.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

19.3 Very Large Data Bases (VLDB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9719.3.1 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9819.3.2 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

19.4 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9919.4.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10019.4.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

20 Clustering: Rounding the Picture 10220.1 Summary of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10220.2 Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

20.2.1 Test for Spatial Randomness - Hopkins Test . . . . . . . . . . . . . . . . . . . . . . . . 10320.3 Check List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A Distance Matrix, Nearest Neighbor Search 105A.1 Distance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 Nearest Neighbor Search (NNS) wiki Nearest neighbor search . . . . . . . . . . . . . . . . . . . 107

B Distance and Similarity Measures 109B.1 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.2 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

C Gaussian Function 111

D Programming Hints 113D.1 Parallel Computing Toolbox in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

E Matrix and Vector Multiplications 114E.1 Dot Product (Vector Multiplication) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114E.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

F Reading 116

G Code Examples 118G.1 The Classifiers in One Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118G.2 The Clustering Algorithms in One Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121G.3 Prepare Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

G.3.1 Whitening Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125G.3.2 Loading and Converting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125G.3.3 Loading the MNIST dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

G.4 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4

G.4.1 Calculating Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127G.5 Classification Example - Nearest-Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

G.5.1 Simple Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128G.5.2 Shrunken Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

G.6 Classification Example - kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131G.6.1 kNN Analysis Systematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

G.7 Estimating the Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133G.8 Classification Example - Linear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134G.9 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136G.10Example Feature Selection: Ranking Features . . . . . . . . . . . . . . . . . . . . . . . . . . 138G.11Function k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139G.12Example ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

G.12.1 ROC Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141G.13Example Feature Selection: Sequential Forward Selection . . . . . . . . . . . . . . . . . . . . 142G.14Clustering Example - K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

G.14.1 Cluster Information and Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144G.15Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

G.15.1 Three Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147G.16Classification Example - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150G.17Classification Example - Ensemble Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151G.18Classification Example - Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153G.19Example Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

G.19.1 Histogramming and Parzen Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154G.19.2 N-Dimensional Histogramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154G.19.3 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

G.20Classification Example - SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156G.21Classification Example - Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157G.22Clustering Example - Fuzzy C-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158G.23Clustering Example - DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160G.24Clustering Example - Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5

Preface

There are many, wonderful textbooks on the subject of pattern recognition, but most can be assigned totwo extreme categories: those starting with the mathematics first and and those explaining how to usea software package. The mathematical textbooks often provide the theoretical background first, followedby giving some examples, while the practical tips appear rather spontaneous, erratic and scarce: thatimbalance can deter the impatient scientist or engineer to approach the subject thoroughly. In contrast, thesoftware-oriented textbooks do not give you sufficient explanation on how you can manipulate the data foryour own purpose. The following workbook aims in between: it provides a learning-by-doing approach withwhich you will be able to understand the basics and which will hopefully allow you to develop your ownclassifiers that can outperform the standard techniques.

Similar Books There are two books that I consider similar to my workbook. One is the introductorybook by James et al. (An Introduction to Statistical Learning; 2013). I can recommend that one for bestillustrative examples for classification and regression. However, the presentation of mathematics still feels abit elaborate; the examples for the programming language R show how to use the functions, but do not givecode details or examples of algorithms. Thus, the book still represents the dichotomy I pointed out above.Despite that criticism, I used it the most recently, to update my workbook.

The other book I consider similar, is the user guide of Python’s SciKit-Learn module. It also gives a shortoverview of methods, but somewhat scattered between different sections and equations. It tries to satisfythe mathematician and the programmer. It gives plenty of code examples, but they show only how to applythe functions and do not explain the key steps in the code. The ones in this workbook are a bit simpler andslightly more explicit; and for the simple algorithms I give explicit examples.

My workbook provides overviews, for functions and for concepts. I dare to summarize the challengesof pattern recognition in a single section without any formalism (Section 2); it is admittedly dense, but thatgives you a better idea of the differences I think. And I put the classifier and clustering algorithms on asingle page for reference (Sections G.1 and G.2).

Motivation There is a deeper motivation to provide such a workbook. I have met researchers and tech-preneurs who were not firm with some of the basics of pattern recognition: some did not understand fun-damental differences in recognition; others frantically tried to apply the newest classification methods (suchas Deep Neural Networks) without verifying whether the old ones deliver satisfying results. But the mostrecently developed classification methods do not always provide better results; and if so, then often at thecost of a much larger effort, be that time, computational resources or less robustness. Figure 1 shows howclassification accuracy is related to the complexity of the method: the improvement is saturating. Thus,for an easily classifiable problem a more complex method will likely show a better performance; but for amore difficult-to-classify problem, it will also show an improvement only. It is therefore worth to understandthe basics well and apply those first - they will usually get you very far in very short time; and those quickresults often allow you to make early decisions in your analysis that are sometimes necessary to continueyour research in a specific direction. Only if you intend to optimize a task or if you need to outperformyour competition, only then you move on to more modern, advanced classifiers such as Support VectorMachines (SVM) and Deep Neural Networks (DNN). The workbook attempts to teach you those basics asstraightforward and sound as possible, but also explains how to use the modern classifiers.

Limited Time or Resources Sometimes your data is so large, that you simply lack the time or resourcesto classify everything thoroughly and you need to concentrate on one part of it. But which part? Whichsub-selection of your data would be still representative for your entire set? Again, you can obtain a goodidea by applying the basic techniques first, which may deliver sup-par results only, but which allow you toidentify the most representative part of your data. This works well, because the advanced classifiers (SVMor DNN) do not carry out magic - they merely improve the results (Figure 1): it is unlikely that the advancedclassifiers would identify a different part of your data as the best sub-selection.

6

Complexity of Method

[Resources]

Accuracy

easy task

di!cult task

Figure 1: Performance gain in pattern recognition: the more complex the method, the larger the required resources,and the better the recognition accuracy; this holds for most but not all recognitions tasks. But that gain in accuracytypically saturates with increasing effort. Put differently, complex methods often solve tasks better than simple methods,but they will not solve a difficult task easily - they will only improve in comparison to the simple methods. Conclusion:you can obtain a good impression of the nature of your data even with simple methods.

Big Data, Deep Neural Networks - On Modern Terms New terms such as Big Data or Deep Neural Net-works sometimes suggest breakthroughs, but progress in engineering or science occurs mostly step-wise;conceptual breakthroughs are rare. Younger experts generate new terms to promote their contributionsto the field; some older experts tend to belittle such new terminology; this is however the natural cycle ofprogress and promotion - as it occurs anywhere else, be it in business, in the art scene or in the musicindustry.

Algorithms that deal with very large databases (VLDB) - or call it Big Data - were invented decadesago already, but with modern computers one can tackle larger data more easily. Similarly, neural networkswith three or more layers had been envisioned since the 60s - and occasionally tested -, but it is thecomputational power of modern computers which allows to use them systematically and on a large scale.And that is certainly exciting per se and justifies the hype.

Related Fields There are some fields that are related to Pattern Recognition, which however are hard toseparate because they overlap so much in content. Two closely related fields are the following and theirdescription should be regarded as tendential and not as absolute.

Machine Learning corresponds to the field of Pattern Recognition but focuses a bit more on supervisedand reinforcement learning. Some experts consider Pattern Recognition as a part of the field ofMachine Learning. wiki Machine Learning

Data Mining focuses in particular on unsupervised learning (clustering) of large datasets; often, the tradi-tional techniques of the field Pattern Recognition are modified such, that they are capable of dealingwith those large datasets. wiki Data Mining

What to use from the WorkbookBasics: The first 9 sections comprise the basics - the traditional techniques; they cover the principal

classification and clustering techniques. With those you obtain results very quickly and those mayalready allow you to make decisions to move forward more rapidly.

Basics for ’Quantized’ Data: It may be the case that your data has a certain ’quantized’ character, thatis the typical distance measurements such as ’a minus b’ are not optimal. In that case you may try

7

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/Data_mining

Decision Trees (Section 11), where comparisons are based on relations such as ’a greater b’ or ’asmaller equal b’. If you attempt clustering on data that possess hierarchical character, you may try thehierarchical clustering methods (Section 10).

Basics - Refinement: By playing with certain tricks you may be able to improve the performance of thebasic techniques (Section 12) - in particular if your data comes from different sources. With densitymethods you can analyze low-dimensional data in more detail (Section 14).

Optimization: Use modern classifiers to achieve higher classification performance (Sections 15 and 16).But be prepared to spend much more time to obtain your better results - and be prepared to ’upgrade’your computational resources. And be advised that in some cases the results are worse than usingthe basic techniques - no matter how well you tune the modern classifiers. For that reason it isrecommended to always apply the basic techniques for a comparison.

8

1 Introduction

Pattern recognition is a general term for a number of different tasks. Some tasks occur more frequent thanothers; some tasks are related to each other. We firstly explain those tasks (Section 1.1). To approachthose tasks one presses the information into a specific data format; the data values themselves can be ofdifferent nature (Section 1.2). To solve those tasks, we create models of which we would like to know howwell they perform when they are put to test (Section 1.3). Eventually, we elaborate a bit on various issues,such as code, software packages, etc. (Section 1.4).

1.1 The Recognition Tasks wiki Pattern Recognition

There exists a number of tasks and some of them are related. We introduce the three most common onesfirst.

JWHT p128, 4.1

Classification In this task we categorize data based on experience. Newly seen objects, events, obser-vations, etc. are compared to previously accumulated information and put into drawers, into categories; wepredict labels in this task. In order to be able to classify, we need to have learned those categories and weneed to have developed a model that carries out the proper assignment as faithful as possible: we collectdata, label it, and then create a model for the classification process.

For example, if we wish to build a model that can read handwritten postal codes automatically, then wecollect samples of handwritten digits of many different persons, then label the digits, and then build a modelthat discriminates the digits. When we apply the model to handwritten postal codes of other persons, thatis to new samples, then the model will predict those samples.

The process of training a model is also called learning or fitting. And because in this task the learningprocess takes place with class labels, with a ’teacher’, it is also called supervised learning.

Clustering Here we try to find categories that can serve as future experience. We try to make sense ofdata - we try to find meaningful patterns, groups, trends, structures, partitions, etc., that could correspond topotential classes, called clusters in this task; it is exploration, a search for new labels. Because there is noteacher in this type of classification problem, it is said that the clustering algorithms perform unsupervisedlearning.

Clustering is more varied in its applications than classification. There are roughly three main purposesfor clustering:- Finding structure: to gain insight into data, generate hypotheses, detect anomalies, and identify salient

features.- Natural classification: to identify the degree of similarity among forms or organisms (phylogenetic rela-

tionship).- Compression: as a method for organizing the data and summarizing it through cluster prototypes.

We give two specific examples (from ThKo p598):

Hypothesis testing in business: cluster analysis is used for the verification of the validity of a specifichypothesis. Consider, for example, the following hypothesis: ’Big companies invest abroad.’ One wayto verify whether this is true is to apply cluster analysis to a large and representative set of companies.Suppose that each company is represented by its size, its activities abroad, and its ability to completesuccessfully projects on applied research. If, after applying cluster analysis, a cluster is formed thatcorresponds to companies that are large and have investments abroad (regardless of their ability tocomplete successfully projects on applied research), then the hypothesis is supported by the clusteranalysis.

Prediction based on groups for medical diagnosis: cluster analysis is applied to a dataset concerning pa-tients infected by the same disease. This results in a number of clusters of patients, according to theirreaction to specific drugs. Then for a new patient, we identify the most appropriate cluster for the

9

patient and, based on it, we decide on his or her medication.

Regression Analysis Here we predict quantities based on observations. For instance in weather fore-casting one estimates tomorrows temperature by relating observed climate parameters. Thus, in regressionwe do not predict by seeking labels as in classification, instead we try to relate variables to predict continu-ous values.

Regression analysis is carried out with methods very similar to classification and is also consideredsupervised learning. Most classification algorithms can be used for regression as well. We do not treat re-gression for reason of brevity. If one has understood how to use the classification and clustering algorithms,then one should have no difficulties to apply regression algorithms.

There exist other recognition challenges. They are sometimes considered their own task.

Dimensionality Reduction This is the task of reducing the data variety to the most distinct aspects. Mosttextbooks regard this as an optimization of the classification and clustering tasks. We treat that topic inSection 7.

Reinforcement Learning Alp p447 Is the task to make the system adapt over time to new challenges whenthere are no clear supervised signals. The task lacks class labels that would allow supervised learning,but there is a feedback that can be exploited to sense the right learning direction. We do not introduce thistopic.

Neural Networks In earlier years, the Neural Network methodology was sometimes considered its ownpattern recognition challenge, but one can also assign the methodology to the domain of supervised clas-sification. And with the success of Deep Neural Networks, the methodology has firmly gained its place insupervised classification. We do treat it quickly in Section 16.

10

1.2 Data Format, Formalism & Terminology

When we gather data we make repeated observations, on objects, events, patterns, etc. Those observation-s represent samples that are later used for building our model. Those observations are typically quantifiedwith the same number of properties. We so accumulate a m × n matrix D, which typically is organized asfollows:

rows

[samples,

observations]

1

2

3

4

m

columns

[features, variables, predictors,

components, dimensions]

1 2 3 n. .

.

.

group,

class,

category

1

2

3

1

3

1

2

2

Figure 2: On the left: a m × n data matrix (or n × d or N × d): abbreviated D or X or other letters. On the right: agroup variable - if available -, which holds the class or group label for each sample data point.Rows: numbered 1 through m; they represent samples or observations: images, objects, a set of measurements orexpressed statistically: each row represents a data point in a multi-dimensional space. The number of samples m issometimes also denoted as n or N in other texts.Columns: numbered 1 through n; they are called features, variables, predictors, components or dimensions - depend-ing on the context or preferred terminology: they can be the individual pixels of an image, the measurements of anobject, etc.Individual matrix entries are denoted as dij or xij for example, with index variables i and j counting 1 to m and n,respectively.The group variable (on the right) is often abbreviated Y and holds three classes in this case. If such group informationis present, we can use supervised learning algorithms to make better predictions.

Each row of D describes a data sample (or observation), that is a vector d, of which each dimension d(j)(or component) - sometimes also denoted as dj - represents the measurement of a different feature (orvariable), j = 1..n. Thus, each sample is a point in n-dimensional space; in Figure 6, it is a two-dimensionalspace only and the data matrix would have only two columns. In praxis, the dimensionality can rangefrom two to several thousand dimensions. In computer vision for example, the image’s individual pixelsare often taken as dimensions, that is for a 200x300 pixel image we have 60’000 dimensions (or features,variables,...). In Bioinformatics the dimensionality can also easily grow to several thousands, in particular inDNA microarray analysis; in Webmining as well.

If we know the class labels of our samples, then that information is organized in a group variable Ywhose number of elements is the same as the number of samples (see column in right part of Figure 2).

Different Number of Features If observations are measured with different number of features we stillbuild a matrix but leave entries that were not measured blank, for instance we can use the place hold NaN

11

(not-a-number). Later we try to calculate with those values, i.e. using functions nansum, nanmean, etc. Or weimpute the missing values, i.e. use knnimpute. Both variants are not optimal, but it may be better to keeppartial data and attempt to model with it nevertheless, than to discard it and potentially loose information.

Other Notations Notations of matrices and group variables may vary. The one above is the typical math-ematical notation, but some books or software programs may also talk of a n× d matrix, with n the numberof samples, and d the dimensionality of the data; or N × d. Or other variable names are used. This varietyis of course confusing sometimes, but the traditional mathematical notation using X and Y is not very infor-mative and that is why it is popular to choose more informative variable names. In rare cases, the axes ofthe data matrix are flipped - rows are features and columns are samples.

In this workbook we do not pursue a consistent notation, as we took formulations from different textbooksand did not adapt the notation. We prefer the letter G for the grouping variable instead of the letter Y .However, the provided example code is fairly consistent and its notation will be introduced soon (Section1.3).

1.2.1 Types of Feature Values ThKo p599, 11.1.2

Kuncheva p4, 1.1.3

HKP p40, 2.1Typically we associate the idea of a measurement value with a number that can be compared to othermeasurement values by taking their arithmetic difference. Features with such values would be called quan-titative and are often continuous. But there exist also feature values with other characteristics (Fig. 3) -or values may be simply missing. We attempt to summarize the prevalent types, although there exists noabsolute agreement on their exact definitions, i.e. wiki Level of measurement.

Feature

Types

quantitative

continuous

qualitative

discrete

ordinal

nominal

numerical

categorical

length, pressure

game score (countable)

education degree

profession, make of a car

Figure 3: Classification of feature types. Most of the data have quantitative, continuous values and most algorithmsare designed to compute with those. If the values are discrete, we can still use the regular algorithms, but changingsome aspects or parameters can improve recognition performance. If we have qualitative values, then a decision treeis likely to perform better. [Figure after: Kuncheva, 2004, Fig. 1.2]

Quantitative (Real-Valued, Numeric): Familiar quantities are for instance mass, time, distance, heat andangular separation. Thus here we deal mostly with continuous values, but measurements can also bediscrete, meaning the number of possible values is limited.

Qualitative (Nominal, Categorical): Examples are nationality, ethnicity, genre, style, etc. In that case,arithmetic differences do not really make sense, with the exception if the quality is described by twopossible values only, see binary next.

Binary: Feature values take only two values, zero or one for instance, as in a computer; or ’false’ and ’true’- Matlab allows to specify such values. Binary features can be regarded as a categorical variable with

12

two possible values only. Taking the arithmetic difference between two binary values is possible, butnot always optimal: there exist specific difference measures for binary data, that can improve yourpattern recognition results.

Missing Data It may be the case that there are missing data, meaning a sample may lack one or severalcomponent values. There can be various reasons for that: either the measurement was not possible orit is inadequate for the specific sample to have a component value. In that case, it is most appropriateto fill in NaN entries (not-a-number). Software packages can deal with NaN entries in general, but notnecessarily all algorithms; for instance, some algorithms will eliminate all features (variables) that containany NaN entries. This elimination of entire columns is a rather simple work-around and may cause the lossof precious information of non-missing values: it could be beneficial to deal with NaN somehow.

Heterogeneous Data It is not unusual that the data consist of features of different types, e.g. consistingof continuous (quantitative) and binary variables. In a first step it suffices to scale your data and to test themwithout separating them. But we can optimize the performance of a classifier if we test also other difference(distance) measures. How we deal with such feature values will be mentioned throughout the work book.

1.3 Model (Algorithm) Selection, Model Evaluation wiki Model selection

To find the best model for our data, we simply have to try different algorithms - it is hard to predict whichone is most suitable without some experimenting. Unless of course we tackle a task for which others havealready accumulated a lot of experience. But even then, it is likely that you can find some model variant of amodel type that achieves competitive results. This is particularly the case for complex models. The searchfor the best classifier model (algorithm) is called model selection.

Model selection is intricate because tuning and optimizing a model often requires some expertise, inparticular for more complex models. Hence, if we do not properly tune a model, then we may not exploit itsfull potential and we might end up selecting a less optimal model that just happened to perform better dueto a coincidental choice of optimal parameters.

Tuning a specific model (type) can be considered model evaluation. Model evaluation is time-consumingbecause a model often requires the adjustment of so-called hyper-parameters, typically one or two in num-ber (wiki Hyperparameter optimization). There exist also other settings, such as the choice of distance metric oroptimization routine. Those settings are sometimes considered hyper-parameters as well. A model canperform significantly different for different settings. In Matlab, the optimization of hyper-parameters is donecompletely manually by writing for-loops. In Python there exists a module model selection that containsfunctions that help find the optimal hyper-parameters for a selected model variant.

Generalization Performance An important aspect of model evaluation is the proper estimation of the so-called generalization performance. In the example of the automatic digit classification task, this would giveus an estimation of how reliable the system is, when it is applied on a large crowd of people, on novel datathat were not seen before by the trained system. After we have trained a model, we can evaluate it on thevery same data that were used for training. That gives us the training error, or sometimes one reports theso-called prediction accuracy for training, which is 1 (or 100 percent) minus the training error. Taking thistraining error as a measure for generalization performance however misleading, because some classifierstend to over-fit (over-learn) their training data and as a result they do not generalize well when confrontedwith new data. What we desire instead, is to determine a testing error (or prediction accuracy) for novel,unseen data.

Proper estimation of this generalization performance can be done by splitting the data and the groupingvariable (the labels) into two partitions (Fig. 4): one partition serves exclusively for training our model; andone partition is reserved for testing, for estimating the generalization performance of our model. If our dataset is very large with several thousands instances per class and their labels, then we split our dataset D (orX) in two equal partitions. We then train on one partition and predict on the other half. Or we might evenchoose two thirds of the data for training and the remaining third for testing. For smaller data sets this maynot be optimal because even two third of the data may not be sufficient to train properly. For smaller data

13

features groups

training

testing

(generalization performance)}}

MODEL

(ALGORITHM)

Figure 4: Evaluating a classification model. We split the data set into two partitions. One partition is reserved fortraining the model and we can determine a training error, which however we should not rely on because some modelstend to over-fit (over-learn) the data. The other partition serves to test the model and we determine a testing error (orprediction accuracy), which is our estimate of the generalization performance. If our data set is very large, then thiscould be sufficient already. If not, we need an optimized scheme, such as cross validation (coming up next).

sets one uses therefore the procedure of cross-validation, introduced below. It is frequently used for modelevaluation and model selection.

1.3.1 Cross Validation wiki Cross-validation (statistics)

Cross-validation is a procedure to estimate the generalization performance as reliably as possible in caseour given data is of moderate size only. If we split the data into two halves (as in Fig. 4), then we cangenerate two prediction estimates: after we have generated one prediction estimate using each half, wesimply swap the two halves and generate another prediction estimate. Then we average the two predictionvalues and that is our estimate for the generalization performance. This is also called two-fold cross-validation, where one fold stands for one half. This scheme is unsatisfactory, because we have not usedall the data for training. We could split our data such that there is more training data, but then we remainwith less testing data and with less reliable prediction estimates. The solution to that conundrum is to applymore folds.

Folding The folding scheme divides the data set into n partitions, called folds. One of those folds isreserved for testing, the remaining folds are used for training (Fig. 5). Then we rotate through the folds ntimes and obtain n prediction estimates, which we then average to arrive at our mean prediction estimate.The most common number of folds is five, that is five-fold cross-validation: one fold is reserved for testing,the remaining four folds are used for training. We will explain in Section 8, why this five-fold cross-validationis the most popular one.

Our Notation Throughout the book we use DL to denote the training set and DT the testing set, take notethe font style of D). The corresponding group labels are denoted as GL and GT , respectively. In our codeexamples, we use the variable name DAT to name the entire data set. DAT is then split into TREN and TEST -by means of folding -, sometimes also named TRN and TST, respectively. Group labels are often named withthe abbreviation Grp or Lb.

14

fold 1

features groups

fold 2

fold 3

fold 4

fold 5 }

} training

testing

Figure 5: Evaluating a classification model using k-fold cross validation (strongly recommended for small data sets).The data matrix is partitioned into n folds, in this illustration five folds. Then n − 1 folds are used for training a model(4 folds in this case); the remaining fold is used for testing the model for its prediction accuracy (generalization perfor-mance). That would give us one prediction estimate. Then we rotate n − 1 times and obtain another n − 1 predictionestimates that we then average and report as our prediction accuracy; that is also called n-fold cross-validation, in ourillustration a 5-fold cross-validation. Software packages use by default often 10-fold cross validation.

X(DAT) is split into In Algorithms In Matlab/Python Codetraining matrix (i.e. 4 folds) DL, GL TREN or TRN; GrpTrn or Grp.Trn or LbTrntesting matrix (i.e. 1 fold) DT , GT TEST or TST; GrpTst or Grp.Tst or LbTst

Matlab Matlab provides consistent function names (as of version 2015a) to train and test classificationmodels. Its preferred terminology is called fitting and predicting. For instance for a kNN classifier the twoprincipal function names are

Mdl = fitcknn(TREN, GrpTrn); % learning a model (training)

GrpPred = predict(Mdl, TEST); % predicting new data (testing)

whereby GrpPred holds the estimated class labels. For a Support-Vector Machine, the fitting function iscalled fitcsvm, the predict function remains the same. To apply proper folding - as mentioned above - , onecan set certain options, to be introduced later. The impatient reader may already take a look at a summaryof classifiers in Appendix G.1; in those examples however, the predict function is not explicitly used, theexamples show how to apply cross-validation directly with the fitting function fitcxxx.

Matlab often uses the mathematical X/Y notation for variable names in function scripts, with X beingthe data - training or testing -, and Y being the grouping variable.

Python - SciKit-Learn In Python’s package called SciKit-Learn the terminology and notation is mostlythe same: there, one speaks of learning and predicting and the corresponding function names are fit andpredict. Those functions are called from an estimator instance, which we call Clf in the following codesnippet:

from sklearn.neighbors import KNeighborsClassifier # we need to import every function we use

Clf = KNeighborsClassifier() # creating the estimator instance (a kNN classifer in this case)

Clf.fit(TREN, GrpTrn) # learn

Clf.predict(TEST) # test

15

1.3.2 User Interface for Classification (Matlab)

Software packages become increasingly more convenient to evaluate a classification task. Matlab providesthe script classificationLearner to test a range of classifiers. All you need to provide is the the datamatrix and the group vector. Then you can classify your data using different classifiers choosing from amenu. This is very useful for orienting: this can very quickly point toward the best-classifying model thatwould be suitable to classify your data. The downside is that the program is generic and that you cannotfine-tune your system. For that there exists an option that generates the code of your preferred classifiermodel and that code you could then manipulate toward your specific goals and needs. The workbookteaches how to manipulate that code.

1.3.3 Clustering

In clustering there is no evaluation phase of the model. Instead, we cluster and then analyze the charac-teristics of the output: we analyze the obtained cluster sizes, their center values, their variances, etc. Formost clustering algorithms we need to specify a parameter that expresses our expectation of the clustercount or cluster size we expect. And because we often do not know exactly what to expect, we often run theclustering algorithm for a range of parameters values and then decide based on the output analysis, whichcluster formation appears to be the most suitable.

Matlab There is also no particular naming of the functions as in classification. We simply apply a function.

Pyton SkiKit-Learn’s function follow the naming of the classification function. There is a function fit thatruns the algorithm.

Appendix G.2 shows how to apply the clustering algorithms in their simplest form.

1.4 Varia: Code, Software Packages, Training Data Sets

Source for this Workbook I have tried to compile the best pieces from each textbook and I provide exactcitations including page number, see Appendix F for a listing of titles. My workbook distinguishes itself fromthe textbooks by specifying the use of some of the procedure more explicitly; and by providing code, whichis optimized for the use in a high-level language, such as Matlab.

Mathematical Notation The mathematical notation in this workbook is admittedly a bit messy, because Itook equations from different textbooks. I did not make an effort to create a consistent notation, so that thereader can easily compare the equations to the original text. In the majority of textbooks a vector is denotedwith a lower-case letter in bold face, e.g. x; a matrix is denoted as an upper-case letter in bold face, e.g. Σ.But there are deviations from this notational norm.

Code The code fragments I provide are written in Matlab and Python; in other languages most of thecommands have the same or a similar name. The computations in Matlab are written in vectorized form,which is equivalent to matrix notation: this type of vector/matrix thinking is unusual at the beginning, buthighly recommended for three reasons: 1) the computation time is shorter than if one uses for-loops; 2) codeis more compact; 3) code is less error-prone. However, some code fragments may contain unintendedmistakes, as I copied/pasted them from my own Matlab scripts and made occasionally some unverifiedmodifications for instruction purposes. The same holds for my Python scripts. For Python I use the SkiKit-Learn package (Pedregosa et al., 2011).It can also be useful to check Matlab’s file exchange website for demos of various kinds:

http://www.mathworks.com/matlabcentral/fileexchange

16

http://www.mathworks.com/matlabcentral/fileexchange

Some Software Packages Here are some other languages, that should be capable to serve your needs.The Weka software is a pattern recognition software that allows you to use the algorithms using a userinterface; however you may be stripped off flexibility.

MatLab Unfortunately expensive and mostly available either in academia or industry.R Free software package supposed to be a replacement for MatLab. wiki R (programming language)Python High-level language similar to Matlab, but more explicit in coding. wiki Python (programming language)

see in particular http://scikit-learn.org/stable/Weka Free software package written in Java. wiki Weka (machine learning)

Example Datasets Here are links to example datasets that have been used throughout machine learninghistory and that appear often in text books. Some of those datasets are rather small, but it can be convenientto get your classifier first working on such a small dataset and then to move on to real data. We providehere some links but we did not check whether they are still valid. In Python, many of those data sets comewith the module sklearn.datasets, see SKL page 540, section 3.5.

- Iris flower dataset: consists of 150 samples, each one with 4 attributes (dimensions):http://mlearn.ics.uci.edu/MLRepository.html

http://www.ics.uci.edu/~mlearn/MLRepository.html

It also exists multiple times in Matlab: once as part of the statistics toolbox: load fisheriris; and aspart of the fuzzy logic toolbox: load iris.dat. Python has it as datasets.load iris().

- Handwritten digits (MNIST database): consists of 70000 digits, each one 28 x 28 pixels:http://yann.lecun.com/exdb/mnist/. In Python as datasets.load digits(), though I think that

command loads only a subset of those data.

- Bishop provides also other collections, Bis p677:http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/datasets.htm

Matlab offers quite a number of datasets, see http://www.mathworks.com/help/stats/_bq9uxn4.html.

17

http://scikit-learn.org/stable/

http://mlearn.ics.uci.edu/MLRepository.html

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://yann.lecun.com/exdb/mnist/

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/datasets.htm

http://www.mathworks.com/help/stats/_bq9uxn4.html

2 The Recognition Challenge

Any model has advantages and disadvantages. A model often excels only for a specific configuration ofdata points, for other configurations it performs reasonably only; for some configurations it may even fail. Inthe next section we sketch the variety of models that exist and explain in what situation they perform well(Section 2.1). That section is going to be dense in ideas and concepts, perhaps too dense. But I think itserves better to have concepts introduced at the beginning as a summary, rather than scattered across theentire workbook.

Then there is the ’work load’ that a model is able to take: some are suitable to swallow large data, someare not; some learn fast, some slow. Those aspects are introduced in Section 2.2.

2.1 The Modelling Challenge

We introduce classification models first. We begin with a trivial case. It consists of two point clouds, seeFig. 6a, squares and triangles, clearly spaced apart. We are given a new (testing) point, the gray triangle,and we are asked to assign it to one of the two groups, to label it as either square or triangle. There are twosimple models that come to mind: determining a centroid (b) or finding a threshold (a).

a

b

c

xx

Figure 6: A trivial case for recognition in two dimensions.a. We are given two sets of points representing two classes, squares and circles, respectively. And we have a newpoint, a gray triangle, which appears to be closer to the circle class. How would we create a model that finds theappropriate class label for that new point (or any point)?b. Nearest-Centroid Classification: we firstly calculate the mean point for each class (marked as ’x’) and then determineto which one the new point lies closer.c. Decision Stump: we set a threshold at an appropriate value of the x-axis.

Centroid: we calculate the mean point for all squares and the mean point for all circles. In this context,the mean point is called centroid - it is the center of a point cloud, a group, a class or a cluster. Thenwe determine to which centroid our new point is nearest, and that would be the predicted class label.This model is also referred to as nearest-centroid classification (Section 4).

Such a model is called a generative or parametric model, because we have described the two classesby their centroid; that is a minimalistic description but it is one nonetheless in comparison to the nextmodel.

Threshold: we set a value on the x-axis that acts as a threshold, such that any new point whose x-coordinate is smaller than that threshold value, is labeled square, and any testing point whose x-coordinate is larger than that threshold, is labeled circle. This model corresponds to a decision tree,a small one with a single decision. Such a single-decision tree is also called decision stump.

18

Such a model is called a discriminative or non-parametric model, because we merely define a borderthat separates the two classes. The threshold value is of course a parameter as well, but not one thatdescribes the groups themselves.

Admittedly, it is somewhat an exaggeration to talk of models so far; they rather represent decision process-es. Nevertheless they express the essence of more complex models. The actual recognition challenge isto find the parameters of those models: of course we want an automated learning process. In case of thecentroid model that is not a real challenge, we merely compute an average. In case of the threshold this isnot as straightforward anymore: we need to test a range of values.

Classes Overlap Now the classes touch each other or even overlap, Fig. 7a. This requires models thatare more precise and more flexible. We can sketch again a generative and a discriminative approach.

a

b

c

Figure 7: A more challenging case: the two groups overlap (a.).b. Function: we model the point clouds as ellipses with suitable orientations and radii ratios. Or even better, withGaussian functions (Fig. 39), as many classifiers do (Section 17).c. Straight Line Equation: we try a discrimination using a straight line of appropriate slope and offset as is done bysome linear classifiers (Section 6).

Function (generative): here we attempt to find a function that captures the point cloud, the distribution.For instance we use an ellipse whose orientation and shape covers the point cloud as well as possible,see Fig. 7b. This is the approach taken by Naive Bayes classifiers for instance (Section 17). One cantake any function that captures the extent of the point cloud; the Gaussian function is very popular forthat purpose (Appendix C). The use of functions represents ’real’ generative modeling, as opposed tocalculating only the centroid.

Straight Line (discriminative): because smaller/larger thresholding along the axis does not work any-more (as in the trivial case), we now use a straight line equation as threshold, see Fig. 7c. The lineequation has two parameters, slope and offset. The Linear Discriminant Analysis in Section 6 movesinto that direction.

The distinction between generative (parametric) and discriminative (non-parametric) model is not alwaysclear cut in more complex models. Some discriminative models use generative modeling to arrive at agood decision function, in which case the typification becomes rather nominal. We have introduced theterminology, because it appears in most text books.

19

Classes Nested In the following case, classes are nested (Fig. 8a): one class encompasses the other.Again, previous models are challenged by this new configuration. If we tried something with centroids, wewould observe that they lied very near, which makes the generative models introduced so far unsuitable.Finding a straight line is not feasible either. Clearly, a novel direction is required. We sketch three directions.

a

b

c

Figure 8: A difficult case: the two classes are nested; one group is tugged into the other.a. The class centroids are very near (not shown), making function modeling as discussed so far difficult; a straight linewill not work either.b. We transform the input space such that the original configuration appears linearly separable; the illustration is notreflecting a true transformation model, but should exemplify what is meant by ’becoming linearly separable’.c. The problem is solved with multiple decision lines: each one solves the classification task only mediocre, but theirpooled decision can be quite a reasonable decision (Section 12). Some models use decision stumps as illustrated inFig. 6c to solve a task in that way.

1. Neighborhood Analysis: We determine all inter-point distances, which results in a rather exhaustivedescription. This quasi-full description can be exploited in various ways (both cases not depicted infigure):

a. Neighboring Labels: we merely look at the class labels of the neighbors: for instance, becausethe triangle is surrounded by points of the class circle, we would label it as such. This is calledk-Nearest-Neighbor classification (Section 5).

b. Topology: another approach is to investigate the full set of distances and to arrive at a descriptionof the topology, which would enable us to characterize the half-circle as an elongated cluster andthe inside point cloud as a ’density’ for example.

Both approaches are computationally intensive because they calculate a lot of distance measure-ments; this computational aspect will be elaborated in the next section (Section 2.2).

2. Multiple Decisions: We pool the decision of multiple simple classifiers (Fig. 8b). The individualdecisions may be unreliable, but the pooled decision can be a good prediction. This is called en-semble classification, because we deal with an ensemble of classifiers (Section 12). Some ensembleclassifiers use decision stumps as introduced before (Fig. 6c).

20

3. Transformation: Here we try to transform the axes, such that classes appear ’linearly separable’ (Fig.8c). By transforming the axes we hope that the configuration of points changes in a way that we cantackle the problem with one of the simpler models sketched earlier. This approach is computationallyintensive as well and suitable for limited data sets only.

Clustering

Clustering is more challenging than classification from a modeling point of view. For any of the classificationcases discussed above, the clustering task tries to achieve the same goal in principle, namely ’finding’ thetwo groups. Yet in clustering we do not have class labels for guidance and due to the lack of that information,the clustering task becomes therefore a very challenging ’search’ task. Furthermore, we do not even knowhow many clusters there are. Let us take a complicated distribution as depicted in Figure 9. How manyclusters do you recognize? Looking at the distribution coarsely (squinting your eyes), one would say twoclusters; looking at distribution on a finer scale (approach the page), then we could count three or moreclusters. That is, we are faced with a ’scale’ problem right from the beginning: we do not even know whatan appropriate focus would be. For that reason most clustering algorithms require the specification of aparameter, that expresses at what scale the clustering should take place: we need to give the algorithmsome clue of what we roughly expect.

Figure 9: Illustrating the clustering problem in 2D.We are given a set of points and we attempt to finddense regions in the point cloud, which likely corre-spond to potential classes. Are there two, three ormore classes in the point cloud?Intuitively one would like to ’smoothen’ the distribu-tion (Section 14) or to measure all point-to-point dis-tances to obtain a detailed description of the pointdistribution (Section 10). Both approaches are how-ever computationally very intensive for large data.For large datasets we therefore use ’simpler’ proce-dures such as the K-Means algorithm (Section 9).

Considering now models for clustering, then perhaps the first thought might be to look for densities, pointsthat appear agglomerated with respect to their neighborhood. For instance we calculate at each point somemeasure of density and then select those points as cluster centroids that appear agglomerated. One canthink of this density search in two ways:

Neighborhood Analysis: we measure all inter-point distances - as was suggested already for the clas-sification task -, then select the ones with highest density, and then proceed connecting to otherneighbors. This is the direction taken by the method of hierarchical clustering (Section 10). But aspointed out for classification already, this type of neighborhood analysis is computationally expensiveand therefore suitable only for data sets of limited size.

Smoothing: an engineer with experience in signal detection might consider some type of convolution,where one runs a filtering function across the entire space with the purpose to smoothen the distri-bution. This would give us the opportunity to find centroids that do not necessarily coincide with anactual data-point. This approach is often done to find the centroid of geographical locations. Thatsmoothing however is even more expensive than the neighborhood search, and it is therefore onlydone for few dimensions, in which case we talk also of density estimation (Section 14).

21

Practically, one the most efficient clustering algorithm is the so-called K-Means algorithm (Section 9), whosealgorithm is not immediately intuitive in comparison to all the other classification and clustering scenarioswe have sketched so far. Yet, one of its key steps contains a labeling process, that is similar to the onecarried by the nearest-centroid classifier (Fig. 6b). And by mentioning that we have come full circles in ourmini-survey of principal models.

2.2 The Computational Challenge

The computational challenge starts when the data matrix becomes either long or wide (Fig. 10); when thenumber of samples grows and the data become large; or when the number of dimensions grows and thedata become high-dimensional. From what size on data is considered large, very large or big, is relativeand shifts with increasing availability of RAM. From what dimension data is considered high-dimensional isalso relative and depends in part on the data’s denseness.

In both situations, distance measurements become problematic. For large data they simply becometime consuming. Yet for some models it suffices to merely know, who the immediate neighbors are, andone does not require to know the distance to the remaining points. For this limited neighbor search, andunder certain certain situations, distance measurements can be calculated even for large data. This is alsoknown as the challenge of nearest-neighbor search (Appendix A).

For high-dimensional data, distance measurements become ineffective due to the vastness of the high-dimensional space; this problem is usually considered part of the so-called curse of dimensionality.

high dimensionallarge

lower

dimensionalquantized

compression

dimensionality

reductioni.e. shrinking

quasi-lower

dimensional

Figure 10: The challenges for different data ’proportions’.Left half: when the data is very large - the matrix is long in height -, then the calculation costs increase and ones startsto be restricted to simpler models; in case of clustering one sometimes performs a compression first, a ’pre-clustering’to quantize the data.Right half: when the data is high-dimensional - the matrix is wide -, then we are faced with the curse of dimensionality.Then one often attempts to find a lower-dimensional space: either a true dimensionality reduction (oblique black arrowtoward south-west); or the identification of the most contributing dimensions (i.e. shrinking) and neglecting the lesscontributing features, denoted as ’quasi-lower dimensional’.

Curse of Dimensionality This term stands for the paradox that we face a ’coordination’ problem when thenumber of dimensions increases. Normally, our classification accuracy improves with increasing numberof features, because with more dimensions our data is described in more detail. In practice however, one

22

observes that with increasing dimensionality, the gain in accuracy starts saturating and it can even drop. Itseems as if increasing dimensionality becomes irrelevant at some point; one somehow can not coordinatethe information to fully profit from.

One phenomenon of higher dimensionality is that points become lonelier in space: they become sodistant from each other, that differences between them become small; in high-dimensional space, they arequasi equally spaced (Figure 2.6. in HTF p23, pdf42). From what dimensionality on the distance measurementbecome ineffective, depends on the data’s denseness: the more data points there are, the less lonely theyare in space. From dimensionality 10 to 20 on however, we will not have sufficient data to avoid pointloneliness. From that dimensionality on, data are said to be high-dimensional ; until that dimensionality,data is said to be low-dimensional.

Avoiding the Curse If the data is high-dimensional, then we can try to operate in a lower-dimensionalspace. There are many techniques and methods, often geared to the specific task at hand. There aretwo principal directions: one operates on the full dimensionality but tries to ignore the least-contributingdimensions. The second one is to actually reduce the dimensionality by knocking out dimensions.

Shrinking: is a method that is incorporated into a classification model. It tries to identify those dimensionsthat contribute most and will neglect less-contributing dimensions either during fitting, or during thedecision process (Section 4.1).

Dimensionality Reduction: these are methods that find a lower-dimensional space before they are fedto a classification or clustering algorithm (Section 7). Thus, the ensuing classification and clusteringcomputations are not done on the full dimensionality, but on a ’slimmed’ data matrix.

Large Sample Size If the data is large, then we will choose a simpler model, in particular a model whosefitting (learning) duration is not too long and that does not require memory-intensive calculations to classifynovel data. In that case, linear classifiers are perhaps the best choice, as they do not rely on excessivecalculations. In case of clustering we may add a compression step before any ’regular’ clustering (lefthand side Fig. 10). That serves the goal to quantize our data with the risk of loosing precision. But thatcompression step enables us to find some densities at all, which we could not achieve with the full, originalset of data.

23

3 Data Preparation (Loading, Inspection, Adjustment, Scaling) HKP p39, Ch 2

ThKo p262, 5.2

Given a new data set, it is best to inspect it and to adjust it as appropriate as possible. The data maycontain entries that are difficult to deal with for classification algorithms, such as missing entries in formof blanks, Not-a-Number (NaN) or other place-holders; there may exist ’unbalanced’ features; there maybe dimensions with zero values only, etc. Software programs will perform some of the adjustment auto-matically, but they tend to eliminate difficult samples or features entirely, therefore most likely lowering therecognition accuracy as you throw away imperfect samples that could carry useful information neverthelessin its remaining attributes. This section helps to adjust your data to optimally classify it. Appendix G.3 hasa script that applies the functions that we will mention in the following.

Most of the data files can be opened with the Matlab command importdata. In Python you shouldconsult the modules scipy.io or numpy/routines.io. Section 3.5 gives more details and also advice onhow to organize your data if it consists of several files.

Ensure that your data has the typical orientation right from the beginning, namely observations-by-variables (see Figure 2), otherwise it can become confusing in later stages of the analysis. Transpose yourmatrix as follows:

DAT = DAT’; % if necessary flip the matrix to format observations-x-variables (Matlab)

DAT = DAT.transpose() # (Python)

Now you visually inspect your data, Section 3.1 introduces some standard techniques. Then you check formissing values or other special entries, discussed in Section 3.2. Finally, scaling your data helps improving(prediction) accuracy in many cases, so it can be worthwhile trying different schemes, see Section 3.3.

3.1 Visual Inspection, Group Statistics

Display Data as Image Because our data is a matrix, a two-dimensional array, it can be convenientlydisplayed as an image. In Matlab you use the function imagesc for that, in Python imshow. If your data isreally large, then simply choose a subset of it - it still may obtaining a quick impression of your data. Takingmore than several thousands samples and dimensions probably does not make much sense, becauseyour screen resolution is limited, but you may try nevertheless. For very large data this is however notrecommended, your display may lack the necessary memory size.

figure(1);clf;imagesc(DAT(1:1500,1:2000));colorbar;

from matplotlib.pyplot import imshow, figure, colorbar

figure(figsize=(4,40)); imshow(DAT); colorbar()

By this visual analysis, one can often observe certain data characteristics, for instance whether some of yourfeatures have a low range of values; whether they have only few non-zero values; perhaps they are evenbinary; that is, it can some idea about the types of features, that are present (Section 1.2.1). Sometimesone can already recognize the classes, if the samples are organized class-wise.

Descriptive Statistics The next step would be to observe simple descriptive statistics, for example bycalculating the mean, standard deviation, range, etc. The command boxplot plots some of those values(Fig. 11).

Feature Value Distribution To look at the data in more detail we could then employ histograms (Fig.12). We return to that in the section of density estimation (Section 14). It also merits to check whether thevariables co-vary, but we elaborate on that later (Section 6.1).

24

Figure 11: Box plot for the four features (variables) of thefamous Iris flower data set. From center to outside: thered marker denotes the median; the edges of the blue boxdelineate the 25th and 75th percentile; the black marker-s denote the range without outliers; red plus markers de-note points considered outliers. Matlab: boxplot. Python:matplotlib.pyplot.boxplot.

Note that the ’spread’ of the data can be very different:the 2nd and 3rd features have very different ranges. Thatmeans we might night scaling, otherwise the petal length(3rd feature) might dominate classification.

Sepal Length Sepal Width Petal Length Petal Width

Cen

tim

eter

s

0

1

2

3

4

5

6

7

8

Figure 12: Histograms for the four Iris features.Matlab: hist. Python: matplotlib.pyplot.hist.

Note that the distributions look very different. Someappear to be uni-modal (one peak) - the upper row;some appear to be bi-modal (two peaks) - the lowerrow; or they can be even multi-modal (lower right),though that could also be a distribution with nopeaks at all.

4 5 6 7 8

Fre

quen

cy

0

10

20

30Sepal Length

2 3 4 50

20

40

60Sepal Width

Centimeters0 2 4 6 8

Fre

quen

cy

0

10

20

30

40Petal Length

Centimeters0 1 2 3

0

20

40

60Petal Width

Unique Values With the function unique you obtain a list of only those values that are actually used inyour distribution of values. Perhaps one of your feature dimensions is a repetition of the same three values,i.e. 3, 4.5 and 5.7. That is useful to know in order to choose appropriate classifier parameters. We can usethis function also to observe what classes are present in the group variable:

LbU = unique(Grp); % the class/group labels

nGrp = length(LbU); % number of groups

histc(Grp,LbU) % histogram of group members/class instances

from numpy import unique

GrpU = unique(GrpLb)

nGrp = len(GrpU)

25

3.2 Special Entries (Not a Number, Infinity, etc.)

It is not unusual that data contains missing entries indicated by NaN (not-a-number) or some other placeholder. If you generated the parameter, perhaps you created ’accidentally’ an infinity entry (Inf). You needto address those special entries somehow, otherwise classification can become difficult and can even returnresults that make no sense. In Matlab and Python you find for instance NaN values with the function isnan,see the example in Appendix G.3.

Inf/NaN Those entries are generated when there is a division by zero, in which case Matlab will alsodisplay a warning; or when one operand is already of either type. Those two cases in more detail:

- Division by zero: Matlab returns a division-by-0 warning and creates an

Inf entry, if the divisor (denominator) is 0 (e.g. 1/0)NaN entry, if both divisor and dividend (numerator) are 0 (i.e. 0/0).

- Any operation with a NaN or Inf entry remains or produces a NaN or Inf entry.

Because most classifiers use multiplication operations, entries with NaN or Inf values will propagate throughthe computations and therefore likely return useless results. In odd cases you obtain 100% correct classi-fication, for instance if in a task with two classes, one class contains NaN entries. Here are tricks to dealwith those entries:

Avoid Inf by Division: To avoid the creation of infinity entries, one can add the smallest value possible toa divisor, e.g. type directly in Matlab

1/(0+eps)

eps is the smallest value possible in Matlab - you would add it to your variable that is acting as divisor.This trick will take the largest value possible - instead of generating an Inf entry -, thus permitting tofurther operate with the variable, as opposed to an Inf entry.

Avoid Inf by Scaling: Perhaps it is worth it to scale that feature by a function, such as tanh or otherfunction that squashes the values to a small range, see also the next Section 3.3 on scaling.

Inf: If you cannot avoid Inf entries, then try your classification by replacing them with the largest floatingpoint number using the command realmax or intmax.

NaN: If your data contains already not-a-number entries, then you are forced to use special commandsthat can deal with those: e.g. nanmean, nanstd, nancov, etc. A typical algorithm - implemented insome software - will simply knock out the dimensions (variables) where NaN entries occur. That is aquick solution, but you also loose those variables completely, meaning you probably lower predictionaccuracy.

Constant Variables If your data contains features whose values are all the same, then eliminate thosevariables immediately. Software implementations often will take care of that, but if eliminating them before-hand is always more elegant.

3.3 Scaling ThKo p263, 5.2.2

Your data may have features (dimensions), whose values may significantly differ in their range. For onedimension, the differences amongst values could be in the order of thousands; for another dimension,the differences could be less than some very small number. Since a classifier will try to compare thedimensions, it could therefore be beneficial to scale (or standardize) your data. For some functions (mostlyin Matlab), there is an option that allows to specify whether to standardize your data. But it is useful to knowthat there exist different possibilities to perform the scaling operation:

1. Standardization: Divide the feature values by their mean and standard deviation, for each variableseparately. The resulting scaled features will then have zero mean and unit variance. In Matlab wecan use the command zscore, in Python this is carried out by sklearn.preprocessing.scale. Thecode examples in G.3 show how to use it.

26

2. Scaling to range: Limit the feature values to the range of [0, 1] or [-1, 1] by corresponding scaling. Inparticular Deep Neural Networks prefer the data input in unit range [0,1]:DAT = bsxfun(@plus, DAT, -min(DAT,[],1)); % set minimum to 0

DAT = bsxfun(@rdivide, DAT, max(DAT,[],1)); % now we scale to 1

3. Scaling by function: Scale the feature values by an exponential or tangent function (e.g. tanh).

4. Whitening transformation (DHS pp 34, pdf 54): this is a decorrelation method in which we multiply each sam-ple by the covariance matrix of the dataset. The method is called ”whitening” because it transformsthe input matrix to the form of white noise, which by definition is uncorrelated and has uniform variance(see Section G.3.1 for details).

Note 1: To estimate the prediction accuracy of your classifier properly, you should determine the scalingparameters for the training set only and then scale your training and testing set separately, e.g.

[TRN Mu Sig] = zscore(TRN); % scaling and obtaining mean and standard deviation

DF = bsxfun(@minus, TST, Mu); % scaling testing set

TST = bsxfun(@rdivide, DF, Sig);

Note 2: Scaling may distort the relations between dimensions and hence the distances between samples.Therefore, scaling does not necessarily improve classification (or clustering). It may be useful to look at thedistribution of individual features (see Section 3.1 above) too see what type of scaling may be appropriate.

3.4 Permute Training Set

For some classifiers, it is important to permute your training set. If your training set is organized group-wiseor with any other regularity, then that can lead to wrong predictions. You therefore need to randomize theorder of training samples. Simply create a vector Perm with numbers ordered randomly and reorder yourdata and the group variable:

IxPerm = randperm(nSmp); % randomize order of training samples

DAT = DAT(IxPerm,:); % reorder training set

GrpLb = GrpLb(IxPerm); % reorder group variable

Variable nTrn is the number of training samples.

3.5 Load Your Data

Before you attempt to process and classify your data, it can be useful to format your data in a separatescript and save that formatted data separately - if it is not too large. This is particularly recommended if youhave a large number of files. There’s no general scheme of course to carry out this preparation, becauseeach dataset is individual. In the following are some tips how to organize your work and how to load thedata:

Organize It is useful to create the following folders to organize your data and scripts:

- DatRaw place your downloaded data into that folder.- DatPrep where the processed data will be saved.- Classif matlab scripts for classification and data manipulation.

Open a script called PrepRawData to prepare your raw data. You will load the data, convert them and thensave them. This is elaborated in the following paragraphs:

Loading Data Most files can be loaded with the command importdata. Should the function be insuffi-cient, due to lack of specificity for example, then one has to start looking at commands such as textscan,textread etc., see also the section entitled ’See Also’ at the end of each help document. For images thereexists the special command imread. If all fails, it is not a shame to ask a system administrator to explain toyou how to read your data. Some formats can be indeed tricky.

27

Data Preparation Assign your data to a matrix called DAT of format [Samples x Dimensions] for rows andcolumns (see again Section 1.2). If you have many files, you may want to initialize the matrix beforehand inorder to speed up the preparation step, e.g.

DAT = zeros(nSmp, nDim, ’single’);

Initialize a global structure variable, which contains the path names to those folders, e.g.

FOLD.DatDigits = ’C:/Classification/Data/Digits’;

With the command dir you can obtain a list of all images, e.g.

ImgNames = dir(FOLD.ImgSatImg)

and then access the filenames FileNames(1).name. The first two entries will contain ’.’ and ’..’, thus startwith FileNames(3). Use fileparts to separate the path into its components.

If you need high computing precision, use double type, instead of single:

DAT = zeros(nSmp, nDim, ’double’);

Matlab does everything in double by default, but double requires also twice as much memory. There maybe sufficient hard-disk space, but RAM is often the limiting memory (the more giga-bytes the better). Formost tasks, the datatype single is sufficient however.

Grouping Variable Prepare also your group variable Grp, the class/category labels for each sample. Ifyour labels are not numerical, for instance consisting only of values between 1 and the number of classes, itis recommended that you use the command grp2idx to convert your labels into that format - it will facilitatelater the classification procedure. Although Matlab is relatively flexible with labels, for instance allowingstring labels, it can become obscure to deal with this - unless one enjoys the combination of flexibility andelegance.

Appendix G.3.3 gives an example of how to load and convert. It is a function, which returns the dataalready partitioned as training and testing set with the corresponding grouping variables called Lbl.

Saving Data - And Reload Later Saving the data is simply done by the Matlab command save. This willsave the data in Matlab’s own format, which is a type of compression format. When you reload the data inyour classification script you use the command load.

Appendix G.3.2 show codes fragments to understand how to program the individual steps.

3.6 Recapitulation

1. If you generate the data yourself, i.e. by a model, then avoid infinity values (or any division-by-0) usingfor instance x/(0+eps). Squash your variables to unit range - it is most practical for classifiers.

2. Load your data with importdata, textscan, textread, etc.

3. Inspect your data visually with plotting commands such as imagesc and boxplot. Inform yourselfwhat entries the grouping variable contains - if there is one.

4. Analyze your data entries for variables containing only one value: eliminate them immediately.

5. If your data contains NaN entries, you should check how your classification algorithm handles them.In most cases, the algorithm will eliminate the corresponding columns.

6. Permute your training set. In particular for classification algorithms that learn sequentially such asNeural Networks. Or for clustering algorithms that analyze data points sequentially.

7. Consider scaling your data. It will improve performance in most cases - for some classifiers scaling isnecessary. Try different scaling schemes - there will not be big differences in results, but they couldbe significant nevertheless.

28

4 Nearest-Centroid Classifier HTF p649, pdf 668

SKL p205

The nearest-centroid classifier is such a simple classifier that it is not even explicitly mentioned in sometext books. Its value has been recognized when a variant of it was applied to high-dimensional data sets,in particular to data sets in gene analysis or in document classification. In those domains, the number offeatures can be much larger than the number of samples.

In our case it serves two additional purposes. One is that it can serve as a base-line classifier. Theprediction accuracy will be rather low and we obtain a better accuracy with most other classifiers. If we donot obtain a better accuracy with other classifiers, then we may not have applied them properly. Second,the classifier is good for instructional purpose: it helps to get acquainted with the specifics of the softwarepackages; and the classifier constitutes a key step in the most successful clustering algorithm, the K-Means algorithm (Section 9).

As introduced in 2.1 already (Fig. 6b), the algorithm determines the centroid for each class. When anovel sample is applied, then the distance between the test sample and all centroids is calculated and theshortest one determines the class label.

Algorithm 1 Nearest-Centroid classification. DL=TRN (training samples), DT=TST (testing samples). Gvector with group labels (length = nTrainingSamples), c: number of classes (groups).

Initialization scale dataTraining for each class ∈ DL, determine mean vector (centroid) using G→ C.

C is our list of c centroids.Testing for a testing sample (∈ DT ): calculate distance to all centroids in C → D (c distance values)Decision determine minimum distance of D: corresponding class label is our predicted class label.

Implementation

The code example in Appendix G.5 shows how to program such a classifier. With the function crossvalind

we partition our data into five folds (Section 1.3). One fold is used for testing, the remaining four folds aretaken to be the training samples.

The calculation of the class means is done by a for-loop and is self-explanatory hopefully. If not, youneed a programming course. Or simply play with the code until it clicks. In the testing phase, we havechosen to use the function pdist2 which calculates the distances for all testing samples. One could writea for loop as well; do so if you need programming practice. DM is a distance matrix and holds all distancesbetween testing samples and centroids. With the function min we find the minimum distance. The secondoutput argument LbPred holds the group index. Finally we need only to compare predict labels with truelabels by LbPred==GrpTst.

The program carries out only one prediction estimate. If one wanted a more reliable prediction estimate,then one would rotate through the folds by writing a for-loop that wraps the training/testing procedure.

Python In Python the code is nearly identical. Matlab’s function pdist2 is called cdist in Python. In Pythonwe obtain the indices for the minimum value using function argmin (’arg’ for argument). Python also offersthe script sklearn.neighbors.NearestCentroid to run the Nearest-Centroid classifier in its simple version;we show how to apply the function at the end of the code, it is appended to the explicit example.

4.1 Nearest-Shrunken Centroid (NSC)

The Nearest-Shrunken classifier is an elaboration of the above classification procedure that estimates thesignificance of individual feature (predictors) and if their significance appears low, then they are given lessweight during the decision process. This is however not done completely automatically. The user needsto specify a threshold parameter, called shrink parameter, that helps identifying seemingly insignificantfeatures. A parameter entered by a user is also called a hyper-parameter sometimes (Section 1.3).

29

Estimating the significance is done by observing the variance of individual features. The easiest waywould be to apply the variance function to the entire list of features values, i.e. var(DAT(:,1)) for thefirst feature. This is a form of significance estimation that is indeed done to identify potentially irrele-vant features (Section 7.2). But here the feature significance is estimated with respect to each class, i.e.var(DAT(GrpTrn==1,1)) for the first feature and first class. If that significance is below the shrink threshold,then the feature will be given less ’attention’ during the decision process. We do not elaborate on the detailsof this procedure: the crux is that the group centroids move a bit closer to each other with respect to thesignificant features, but not with respect to the insignificant features. The actual decision process remainsthe same as for the simple version introduced above.

Implementation We present a code example for Python only, Appendix G.5.2. It demonstrates how touse the function: we need to set only a threshold value. And it demonstrates in an explicit version the detailsof the shrinking procedure. The code contains an option to use also a simple toy data set, which is usefulto study the details of the shrinking procedure.

Since we probably do not know the appropriate threshold value at the beginning of our task, we need toexperiment a bit. The easiest way is to write a loop that tests a range of values - the Python documentationshows an example SKL p1113, 23.4.

Discussion As pointed out already, this is variant has been particularly useful for high dimensionalitybecause it effectively helps ignoring noisy, irrelevant features. In gene analysis, it can help to identify a fewdozens relevant dimensions (genes) amongst thousands of dimensions.

The nearest-shrunken centroid classifier is ”equivalent to a regularized version of the diagonal-covarianceform of the Linear Discriminant Analysis”, HTF p652, pdf 671. It will later in the workbook, in particular in Section 6,become clearer what that means.

4.2 Summary

Irrespective the dimensionality of your data, this is a classifier that can be used as a base-line classifier,namely to obtain a prediction accuracy that can serve as a lower reference.

Using the shrunken variant, this classifier is suitable for very high-dimensional data and has been shownto be competitive on certain datasets, in particular gene analysis and document analysis. That does notmean we can expect competitive results for any high-dimensional problem; it might well be that in thosedomains this type of feature significance estimation is an appropriate solution to identify irrelevant features.Other high-dimensional dataset may not bear so many irrelevant features.

The technique of shrinkage is a technique that has been employed also in other classifier algorithms.One could say that in general it helps mitigating the curse of dimensionality. The user needs however tospecify a threshold value (hyper-parameter) in this particular variant and that requires some optimization.

30

5 k-Nearest Neighbor (kNN) Classifier

The k-nearest neighbor algorithm is another simple classifier. With regard to the modeling effort, it is simplerthan the nearest-centroid classifier of the previous section. The kNN classifier makes a judgment based onthe labels present in its immediate neighborhood, based on its nearest samples. Look at the case in Figure13: we wish to label the triangle as either belonging to the circle class or to the square class. For thatpurpose we look at its k nearest neighbors, let us say the three nearest neighbors, i.e. we set k = 3. Thenthis would correspond to observing the neighboring points within a disk, whose center lies at the testingpoint (the triangle) and a radius corresponding to the third-farthest neighbor; outlined here by a gray circlethat is a bit larger. Because the neighborhood contains two points of the circle class and one point of thesquare class, we would label the testing point as belonging to class circle. If we looked at k = 7, the larger,gray circle in the figure, then we would label it as belonging to class ’square’, because those are in majority.In short, we look at the most frequent class label for the first k nearest neighbors and that is already theclassification decision, the prediction for the testing sample. This classifier is therefore rather unspectacularin an algorithmic sense; no real abstraction of the training samples is sought; one could say that no actuallearning takes place; sometimes that is referred to as lazy learning.

We summarize the classifier by expressing this a bit more formally. Given a training set, we simplystore all its samples as a reference. To classify a novel (testing) sample, we firstly measure the distancebetween that sample and all the stored training samples. Then we sort those distances. Then we choosea neighborhood size k. Then we observe the k nearest class labels of the sorted, reference samples: themost frequent class label will be the label of the testing sample.

Figure 13: The k-Nearest-Neighbor (kNN) classifier al-gorithm. We have two classes again: squares and circle.We label a new sample (the triangle) based on its clos-est samples. The smaller gray circle outlines a neighbor-hood with three nearest neighbors, k= 3: if the samplepoint would be judged by that neighborhood size, then itwould be classified as circle. The larger circle outlines aneighborhood with 7 neighbors (k = 7); if we judge thesample by that neighborhood size, then it would be classi-fied as square. Algorithmically very simple in its decisionmaking, practically the classifier struggles with large datasets, because taking a lot of distance measurements istime-consuming.

Evaluation As we have seen in the example of Figure 13, there can be different class labels for differentnumbers of k. What number of k gives us the best classification accuracy is hard to predict for a given dataset, and we therefore simply have to try out by exploring a range of ks. It is a hyper-parameter that requiresoptimization. Choosing an even number of k is not useful, because we may face parity. In practice, a kbetween 3 and 9 will most likely return optimal results.

A different distance metric might return also a different classification accuracy, e.g. choosing a Manhat-tan distance instead of the usual Euclidean distance (Appendix B.1). This is a hyper-parameter that onealso has to try out.

The differences in accuracies for different ks or metrics are most likely to be subtle. One therefore hasto apply proper folding to ensure that we do indeed achieve significantly better accuracies (Section 1.3.1).

Advantages and Disadvantages What makes the kNN classifier attractive is that not much can go wrongdue to its simplicity. Most other classifiers will generate a better classification accuracy, but the accuracy of

31

the kNN classifier can serve as a lower performance reference. One also calls that a base-line classifier. Ifwe do not achieve a better classification accuracy with more complex classifier algorithms, then we mightnot have applied those properly.

Another advantage of this classifier is that it does not require many training samples. Even with a fewtraining samples, we can make a prediction with the kNN classifier without any concerns, whereas for otherclassifier models a prediction on few samples is always a very vague result.

The disadvantage of the kNN classifier is that it is slow when the data are large. The reason is thattaking a lot of distance measurements is time-consuming, as pointed out already in Section 2.2. If the di-mensionality of your data is small, i.e. 10 dimensions or less, then there are good solutions to the problemand software packages will automatically choose such solutions. We introduce those solutions in a sub-sequent paragraph. For larger dimensionality however, the kNN classifier can become unfeasible slow, inwhich case you need to choose a different classifier.

Algorithmic Procedure We now formalize the learning and classification procedure more explicitly. Givenis a training set, a matrix TRN with corresponding group (class) labels in vector GrpTrn; and there is a testingset, a matrix TST with corresponding group variable GrpTst (see also Algorithm 2). To classify a samplefrom the testing set - one row vector of TST -, we measure the distance to all samples in TRN and placethe distance values into an array Dist. Then we sort the distances in Dist in increasing order - nearestneighbor first; and we change the order of the group labels GrpTrn accordingly. Now we observe the knearest neighbors in the re-arranged group variable - the actual distances are not of real interest anymore.

Algorithm 2 kNN classification. DL=TRN (training samples), DT=TST (testing samples). G vector with grouplabels (length = nTrainingSamples).

Initialization scale dataTraining ’store’ training samples DL with class (group) labels G.

(In fact, no actual training takes place here)Testing for a testing sample (∈ DT ): compute distances to all training samples→ D,

rank (order) D → Dr

Decision observe the first k (ranked) distances in Dr (the k nearest neighbors):e.g. majority vote of the most frequent class label of the kNN determines category label

Tree-Type Data Structures As mentioned already, for large number of samples the distance measure-ments become time consuming. But one can save some of the measurements by exploiting ’relations’. Forinstance, if we have determined that the distance between some point A and some point B is large, and wehave determined that the distance between A and some point C is small, then we know that the distancebetween B and C is also large. There are structures that build a tree-type storage, with which one candetermine distances much quicker. Those structures are akin to decision trees we will introduce in Section11 - similar to the flow diagram we had in school. Those structures are however only useful if the dimen-sionality is low, i.e. around 9 to 16 dimensions, a number which depends a bit on the expert’s viewpoint.And they only are useful if the data is not sparse. Sparse means that the data contain a lot of zeros, that issparse refers to the few entries that have non-zero value. If those two conditions are fulfilled, then softwarepackages will create automatically a tree-type data structure that makes distance measurements faster. Ifthose two conditions are not fulfilled, then the distance measurements are computed in full and that cantake very long for large datasets. In Matlab this is called ’exhaustive search’.

Variants The classification decision can be also carried out by giving weights to the neighbors: with thatwe can give preference to some neighbors over others. This is implemented as options in the respectivesoftware commands. kNN classification can also be carried out with a fixed neighborhood: one speci-fies a distance, which so corresponds to a radius. In Matlab available with rangesearch, in Python asRadiusNeighborsClassifier.

32

5.1 Usage in Matlab

The code segments introduced next, also exist as a copy-paste example in Appendix G.6. Here we highlightcertain lines.The simplest way to apply Matlab’s function fitcknn is to feed it the entire data set DAT with the entire grouplabel vector GrpLb:

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,5);

pc = 1-kfoldLoss(Mdl); % percent correct = 1-error

In this case, the folding is done automatically, namely a 10-fold cross-validation: 9 folds are used for training,1 fold is used for testing; then we rotate 9 times, obtaining so 10 classification estimates in total. With thefunction kfoldLoss the average error for the 10 rotations is calculated automatically. We can specify adifferent number of folds, i.e.

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN, ’kfold’,nFld);

If we wish to control the folding ourselves, then we can do this using the function crossval:

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN);

MdlF = crossval(Mdl, ’kfold’,nFld);

pcf = 1-kfoldLoss(MdlF);

If we desire to fold the data completely ourselves, then here is how we would use the functions for one fold :

Mdl = fitcknn(TRN, Grp.Trn, ’NumNeighbors’,nNN);

LbPred = predict(Mdl, TST);

Bhit = LbPred==Grp.Tst; % binary vector with hits equal 1

The function predict takes the model and estimates labels for the testing data TST. It outputs a one-dimensional variable, that contains the predicted group labels, called LbPred here. Now you only need tocompare them with your true (actual) group labels, LbPred==Grp.Tst and evaluate that binary vector: 1’scorresponds to a hit (correct classification) and 0’s corresponding to incorrect classification.

Own Implementation - knnsearch If you wish to write your own implementation, then the functionknnsearch comes in handy. You pass the training and testing set as variables and the number of k asparameter,

[IXNN Dist] = knnsearch(TRN, TST, ’k’, 5);

and you receive the ntst × k matrix IXNN which contains the indices to the training samples (ntst = number oftesting samples). Those you need to convert to the corresponding grouping variables and then determinethe most frequent group (class) in your given neighborhood k.

Older Matlab versions had a function knnclassify, which one applied as follows:

LbPred = knnclassify(TST, TRN, Grp.Trn, 5);

This is not included anymore in the code example in the Appendix.

Large Data In Matlab, dimensionality up to 9 will be converted into a tree-type structure. One can alsoenforce the application of the use of tree-type structures, but then it is left to the user to interpret the resultsproperly.

33

5.2 Implementation

Coding a kNN classifier is fairly easy. Here are some fragments to understand how little it actually requires(see also ThKo p82). A complete example is given in Appendix G.6.

%% --- Knn classification

nCls = 2; % # of classes

nTrn = size(TRN,1); % # of training samples

nTst = size(TST,1); % # of testing samples

GNN = zeros(nTst,11); % we will store 11 nearest neighbors

for i = 1:nTst

iTst = repmat(TST(i,:), nTrn, 1); % replicate to same size [nTrn nDim]

Diff = TRN-iTst; % difference [nTrn nDim]

Dist = sum(abs(Diff),2); % Manhattan distance [nTrn 1]

[dst ix] = min(Dist); % min distance for 1-NN

[~, O] = sort(Dist,’ascend’); % increasing dist for k-NN

GNN(i,:) = Grp.Trn(O(1:11)); % closest 11 samples

end

%% --- Knn analysis quick (for 5 NN)

HNN = histc(GNN(:,1:5), 1:nCls, 2); % histogram for 5 NN

[Fq LbTst] = max(HNN, [], 2); % LbTst contains class assignment

Hit = LbTst==Grp.Tst;

fprintf(’Perc correct for 5NN %1.2f\n’, nnz(Hit)/nTst*100);

See also the progamming hints in Section D for why we chose a for-loop in this case.

In Appendix G.6.1 we give an example of how to analyze a range of different ks.

5.3 Recapitulation

Recommendation Even though the kNN may not provide a good prediction accuracy (percent correctlyclassified), it can serve as a lower reference - as a base-line accuracy - when testing other classifiers: ifwe do not obtain a better prediction accuracy with more complex classifiers, then we should consider thepossibility that we have not applied them properly.

Advantages- With the kNN classifier we obtain a base-line accuracy with an easily implementable decision model.- The kNN classifier even works when only few training samples are available, for instance n < 5 per class,

a situation for which other classifiers can be vague in prediction.

Disadvantages The larger the data set, the slower the classification duration. In professional terminology,one says that the kNN classifier has complexity O(dn), with d the dimensionality (number of attributes) andn the number of samples. This is also called the O-notation and we will explain it a bit more later to sparetoo many details (Section 18.2). To alleviate the complexity problem, one can use tree-type data structures,but that only works for limited dimensionality.

Other The kNN classifier does not have an actual learning process, that is, no effort was made in ab-stracting or manipulating the data to derive a simple decision model. In fact, it implements a decision ruleonly and nothing more.

34

6 Linear Classifier; Linear and Quadratic Discriminant

A linear classifier separates the data points of different classes by a straight line, see Fig. 7 again. Thatline is also called a decision boundary. In case of a data set with two features, such as a visual display thatbears an x- and a y-coordinate, one can think of the boundary as a straight line with slope m and an offsetc:

y = mx+ c. (1)

To estimate the class label of a new (testing) sample, one would simply determine on which side of theboundary it lies. In the two-dimensional case, we enter its coordinate values to the line equation anddetermine whether it is larger or smaller than c. That would be the decision for a binary classifier, one thatdiscriminates only between two classes.

For three features, there would be two values of m; in that case we attempt to find a plane in a three-dimensional space; for four or more features the number of m grows correspondingly and we talk of hyper-planes. There remains only one bias parameter c. In the terminology of pattern recognition terminology,we speak of weights instead of slope and offset, and both are lumped into a weight vector w, whose lengthcorresponds to the number of features.

The learning procedure of a linear classifier tries to find a suitable weight vector, such that the classesare well separated in a statistical sense. There exist many procedures to find those parameter values, eachone with advantages and disadvantages.

We continue by firstly explaining the classification procedure in more detail, then we sketch the varietyof learning procedures. We continue with the example of a binary classification task - a classification of twoclasses only -, and then elaborate on the multi-class procedure - a classification with three or more classes.

Binary Classification (2 classes): A binary classifier takes a sample (observation) as input and multiplieseach element with a corresponding value of the weight vector, and then sums those products to a scalarvalue (Figure 14, left side). In mathematics, this sum of products is also called the dot product or scalarproduct or inner product. There are different notations for this product (see also Appendix E). With thesample being vector x(i) with dimensionality d (i = 1, .., d), and a weight vector w of same dimensionality,we can write:

g(x) =

i=d∑i=1

x(i)w(i) (2)

g is also called the discrimination function in the context of a classifier; g is the scalar value. This product issometimes notated in more compact form as:∑

i

x(i)w(i) ≡ x ·w ≡ xtw. (3)

The resulting scalar value is called the score (or posterior ) in this context. It expresses the degree ofconfidence for a class label. It then needs to be thresholded to decide to which class label the sample issupposed to belong. Depending on the exact implementation of the classifier model, the threshold may be0 - if g ranges from negative to positive; or it may be 0.5, when g ranges from 0 to 1.

Multiple-Class Classification: For a classification task with multiple classes, there is a weight vector wk

(of length d) for each individual class k (Figure 14, right side). Those k weight vectors are concatenated andexpressed as a k × d weight matrix W, whose size - explicitly expressed - is [number of classes x numberof dimensions]. The classification procedure then consists of two steps: the first step is the computation ofthe discriminant value for each class; the second step is the selection of the maximum discriminant valueto select the most appropriate class label.

The computation of k discriminant values is expressed with a single line, whereby here we add thevariable k as subscript to g to express that we obtain a one-dimensional array of values:

gk(x) = xtW. (4)

35

sample,

observation

binary classi�er

1 2 3 n. .

multi-class classi�er

weights

score,

posterior

Figure 14: The linear classifier takes a sample as input, multiplies each value with its corresponding weight value(s),and sums those products to a value called the score (or posterior); the score expresses the degree of confidence for aclass.Binary Classifier: the weights are an array of values. The score is thresholded to decide the class label. The weightingoperation corresponds to the dot product (or scalar product or inner product).Multi-Class Classifier: for each class (three in this illustration) there exists a separate weight vector: together theyform the weight matrix. The input vector (sample) is multiplied with each weight vector (as in the binary classifier) andits output placed into a corresponding entry of the output vector. In the illustration we have shown the sum of productsonly for one class for reason of clarity. The weighting operation corresponds here to the matrix product.

The notation appears not to have changed much in comparison to the dot product (eq. 3), but the use of amatrix W makes it now a matrix product, see again Appendix E for details. As explained already, the resulthere are k posterior values.

To select now the class label with the highest confidence, we simply apply the maximum operation tothe array gk:

argmaxkgk. (5)

’arg’ stands for argument, meaning the index of where in the array gk the maximum occurs. In Matlab thisis included in the command max by specifying a second output argument:

[vl ix] = max(Post);

That is, variable ix holds the selected class index; variable vl holds the maximum posterior value.

Learning: The challenge is of course to find the appropriate weight values W, which would best separatethe classes. Mathematically speaking - and formulated for a two-class (binary) problem -, we deal witha linear programming problem because trying to find the discrimination functions gi(x) is dealing with aset of linear inequalities: wtxi > 0. There exists a large number of methods to solve such inequalities ofwhich many belong to two important categories: gradient descent procedures and matrix decompositionsmethods. We do not elaborate on these methods, but merely explain how one specific type of matrix de-composition is implemented (Section 6.5).

Building and applying a linear classifier can be summarized as follows:

36

Algorithm 3 Linear classifier principle. k = 1, .., nclasses. Wk×d = {wk}, G vector with class labels.Training find optimal weight matrix W for gk(x) = xtW exploiting G (x ∈ DL)

using matrix decomposition for instance.Testing for a testing sample x determine gk(x) = xtW (x ∈ DT )Decision chose maximum of gk: argmaxk gk

Variants: The above equations represent only a principle. There exist many variants, but all linear classifi-er models contain at their heart the matrix product between a sample and a weight matrix of correspondingdimensionality. They are called linear because there are no exponents in the weight matrix. The Perceptron- the premaster of Neural Networks - carries out this operation; the Support Vector Machine also partially,though it is non-linear in other aspects; some of the Naive Bayes classifiers too.

Here we continue with two variants in particular, the Linear Discriminant Analysis (LDA) and the Quadrat-ic Discriminant Analysis (QDA). The QDA is strictly speaking not a linear classifier anymore, but can be re-garded as a general form of it. Both use the covariance matrix to find an optimal decision boundary, whichwill be detailed in the next Section 6.1.

6.1 Covariance Matrix Σ wiki Covariance, Covariance matrix

The covariance matrix expresses to what degree the individual variables (or features or dimensions) dependon each other (Fig. 15). It is used in many machine learning algorithms.

We recall from our high school education that for a single variable, the variance measures how muchthe values are spread around their mean. Analogously, when we have two variables A and B, then weobserve how the corresponding elements co-vary with respect to the two corresponding means, A and Brespectively:

qA,B =1

N − 1

N∑i=1

(Ai −A)(Bi −B), (6)

where N is the number of observations. If the individual differences co-vary, then the covariance value qA,Bis positive, otherwise it is negative. The divisor N − 1 is typical for an estimate of the covariance.

For reasons of practicality, one generates a full matrix Σ, which for two variables would look as follows

Σ =

[qA,A qA,BqB,A qB,B

], (7)

where the entries along the first diagonal (upper left to lower right) correspond to the ’self-variance’ of thevariables, namely qA,A and qB,B ; and the other two entries have the same value, that is qB,A = qA,B . Forthree or more dimensions, one would create a d × d matrix (d = number of dimensions), with values alongthe diagonal again being the self-variance; and corresponding values above and below the diagonal arealso the same. The covariance matrix is thus a square, symmetric matrix; its calculation can be expressedalso in matrix notation.

Observe that Σ is denoted as boldface as opposed to the summation sign Σ. Σ is rather the symbolfor the unbiased (theoretic) covariance matrix, which we typically do not know. That is why the estimatedcovariance matrix is denoted with a ’hat’, namely as Σ̂ in books, spoken sigma hat. In Matlab, the matrixΣ̂ can be calculated with the command cov, or if the data contain not-a-number entries then you can usenancov.

Small Sample Size Problem Calculating a covariance matrix is easy. Yet for the discriminant analysiswith matrix decompositions, this matrix needs also to show certain properties, e.g. it needs to be positivesemi-finite, meaning that after some transformation the covariance matrix has only positive values in alimited range. We omit the details of those properties. Our concern is that this constraint is sometimes notfulfilled, in particular if the dataset has only few samples. And if the covariance matrix is generated for eachindividual class, as in case of the QDA, then the problem of obtaining an ’adequate’ covariance matrix is

37

Sepal

Length

Sepal

Wid

th

Petal

Length

Petal

Wid

th

Sepal Length

Sepal Width

Petal Length

Petal Width 0

0.5

1

1.5

2

2.5

3

Figure 15: The covariance matrix of the Iris flower data set. It expresses how features covary with each other. Alongthe diagonal are the variance values for the individual features; the one for the petal length is very large - cross checkwith the box plot in Fig.11. In the upper and lower half are the pair-wise covariance values; they can be negative. Theupper and lower half have the same values; the matrix is symmetric.

aggravated. This is known as the small sample size problem. If the appropriate covariance matrix can notbe generated, then newer Matlab versions will generate the following error:

Pooled-in covariance matrix is singular. Set DiscrimType property to ’pseudoLinear’ or ’diagLinear’ or increase Gamma.

Thus, we will add the option ’DiscrimType’, ’pseudoLinear’ to the function fitcdiscr, shown in moredetail below.Older Matlab versions complain with the following error:

The pooled covariance matrix of TRAINING must be positive definite.

To work around this barrier, it is easiest to apply a dimensionality reduction using the PCA and then retrywith lower dimensionality, which will be the topic of the upcoming Section 7.

6.2 Linear and Quadratic Discriminant Analysis

The Linear Discriminant Analysis uses the covariance matrix as shown for the data set in Fig. 15, namely forthe entire data set, irrespective of what classes are present. Intuitively, one would think that this is unspecificas we have not made use of potential distribution and variance differences between classes. The QuadraticLinear Discriminant exploits that class information: it computes the covariance matrix for each individualclass and we end up with K times more parameters to tune (K the number of classes). It turns out that thisincrease in parameters does not always help - due to the curse of dimensionality again. In what situationswhich discriminant analysis performs tendentiously better is summarized now:

Linear Discriminant Analysis (LDA): one covariance matrix for all (classes). In professional terminologyone says: the LDA assumes that the observations within each class are drawn from a multivariateGaussian distribution with a class-specific mean vector and a covariance matrix that is common to allK classes. Practically speaking, it has less parameters than the QDA, but shows a lower predictionvariance, meaning it is less likely to vary around the actual target value. It is used preferably when thenumber of training observations is small. JWHT p149, 4.4.4

Quadratic Discriminant Analysis (QDA): each class has its own covariance matrix, resulting in K timesmore parameters to tune. This variant is recommended when the training set is very large, such thatthe covariance matrices can be easily estimated; or if the assumption of a common covariance matrixfor the K classes does not make sense at all.

38

There exist three variants for both of these discriminant classifiers. One uses only the diagonal entries ofthe covariance matrix. The second is preferably used in case of the small sample size problem, the pseudovariant. And we can apply the method of shrinkage, as we did already for the Nearest-Shrunken Centroidclassifier, a process called regularization here.

Diagonal Variant: The diagonal variant will use only the diagonal entries of the covariance matrix andtherefore operate with far less information; one now uses only the variance information of the individualfeatures. The advantage is that it is easier to determine the weight matrix, in particular in case of thesmall sample size problem. The disadvantage is that our prediction accuracy will most likely drop.

Pseudo Variant: To calculate the optimal weight matrix, the modern versions of the discriminant analysisrely on the so-called inverse of covariance matrix. That inverse is difficult to calculate if there are fewsamples present. One can however determine a so-called pseudoinverse, that can be considered anestimation of the actual inverse. This will return a lower classification performance, however we canat least predict in face of the small sample size problem.

Regularization (Shrinkage): this is again a technique for feature emphasis, very much like the methodof shrinkage we mentioned for the Nearest-Shrunken Centroid classifier (Section 4.1). A regularizedversion of the diagonal variant of the LDA essentially corresponds to the Nearest-Shrunken Centroidclassifier: same principle, but different names due to different viewpoints.

6.3 Usage in Matlab

In Matlab the discriminant analysis is implemented with the function script fitcdiscr. One can choosebetween the variants using the option ’DiscrimType’; for regularization we specify the parameter ’Gamma’.Here are some examples how to use those functions:

MdCv = fitcdiscr(DAT, GrpLb, ’kfold’,nFld, ’DiscrimType’, ’linear’);

pc = 1-kfoldLoss(MdCv); % percent correct = 1-error

This is a formulation that is analogous to the use of the kNN classifier, compare with Section 5.1; see alsothe overview in Appendix G.1 again. Folding can be explicitly instructed as

Mdl = fitcdiscr(DAT, GrpLb);

MdCv = crossval(Mdl, ’kfold’,nFld);

pc = 1-kfoldLoss(MdCv);

which again is analogous to the kNN example. Appendix G.8 ensures that the reader has a working exam-ple.

If one intends to fold the data by oneself, then we apply the for-loop as exemplified in the kNN example;the generic folding loop is also illustrated in Section 8.1.

Classification errors/difficulties Should the data set be difficult to handle for a linear classifier, then wecan try adding the option pseudolinear as follows:

Mdl = fitcdiscr(DAT, GrpLb, ’discrimType’, ’pseudolinear’);

This may return sub-optimal results, but at least the classification task can be solved.

6.4 Usage in PythonSKL p173

The SciKit-Learn package offers the module sklearn.discriminant analysis to run the discriminant anal-ysis, whereby the different variants are sometimes in separate functions. For instance, one function forthe LDA, LinearDiscriminantAnalysis and one function for the QDA, QuadraticDiscriminantAnalysis.There are options to specify shrinkage (regularization). This is however better studied in the documentation.An example of applying the LDA was given in the overview (Appendix G.1).

39

6.5 Implementation Matrix Decomposition (Matlab)

In the following we point out how a linear discriminant classifier can be implemented using matrix decompo-sitions, whereby the code here is taken from the old Matlab function classify. I suspect the implementationhas not changed substantially since. The code was slightly modified to make it compatible with our vari-able names (TREN, TEST,...). There are two essential steps for training and one for testing (applying),enumerated as steps 1 to 3 now:

1. Learning: for each class the mean of its training samples is calculated and assigned to the matrixGmeans - as it was done for the Nearest-Centroid classifier (Section 4):Gmeans = NaN(nGrp, nDim);

for k = 1:nGrp

Gmeans(k,:) = mean(TREN(GrpTrn==k,:),1);

end

2. Learning: Now we estimate the covariance matrix using the orthogonal-triangular decomposition,which is carried out with the command qr: as argument we enter the training data subtracted by thegroup means. The second line performs a scaling division:[~,R] = qr(TREN - Gmeans(GrpTrn,:), 0);

R = R / sqrt(nObs - nGrp); % SigmaHat = R’*R

s = svd(R);

if any(s <= max(nObs,nDim) * eps(max(s)))

error(message(’stats:classify:BadLinearVar’));

end

logDetSigma = 2*sum(log(s)); % avoid over/underflow

Then, another matrix decomposition follows, namely the singular value decomposition, see commandsvd in third line. After that it is verified that the covariance matrix is adequate. We do not furtherexplain these decompositions for reason of brevity. The result is a matrix R and a scalar logDetSigmawhich are the equivalent of the weight vector w.

3. Applying: When we classify testing samples, the function generates a ’confidence’ value for eachclass, which is also called the posterior : it is placed into the matrix D here (number of testing samples× number of dimensions). The operation is quasi the dot product (eq. 2):for k = 1:nGrp

A = bsxfun(@minus,TEST, Gmeans(k,:)) / R;

D(:,k) = log(prior(k)) - .5*(sum(A .* A, 2) + logDetSigma);

end

Some of this will be clarified, when we look at the Naive Bayes classifier in Section 17, but for themoment we simply apply this procedure without understanding every detail.

With the posteriors in D we determine the class label by the argmax operation, [V Lb] = max(D,[],2); (eq.5).

6.6 Recapitulation

A discriminant classifier is simple to apply, returns reasonable results and is fast in testing. The computationof the weights is done with the covariance matrix, whose space complexity is therefore O(d2); for classifica-tion one needs to perform a matrix product only, whose space and time complexity is only O(d). For thosereasons, the discriminant classifier is a very popular classifier.

Advantages There is no need to set any hyper-parameters in the simplest form of this classifier, asopposed to selecting a nearest-neighbor value for the kNN classifier for instance. The efficiency of a lineardiscriminant classifier is unparalleled. To obtain a better prediction accuracy with a different classifier, thelearning duration will increase substantially and it will require the adjustment of more and more parameters.

40

Disadvantages- When the number of observations is small, it can be difficult to obtain an adequate covariance matrix.

In that case we either work with the diagonal variant, or the pseudo variant or we apply the PrincipalComponent Analysis (PCA) first - coming up in the next Section - and then apply the linear classifieragain. This little hurdle does not harm the advantages of the linear classifier - the combination of theLDA with the PCA is one of the most used classification schemes.

41

7 Dimensionality Reduction

Dimensionality reduction is the process of compacting the data matrix without loosing too much information(Fig. 10, right side). It serves to combat the curse of dimensionality and to find a reduced space, becausethe lower the dimensionality, the faster will also be the classification or clustering algorithm, during both thefitting and prediction procedure. The risk of dimensionality reduction is to loose some information; but thegain in convenience usually outweighs the small loss.

We have already encountered processes that select or transform features in the classifier algorithmswe have introduced previously (Nearest-Shrunken Centroid classifier, Linear Discriminant Analysis). Theyshare some similarities with the processes we will introduce in this section, but in those classifier algorithmsthe data matrix was not actually reduced: during the classification process, one still operates with theoriginal dimensionality. For the following methods in contrast, one effectively reduces the data matrix inwidth; it is a true reduction in dimensionality. It can occur in two principally different ways:

1. Feature Selection (aka Feature Reduction, Variable (Subset) Selection): is the selection of the(hopefully) best subset of features of the original set of features (Section 7.2). In other words, we knockout features that are potentially irrelevant or marginally significant. In its simplest way, one estimatesthe significance of individual variables by applying a simple statistical or information-theoretic measurethat estimates the relevance of a variable; that is not so much different from the method of shrinkage.Or we apply a classifier in a sequential search process to find out which features can be addedor dropped. The features of the reduced subset are unchanged, as opposed to the subset of atransformation, sketched next.

2. Feature Transformation/Generation: is the transformation of the original set of features to create anew, reduced set of features (Section 7.1). The methods for that are matrix manipulations, heavymathematics. In such a transformation all features are combined for generating a new, reduced set offeatures. The new features of the reduced subset are different from any of the original features.

Sometimes, one speaks also of feature transformation in case of the Linear Discriminant Analysis(Section 6). That transformation however exploits the presence of group labels to find an optimal,reduced set of features. It can therefore not be used with clustering methods, because we do nothave any group information for a clustering task. The methods introduced here do not require grouplabels; for that reason they are sometimes classified as unsupervised learning methods.

This type of dimensionality reduction is also called feature extraction sometimes. It should not beconfused with processes of creating features from individual observations. For instance in computervision, one talks of features extraction if one detects visual features in images - feature extraction inindividual observations. In the process discussed below, an analysis across observations is carriedout to create new variables.

7.1 Feature Transformation (Unsupervised Learning)

There exist several feature transformation methods. The most popular transformation is the principal com-ponent analysis (PCA) and is treated next (Section 7.1.1).

7.1.1 Principal Component Analysis (PCA) DHS p115, 568

Alp p113

ThKo p326The principal component analysis (PCA) is also called the Karhunen-Loeve transform - it was invent-ed multiple times. The PCA works by realigning the coordinate system to the distribution of the data.wiki Principal component analysis

Example: We have a two-dimensional dataset, whose overall distribution appears like the shape of anellipse, see Figure 16 left side. The ellipse’s larger diameter lies at an angle of 45 degrees. This ellipticalpoint cloud has two major ’directions’, which are denoted as z1 and z2. The first one is the dominant one,the second one is aligned orthogonally to the first one. The PCA detects those principal axes and thenplaces the axes of the original coordinate system - x1 and x2 - onto the new directions z1 and z2, illustrated

42

on the right side in Figure 16.

In other words, the PCA determines the ’directions’ of greatest variance in the data, and then rotates thecoordinate axes to those directions and it moves the origin of the coordinate axes onto the data’s center.

x1

x2

z1

z2z1

z2

Figure 16: Principal components analysis. Left: a set of points whose distribution happens to be elliptical and that canbe represented by the two axes z1 and z2. The PCA procedure centers the samples and then rotates the coordinateaxes to line up with the directions of highest variance. Right: z1 and z2 as the new axes. Assume now that the varianceon z2 is very small - a very thin ellipse -, then we could decide to ignore that axis and we would have reduced thedimensionality from two to one.

There exist different procedures to perform the PCA. Here we sketch the one using the covariance matrix.It consists of five basic steps:

1. The mean and covariance for the data are determined. The mean results in a single vector µ (dimen-sionality d); the d× d covariance matrix Σ was introduced before (Section 6.1).

2. The eigenvectors ei and eigenvalues λi are computed from the covariance matrix. For each dimensioni there exists a eigenvector e and its corresponding eigenvalue λ. They represent the directions in thepoint distribution. The eigenvalues represent the ’significance’ of the direction. We omit the details ofhow they are generated.

3. We now chose the k largest eigenvalues and their corresponding eigenvalues. There are differentways to choose k.

4. We build a d× k matrix A consisting of the k eigenvectors.

5. The original data x are multiplied with matrix A in order to arrive at the reduced data xr, which thenis of dimensionality [number of samples ×k]. This multiplication occurs as explained in Appendix E.

Algorithm 4 summarizes the individual steps. To make accurate estimates with our classifiers later, theoptimal k is determined only for the training set and we thus apply the last step twice: once to the trainingset and once to the testing set.

Matlab Matlab provides the command pca (older Matlab versions use princomp), which carries out steps1 and 2 of Algorithm 4. Here are the steps in Matlab code, the explanations are added below. A completeexample is given in G.9.

[coeff,~,lat] = pca(DAT); % the princ-comp analysis in Matlab

nPco = round(min(size(DAT))*0.7); % reduced dimensionality

PCO = coeff(:,1:nPco); % select the 1st nPco eigenvectors

% --- Reduce Data:

43

Algorithm 4 The steps of the PCA (for the method using the covariance matrix): performed on DL.Parameters k: number of principal componentsInitialization none particularInput xj(i): list of observations (DL), j = 1, .., nObservations, i = 1, .., d (nDimensions)1) Compute: µ: d-dim mean vector

Σ: d× d covariance matrix2) Compute eigenvectors ei and eigenvalues λi (i is dimension index)3) Selection of k largest eigenvalues and corresponding eigenvectors4) Build d× k matrix A with columns consisting of the k eigenvectors5) Projection of data xj onto k-dim subspace x′: x′ = F1(x) = At(x− µ)Output x′j : list of transformed observations

DATRed = zeros(nObs,nPco); % init reduced data matrix

for i = 1:nObs,

DATRed(i,:) = DAT(i,:) * PCO; % transform each sample

end

DAT = DATRed; % replace ’old’ dataset with new, reduced dataset

clear DATRed; % clear to save on memory

- The variable DAT is the original data of size n × d [nObs=n, number of samples/observations]. Thecommand pca returns a d × d matrix called coeff as well as a vector of latencies lat; we ignorethe second output argument for the moment. The matrix coeff corresponds to the eigenvectors, thevariable lat corresponds to the eigenvalues - if the eigendecomposition was used (see step 2 inAlgorithm 4). The default for the command pca is however a singular value decomposition, anothermatrix decomposition method, that we encountered already for the linear classifier.

- We then choose the number k (=nPco in the code), which can be done based on the values in lat.For simplicity, we choose here k based on dimensionality, where the proportion value of 0.7 is onlya suggestion, but one that should return reasonable results. Should there be fewer samples thandimensions - hopefully not -, then k needs to be smaller then the number of observations minus one -that is why we use the minimum function on the size output. More on the choice of k will follow later.

- Then we create the submatrix PCO of dimensionality d× k, corresponding to matrix A in Algorithm 4.

- Finally, we multiply each sample (x = DAT(i,:)) by this submatrix and obtain the data DATRed with lowerdimensionality (size n× k). That is the data you would then use for classification, i.e. with fitcdiscr

Choice of number of principal components (k) There does not exist a single recipe for choosing anoptimal k. Here are some suggestions:- One dataset: if one classifies only a single dataset, one could manually observe the ’variances’ in vari-

able lat (latencies). Some prefer the elbow method, meaning one takes the value where its spatialdistance to the straight line connecting the curve endpoints in the plot is largest. More explicitly, firstcalculate the line equation connecting the largest and lowest latency value; then, measure the dis-tance between each latency and the line equation; choose the maximal distance as being the optimalk.

- Several datasets: observe the lat values for some sets and try to derive a reasonable rule, e.g. the first kcomponents until 95 or 99 percent of the total variance is used up. For that purpose simply normalizethe values, LatNrm = lat/sum(lat);, the sum should be equal one.

- Try a range of values and observe where the prediction accuracy is maximal. If the task is classification,then we choose k in dependence of the accuracy, which however should be the same for all folds inorder to properly predict. If the task is clustering, then we choose k in dependence of the criterion forcluster validity. This will be further explained in the upcoming sections.

44

- Restriction: a meaningful number of principal components has to be less than the number of samplesminus one, hence the operation min(size(DAT)) in the above code example. This is also mentionedin the Matlab documentation as the ’degrees of freedom’.

Recapitulation

Advantage The principal component analysis works quasi without parameters. We need to merely choosethe desired number k of components. This can be regarded as a hyper-parameter, but it is not one that isrequired to make the transformation work, it is one that selects from the output.

Disadvantage It is possible that the PCA eliminates some dimensions that could have been useful fordiscrimination of some classes. But that loss in discrimination is typically small and thus negligible incomparison to the ease of use of the procedure. In the end, the PCA often allows us to obtain a betterprediction accuracy than when using the original dimensionality.

7.2 Feature Selection wiki Feature selectionAlp p110

ThKo p261, ch5, pdf 274

SKL p262, 3.1.13

Feature selection approaches can be categorized into three groups. One group are filter methods thatoperate mostly on single variables (Section 7.2.1). Another one are wrapper methods that operate mostlyon groups of variables (Section 7.2.2). And the third group are embedded methods that are built-in into theclassifier algorithm, such as the method of shrinkage that the Nearest-Shrunken Centroid classifier carriesout (Sections 4.1); or the decision tree that automatically selects features by design. As mentioned already,embedded methods do not really reduce the dimensionality. Filter and wrapper methods in contrast attemptto completely eliminate irrelevant features. Common to all three approaches is that they typically exploit thegrouping variable GrpLb, thus they are not suitable for use in clustering since in clustering we lack any groupinformation; an exception is the filter method observing feature variance (Section 7.2.1). In Python, the fea-ture selection algorithms, filter and wrapper methods, are offered in module sklearn.feature selection.In Matlab they exist in various toolboxes.

7.2.1 Filter Methods

Filter methods typically observe single features only and calculate some statistics. Based on those statis-tics one derives a threshold and selects those features that appear more ’valuable’. Due to their relativesimplicity, those methods are fast.

Feature Variance This is the simplest technique to eliminate potentially irrelevant features. We calcu-late the variance of the values of a feature and if the variance is below a certain threshold, then we omitthat feature. Put inversely, we select high-variance features. This is not without the risk of potential lossof information, because even a low-variance feature could carry valuable information for classification inconjunction with some other feature. If the variance is zero however, then it carries no information at all,as it has the same value throughout all observations; this elimination can be regarded as data preparation(Section 3). Some classifier algorithms eliminate zero-variance features automatically. In Python this tech-nique is available as sklearn.feature selection.VarianceThreshold; in Matlab you need to program ityourself. This technique can also be applied before a clustering algorithm, as it does not require any grouplabels.

Feature Importance with Group Variable For each feature we determine with some statistical or othermeasure, whether the values for one class do actually differ from the values of another class. The easiestway would be to compare the mean values for each class, but that is somewhat too simplistic, becausethe values for one class can substantially overlap with the values and therefore have very similar meanvalues but be different in shape nevertheless. One therefore resorts to statistical tests, such as the t-Test,or to information theoretic measures, such as the Kullback-Liebler distance. One such measure will be

45

introduced in more detail in Section 8.2.3. Using such measure one ranks the variables and selects themost significant ones.

Many of these measures compare only two distributions, meaning two classes. Thus if we have three ormore classes then we need to write a loop that tests each class versus all others. Furthermore, some of themeasures assume that the variables show a normal (Gaussian) distribution, which is rarely the case. It istherefore beneficial to observe several measures and choose the one that yields the highest performance.

Usage in Matlab The bioinformatics toolbox provides the function rankfeatures which ranks the vari-ables according to some selected criterion. The function performs only for binary classifiers. Thus, for morethan two classes we need to write a loop and take the average for instance.

Two Classes If Bg1 is a binary vector that identifies one of the two classes, then one would call thefunction as follows:

[O V] = rankfeatures(DAT’, Bg1, ’criterion’, ’ttest’);

Note that the data matrix DAT is passed in transposed (rotated) format - unlike so many other patter recog-nition functions requiring the data matrix. O is the order of indices in decreasing order; V holds the values inorder corresponding to DAT. Appendix G.10 contains an example applying the function on artificial data.

Multiple Classes In this case, we test one class against all others - to create a binary evaluation; andwe loop through all c classes (groups). We write a loop in which the binary vector (Bg1 above) identifiesone class by 1s for instance, and the remaining classes are all set to 0s. Then we apply again the functionrankfeatures as above. We obtain the values V c times; we then average those values and then rank thefeatures again.

If the function script rankfeatures is not available, then try the example in Appendix G.12.1: it calculatesthe so-called area-under-the curve (AUC) value of two distributions (coming up in Section 8.2.3).

7.2.2 Wrapper Methods

Here we apply a classification algorithm to evaluate which features could be more significant than others.We can apply the classification algorithm on single features or on group of features. To ensure that weestimate the significance as accurate as possible, it is best to apply cross-validation (Section 1.3.1).

To guarantee that we find the optimal subset of features, one would have to test all (binomial) combi-nations of features individually, which is also known as exhaustive search. Exhaustive search becomeshowever unfeasible for high dimensionality. Instead, suboptimal but more time efficient search techniquesare employed. The two most popular ones work by gradually increasing or decreasing the number of fea-tures, called sequential forward and sequential backward selection, respectively.

Sequential Forward Selection Here we start by testing each feature individually first and then select theone with the lowest training error (or highest prediction accuracy). Let us say we start with 10 features,select the best one, then 9 are remaining. In a second round, we test the two features together, namely theselected one from the first round and each one from the 9 remaining ones. From those 9 tests, we selectthe best fit again, and that leaves us with 8 features remaining. In a third round we then test three features,etc. The process is halted when the training error does not further decrease anymore, or the predictionaccuracy not increases anymore. In Appendix G.13 we provide an example for this process.

The cost of this search process is calculated as follows: to decrease the dimensions from full dimensio-nality d to reduced dimensionality k, we need to train and test the system d+ (d− 1) + (d− 2) + ...+ (d− k)times, which is O(d2). Thus, this is square complexity in comparison to the typical filter method introducedabove.

46

Sequential Forward Selection This process starts with the full dimensionality and performs 10 testsinitially on a 9-dimensional data set. Then, one removes the one feature that shows the smallest trainingerror. In a second round, 9 tests are carried out on a 8-dimensional data set, etc.

Usage in Matlab In Matlab exists the function sequentialfs that allows both forward and backward se-lection.

47

8 Evaluating and Improving Classifiers HTF p219, ch7, pdf238

We now elaborate on how to properly characterize and optimize the performance of a classifier. One im-portant issue was already introduced, namely the proper estimation of the generalization performance ofour classifier model by using the process of cross-validation; we elaborate on that issue in the followingSection 8.1. Then we introduce performance measures that were developed in particular for binary clas-sifiers, which often implement a detection task (Section 8.2): we will learn that there is often a trade-offand sometimes we wish to bias the decision in favor of one side. Then we observe some more tricks thatcan be applied to analyze multi-class tasks (Section 8.3). And we close with more tricks on analyzing andimproving classifier performance (Section 8.4).

8.1 Types of Error Estimation, Variance DHS p465

When we estimate the generalization performance then there are two types of measures to characterizethat estimate: bias and variance. The two measures are analogous to the terms ’accuracy’ and ’precision’.They are defined as (see also Figure 17):

Bias measures the accuracy or quality of the match, that is the difference between the estimated and thetrue value. A high bias implies a poor match.

Variance measures the precision or specificity of the match. A high variance implies a weak match.

Figure 17: Two measures to characterize generalization perfor-mance (prediction accuracy, testing error, etc.). Individual estimatesare shown as black dots on an axis - let’s say from a 8-fold cross-validation. Our estimate is then the average of those 8 values, thevariance is the range of those 8 values. The deviation of the estimatefrom the true value is called bias. Variance expresses how scatteredthe individual estimates are around the overall estimate. We wouldlike both - variance and bias - to be small.

estim

ate

true v

alue

variance

bias

Bias and variance are affected by two issues. One issue concerns the type of resampling we use, the waywe apply the method of cross-folding for instance, discussed below (Section 8.1.2). The other issue is theclassifier model itself: the algorithm can affect variance in particular, explained next.

8.1.1 Variance in Classifiers

When we train a model repeatedly, then we observe that the variance for some model types is larger thanfor others. There can be two reasons for high variance. One is that when we have few samples, such asin case of the LDA that exhibits increased variance for small sample sizes (Section 6.2). The other reasonis that when the learning algorithm is not stable, such as in a decision tree that will be introduced in thenext section has a very high variance. The same holds for the Perceptron learning algorithm (Section 16).On the other hand, variance is sometimes reduced by a learning algorithm. For instance, the methodsfor shrinkage or regularization (nearest-shrunken centroid classifier and discriminant analysis) reduce thevariance of a classifier; those methods have the negative side-effect of a small increase in bias, but in thatcase, the overall benefit of improvement outweighs the downside. It would seem that high variance is ofdisadvantage, but it can be in fact exploited when building more complex classifiers. Classifiers with highvariance are in particular useful when building ensemble classifiers (upcoming in Section 12).

48

Practically we pay attention to variance in the following ways:

Model Selection when Different Bias: When we select models we need to make sure that the chosenmodel is indeed better by verifying that its prediction accuracy is significantly higher than that of theother models. Ideally, its prediction accuracy is better by the sum of half the standard deviation foreach classifier of a pair of classifiers under investigation.

Model Selection when Equal Bias: If we tested two models and both show equal prediction accuracy butdifferent variance, then we chose the one with lower variance.

Building an Ensemble Classifier: when we train a model with the same base-learner, then that base-learnershould preferably show a large variance (Section 12).

8.1.2 Resampling

Table 1 gives a summary of resampling variants and how they influence bias and variance. The first fourvariants should be clear from what we have treated so far. The last one, ’Bootstrap’, will become clearerwhen we introduce ensemble classifiers (Section 12). The Matlab function crossvalind generates indicesfor most of these variants.

Table 1: Error estimation methods (from Jain et al. 2000). n: sample size, d: dimensionality. wiki Resampling (statistics)Method Property CommentsResubstitution All the available data is used for training as well as testing;

training set = test set.Optimistically biased estimate, especially whenn/d is small.

Holdout Half the data is used for training and the remaining datais used for testing; training and test sets are independent.

Pessimistically biased estimate; different parti-tionings will give different estimates.

Leave-one-out,Jackknife

A classifier is designed using (n-1) samples and evaluat-ed on the one remaining sample; this is repeated n timeswith different trainings sets of size (n-1).

Estimate is unbiased but it has a large variance;large computational requirement because n dif-ferent classifiers have to be designed.

Rotation, n-foldcross validation

A compromise between holdout and leave-one-out me-thods; divide the available samples into P disjoint sub-sets, 1≤P≤ n. Use (P -1) subsets for training and theremaining subset for test.

Estimate has lower bias than the holdout methodand is cheaper to implement than the leave-one-out method.Ix = crossvalind(’kfold’,100,5);

Bootstrap Generate many bootstrap sample sets of size n by sam-pling with replacement; several estimators of the error ratecan be defined.

Bootstrap estimates can have lower variancethan the leave-one-out method; computationallymore demanding; useful for small n.

Matlab Here we point again, how to explicitly fold data. We did so already in the classification example forthe kNN algorithm (Appendix G.6), but it is shown here in isolation for clarity.

nFld = 5;

Fld = crossvalind(’kfold’, Grp, nFld);

for i = 1:nFld

Btst = Fld==i; % logic vector indicating testing samples

Btrn = ~Btst; % logic vector indicating training sample

GrpTren = Grp(Btrn); % grouping variable for training

GrpTest = Grp(Btst); % grouping variable for testing

LbPrd = classify(DAT(Btst,:), DAT(Btrn,:), GrpTren); % now classify

Bhit = LbPrd==GrpTest; % logic vector with hits

...further analysis...

end

We have given an example how to use the explicit folding in Appendix G.6 (the kNN example). Should thecommand crossvalind be missing, we can write the function also ourselves, see for instance AppendixG.11.

49

8.2 Binary ClassifiersAlp p489

Binary classifiers are often developed for tasks where we need to spot specific information in a large clutterof general information, such as spam detection in incoming emails, cancer detection in medical images,face detection in street images, etc. Those are tasks that can be considered search tasks and we areparticularly interested in finding that signal with high confidence. For those tasks it is not overly informativeto report only the prediction accuracy as we did so far, because if we detect our signal in a very ’liberal’way, then our system appears to do very well, but it can have its cost. For instance, if our spam-filterlabels every incoming email as spam - 100 percent correctly predicted - , then we certainly do not missa true spam-email, however no regular email would show up in our inbox. Obviously, we need a moresophisticated measure to estimate the success of a search task. Those measures stem from the domainsof signal detection theory and information theory.

We use a slightly more sophisticated example to understand the logic of those measures: a doctorinspects an X-ray image and makes a decision whether the patient requires treatment or not. To make thatdecision the doctor needs to detect some specific pattern (signal) - for instance, the bones show an unusualtexture. But because analyzing X-ray images is difficult and because there exists no absolute certainty, adecision can therefore be of one of four possible response outcomes. How those four response outcomesoccur may be best understood by looking at the graph in Figure 18.

TP

TN

FNFP

signal

noise

NegativesPositives

θP

d

Figure 18: Discrimination of two ’overlapping’ classes: probability/frequency P versus variable d (or feature, dimen-sion). The left distribution represents the signal (spam-email, cancer in medical image, face in street image); the rightdistribution represents the noise (regular emails; healthy tissue; street objects). A decision threshold θ is set and weobtain four response types: true positives (TP; hit), false positives (FP), false negatives (FN; miss), and true negatives(TN).

The graph depicts two overlapping density distributions: the one on the left represents the signal - thepattern that the doctor needs to detect; the one on the right represents the noise - or background or dis-tracter, from which the doctor needs to discriminate the pattern. To make a decision, a threshold θ is appliedand a side is chosen: if the left side of θ is chosen, one has considered the stimulus (or input) as a signal;if the right side is chosen, one has considered the stimulus as distracter. For either choice, our predictionmay be right or wrong. If the stimulus is considered the signal and it was truly the signal, then we talk ofa ’hit’ or ’true positive’; if it was not the signal, then we call it a ’false positive’ or ’false alarm’. Analogous,if the stimulus was truly the distracter, then we talk of ’true negative’ or ’correct rejection’; if it was not thedistracter, then we label it ’false negative’ or ’miss’. Those four response types are summarized below:

Label 1 Label 2 Part of distribution ExampleTP true positive hit left of θ, under signal Sick people correctly diagnosed as sick

TN true negative correct rejection right of θ, under noise Healthy people correctly identified as healthy

FP false positive false alarm left of θ, under noise Healthy people incorrectly identified as sick

FN false negative miss right of θ, under signal Sick people incorrectly identified as healthy

50

Clearly, wherever the decision threshold θ is set, it results in a trade-off. If the doctor wants to avoidunnecessary treatment - of persons incorrectly identified as sick -, then he choses (implicitly) a thresholdmore to the left; thereby the doctor would also miss some actual false negatives - sick people incorrectlyclassified as healthy. And vice versa, if the doctor attempts to ensure that all sick people are treated, thenhe choses a threshold more to the right, but thereby treating also some healthy persons.

To quantify this trade-off there exist different measures. In a first step, the responses are arranged in aso-called confusion matrix (Section 8.2.1). From that matrix we use certain quantities with which we cancalculate different trade-off measures, depending on your preferred viewpoint (Section 8.2.2). If we have thepossibility to influence the system performance by changing some parameters, then we can even measurean ROC curve (Section 8.2.3).

8.2.1 Confusion Matrix wiki Confusion matrix

In the confusion matrix we organize the four response outcomes as ’predicted’ versus ’actual’ and arrive soat a 2× 2 array. This response table is also known as contingency table or cross tabulation.

ActualMatches Non-matches

Predicted Matches TP FP P’Non-matches FN TN N’

P N

The columns sum up to the actual number of positives (P) and negatives (N), while the rows sum up to thepredicted number of positives (P’) and negatives (N’):

P = # positive actual instances = TP + FNN = # negative actual instances = FP + TNP’ = # positive classified instances = TP + FPN’ = # negative classified instances = FN + TN

Example: Assume you have made 100 observation decisions in your task of which 22 times you predictedthe presence of the signal and 78 times you predicted its absence, P’ and N’ respectively. Later you areinformed that 18 of your ’presence’ predictions were correct (= true positives), and 76 of your absence pre-dictions were also correct (= true negatives). Then, you can calculate the frequency of the two remainingresponse types, namely FP and FN:

ActualMatches Non-matches

Predicted Matches TP = 18 FP = 4 P’ = 22Non-matches FN = 2 TN = 76 N’ = 78

P = 20 N = 80 100 total

8.2.2 Measures and Response Manipulation wiki Precision and recall

Given the quantities as calculated above, we can then determine measures that quantify the classificationperformance more precisely - than if we used only the hit rate and the above confusion matrix. Thosequantities often come in pairs and are typical for certain domains, see Table 2. For three of those four,one quantity is the same, namely the hit-rate TP/P: it is called true-positive rate, recall and sensitivity in therespective domains.

51

Table 2: Performance measures for a binary task. Some measures have multiple names; each domain (see right mostcolumn) prefers different terminology and definitions. In Matlab: classperf.

Name Formula Other Names Preferred UseError (FP + FN)/(P + N)Accuracy (TP + TN)/(P + N) = 1 - ErrorTrue Pos Rate TP / P Hit-Rate, Recall, Sensitivity ROC curveFalse Pos Rate FP / N Fall-Out RatePrecision TP / P’ InformationRecall TP / P True Pos Rate, Sensitivity RetrievalSensitivity TP / P True Pos Rate, Recall MedicineSpecificity TN / N = 1 - FP-rateF1 score 2 · (Precision·Recall)

(Precision+Recall)

For the above example, the true-positive rate (TP-rate or TPR) is 0.90; the false positive rate (FP-rate orFPR) is 0.05; the accuracy is 0.94.

Response Manipulation There are cases where one may not be satisfied with the response rates ofthe system, because there may exist certain ’costs’ with response types. For instance, in the case of thedoctor’s decision, one may need to take into consideration the price of a treatment or the potential sideeffects of a treatment. Thus, one would like to bias the system in favor of a specific response outcome.Example: Google Map uses an algorithm to blur faces and car license plates in their street view, in orderto avoid lawsuits by private persons who were accidentally photographed during the street view recording.How would you adjust the algorithm? Would you permit unblurred faces? Are false alarms costly?

Answer: You probably want to detect all faces to ensure that no law suit is filed, that is you ideally wanta perfect recall. Due to the trade-off this means that false alarms are going to be higher. False alarmsare objects, that appear similar to faces and car license plates. If such objects get blurred occasionallythen this does not really impair the overall benefit of the street, view, hence false alarms do have a lowcost here.

This search for an optimal decision is easier if one knows the relationship for two pair of measures. Thisrelationship is typically plotted as the so-called ROC curve, coming up next.

8.2.3 ROC Analysis wiki Receiver operating characteristic

The ROC analysis is a way of visualizing the response trade-off for a range of decision thresholds. In thatanalysis, the true-positive rate is plotted against the false-positive rate. Those two rates are calculated fordifferent decision thresholds and if one connects the points one obtains the so-called receiver-operatingcharacteristic (ROC) curve, see now Figure 19. It starts in the lower left corner and then increases, whichwould correspond to moving the decision threshold from left to right in Figure 18.

52

Figure 19: The ROC curve is generated by sys-tematically manipulating the decision threshold andplotting the true positive rate against the false pos-itive rate for each performance measurement. Thecurve lies typically above the diagonal. The diago-nal represents chance level (gray dashed line). Ide-ally the curve would rise steep - the dashed ROCrepresents a better classifier than the stippled ROC(further away from the diagonal is better). Some-times the area under the curve (AUC) is used as aquantity - the higher the value, the better the perfor-mance. Matlab: perfcurve.

1

chan

ce level

False

Positive Rate

True

Positive Rate

10

Assume that the black (solid) ROC curve reflects the performance of a loosely ’tuned’ model. If onehopes to find a better model, then the curve should bend more toward the upper left, that is it should risefaster toward one, such as in the dashed ROC curve. If the curve becomes flatter and closer to the stippleddiagonal, then the model is worse. If the curve runs along the diagonal, then the decision making is prettymuch random. If the curve runs below the diagonal, something is completely wrong - we may have swappedthe classes by mistake.

Using the ROC curve, the classification performance is sometimes specified as the area underneath it(AUC), thus we report a scalar value between 0.5 (chance) and 1.0 (perfect).

8.3 Three or More Classes

For classifiers discriminating three or more classes, the above ’binary’ characterization - the response tableand its measures - is not directly applicable. Often, one reports only the percentage of correct classificationor the error, that is the percentage of misclassification as we did so far in our examples. Nevertheless,we can gain more insight about the classifier model by observing the confusion matrix from which we canderive c binary evaluations.

Confusion Matrix for 2 or more Classes: The confusion matrix is of size c× c, where c is the number ofclasses. In that table, the axes for the actual and the predicted classes are often swapped - as opposed tothe response table introduced in Section 8.2.1: the given (actual) classes are listed row-wise, the predictedresponses are given column-wise. A classifier with good performance would return a confusion matrixwhere mostly the diagonal entries would show high values, namely where actual and predicted class agree.

Example: Assume you have trained a classifier to distinguish between cats, dogs and rabbits. You test theclassifier with 27 samples, 8 cats, 6 dogs, and 13 rabbits. You observe that your model makes the followingconfusions:

PredictedCat Dog Rabbit

ActualCat 5 3 0Dog 2 3 1Rabbit 0 2 11

In Matlab we can use the command confusionmat (stats toolbox) for which the first variable must be theactual group and the second variable must be the predicted group (grouphat):

53

CM = confusionmat(Grp.Tst, LbTst);

Or we can generate the confusion matrix as follows:

CM = accumarray([Grp.Tst LbTst],1,[nCat nCat]);

c Binary Analyses (One-Versus-All) Using the confusion matrix, we can now generate the binary mea-sures - as introduced before - for each class c individually: for a selected class c, a discrimination betweenthe selected class versus all other categories is carried out. For a category under investigation, with indexi, we can obtain three values:

hit count TP the ith diagonal entryfalse positives count FP(false alarms)

the sum of values along the ith column minus the hit count TP.

false negatives count FN(misses)

the sum of values along the ith row minus the hit count TP.

Note that false alarms and misses appear swapped in comparison to the 2× 2 confusion matrix of Section8.2.1.

In the example above, the hit count for the dog class is three; its false positive count is five (3+2); itsfalse negative count is three (2+1).

In Matlab we can obtain false alarms and misses from the confusion matrix with the following lines (nCls =no. of classes):

CMpur = CM; % create a copy, which will be our ’pure’ confusions

CMpur(diag(true(nCls,1))) = 0; % set all diagonal entries to 0 (hits knocked out)

Cfsa = sum(CMpur,1); % false alarms per category

Cmss = sum(CMpur,2); % misses per category

8.4 More Tricks & Hints

We now list optimizations that one can perform in order to seek improvement of a classifier. The first oneis a potential pitfall when we use classes with few instances (Section 8.4.1). The second one observesthe classifier accuracy for different amounts of training (Section 8.4.2). The third optimization concerns theenlarging of the training set by manipulating or sub-selecting the training set (Section 8.4.3).

8.4.1 Class Imbalance Problem ThKo p237

In practice there may exist datasets in which one class has many more samples than another class; orsome classes may have only few samples due to their rarity for instance. This is usually referred to asthe class imbalance problem. Such situations occur in a number of applications such as text classification,diagnosis of rare medical conditions, and detection of oil spills in satellite imaging. Class imbalance maynot be a problem if the task is easy to learn, that is if classes are well separable; or if a large training datasetis available. If not, one may consider to try to avoid possible harmful effects by ’rebalancing’ the classes byeither over-sampling the small classes and/or by under-sampling the large classes.

8.4.2 Learning Curve

It is common to test the classifier for different amounts of learning samples (e.g. 5, 10, 15, 20 trainingsamples), and to plot classification accuracy (and/or error) as a function of the number of training samples, agraph called learning curve. An increase in sample size should typically lead to an increase in performance- at least initially; if performance only decreases then something is wrong. For some classifiers - neuralnetworks typically -, the classification accuracy may start to decrease for very large amounts of training due

54

to a phenomenon called overtraining (overfitting). In that case one would would like to stop learning whenthe accuracy starts to decrease. To achieve that, one employs a validation set (see above Section ??): theperformance on the validation set would increase initially and then saturate with learning duration; it wouldstart to decrease when overtraining starts to occur, and that is the point when training should be stopped.

8.4.3 Improvement with Hard Negative Mining and Artificial Samples

Now we mention two tricks that may help to improve classifier performance. These are tricks that aretypically used with other classifiers, but can very well be tested with a linear classifier at little cost. Trying totune a more complex classifier can be more difficult than trying out one of the following to tricks:

Hard Negative Mining: Here we focus on samples which trigger false alarms in our classifiers. Moreexplicitly, if we train a classifier to categorize digits, than we observe and collect (mine) those digits thatconfuse one category, that is those samples that trigger false alarms. For instance, we collect those digitsamples that confuse class ’1’, which could be ’7’s or ’4’s; or for category ’3’ it could be ’2’ or ’8’. Then wetrain the classifier again with those ’hard negatives’ in particular.

Creating ’Artificial’ Samples: This is a trick popular in computer vision. Because the collection andlabeling of image classes is tedious work, one sometimes expands the training set artificially by generating(automatically) more training images, which are slightly scaled and distorted variants of the original samples.This can be tried in combination with adding noise to the samples. Of course, artificial samples are createdfrom the training set only TRN - not from the testing set.

55

9 Clustering - K-Means wiki K-means clusteringJWHT p386, 10.3.1

DHS p526

ThKo p741

Alp p145

We now introduce the first clustering procedure. Should one have forgotten the purpose of clustering thenrevisit Section 1.1. Perhaps even the Section on computational challenges (Section 2.2).

The most common clustering procedure is the K-Means method. It is an iterative procedure where thecluster centers - also called centroids - gradually move towards the ’true densities’ by repeated distancemeasurements (Fig 20). To initiate the procedure, one specifies how many clusters k one expects in thedata; then one chooses k points randomly - from the total of n samples - and those are taken as initialcentroids (b in Fig. 20). Then the procedure iterates through the following two principal steps:

1. Partitioning: Each data point is assigned to its nearest centroid (Fig. 20c). In other words, for eachdata-point its ’membership’ (or label) is determined. This assignment is done based on distance - andis analogous to Nearest-Centroid classification (Section 4). The assignment results in k partitions,which one can think of as temporary clusters.

2. Mean: With those partitions, the new centroids are calculated by taking the mean - hence the name K-Means (Fig. 20d). The new centroids will be in a slightly different location than the previous centroids,namely a bit closer toward the density.

With the new centroids obtained in the second step, we then repeat step one, and then we recomputethe new centroids in a second step again. By repeating these two steps, the procedure gradually movesthe centroids towards the true density centroids. To terminate this cycle, it requires the definition of astopping criterion, e.g. we quit after the new centroids hardly move anymore, which means that the clusterdevelopment has ’settled’. Algorithm 5 summarizes the procedure.

Algorithm 5 K-Means clustering method. Centroid = cluster center.Parameters k: number of clusters.Initialization randomly select k samples as initial centroids.Input x list of vectorsRepeat

1. Generate a new partition by assigning each pattern to its closest centroid2. Compute new centroids from the labels obtained in the previous step(3. Optional): adjust the number of clusters by merging and splitting existing clusters

or by removing small, or outlier clusters.Until stopping criterion fulfilled (e.g. new centroids hardly move anymore)Output L list of labels (a cluster label for each xi)

There are two modes with which step 1 can be carried out, called batch update and on-line update.

Batch Update: Here, the re-assignment of a class label for a point to its closest centroid (step 1) is doneat once for all points simultaneously. This is somewhat coarse but fast, much faster than the on-lineupdate.

On-line Update: the re-assignment of class labels occurs for each point individually, which is more timeconsuming than the batch update, but also more accurate. If you have very large datasets, you maywant to avoid individual update and choose only the batch update.

The shortcoming of the K-Means procedure is that for complex data, it does not always generate the clusters’properly’. There is no clustering method that universally does so, but the K-Means algorithm is a bit lessrobust in that respect than other methods. There are several reasons:- Random initialization: the initial (random) choice of centroids can be crucial for successful clustering.

Modern variants of the K-Means algorithm, such as the K-Means++ have substantially improved thisdownside.

56

Figure 20: The principle of the K-Means method exemplifiedon a 2D example.

a. A dataset. We assume somehow that there are twoclusters present (which is obvious in 2D to a human, but is notobvious to an algorithm), hence we set k = 2.

b. Initialization: we initiate the procedure by choosingtwo data points randomly, the filled circles, which are going tobe our initial centroids.

c. Partitioning (1st step of loop): we determine to which cen-troid each data point is nearest (nearest-centroid classificationas in Section 4) and generate so two partitions: one partitionis outlined by a gray line; the rest of the points are the otherpartition. This partitioning is one of two principal steps in theiterative procedure.

d. Calculation Mean (2nd step of loop): we calculatethe new centroids by averaging data-points of the clusters, newcentroids marked as ’x’; this is the second principal step of theiterative procedure. Gray arrow shows movement from old tonew centroid.

We repeat steps one and two, c and d respectively, andthat will gradually move the centroids to the center of the actualclusters. The procedure is stopped until the centroids hardlymove anymore.

a

x

b

d

x

c

- By using a ’mean’ calculation, there is a tendency to interpret the data as compact clusters; elongatedclusters are not so well ’captured’ by the procedure.

A simple solution to the downside of random initialization, is to run the procedure repeatedly, - always start-ing with different initial centroids -, and then to select the partitioning for which the total sum of final distancevalues is smallest. This type of repetition is not explicitly stated in Algorithm 5, that is, we repeat the entirealgorithm T times and chose that L (out of T L’s), whose total distance is smallest. Although this repeatedapplication is no guarantee that the actual centroids are found, it has been shown to be fairly reliable inpractice.

The great advantage of the K-Means procedure is that works relatively fast, because most samples are in-spected only occasionally, namely for the number of iterations until the actual centroid is found. In contrast,the hierarchical clustering procedure, treated in the next Section 10, is much more exhaustive by observingeach sample n− 1 times.

Clustering can be carried out with different types of distance measure, or it can be done with similaritymeasures, see also Appendix B. The Hamming distance and the Manhattan distance are better suited todeal with binary data or discrete-valued data.

57

9.1 Usage in Matlab

Implementing a primitive version of the K-Means algorithm is not so difficult, because it is a fairly straight-forward algorithm. An example is given in Appendix G.14. It does only batch update for simplicity.

Matlab provides the function kmeans, for which one has to specify only the number of clusters k as minimalparameter input:

Lb = kmeans(DAT, 5);

which carries out one repetition by default. But you can also specify more repetitions using the parameterparameter ’replicates’, see the example in G.14. The variable Lb contains a cluster label between one andfive and you then need to write a loop that identifies the indices for the individual clusters, see the functionin Appendix G.14.1 for an example.

The function kmeans firstly performs a coarse clustering using the batch update, followed by a refinementwith an online update, see nested functions called batchUpdate and onlineUpdate. You can turn off theonline update as follows:

Lb = kmeans(DAT, 5, ’onlinephase’, ’off’);

which is recommended when your sample size is several tens of thousands or larger - otherwise yourmachine may be completely occupied by clustering. You can always add the online phase later if you thinkyour PC can deal with the size of the data.

Distance The default distance measure that kmeans uses is the squared Euclidean distance, which isessentially the Euclidean distance. But because taking the square root is not really necessary for generatingthe partitions, the square root function is omitted to save on computation.

Scaling By default the function kmeans uses the (squared) Euclidean measure for which no scaling of thedata takes place. Thus, it may be worth trying also clustering with scaling (Section 3.3). If you choosedifferent distance measures - or in fact similarity measures - then the data are scaled accordingly, seeAppendix B for how that is done.

Options There are a lot of options and parameters you can pass with the function statset. For instanceyou can set a limit for the maximum number of iterations and you can instruct the function to display in-between results:

Opt = statset(’MaxIter’,500,’Display’,’iter’);

Lb = kmeans(DAT,5, ’replicates’,3, ’onlinephase’,’off’, ’options’,Opt);

You then pass the structure Opt to the function kmeans. In the above example, we also instruct the kmeansfunction to perform three repetitions - called replicates here.

NaN The Matlab function kmeans eliminiates any NaN entries, that is, it throws out any rows (observations)where NaN occur. The elimination is done with the function statremovenan. If you want to keep theobservations with NaN entries, then you have to write your own function.

9.2 Usage in Python

In SciKit-Learn, the K-Meansalgorithm is implemented in module sklearn.cluster. It offers two variants asseparate scripts: the regular algorithm in KMeans and a fast variant for large data sets MiniBatchKMeans run-ning batch-updates as mentioned above. I have tested the latter for hundreds of thousands of vectors andit does really cluster very fast, much faster than Matlab’s algorithm with the option ’onlinephase’,’off’.

58

9.3 Determining k - Cluster Validity wiki Determining the number of clusters in a data set

ThKo p880, 16.4.1If we are clueless of how many clusters we expect in our dataset, then we need to test a range of ks. Wethen need some measure or method that tells us what k could be optimal; this method quantifies somehowhow ’suitable’ the obtained clusters are. This quantification is also called cluster validity. There is a numberof methods we can try, here are two popular ones:

Decrease of Within-Cluster Scatter: This is probably the simplest method. For each cluster we deter-mine its so-called within-cluster scatter, which is a measure that expresses how scattered (or spread)a cluster is, for instance by determining the mean distance from the centroid to all its cluster members.This measure is used in the Matlab function kmeans twice: once to determine when clustering is com-pleted, namely by observing when it does not decrease anymore; and once by choosing the run withthe lowest scatter - if you have chosen to run the K-Means-clustering algorithm multiple times withthe option ’replicates’. To make a decision which k could be optimal, one plots the scatter valueagainst varying k (Fig. 21), which will result in a decreasing curve and then apply the elbow method -as it was also suggested for choosing k for the PCA (Section 7.1.1).

Figure 21: Scatter as a function of the number ofclusters k. The Iris data set tested with ks rangingfrom 2 to 10. The y-axis plots the mean scatter forthe clusters. The curve is decreasing because clus-ter sizes become increasingly smaller for increasingk. One way to choose an appropriate k would be touse the elbow method: one draws a straight line be-tween the first and last point and determines whichpoint in between shows the largest distance to thestraight line.

K [# of Clusters]2 3 4 5 6 7 8 9 10

Mea

n S

catt

er

0

20

40

60

80

Silhouette Index: this method goes a step further than the previous measure and also determines howseparated the clusters are from each other, that is it measures also the between-cluster scatter. Todetermine this measure, one can take for instance the distances between a centroid and all other (non-member) points in the dataset. Thus, this measure is computationally significantly more expensivethan the previous one. Matlab provides the function silhouette to determine a so-called silhouetteindex for each cluster (Fig. 22). One can make a selection based on the average of the k silhouetteindices; or one can choose k where all indices are above zero.

9.4 Recapitulation

The K-Means clustering algorithm is simple to use and delivers results quickly. It is therefore recommendedthat you always use it to analyze your data - even if you think your data deserve better analysis. The K-Means algorithm may give you a ’coarse’ result only, which you can then attempt to improve with anotherclustering algorithm.

Complexity The overall complexity is fairly low, namely O(ndkT ), where n is the sample size, d thedimensionality, k the number of clusters and T the number of replicates (DHS p527). Note that if you performan excessive number of replicates on a small dataset, you may exceed the complexity of the hierarchicalclustering algorithm (Section 10), in which case it may be more feasible to use that one.

Advantages K-Means works relatively fast due to its relatively low complexity. It is therefore suitablefor very large datasets unlike some other clustering methods. On the Matlab site ’File Exchange’, a user

59

Figure 22: Silhouette indices for k=4, k=5 and k=6(left to right). For each cluster, the silhouette valuesare ordered in decreasing order (top-to-bottom).A large value signifies a data point that shows astrong affiliation with its cluster. For k=4 (left), someof the data points for the fourth cluster (bottom) arenegative. For k=5 (center), only a single data pointis negative (middle; cluster no. 3). For k=6, datapoints for two clusters show negative values (clus-ter no. 2 and 6). According to this, a clustering withk=5 might be most appropriate.

0 0.5 1C

lust

er

1

2

3

4

Silhouette Value0 0.5 1

1

2

3

4

5

0 0.5 1

1

2

3

4

5

6

provides an implementation that can deal with Gigabytes of data - it requires however a C compiler.

Disadvantages- Specification of the number of clusters k: it is not obvious what a useful k would be - unless we know

exactly how many clusters to expect. We have mentioned several methods to choose k automatically,but none of those procedures has proven to be effective for all patterns.

- It tends to favor hyper-ellipsoidal clusters - in particular if you use the within-cluster scatter for evaluation,e.g. for terminating or for validity.

- The algorithm does not guarantee optimal results due to the random initial selection.- It can not handle categorical data well, due to the mean function.- It is sensitive to outliers, because each observation is enforced to be part of a cluster.

How one can address some of the disadvantages is treated later in Section 19.

60

10 Clustering - Hierarchical DHS p550

ThKo p653

HKP p456, 10.3

LRU p245, 7.2

Hierarchical clustering is a procedure that starts by calculating the pair-wise distances between all ob-servations and then proceeds by linking up near pairs. In comparison, the K-Means method measuresonly a subset of those distances. Hierarchical clustering is therefore more thorough in its analysis andwould therefore be the preferred clustering method in principle, however the exhaustive pair-wise distancemeasurements are computationally more intensive and the method is therefore not easy to apply to largedatasets. Hierarchical clustering consists of three principal steps:

1. Pairwise distances: The distances between all points are calculated resulting in a list of pair-wisemeasurements.

2. Linking to a hierarchy: Using the list of distance values, a nested hierarchy of all n data points isgenerated, by starting with the nearest pairs and then gradually linking up the remaining pairs. Thehierarchy can be represented by a tree, which here is called dendrogram. This step is far less costlythan the first one (the pairwise distance calculation), as one compares only a list of distances.

3. Thresholding the hierarchy: We cut the tree horizontally at some distance and the resulting ’branches’form the clusters.

10.1 Pairwise Distances

The calculation of those pairwise distances is the costliest step of the hierarchical clustering procedures,in terms of both space and time. For n vectors (observations), one needs to compute n(n − 1)/2 values,roughly n2/2. It is therefore said that hierarchical clustering is of square complexity, O(N2). For examplefor 20’000 vectors, we would have to calculate almost 400 million distance measurements. For two dimen-sions that might be doable; for larger dimensionality this becomes quickly unfeasible, because the time tocalculate those distances grows exponentially with increasing dimensionality.

Matlab and Python offer the function pdist to calculate pair-wise distances. Watch out: if you send a largelist of vectors to pdist, you need to pay attention also to memory demands, see also Appendix A.1 for somemore details on the intricacies of calculating pair-wise distances.

10.2 Linking (Agglomerative)

There exist two principal types of linking, namely agglomerative and divisive:

Agglomerative linking starts with the tightest pairs, the pairs with smallest distance, and then graduallylinks to more distal pairs; agglomerative linking is the more popular one and we therefore consideronly that one.

Divisive linking works the opposite way: it starts by considering all pairs and then tries to gradually breakdown the links. Divisive linking is computationally so demanding that it is rare in practice.

Agglomerative linking is also known as Bottom-Up clustering or Clumping. It starts by considering everydata point as an individual cluster; in other words for n data points one starts with n clusters. Because acluster with a single member is also called a singleton cluster, this linking procedure can be said to startwith n singleton clusters. Then one forms the hierarchy by successively merging data points into pairs ofdata points, whose distance is smallest, and one continues that type of linking until all points have beenmerged into a single cluster. The general algorithm is as follows, whereby the function g represents the linkmetric, a measure that takes the distance between a data point and a set of points.The link metric g can be implemented in various ways, Figure 23 shows some examples. The nearest-neighbor link metric is also referred to as single linkage; the farthest-neighbor link metric is also referredto as complete linkage. The choice of link metric biases the clustering process such that it favors certainshapes of clusters. We elaborate on those two types below, by looking at the toy example in Figure 24 topleft.

61

Algorithm 6 Generalized Agglomerative Scheme (GAS). g: link metric (a type of distance measure betweena point and a set). C: a cluster with members xi. R: set of clusters. From ThKo p654.

Parameters cut threshold θInitialization t = 0Initialization choose Rt = {Ci = {xi}, i = 1, .., N} as the initial clustering.Repeat:

t = t+ 1Among all possible pairs of clusters (Cr, Cs) in Rt−1 find the one, say (Ci, Cj), such that

g(Ci, Cj) = minr,s

g(Cr, Cs) (8)

Define Cq = Ci ∪ Cj and produce the new clustering Rt = (Rt−1 − {Ci, Cj}) ∪ {Cq}Until all vectors lie in a single cluster resulting in the final set of clusters.Cut hierarchy at level θ

Figure 23: Types of distance measurements between a singlepoint (left, filled) and a set of points (right). In the context ofclustering these are called link metrics.

a. Nearest Neighbor: First, the distances between thesingle point and the points in the set are determined; then, theminimum of those distances is chosen. During the process oflinking, this is also referred to as single-linkage.

b. Farthest Neighbor: as in a., but now the maximumof those distances is taken. Also referred to as complete-linkage.

c. Average: first, for the set of points the average iscomputed (marked as ’x); then the distance between the singlepoint and the average point is taken.

a

b

x

c

Note that in above equation 8, the minimum function min before function g is not part of g itself.

Single-Linkage: this is also known as the Nearest-Neighbor linkage; it uses the minimum distance tocompute the distances between samples and merging clusters, see Figure 23a. For that reason thereappear only three unique distance values in the dendrogram of Figure 24 center left: the smallestdistance value (equal 1.5) reflects the minimum spacing between the points in the upper row - the fivepoints at y-coordinate 0.8; the second smallest distance value (equal 2) reflects the minimum spacingbetween the four lower points. The third distance reflects the distance between the upper row ofpoints and the set of lower four points. In this example, we intentionally cut the dendrogram betweenthe second and third distance and arrive so at two final clusters. In general, the single-linkage methodtends to generate chain-like clusters, which is evident by the selection of the upper row of points asone cluster (Figure 24, lower left).

Complete-Linkage (aka Farthest Neighbor): uses the maximum distance to compute the distancesbetween samples and clusters, see Figure 23b. The corresponding dendrogram looks a bit lessintuitive now (Figure 24, center right). There are more unique distance values due to the choice ofthe largest distance between two sets of merging clusters. The method tends to generate compact

62

clusters, which in this example is not so obvious, except that it did not acknowledge the presence of aline of points as did the linkage method.

There exist of course many other types of linkage metrics. For instance the average metric takes the meandistance value of the clusters under consideration, see Figure 23c.

Usage in Matlab Matlab provides the function linkage to organize the hierarchy. We specify with thesecond parameter the desired linkage metric, for instance:

Dis = pdist(DAT); % pairwise distances

LnkSin = linkage(Dis, ’single’); % single linkage method

LnkCmp = linkage(Dis, ’complete’); % complete linkage method

The output variables Lsn and Lcm are arrays containing the connections of the tree (N = number of observa-tions). Each array is of size (N −1)×3: the first two columns contain the tree connections, pairs of samplesor clusters that are closest; the third column contains the distances between those pairs of samples orclusters. The hierarchy can be displayed using the command dendrogram.

10.3 Thresholding the Hierarchy - Cluster Validity ThKo p690, 13.6

The link tree can be thresholded in different ways. The simplest is to cut it a specific distance value, thusresulting in some number of clusters. If we have a specific number of clusters k in mind, then we cut thetree at the corresponding level that results in exactly k clusters. Or we wish to apply a criterion, similar towhat we did for K-Means clustering (Section 9.3). Here are two criteria that are popular with hierarchicaltrees:- Inconsistency coefficient: expresses how inconsistent a link is in the tree - the larger its value, the more

inconsistent it is. For a given link, the inconsistent value is calculated from the height values of allthose links that are on the same level.

- Cophenetic (correlation) coefficient: is a singular value that describes how faithfully a dendrogram pre-serves the pairwise distances between the original unmodeled data points. It has been used a lot inbiostatistics. The closer the value to one, the better the ’preservation’.

Usage in Matlab Matlab provides the function inconsistent and cophenet to generate those measures,where Lnk is the output as generated by the command linkage (see Section above) and Dis are thepairwise distances:

Inc = inconsistent(Lnk); % [nObs-1 4]

coph = cophenet(Lnk, Dis); % single value

Matlab offers the function cluster to cut the link hierarchy in those various ways. If you specify inconsistentas criterion, then it will automatically call the function script inconsistent.

Lbl = cluster(Lnk,’maxclust’, 3); % three or less clusters

Lbl = cluster(Lnk,’cutoff’, ’distance’, 2.01);

Lbl = cluster(Lnk,’cutoff’, ’inconsistent’, 1.25);

Matlab also offers the function script clusterdata, which performs performs the three steps all together:pairwise distances, linking and thresholding the hierarchy.

63

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Data

4 5 1 2 3 6 7 8 9

Dis

tan

ce

0

0.2

0.4

0.6Single Linkage

4 5 1 2 3 6 7 8 90

0.2

0.4

0.6Complete Linkage

0 0.5 10

0.5

1t=0.201, 2 Clusters

0 0.5 10

0.5

1t=0.201, 5 Clusters

0 0.5 10

0.5

1t=0.19, 5 Clusters

0 0.5 10

0.5

1t=0.3, 3 Clusters

Figure 24: Understanding the difference between single-linkage and complete-linkage. The cut-off distance is setat varying values to observe the resulting clusters (t=0.201, see gray-dashed line in dendrogram; t=0.19, t=0.3, notshown).Top Graph: a dataset consisting of 9 points.Left Column (Single Linkage): in the dendrogram two groups of distances are evident; one group shows slightly high-er pair-wise distances than the other. If we cut at t=0.201, then that results in two clusters: a ’compact’ 4-point clusterand a chain-like 5-point cluster. If we lower the threshold a bit to t=0.19, then this breaks up the compact 4-point cluster,but the 5-point chain remains.Right Column (Complete Linkage): the dendrogram shows that distances are larger due to the use of farthest neigh-bors for linking. Thresholding at 0.201 results in 5 clusters that do not really represent any meaningful partitioning.Raising the threshold to 0.3 consolidates the clusters to one 4-point cluster and two other cluster; a chain is neverformed.

64

10.4 Recapitulation

Application Hierarchical clustering is suitable in particular when hierarchies need to be described, i.e.biological taxonomy. If the data set is not too large, and one is able to calculate the pair-wise distances indesired time, then this is certainly a method to explore. If we suspect clusters of certain shapes, then wecan choose a corresponding linkage method.

Advantages- Hierarchical clustering offers easy interpretation of the results - as does the decision tree classifier,

coming up in the next section (Section 11).- In general, the exhaustive analysis permits to find cluster centers more reliably than the K-Means method.

Disadvantages- Complexity: the exhaustive (pair-wise) comparison of all samples (observations) implies square com-

plexity: O(N2). Hierarchical clustering is thus suitable for databases of limited size only. There existvarious methods to reduce the complexity to some extent, but those can also be applied to many otherclustering methods, that have a smaller complexity by its nature.

- Linkage determines shape: the choice of linkage method determines what type of cluster shape we willobtain. That means that the linkage method imposes a certain structure on the data.

- Lack of method or algorithm to determine an optimal cluster number: it is difficult to determine an optimalthreshold (cutoff) level automatically - as it is difficult to determine an optimal k in the K-Means proce-dure. This is a general challenge of any clustering algorithm and therefore not a disadvantage specificto this type of clustering method.

Usage in Matlab The three steps can be carried out as follows:

Dis = pdist(DAT); % pairwise distances

Lnk = linkage(Dis, ’single’); % single linkage method

Lbl = cluster(Lnk,’cutoff’, 1.25);

Or use the function clusterdata, i.e.

Lbl = clusterdata(DAT, ’maxclust’, 3);

65

11 Decision Tree ThKo p215, 4.20, pdf 228

Kuncheva p6.8, 2.4

Alp p185, ch 9

JWHT p303, ch8

DHS p395

A decision tree is a multistage decision system, in which classes are sequentially rejected until we reach afinally accepted class. The decision process corresponds to the flow diagram we learned in school.

A decision tree is particularly useful if data is nominal, see again Section 1.2.1. Nominal data arediscrete and without any natural notion of similarity or even ordering. One often uses lists of attributes toexpress objects. A common approach is to specify the values of a fixed number of properties by a propertyd-tuple. For example, consider describing a piece of fruit by the four properties of color, texture, taste andsmell. Then a particular piece of fruit might be described by the 4-tuple red, shiny, sweet, small, which is ashorthand for color = red, texture = shiny, taste = sweet and size = small. Such data can be classified withdecision trees.

Figure 11 shows an example for a 2D dataset. On the left is shown an (artificial) dataset, which consistsof 6 regions, whose points belong to four different classes; classes 1 and 3 have two separate regions each.The vertical and horizontal lines represent decision boundaries that separate the classes. For example thedecision boundary at x1 = 1/4 separates part of class ω1 from all others. In order to assign a testing sample{x1, x2} to one of the four classes, one applies a number of decision boundaries - the decision tree is anefficient way to train and apply these decisions. The illustrated example may appear as very simple, but isin fact very difficult to solve for other classifiers, as in this example the data points of some classes lie invery different regions of the entire 2D space.

Figure 25: Left: a pattern divided into rectangular subspaces by a decision tree. Right: corresponding tree. Circles:decision nodes. Squares: leaf/terminal nodes. [Source: Theodoridis, Koutroumbas, 2008, Fig 4.27,4.28]

On the right side in Figure 11 is shown a decision tree that separates all instances and assigns themto their corresponding classes. A decision tree is drawn from top-to-bottom and consists of three types ofnodes that are connected by links or branches:

root node: the top node of the tree; it is a decision node, labeled t0 in this case.

decision node: tests a component of the multi-dimensional feature vector; drawn here as circles andlabeled ti, i = 0, .., ndecisions.

leave (or terminal) node: assigns the instance to a class label; drawn here as squares and labeled ωi,i = 1, .., nclasses.

66

The example decision tree consists of 5 decision nodes and 6 leave nodes. Given a (testing) data point,e.g. x1 = 0.15, x2 = 0.5, the decision node t0 tests the first component, x1, by applying a threshold valueof 1/4: if the value is below, the data point is assigned to class ω1; if not, the process continues with binarydecisions of the general form of xi > α (α = threshold value) until it has found a likely class label.

The example of Figure 11 is a binary decision tree and splits the space into rectangles with sides parallelto the axes; for higher dimensionality (3D or more) those would be called hyper-rectangles. Other types oftrees are also possible, that split the space into convex polyhedral cells or into pieces of spheres. Note thatit is possible to reach a decision without having tested all available feature components.

In praxis, we often have data of higher dimensionality and we therefore need to develop binary decisiontrees automatically, that is, we need to learn somehow when which component xi is tested with what thresh-old value αi. The learning rule selects threshold values where the decision achieves higher class ’purity’,meaning individual class frequencies should either increase or decrease with every split. At the beginning,the training set X is considered ’impure’ because all class labels are present. With every following decision- and hence the splitting of the training set - the resulting split in class labels is supposed to be purer. Thetraining procedure has therefore three key issues: impurity, stop splitting and class assignment rule. Thoseissues are elaborated now:

1: Impurity

Every binary split of a node, t, generates two descendant nodes, denoted as tY and tN according to the’Yes’ or ’No’ decision; node t is also referred to as the ancestor node (when viewing such a split). Thedataset arriving at the ancestor node is split into subsets XtY , XtN , which in turn are fed to the descendantnodes, see Figure 26; the root node is associated with the entire training set X.

Figure 26: The (learning) datasets associated with thesplit at a decision node. The set of training vectors Xt ar-rives at the ancestor node and is split into sets XtN andXtY , which are fed to the descendant nodes. For eachnode a (class) impurity is calculated, I(t) at the ances-tor node, I(tY ) and I(tN ) at the respective descendantnodes. The learning rule seeks to decrease the impurityat splits.

ancestor

descendants

X t

X tN X tY

Now the crucial point: every split must generate subsets that are more ’class homogeneous’ compared tothe ancestor’s subset Xt. This means that the training feature vectors in XtN and the ones in XtY show ahigher preference for specific class(es), whereas data inXt are more equally distributed among the classes.Example: for a 4-class task: assume that the vectors in subset Xt are distributed among the classes withequal probability (percentage). If one splits the node so that the points that belong to classes ω1 and ω2 formsubset XtY , and the points from ω3 and ω4 form subset XtN , then the new subsets are more homogeneouscompared to Xt or ’purer’ in the decision tree terminology.

The goal, therefore, is to define a measure that quantifies node impurity and split the node so that theoverall impurity of the descendant nodes is optimally decreased with respect to the ancestor node’s impurity.Let P (ωi|t) denote the probability that a vector in the subset Xt, associated with a node t, belongs to classωi, i = 1, 2, ...,M . A commonly used definition of node impurity, denoted as I(t), is the entropy for subsetXt:

I(t) = −M∑i=1

P (ωi|t) log2 P (ωi|t) (9)

where log2 is the logarithm with base 2 (see Shannon’s Information Theory for more details). We have:

67

- Maximum impurity I(t) if all probabilities are equal to 1/M (highest impurity)- Least impurity I(t) = 0 if all data belong to a single class, that is, if only one of the P (ωi|t) = 1 and all the

others are zero (recall that 0 log 0 = 0).When determining the threshold α at node t, we attempt to chose a value such that ∆I(t) is large.

Example: given is a 3-class discrimination task and a set Xt associated with node t containing Nt = 10vectors: 4 of these belong to class ω1, 4 to class ω2, and 2 to class ω3. Node splitting results into: subsetXtY , with 3 vectors from ω1, and 1 from ω2; and subset XtN with 1 vector from ω1, 3 from ω2, and 2 fromω3. The goal is to compute the decrease in node impurity after splitting. We have that:

I(t) = − 4

10log2

4

10− 4

10log2

4

10− 2

10log2

2

10= 1.521

I(tY ) = −3

4log2

3

4− 1

4log2

1

4= 0.815

I(tN ) = −1

6log2

1

6− 3

6log2

3

6− 2

6log2

2

6= 1.472

Hence, the decrease in impurity at this split is

∆I(t) = 1.521− 4

10(0.815)− 6

10(1.472) = 0.315.

2: Stop Splitting

The natural question that now arises is when one decides to stop splitting a node and declares it as a leafof the tree. A possibility is to adopt a threshold T and stop splitting if the maximum value of ∆I(t), over allpossible splits, is less than T . Other alternatives are to stop splitting either if the cardinality of the subsetXt is small enough or if Xt is pure, in the sense that all points in it belong to a single class.

3: Class Assignment Rule

Once a node is declared to be a leaf, then it has to be given a class label. A commonly used rule is themajority rule, that is, the leaf is labeled as ωj where

j = argmaxiP (ωi|t)

In words, we assign a leaf, t, to that class to which the majority of the vectors in Xt belong.

Learning A critical factor in designing a decision tree is its size. The size of a tree must be large enoughbut not too large; otherwise it tends to learn the particular details of the training set and exhibits poorgeneralization performance. Experience has shown that use of a threshold value for the impurity decreasesas the stop-splitting rule does not lead to trees of the right size. Many times it stops tree growing either tooearly or too late. The most commonly used approach is to grow a tree up to a large size first and then prunenodes according to a pruning criterion. A number of pruning criteria have been suggested in the literature.A commonly used criterion is to combine an estimate of the error probability with a complexity measuringterm, e.g., number of terminal nodes.

It is not uncommon for a small change in the training dataset to result in a very different tree, meaningthere is a high variance associated with tree induction. The reason for this lies in the hierarchical natureof the tree classifiers. An error that occurs in a higher node propagates through the entire subtree, that isall the way down to the leaves below it. The variance can be improved by using so-called random forests,which we will introduce in the section on ensemble classifiers (Section 12.3.1).

68

Algorithm 7 Growing a binary decision tree. From ThKo p219.Parameters Stop-splitting threshold TInitialization Begin with the root node Xt = X.For each new node t

For every feature xk (k = 1, ..., l)For every value αkn (n = 1, ..., Ntk)

- Generate XtY and XtN for: xk(i) ≤ αkn, i = 1, ..., Nt- Compute ∆I(t|αkn)

Endαkn0 = argmaxα∆I(t|αkn)

End[αk0n0

, xk0 ] = argmaxα∆I(t|αkno)

If the stop-splitting rule is metdeclare node t as a leaf and designate it with a class label

ElseGenerate nodes tY , tN with corresponding XtY , XtN for: xk0 ≤ αk0n0

EndEnd

11.1 Usage in Matlab

In Matlab we use the function fitctree to evaluate data with a tree classifier:

MdCv = fitctree(DAT, GrpLb, ’kfold’,nFld);

pcTree = 1-kfoldLoss(MdCv);

Older Matlab versions use the function name classregtree.

We can visualize a tree using the function view:

view(MdCv.Trained1,’Mode’,’graph’);

Appendix G.16 gives a complete example; the usage of fitting and evaluation functions is analogous to theuse of the kNN-classifer function or the linear-classifier function, see again the overview of classifiers inAppendix G.1 or see the explicit code for the kNN classifier in Appendix G.6.

11.2 Recapitulation

Application Decision tree classifiers are particularly useful when the input is non-metric, that is when wehave categorical variables. They also treat mixtures of numeric and categorical variables well.

Advantages Learning duration is short. And due to their structural simplicity, decision trees are easilyinterpretable. Even for Random Forests the learning duration is relatively short.

Disadvantages Learning is not robust: slight changes in the dataset can lead to the growth of verydifferent trees. However with Random Forest - introduced in the next section - , those ’variances’ areaveraged out.

69

12 Ensemble Classifiers Alp p419, ch 17

HKP p377, 8.6

An ensemble classifier is a classifier that makes multiple classification estimates using different classifier-s and then combines those estimates to form a single decision. We introduced the idea with Fig. 8c.Constructing such a classifier makes sense in particular when we have features (variables) from differentsources (Fig. 27a): for instance we may have features from auditory and visual signals and then we traintwo separate classifiers and observe whether we outperform the accuracy of just a single classifier. Or, weclassify a multi-class problem with c classes in a class-wise manner, for instance by training c classifiersthat discriminate one class versus all other classes, also known as one-versus-all (OvA) classification, orone-versus-rest (OvR). In the first case, the data matrix is partitioned along the feature axis, in the secondcase it is partitioned along the sample axis (Fig. 27b). There are more schemes to partition the data matrix(along either axis) to arrive at an effective ensemble classifier, which we introduce later. Or, as a third andsomewhat very simple approach (Fig. 27c): one may train the whole (unpartitioned) data set with differentclassifier algorithms, for instance, once with a kNN classifier, once with a linear classifier, once with a de-cision tree, etc. and then we combine those estimates. There is no guarantee that an ensemble classifierreally works better than if one uses only a single classifier, but by cleverly tuning the individual classifiers,the more likely it is that the ensemble predicts significantly more accurately than a single classifier.

In Matlab, many ensemble techniques are available with the function fitensemble, but there are alsoseparate functions. In Python, they are available in the scikit-learn module as sklearn.ensemble. We nowintroduce how to combine the outputs of individual classifiers (Section 12.1). Then we elaborate on thoseother ensemble learning schemes (Fig. 27d, e, f), for which we firstly explain their rationale (Section 12.2)and then explain how to use them.

12.1 Combining Classifier Outputs

A classifier outputs a class label or a graded measure; the graded measure is a class confidence alsocalled posterior probability or score sometimes. A label is returned by the kNN algorithm and the decisiontree for example. A graded measure is returned by a linear classifier or a neural network for example.Graded measures can be combined straightforwardly by simple math operations for instance, coming upnext (Section 12.1.1). We can also try to learn the combination (Section 12.1.2). If we combine both, labelsand graded measures, then we firstly need to transform the labels into a graded measure; Section 12.1.3gives some ideas. If we build one-versus-all classifier schemes, then it is best to combine them using theoutput label together with an error-correcting output code scheme (ECOC), see Section 12.1.4 for more.

Obtaining Graded Measures In our previous classification examples, we have only extracted labels usingthe function predict. To extract the graded measure, Matlab provides a second output argument called’scores’:

[~,Score] = predict(Mdl, TST); % called ’score’ in Matlab [nSmp nFet]

In Python, the measure is called ’probabilites’ and is obtained by a separate function predict proba appliedto the classifier struct:

Prob = Mdl.predict_proba(TST) # returns probabilities [nSmp nFet]

12.1.1 Voting and Other Combination Rules

The simplest way to combine multiple classifier outputs is by summing their graded measures and then takethe maximum as one does for a linear classifier for instance (eq. 5). That is also called voting, specificallysimple voting, because every classifier output is treated with equal ’weight’. The code example in G.17 givesan explicit example in Matlab for case ’c’ (multiple whole, Fig. 27). The graded measures are concatenatedto a three-dimensional array first, then we sum the confidence values and then we take the maximum tofind the most appropriate class label. In the example, we also test the maximum operation for combiningthe classifiers.

70

auditory

random

subspace

a

visual

b

e f

di cult

cases

stage-wise

c d

class-wise

bagging

feature-wise

multiple whole

Figure 27: Strategies to create effective ensembles of base-classifiers. One partitions the data matrix, either verticallyalong features (cases a and e) or horizontally along observations (cases b, d, and f); or one uses the matrix repeatedly(as a whole) multiple times (case c). Vertical partitioning also helps counteracting the curse of dimensionality.a. The features come from different sources, for instance from an auditory and from a visual signal: then we train oneclassifier only for the auditory input and one classifier only for the visual input and then combine the estimates of thosetwo base classifiers (also called base learners).b. Class-wise classification, for instance a one-versus-all classifier scheme. This is typically done when a Support-Vector Machine is applied to a multi-class task (Section 15), in which case the c classifiers are combined by an Error-Correcting-Output-Code (ECOC) scheme.c. The data set is trained with different classifier algorithms or with the same algorithm but with different hyper-parameters, or a combination of those variations.d. Bagging is the random selection of observations and their training - short for bootstrap aggregating; the method ofrandom forests is a specific case of that scheme.e. Taking case a further: we select randomly from the space; useful if we are uncertain about our feature sourcesbut suspect possible combinations to be of advantage. This type of ensemble creation is sometimes considered asbelonging to the bagging strategy (case d).f. Stage-wise learning, for instance training on difficult cases: generate a classifier for the entire set and then observewhich points represent difficult cases - either wrongly classified samples or those classified with low confidence; thentrain a second classifier only on those difficult cases to arrive at improved discrimination. Boosting and cascading aretwo examples.

71

SC = cat(3,ScNB,ScDC,ScRF); % [nTst nCat nClassifiers]

ScSum = sum(SC,3); % simple voting

ScMax = max(SC,[],3); % max-combination rule

[~,PrdSum] = max(ScSum,[],2);

[~,PrdMax] = max(ScMax,[],2);

One can also apply the median, minimum or product operation. The median rule is more robust to outliers.The minimum and maximum rules are pessimistic and optimistic, respectively. With the product rule, eachlearner has veto power: regardless of the other ones, if one learner has an output of 0, the overall output isset to 0. Note that after applying those combination rules, the measures do not necessarily sum up to 1.

Python Python offers functions for combining classifier outputs and we specify by options what type ofcombination rule we prefer. An example of an all-in-one function was given in the overview G.17.

If any of those combination rules provide a significantly better prediction accuracy than the use of a singleclassifier, then we have already achieved our goal of building an effective ensemble classifier. If not, we cantry to learn the combination, coming up now.

12.1.2 Learning the Combination

Instead of merely choosing a combination rule as we did above, we can also give weights to the differentclassifier outputs, called a weighted sum. In that case we talk of ensembles or linear opinion pools, whichitself represents a linear classifier. If we learn the combination on the entire training set, then this is notoptimal, but we can certainly try without any concern. The combination stage will perform better if it islearned when novel data is presented. That means we need to split the training set into a subset for trainingthe base classifiers only, and a validation subset for the combination stage, see also Section ?? again.Ultimately, this learning scheme is more elaborate and requires more training data.

12.1.3 Component Classifiers without Discriminant Functions DHS p498, 9.7.2, pdf 576

If we create an ensemble classifier, whose individual learners consist of different classifier types, e.g. oneis a linear classifier and the other is a kNN classifier, then we need to adjust their outputs in particular if theydo not compute discriminant functions. In order to integrate the information from the different (component)classifiers we must convert their outputs into discriminant values. It is convenient to convert the classifieroutput g̃i to a range between 0 to 1, now gi, in order to match them to posterior values of a (regular)discriminant classifiers. The simplest heuristics to this end are the following:Analog (e.g. NN): softmax transformation:

gi =eg̃i∑cj=1 e

g̃i. (10)

Rank order (e.g. kNN): If the output is a rank order list, we assume the discriminant function is linearlyproportional to the rank order of the item on the list. The values for gi should thus sum to 1, that isscaling is required.

One-of-c (e.g. Decision Tree): If the output is a one-of-c representation, in which a single category isidentified, we let gj = 1 for the j corresponding to the chosen category, and 0 otherwise.

The following table gives a simple illustration of these heuristics (example taken from Duda/Hard/Storck).Other scaling schemes are certainly possible too. As pointed out one needs to look into the predictionfunctions given by the software packages to understand what graded measures can be obtained.

72

12.1.4 Error-Correcting Output Codes Alp p427, 17.5, pdf 467

This technique is used in particular when we attempt to break down a multi-class challenge with K classesinto an ensemble of K one-versus-all classifiers, each one a binary classifier in which one class is testedversus the remaining classes. We can elaborate this scheme and also choose to train pairwise binary clas-sifiers, in which every class is trained against every other class, also called one-versus-one classification;in that case, a total of K(K1)/2 binary classifiers are trained.

The technique of error-correcting output codes looks at the predicted labels (not the graded measures).Doing so, one is faced with a binary table. In that table there may appear systematic errors which in turncan be corrected by learning appropriate modifications.

In Matlab the method is implemented with fitcecoc. It appears to be the default method when one trains aSupport-Vector Machine to a multi-class problem (Section 15).

Of course, with such schemes the amount of computation increases correspondingly. When constructingsuch ensemble classifier, one should pay attention to the class imbalance problem (Section 8.4.1). Thesoftware functions will typically take care of any imbalance.

12.2 Rationale

The previously introduced ensemble classifiers - the feature wise, the class-wise and the multiple whole(Fig. 27a, b, and c) - are obvious schemes of combining classifiers to achieve a potentially better predictionaccuracy. But one can attempt to push those schemes into a direction, where one systematically attemptsto find classifiers that may perform only sub-optimally, but whose combination nevertheless exceeds theprediction accuracy of a single classifier. Sometimes, it is sufficient to have an individual classifier perform-ing merely above chance level. In those ensemble systems, one often talks of base classifiers or baselearners; one searches for those base classifiers systematically. There exist three search approaches:

Bagging (Fig. 27d): this technique can be seen as a relative of the class-wise technique, but here oneselects randomly from the observations with replacement: it stands for bootstrap aggregating. Moreon that in Section 12.3. The popular Random Forests are a specific instance of that approach.

Random Subspace (Fig. 27e): this can be understood as taking the feature-wise approach further and isuseful if we suspect that certain feature combinations may be of advantage. We do not introduce thisapproach further.

Stage-Wise (Fig. 27f): those classifiers typically focus on the difficult cases: after a first classificationround one identifies those observations that were wrongly classified or with low confidence. In asecond round we then focus only those difficult cases. Section 12.4 explains more.

As hinted already, a base learner does not necessarily need to learn with high accuracy - in some circum-stances it is in fact better if they perform just above chance level. What is of greater importance is that thebase learner complement each other, they should be diverse in their representation.

73

12.3 Bagging, Random Forests

Bagging is a voting method whereby base learners are trained on different subsets of the training sets (Fig.27). The subsets are generated by bootstrap, that is by drawing randomly a subset of samples from thetraining set with replacement; bagging stands for bootstrap aggregation, see again table 1 in Section 8.1.Because sampling is done with replacement, it is possible that some instances are drawn more than onceand that certain instances are not drawn at all. One can use randsample to create indices for differentsubsets, e.g.

nSub = 20; % number of subsets

szSub = round(nTrnSamp*0.1); % 10 percent of all training samples

aMd = cell(nSub,1); % store models

for i = 1:nSub

Ixr = randsample(nTrnSamp, szSub); % indices for szSub randomly chosen samples

DATsub = DAT(Ixr,:); % subset of data

GrpSub = Grp(Ixr); % same subset of group variable

% ---- train a base classifier/learner on DATsub ----

aMd{i} = fitcxxxx(DATsub, GrpSub);

end

The base learners are typically combined using the voting method. To ensure that the base learners donot learn too perfectly, they are trained with an unstable algorithm, such as a decision tree, a single ormultilayer perceptron, or a condensed NN. Unstable means that small changes in the training set cause alarge difference in the generated learner, namely a high variance.

It is rather difficult to obtain a better prediction accuracy with bagging alone. A modification of thisscheme has been very successful however, called the Random Forest classifier, which is introduced next.

12.3.1 Random Forest wiki Random forest JWHT p316, 8.2

HKP p382, 8.6.4

SKL p238A random forest is an ensemble classifier consisting of multiple decision trees, hence its name. It is trainedwith the method of bagging as just introduced above. The larger the number of trees in such a forest, thebetter the prediction accuracy, but the longer also is the training duration.

Random forests do not require folding to arrive at a pooled prediction estimate - that is why the foldinstruction is lacking in the example of the overview (Appendix G.1). For every decision tree that takes asubset of observations for learning, the remaining observations can be taken to predict for that tree. Thatremaining set of observations is also called ’out-of-bag’. At the end of the training procedure, the individualpredictions are pooled, very much like one pools the predications for a new (testing) sample. That error forthe out-of-bag samples is called out-of-bag error.

Usage in Matlab We use the command TreeBagger to learn the ensemble of trees, and then employ thecommand predict to classify the testing samples:

Forest = TreeBagger(100, TREN, Grp.Trn);

PredC = predict(Forest, TEST);

The output of function predict is a list of strings and we need to convert those by using num2str forinstance. A full example is given in Appendix G.18. We can also use the function CompactTreeBagger, whichrequires less memory and is useful if we carry out the training phase only once. The function TreeBagger

allows continuous learning by appending new decision trees later on, but that also requires that one carrieson the entire training set.

Usage in Python There exists two variants for random forests, RandomForestClassifier and ExtraTreesClassifier,both in module sklearn.ensemble.

74

Applications There are two notable cases where decision trees are in particular successful:- document classification, as part of the field of text mining (text categorization)- movement classification, as part of the domain of computer vision. Microsoft’s Kinect device recognizes

a user’s movements by use of such random forests.

12.4 Stage-Wise, Boosting

Boosting is a method that focuses on samples that are difficult to discriminate, namely those samplesthat tend to be unusual for their own class. The boosting procedures starts by training the entire set ina superficial way and then analyzes, which samples were not properly classified. Those mis-classifiedexamples are then selected and a second round of training is carried out. We would have accumulated twobase-learners so far, one for the first round and one for the second round. One can continue this cycle ofselection of mis-classifications and separate training until we have correctly classified all training samplesand we would have accumulated a sequence of base-learners. When we evaluate a new sample, then weapply that sequence and generate so a graded measure.

In the context of boosting, one calls the base learners also weak learners, because they often predictjust-above chance level.

12.5 Recapitulation

Ensemble classifiers are particularly useful if your data are heterogeneous, for example the data come fromdifferent sources or it contains different data types or its classes are heterogeneous consisting so of sub-classes essentially. Then it certainly makes sense to attempt to try an ensemble classifier that partitionsthe data feature wise. It comes at little effort to test this type of classifier.

Class-wise classification is a necessity if one uses the Support-Vector Machine, because it is a binaryclassifier; class-wise ensemble classifiers are best combined with error-correcting output codes (ECOC).

Then there exist ensemble classifiers who use multiple sub-optimal predictors. A particular successfulone is the Random Forest, an ensemble of decision trees. It has been successful in document classification,but also in motion recognition (Microsoft Kinect camera).

The downside of ensemble classifiers is that it can take some time to find the right combination of baselearners. The upside is that one can achieve good prediction accuracies with a relatively simple method incomparison to complex methods such as Deep Neural Networks.

75

13 Recognition of SequencesDHS p413, s 8.5, pdf 481

ThKo p487, s 8.2.2Now we look at classification of strings or sequences that again can not be compared with typical metricmethods. It is another case of classification with nominal data - the first one we introduced with DecisionTrees in Section 11. The following methods are an approach where the pattern is described by a variablelength string of nominal attributes, such as a sequence of base pairs in a segment of DNA, e.g., ’AGCTTCA-GATTCCA’; or the letters in word/text. The methods are useful for dealing with sequences in general.

A particularly long string is denoted text. Any contiguous string text that is part of x is called a sub-string, segment, or more frequently a factor of x. For example, ’GCT’ is a factor of ’AGCTTC’. There isa large number of problems in computations on strings. The ones that are of greatest importance in patternrecognition are:- String matching: Given x and text, test whether x is a factor of text, and if so, determine its position.- Edit distance: Given two strings x and y, compute the minimum number of basic operations - character

insertions, deletions and exchanges - needed to transform x into y.- String matching with errors: Given x and text, find the locations in text where the ’cost’ or ’distance’ of

x to any factor of text is minimal.- String matching with the ’dont care’ symbol: This is the same as basic string matching, but with a special

symbol, ∅, the dont care symbol, which can match any other symbol.

We introduce only the first two.

13.1 String Matching Distance

The simplest detector method is to test each possible shift, which is also called ’naive string matching’. Amore sophisticated method, the Boyer-Moore algorithm, uses the matched result at one position to predictbetter possible matches, thus not testing every position and accelerating the search.

Figure 28: The general string-matching problem is to findall shifts s for which the pattern x appears in text. Anysuch shift is called valid. In this case x = ”bdac” is in-deed a factor of text, and s = 5 is the only valid shift.[Source: Duda,Hart,Storck 2001, Fig 8.7]

Usage In Matlab The function strfind carries out this simple type of matching.

13.2 Edit Distance

The edit distance between x and y describes how many fundamental operations are required to transformx into y. The fundamental operations are:

- Substitutions: a character in x is replaced by the corresponding character in y.

- Insertions: a character in y is inserted into x, thereby increasing the length of x by one character.

- Deletions: a character in x is deleted, thereby decreasing the length of x by one character.

Let C be an m×n matrix of integers associated with a cost or distance and let δ(·, ·) denote a generalizationof the Kronecker delta function, having value 1 if the two arguments (characters) match and 0 otherwise.The basic edit-distance algorithm (Algorithm 8) starts by setting C[0, 0] = 0 and initializing the left columnand top row of C with the integer number of steps away from i = 0, j = 0. The core of this algorithm findsthe minimum cost in each entry of C, column by column (Figure 29). Algorithm 8 is thus greedy in that eachcolumn of the distance or cost matrix is filled using merely the costs in the previous column.

76

Algorithm 8 Edit distance. From DHS p486.Initialization x, y, m← length[x], n← length[y]Initialization C[0, 0] = 0Initialization For i = 1..m,C[i, 0] = i, EndInitialization For j = 1..n,C[0, j] = j, EndFor i = 1..m

For j = 1..nIns = C[i− 1, j] + 1; % insertion costDel = C[i, j − 1] + 1; % deletion costExc = C[i− 1, j − 1] + 1− δ(x[i],y[j]) % no (ex)change costC[i, j] = min(Ins,Del, Exc) % the minimum of the 3 costs

EndEndReturn C[m,n]

As shown in Figure 29, x = ”excused” can be transformed to y = ”exhausted” through one substitutionand two insertions. The table shows the steps of this transformation, along with the computed entries of thecost matrix C. For the case shown, where each fundamental operation has a cost of 1, the edit distance isgiven by the value of the cost matrix at the sink, i.e., C[7, 9] = 3.

Figure 29: The edit distance calculation for strings x and y can be illustrated in a table. Algorithm 3 begin-s at source, i = 0, j = 0, and fills in the cost matrix C, column by column (shown in red), until the full ed-it distance is placed at the sink, C[i = m, j = n]. The edit distance between excused and exhausted is thus 3.[Source: Duda,Hart,Storck 2001, Fig 8.9]

The algorithm has complexityO(mn) and is rather crude; optimized algorithms haveO(m+n) complexityonly. Linear programming techniques can also be used to find a global minimum, though this nearly alwaysrequires greater computational effort.

Note: as mentioned in the introduction, the pattern can consist of any (limited) set of ordered elements,and not just letters. Example: The edit distance is sometimes applied in computer vision, specifically shaperecognition, for which a shape is expressed as a sequence of classified segments.

Usage in Matlab : The Matlab toolbox ’Bioinformatics’ provides a set of functions, e.g. localalign,nwalign, etc.

77

14 Density Estimation

Density estimation is the characterization of a low-dimensional data distribution, typically a one-dimensionaldistribution only, sometimes two-dimensional, rarely three-dimensional. Density estimation is similar to clus-tering in principal (Section 9) and sometimes it is even considered as part of the topic of clustering. Whilein clustering one attempts to identify densities in higher-dimensional data, in density estimation in contrast,that data is typically only one-dimensional, for instance an individual variable (feature) of a multi-dimensionaldataset - a column of your data matrix. The data under investigation can also be a two-dimensional distribu-tion, for instance the spatial locations of objects in a space. It can also be a three-dimensional distribution,however with increasing dimensionality density estimation becomes quickly computationally intensive aswe will learn later. In density estimation, one often seeks an adequate visual display to understand thedistribution better, which in clustering is difficult to create; and one typically identifies the modes (maxima)of the distribution.

There are two principal types of density estimation methods: non-parametric and parametric methods,Sections 14.1 and 14.2, respectively. In non-parametric methods, the distribution is merely transformed andwe typically specify a single parameter for this transformation. In parametric methods we are more explicit:we specify the number of expected densities for instance - similar to specifying k for the K-Means algorithm.

14.1 Non-Parametric MethodsAlp p165

The distribution is observed through different ’windows’ or ’local neighborhoods’, which are placed acrossthe range of the data. There are two methods of ’windowing’, see also Fig. 30. For the method of his-togramming, the windows are called bins, and all we do is to count the number of data-points that lie withina bin. This count is then illustrated in a simple bar plot (Section 14.1.1). In the method of kernel-estimation,the window is called kernel or ’Parzen window’ and we take some weighted average for the points with-in; kernel-estimation results in a smooth distribution function, as opposed to the histogramming method(Section 14.1.2).

-15 -10 -5 0 5 10 15 20

0

2

4

6

8

10 DataHistogramKernel

Figure 30: Density estimation. The data distribution is shown as black dots at y = -0.5; the data could be the values ofone variable of the data matrix D for instance. The (blue) bars represent an estimation by histogramming using a binwidth equal 1. The (green) dotted curve is an estimation using a kernel function that calculates differences betweenpoints and therefore works similar to a similarity measure.

78

14.1.1 Histogramming wiki Histogram

Histogramming is the simplest kind of density estimation - and the fastest one. When we generate ahistogram, the data are assigned to bins whose borders are called edges, the spacing between two edgesis called the bin width. In Fig. 30 a regular spacing was chosen with bin width equal one (blue bars), butunequal spacing is possible too, in which case one specifies an array of edges. The estimate is zero if noinstance falls within a bin.

Choosing the appropriate bin size can be tricky, in particular if one intends to find a description forthe distribution. A too small bin width would not generalize sufficiently, and a too large bin width wouldgeneralize too much. Setting the appropriate range boundaries is not straightforward either. The exerciseswill clarify that.

Histogramming can be done in multiple dimensions too. A two-dimensional histogram is also called bi-variate histogram, for three or more dimensions one can talk of n-dimensional histograms. Two-dimensionalhistograms can also be displayed as bar histograms, namely as columns standing in a plane. For three ormore dimensions, it becomes difficult to display the data and one needs to observe multi-dimensional dataas two-dimensional subspaces for instance to obtain an idea about the data set.

Histogramming appears so convenient and effective, one is tempted to generate this density estimatefor the entire dataset, that is for all variables of the data matrix. For a few dimensions this is possible, but thelarger the dimensionality the quicker we reach memory limits. For instance for 7 dimensions and 10 bins,one will need about one Gigabyte of memory using the single data type. Therefore, for high dimensionality,histogramming is unfeasible.

Usage in Matlab With the command histogram you can plot a histogram whereby it is possible to specifythe number of bins or an array of edges. With the command histcount you receive the actual histogramcount. Let’s assume your data DAT has been scaled already and you intend to observe the first variable andyou prefer to specify bin edges:

Edg = linspace(0,1,21)); % bin edges from 0 to 1 in steps of 0.05

H = histcounts(DAT(:,1), Edg);

bar(Edg(1:end-1), H, ’histc’);

For a two-dimensional histogram you can use the function hist3. If you need n-dimensional histogramming,then try the function in Appendix G.19.2.

14.1.2 Kernel Estimator (Parzen Windows) wiki Kernel density estimationAlp p167

ThKo p51In the method of kernel estimation, the computation involves not only counting data-points but also distancemeasurements between points within the window. The window size and the distance measurement arespecified by a function called the kernel. Those kernels (windows) can be placed equally spaced throughoutthe data range at a set of specified points x, or we can place them exactly at the individual data-points. Theexample in Fig. 30 uses equally spaced points with a spacing equal one to compare it with the histogram.At each such specified point x, the distribution xt (t = 1, .., N , N=number of datapoints) is observed by thekernel, whose behavior can be compared to a lens metaphorically speaking - very much like in a similaritymeasure (Appendix B.2): points near the center are given more ’attention’ than those in the periphery. Inother words, at each specified point x the points of the distribution xt are weighted by the kernel function K(wiki Kernel smoother). The kernel function typically takes firstly the difference between two points, one pointa selected x from the equally spaced set of points, the other point from the distribution xt. That differenceis then normalized by a parameter h called the bandwidth, that controls the width of the lens (the windowsize):

K( |x− xt|

h

)(11)

There are many different kernel functions, see for example wiki Kernel (statistics)#Kernel functions in common use.The most popular kernel function for density estimation is the Gaussian function g(x;µ, σ), see equation 20

79

(in Appendix C). The location parameter µ corresponds to variable x and the width σ corresponds to thebandwidth h.

Returning to our formulation for the density estimate: it is expressed as the function f(x), which consistsof the sum of weighted values for each xt obtained by the kernel function K with center x:

f(x) =1

Nh

N∑t=1

K( |x− xt|

h

), (12)

whereby the divisor Nh normalizes the function. The code example in Appendix G.19.1 should clarify themethod.

Kernel estimation can also be done in two or more dimensions. But the larger the dimensionality, themore computation is required. Because we need to calculate the distance between each set point x andall data points xt, the complexity quickly becomes unfeasible. It is common to perform kernel estimation forspatial coordinates for instance - that is two dimensions -, with the purpose to determine exactly the object’slocation. But for three or more dimensions it used rarely and one would rather use a clustering algorithminstead.

Usage in Matlab Matlab offers the function ksdensity which by default uses the Gaussian function as akernel. One can specify a desired bandwidth

D = ksdensity(DAT(:,1), ’width’, 0.25);

but one can also omit it and the bandwidth will be estimated by some simple rule.

14.2 Parametric Methods Alp p61

Parametric means we express the distribution by parameters, that is, by an equation with more than oneparameter, which is also called the probability density function (PDF) in the context of density estimation.If we assume that our distribution has only one mode - also know as uni-modal distribution -, then thesimplest parametric description would be to take its mean µ and standard deviation σ, also know as thefirst-order statistics of the distribution. If we then intend to determine the distance of new samples to thisuni-modal distribution, then it would be convenient to employ the Gaussian function again (Equation 20).This is exactly what is done for the Naive Bayes classifier (Section 17).

If however we suspect two or more modes in our distribution, then we need an algorithm that finds us µand σ for each mode automatically. For example, in Fig. 30 we have observed with the kernel method thatthere may exist two major densities in the distribution. Of course, one could take the K-Means algorithm(Section 9) which in fact is very similar to the following approach, but the method presented here is subtlerand somewhat more precise.

14.2.1 Gaussian Mixture Models (GMM)

Here we expect that the distribution consists of multiple modes, meaning we know that there are two ormore sources giving rise to bi-modal or multi-modal distributions, respectively. The goal is then to locatethe precise position of the source and to estimate the standard deviation it introduces into the data, whichis again a case for the Gaussian function. We specify the number k of Gaussians we expect and then try tofind the corresponding centers and standard deviations, µi and σi respectively (i = 1, .., k). One thereforecalls this a Gaussian Mixture Model (GMM): the model simply adds the output of k Gaussian functions,whose means and standard deviations correspond to the location of the modes and to the width of theassumed underlying distributions, respectively.

The most popular algorithm to find the appropriate values for µi and σi, is the so-called Expectation-Maximization (EM) algorithm. The algorithm gradually approaches the optimal values by a search proce-dure that is very akin to the K-Means algorithm (Algorithm 5), hence the relation of density estimation toclustering.

80

Usage in Matlab With the function gmdistribution.fit we can find a GMM (available in statistics tool-box), for which we specify k as the minimum parameter:

Dgm = gmdistribution.fit(DAT,3);

Gm = pdf(Dgm, PtEv);

It returns a structure, called Dgm in our case, which we then pass to a function pdf that generates theGMM at points PtEv (points of evaluation). We give a full example in Appendix G.19.3, without any furtherexplanation.

14.3 Recapitulation

Density estimation is used for analyzing variables or specific distributions of low dimensionality, typically oneto three dimensions at most. Histograms are suitable for obtaining an idea of what type of distributions wedeal with, for instance observing the variables (dimensions) of a data matrix. Kernel estimation is suitable ifwe have particular data sets, spatial coordinates for instance.

For larger dimensionality density estimation becomes unfeasible however. In case of histogramming wequickly reach memory limitations for larger dimensionality; in case of the kernel-density estimation method(parametric or non-parametric), the computational complexity is the limiting factor. For large dimensionalityone rather resorts to clustering methods.

81

15 Support Vector Machines ThKo p119

A Support Vector Machine (SVM) is sometimes considered an elaboration of the linear classifier introducedpreviously. A SVM focuses on samples that are difficult to classify, somewhat akin to the hard negativemining technique mentioned before (Section 8.4.3); a SVM will ’dent’ the decision boundary based on thosedifficult examples. For that reason, a SVM typically performs better than a ’ordinary’ linear classifier but italso requires more tuning: its learning duration is typically much longer; and it may only work if the classesare reasonably well separable. The following characteristics make a SVM distinct from an ordinary linearclassifier:

1. Kernel function: The SVM uses such functions to project the data into a higher-dimensional space inwhich the data are hopefully better separable than in their original lower-dimensional space. Kernelfunctions are typically similarity measures, see also Appendix B.

2. Support Vectors: The SVM uses only a few sample vectors for generating the decision boundariesand those are called support vectors. For a ’regular’ linear classifier, there exist multiple reasonabledecision boundaries, that separate the classes of the training set. For instance, the optimal hyperplanein Figure 31 could actually show slightly different orientations. The SVM finds the hyperplane, thatalso gives a good generalization performance, whereby the support vectors are exploited to what iscalled ’maximizing the margin’ - the two bidirectional arrows delineate the margin.

An SVM is however only a binary classifier, taking only two classes as input. If we intend to solve a multi-class task with it, we need to construct K one-versus-other classifiers and then combine their outputs, atechnique introduced in Section 12.1.4 already.

15.1 Usage in Matlab

The overview in Appendix G.1 already gave two examples, one showing how to use the all-in-one function.In older Matlab version you may find the SVM under the bioinformatic toolbox: the function svmtrain trainsthe model and the function svmclassify applies the trained model on the test set.

Scaling By default the function fitcsvm will not scale the data, see again 3.3. If you intend to test yourdata with scaling, then you need to set the parameter ’standardization’ to ’true’. In previous Matlab versionsit was the other way around: scaling was default.

Kernel Function By default, Matlab’s function script for the SVM uses a linear kernel (see also B). Trya different one, if you hunt an optimization of your prediction accuracy. In previous Matlab versions, thedefault used to be the dot product. You can also specify your own kernel. In G.20 we give an example. Thatkernel function is useful when your data express histograms.

Lack of Convergence It is not rare, that the SVM learning algorithm does not converge to a solution withits default settings. The error may look something like No convergence achieved .... In that case weneed to play a bit with certain parameters. Here are two tricks:

1. Lower the box constraint parameter. The default is equal one. Try something smaller, maybe 0.95.

2. Increase the parameter KKTTolerance. The default is equal 0.001. Set to 0.005 to hopefully improve.

Specify those parameters as follows:

Svm = fitcsvm(TRN, GrpTrn, ’boxconstraint’,0.95, ’tolkkt’, 0.005);

82

Figure 31: Training a support vector machine consists of finding the optimal hyperplane, that is, the one with themaximum distance from the nearest training patterns. The support vectors are those (nearest) patterns, a dis-tance b from the hyperplane; in this illustration there are three support vectors, two black ones and one red one.[Source: Duda,Hart,Storck 2001, Fig 5.19]

15.2 Recapitulation

Application If a binary classification task needs to be optimized, it is definitely worth trying a SVM:chances are good your prediction accuracy will increase, but occasionally it will not. It can also be worthtrying to classify three or more classes with the one-versus-all classifier as mentioned in Section 12.1.4.

Advantages- SVMs are probably the best binary classifiers for most binary tasks.- SVMs have also excelled at very high dimensionality, when there are many more dimensions than sam-

ples, as in gene analysis.- The SVM is robust in the sense that it does not require a feature transformation, as opposed to the linear

discriminant classifier which often requires that data are made more compact using for instance theprincipal component analysis.

Disadvantages- SVMs require parameter tuning sometimes, as opposed to linear classifiers, but probably less so than

neural networks.- The learning duration is somewhat long: it typically takes longer than a standard linear classifier (as in

Matlab function fitcdiscr), but it is still much quicker than a Deep Neural Network.- A SVM is difficult to tune, if classes are not reasonably separable - it may also simply not work as well as

a linear classifier for a task hard to discriminate.

83

16 Deep Neural Network (DNN) wiki Deep learning

A Deep Neural Network (DNN) is an elaboration of a traditional (artificial) neural network - like the SupportVector Machine is an elaboration of the ’traditional’ linear classifier. In oder to understand the idea of aDNN, we firstly sketch traditional neural networks in Section 16.1, which is also helpful to understand whythey are still being used in other classifier methodologies, for instance in ensemble classifiers (Section 12)or Support Vector Machines (Section 15).

The neural network methodology is more diverse than any other classifier methodology, in particularthere exist more learning algorithms than for any other classifier methodology. A network needs to bedesigned and a key problem is to find the appropriate network architecture, also called topology sometimes.This search for the appropriate topology is usually done purely heuristically and is therefore somewhat time-consuming.

The performance of traditional networks was not better than the performance of other classifiers for ageneral classification task. For those reasons traditional neural networks were not always taken seriously bytraditional machine learning scientists. The latest development in neural network methodology has howeverproduced networks that frequently classify datasets more accurately than any other classifiers - sometimesby a large gain. But the downside of finding the appropriate topology persists and has become even morechallenging in some cases. In Section 16.2 we introduce so-called Convolutional Neural Networks (CNNs),which are now widely used for image classification. In Section 16.3, we introduce a type of Deep BeliefNetwork (DBN), which is a very potent, general classifier.

16.1 Traditional Neural Networks

There are two principal, traditional neural networks: the Perceptron, which can be considered as the pre-master of all neural networks; and the Multi-Layer Perceptron, which consists of - as the name implies - twoor more layers of Perceptrons.

output

input

hidden

input

output

Figure 32: Network topologies: circles represent neural units; straight lines represent connections between units witheach connection ’flow’ being controlled by a weight value. The input layer is usually placed at the bottom, the outputlayer is placed at the top; hence flow proceeds from bottom-to-top for classification, for learning it proceeds top-to-bottom.Left: the architecture of a (multi-class) linear classifier or Perceptron. The classification flow is considered feed-forward.In this specific case there are 4 input units and 3 output units; the unit count of the output layer often corresponds tothe number of classes to be trained. This diagram is also used for depicting the architecture of a so-called RestrictedBoltzmann Machine (RBM) in the Deep NN methodology; for the RBM the ’flow’ is more flexible than for a Perceptronbut is difficult to depict.Right: the architecture for a three-layer network, e.g. a multi-layer Perceptron (MLP). For MLPs, the classificationprocess occurs feed-forward, the training process occurs back-ward. This diagram is also used for a Deep BeliefNetwork (DBN) if the layers are trained as RBMs; in that case only the classification process is considered feed-forward;the learning process is more complex.

84

Perceptron wiki PerceptronThe Perceptron is essentially a linear classifier as introduced in Section 6, but uses a different learning

method, namely the Perceptron learning rule. That learning rule is not very robust and is consideredobsolete by now, but its lack of robustness is in fact of advantage in ensemble classifiers (Section 12).A Perceptron - or any Linear Classifier - can be regarded as a two-layer network consisting of an inputlayer and an output layer (Fig. 32 left). Those two unit layers hold a layer of weights, which correspondto the weight matrix as in equation 4. The Perceptron architecture in Fig. 32 is specifically a multi-classPerceptron; the equivalent linear classifier would be a so-called ’multi-class linear classifier’.

Multi-Layer Perceptron (MLP) wiki Multilayer perceptron Multi-Layer Perceptrons are stacks of Perceptrons.Typically, the term MLP refers to three layers, namely input layer, hidden layer and output layer (Fig. 32right). But a MLP can have four or more layers in principle, that is two or more hidden layers. The hiddenlayer is typically all-to-all connected (as in the Figure): each unit receives the value from all input units, andit transmits the computed value to all the units in the next layer - either the output layer or another hiddenlayer. A hidden layer is typically understood as the feature layer: it ’recognizes’ part of its input, which inlater layers is then integrated to complete the class information.

An MLP is a so-called feed-forward network because for the classification of a testing sample, the infor-mation flow propagates only forward, namely from the input layer to the hidden layer, to the next hidden layer- if present -, until the output layer. (This is in contrast to so-called recurrent networks, where informationflow occurs in loops).

MLPs can be used as kernel functions in Support Vector Machine (Section 15) - the Matlab functionsvmtrain even contains an option to use them as such. They can also be used in ensemble classifiers,very much like the Perceptrons.

There are different learning rules to train a MLP - the by far most successful one is the so-called back-propagation algorithm. As the name suggests, it works back-ward through the layers to adjust the weightsin the connection layers.

The back-propagation algorithm is often the final step in learning a DNN architecture.

16.1.1 Usage in Matlab

The Neural Network toolbox in Matlab provides a set of commands to simulate the traditional NN, suchas the Perceptron or the MLP. Those commands start typically with the letters net (for network). With thecommand network you initialize a network. We leave it at that without further explanation, because insteadof tuning a traditional network it might be worth trying to directly tune a DNN.

16.2 Convolutional Neural Network (CNN) wiki Convolutional neural network

A Convolutional Neural Network gradually builds an abstraction of its input by firstly detecting local features,followed by a gradual assembling those local features toward more global features. The term convolutionexpresses a mathematical operation in which a a varying signal - in our case the input - is systematicallyanalyzed with a another fixed signal - in our case a set of weights favoring a specific part of the input. Ina CNN, that fixed signal is some local feature and needs to be found during the learning process. CNNsare in particular popular in the domain of image classification, where a network can consist of up to tenhidden layers; that would be a twelve-layer network! CNNs can be regarded as elaborations of Multi-LayerPerceptrons (Section 16.1).

A CNN has two types of hidden layers: feature layers and pooling layers (Fig. 33). A feature layerobserves the result of a convolution - the corresponding weight layer is also called convolutional layersometimes. A pooling layer receives a feature layer as input and merely sub-samples it in order to arriveat the desired local-to-global integration. In the sequence of layers from input to output, the feature layersand the pooling layers alternate in order to arrive at a local-to-global integration. Thus in determining thetopology of a CNN, one needs to experiment with the neighborhood size for the convolutions, as well as thesub-sampling step for the pooling layers.

85

input feature

layer

pooling

layer

outputconvolution

sub-sampling

Figure 33: A simple Convolutional Neural Network (CNN) for learning to classify images. This architecture has fourlayers: input, feature, pooling and output layer; classification flow occurs from left to right. The feature layer is some-times also called feature map: it is the result of a convolution-type scanning of the input: each unit observes a localneighborhood in the input image. The pooling layer merely carries out a sub-sampling of the feature layer. In a typi-cal CNN, there are several alternations between feature and pooling layer. The learning process tries to find optimalconvolutions that help to separate the image classes.

Unlike the hidden unit of a MLP, the hidden unit of a CNN does not receive input from all its predecessorsanymore, but only from a certain neighborhood. For instance, if the input is a 30×30 pixel image, then thefirst hidden unit would observe only the 6×6 pixel neighborhood in the upper left corner of the image; thesecond hidden unit observes the neighborhood shifted by one pixel to the right of the first one, etc. Thehidden units of a CNN cover the entire input - corresponding to a convolution. Each hidden unit wouldobserve whether its neighborhood contains a specific feature that is common across instances, for examplea straight bar, a dot, or any other geometrical structure.

The advantage of CNNs is that their classification accuracy is better than other approaches and some-times much better, which makes them the first choice when signals such as images need to be classified.But their downside is that the training duration is very long. And because it requires some time to find theappropriate topology, it can take weeks to tune such a network.

To speed up the learning process there exist two tricks. One is to use a NVIDIA graphics card withthousands of so-called CUDA cores. The other is to use a pretrained network: those consist of featuremaps, that have been trained previously already on other datasets.


Providing a code example would take a lot of space. We therefore merely point out that example code canbe found on the ’file exchange’ site of Mathworks’ website, as well as on the wikipage for CNNs (https://en.wikipedia.org/wiki/Convolutional_neural_network).

16.3 Deep Belief Network (DBN) wiki Deep belief network

A Belief Network is a network that operates with so-called conditional dependencies. A conditional depen-dence expresses the relation of variables more explicitly than just by combining them with a weighted sum.However determining the full set of parameters for such a network is exceptionally difficult. Deep Belief Net-works (DBNs) are specialized in approximating such networks of conditional dependencies in an efficientmanner, that is at least partially and in reasonable time. Popular implementations of such DBNs consist oflayers of so-called Restricted Boltzmann Machines (RBMs).

86

https://en.wikipedia.org/wiki/Convolutional_neural_network

https://en.wikipedia.org/wiki/Convolutional_neural_network

The principal architecture of a RBM is the same as for a Linear Classifier (or Perceptron; left in Fig.32), but the architecture of a RBM contains an additional set of bias weights (not shown in figure). Thoseadditional weights make learning more difficult but also more capable - they help solving those conditionaldependencies. The typical learning rule for a RBM is the so-called contrast-divergence algorithm.

The choice of an appropriate topology for the entire DBN is relatively simple. With two hidden layersmade of RBMs, one can obtain already fantastic results. A third layer rarely helps in improving classificationaccuracy. Learning in a DBN occurs in two phases. In a first phase, the RBM layers are trained individuallyone at a time in an unsupervised manner: the RBMs perform quasi a clustering process. Then, in thesecond phase, the entire network is fine-tuned with the back-propagation algorithm.

As with Convolutional NNs, a Deep Belief Network takes much time to train. However the choice of thearchitecture is easier to determine as one obtains good results with two layers already. The main advantageof this type of Deep Network is that it is fairly robust: it produces results even for difficult datasets, for whichSVMs are difficult to apply; and it often provides a classification accuracy that is similar or even better thanSVMs. Thus, some scientists prefer DBNs over SVMs.


Again, providing a code example would take too much space. An example can be found on Mathwork’swebsites.

16.4 Recapitulation

A Deep Neural Network (DNN) is definitely worth trying out, as it likely provides a better classificationaccuracy than most other classifiers in many tasks. In the domain of image classification, ConvolutionalNeural Networks (CNNs) have proven to provide the best accuracy, but finding the appropriate architectureremains a heuristic endeavor. A Deep Belief Network (DBN) made of Restricted Boltzmann layers is aclassifier as powerful as the SVM, perhaps even more powerful. It however takes a long time to train aDeep Network and if one needs qualitative results only, it is probably more convenient to use a traditionalclassifier.

Advantages- DNNs can tackle ’large-scale’ problems, which before their implementation could not be really ap-

proached. In that sense they have ushered in a new era.- Once they have been tuned properly, DNNs are fairly robust. DNNs do not require any dimensionality

reduction.

Disadvantages- The search for the appropriate architecture is a heuristic issue, in particular for CNNs.- DNNs require substantial parameter tuning sometimes; yet less so than SVMs according to my experi-

ence.- The learning duration can be terribly long. However in case of CNNs, learning can be substantially

accelerated with the acquisition of proper hardware (NVIDIA graphics card with CUDA cores).

87

17 Naive Bayes Classifier wiki Naive Bayes classifier

We now introduce a classifier whose complexity lies between that of the Nearest-Shrunken-Centroid classi-fier - mentioned in Section 4 - and that of a ’sophisticated’ linear classifier, such as the Linear DiscriminantAnalysis, introduced in Section 6. As pointed out with Fig. 7b already, the Naive Bayes classifier tries tomodel the point clouds of classes with individual Gaussian distributions. It does not care about potentialdependencies between features, although we know that they do exist in most data sets. For that reason itis called ’naive’; one says it assumes independence of features. The naivety comes however with a sub-stantial degree of robustness, because the Naive Bayes classifier sometimes handles both extremes of thedata matrix fairly well (Fig. 10): it can tackle large sample size because it does not have an elaborate fittingprocedure; and it has prove to perform fairly well for some high-dimensional data set such as in documentclassification.

The Naive Bayes classifier tries to find a suitable function for the distribution of feature values. This isnot easy because the distributions appear very diverse, see again Fig. 12: for the sepal length and widththey appear to be uni-modal distributions; for the petal length it is a bi-modal distribution; and for the petalwidth it looks messy - it could be multi-modal or a flat distribution. And those are histograms for all valuesfrom all classes; the distribution for individual classes can be even more diverse. Ideally, one would try tomodel them individually - Matlab allows that (!).

Different versions of the Naive Bayes classifier try to approximate the class distributions with varyingdegree of effort. In the simplest case, we use merely a Gaussian function, meaning we hope for a uni-variatedistribution. We can take the mean and standard deviation and that would be very similar to the Nearest-Shrunken-Centroid classifier we quickly mentioned in Section 4; we omit the details, they are subtle. Or wecan find the optimal mean and standard deviation by density estimation as done in Section 14.2. An evenmore complicated Naive Bayes classifier would try to place a multi-dimensional Gaussian function, sort ofas in Fig. 7b. Some of those variants are suitable for high-dimensional problems and they will be pointedout in the ’Usage’ Section below.

To classify a new (testing) sample, we compute the distribution function value for each class and the onethat returns the largest function value is then our preferred class label.

Algorithm 9 Naive Bayes Classifier. k = 1, .., c (nclasses, K)Training ∀ c classes (∈ DL):

mean µk, covariance Σk, determinant |Σk|, inverse Σ−1k , prior P (k)→ gk as in Equation 22

Testing 1) for a testing sample x ∈ DT determine g(x) ∀ c classes→ gk.2) multiply each gk with the class prior P (k): fk = gk · P (k)

Decision chose maximum of fk: argmaxk fk

If the classes occur with uneven frequencies, we need to determine the frequency for each class, also calledprior, and include this as pointed out in the above algorithm, see the training step; it is then applied duringtesting by multiplying it with the posterior (step no. 2).

The Naive Bayes classifier suffers from the same problems as mentioned for the Linear Discriminant (Sec-tion 6): it can be difficult to compute the covariance matrix in particular for few training samples (smallsample size problem).

17.1 Usage in Matlab, Implementation

The usage of the all-in-one function, fitcnb, was demonstrated in Appendix G.1 already. But implementinga Naive Bayes classifier is not so difficult, see Appendix G.21 for an example (see also ThKo p81). With det wecalculate the determinant. The computation of the inverse is preferably done with the command inv, but ifthe inverse is difficult to compute, for instance due to small sample size, then one can estimate the inversewith pinv. If the inverse can still not be computed, then we need to we perform a dimensionality reduction

88

(Section 7).

We did not include the prior in this code fragment, which one can generate with

Prior = Hgrp./sum(Hgrp(:))

for instance, where Hgrp is the sample count for each class (histogram of group variable, see Appendix G.3).

In older Matlab versions, the Naive Bayes classifier was implemented by the command classify and theoption ’diaglinear’ (or ’diagquadratic’).

Variants Matlab offers many variants, too many to list them here. One needs to study the documentationpages carefully. One noteworthy issue is, that Matlab does allow to specify different functions for differentfeatures (predictors). That is a flexibility I have not found in Python.

17.2 Usage in PythonSKL p225, 3.1.9

Python offers three variants in sklearn.naive bayes:

1. GaussianNB: uses the Gaussian function for fitting. Do not expect high prediction accuracy with thisone.

2. MultinomialNB: suitable for vectors with word counts, i.e. in document classification; or even incomputer vision?

3. BernoulliNB: suitable for vectors with word occurrences, i.e. document classification where we areinterested more the presence of words, less so of their count.

Furthermore it allows to handle large-scale classification problems by specifying the option partial fit.

17.3 Recapitulation

The family of Naive Bayes classifiers is appealing due to the simplicity of their fitting procedure. If onefits with a Gaussian, then the prediction accuracy is probably not competitive. However if one uses themulti-nomial versions then they compete well in document classification tasks; it appears to defy the curseof dimensionality in that case.

Matlab also offers to use their function with fitting predictors individually, a promising flexibility. Pythonin contrast offers a function that can deal with very large data sets, by allowing to partially fit the data.

Despite its reasonable prediction accuracy, the use of the actual posterior values is (perplexingly) notrecommended due to unreliability.

89

18 Classification: Rounding the Picture & Check List DHS p84

18.1 Bayesian Formulation

A typical textbook on pattern classification (with mathematical ambition) starts by introducing the Bayesianformalism and its application to the decision and classification problem. We introduce this formalism at theend, because now it can easier understood - after we have worked with the different classifiers. Bayes’formalism expresses a decision problem in a probabilistic framework:

Bayes rule : P (ωj |x) =p(x|ωj)P (ωj)

p(x)posterior =

likelihood× priorevidence

(13)

We first explain the terms in ’natural’ language, as given on the right side of above equation: DHS p22,23

Alp p50Posterior: is the probability for the presence of a specific category ωj in the sample x.

Likelihood: is the computed value using the density function. In the example of the Naive Bayes classifier(Section 17), it is the value of Equation 22.

Prior: is the probability for the category being present in general, that is, it is the frequency of its occur-rence. We called this prior already in Algorithm 9 (Section 17.1).

Evidence: is the marginal probability that an observation x is seen - regardless of whether it is a positiveor negative example - and ensures normalization. (This was not explicitly calculated.)

Expressed formally now we say: given a sample, x, the probability P (ωj |x) that it belongs to class ωj , is thefraction of the class-conditional probability density function, p(x|ωj), multiplied by the probability with whichthe class appears, P (ωj), divided by the evidence p(x). We can formalize evidence as follows:

p(x) =

c∑j=1

p(x|ωj)P (ωj) =∑

(likelihood× prior) = Normalizer to ensure∑j

P (ωj |x) = 1 (14)

18.1.1 Rephrasing Classifier Methods

Given the above Bayesian formulation, we can now rephrase the working principle of the three classifiertypes (Sections 5, 6, 17) as follows:

k-Nearest-Neighbor (Section 5): estimates the posterior values P (ωj |x) directly, without attempting tocompute any density functions (likelihoods); in short, it is a non-parametric method, because no effortis made to find functions that approximate the density p(x|ωj).kNN is a type of instance-based learning, or lazy learning where the function is only approximatedlocally and all computation is deferred until classification.

Naive Bayes Classifier (Section 17): is essentially the simplest version of the Bayesian formulation andthat classifier makes the following two assumptions in particular:

1. It assumes that the features are independent and identically drawn (i.i.e.), in short statisticallyindependent. This is also called Naive Bayes’ Rule. But often we do not know beforehand,whether the dimensions are uncorrelated.

2. It assumes that the features are Gaussian distributed (µ ≡ ε[x],Σ ≡ ε[(x− µ)(x− µ)t])

For most data, these are two strong assumptions because most data distributions are more complex.Despite those two strong assumptions, the Naive Bayes classifier often returns acceptable perfor-mance.

Discriminative Model (Section 15): they are similar to the kNN approach in the sense that they do notrequire knowledge of the form of the underlying probability distributions. Some researchers argue,that attempting to find the density function is a more complex problem than trying to directly developdiscriminants functions.

90

18.2 Estimating Classifier Complexity - Big O Notation wiki Big O notation

We already discussed some of the advantages and disadvantages of the different classifier types in termsof their complexity. This is typically expressed with the so-called Big O notation. In short, the notationclassifies the algorithms by how they respond to changes in input size, e.g. how a change affects the pro-cessing time or working space requirements. In our case, we investigate changes in n or d (of our n×d datamatrix). The issue is too complex to elaborate here and we merely summarize here, what we mentioned sofar and what will be mentioned in later sections. For classifiers, we also make the distinction between thecomplexity during learning and the one of classifying a testing sample:

Classifier Learning Classificationk Nearest Neighbor - O(dn) [slow]Linear O(d2) O(d)Decision Tree O(d) O(d)Support Vector Machine O(n2) [slow] O(d)Deep Neural Network O(d1d2) [slowest] O(d)

Clustering AlgorithmK-Means O(ndkT )Hierarchical O(n2) [slow]

Table 3: Complexities of classification and clustering methods.d = number of dimensions.n = number of samples.k = number of clusters.T = number of repetitions.

18.3 Parametric (Generative) vs. Non-Parametric (Discriminative)

Along with the Bayesian framework comes also the distinction between parametric and non-parametric me-thods, as already implied above and as introduced in Section 2.1 (see also 14). Bishop uses the termsgenerative versus discriminative instead. Textbook chapters are often organized according to this distinc-tion. See also wiki Linear classifier#Generative models vs. discriminative models.

The parametric, generative methods pursue the approximation of density distributions p(x|ωj) by func-tions with a few essential parameters. The poster example is the Naive Bayes classifier (Section 17). It isthe preferred approach by theoreticians.

Non-parametric methods in contrast find approximations without any explicit models (and hence param-eters), such as the kNN and the Parzen window. Here we summarize the typical assignment of the methodsto those categories:

Parametric (Generative)

Naive BayesExpectation-Maximization (Section 14.2.1)(Maximum-Likelihood Estimation)...in short: multi-variate methods

Semi-parametric Clustering, i.e. K-MeansExpectation-Maximization

Non-parametric (Discriminative)

k Nearest NeighborSupport Vector MachinesDecision TreesNeural Networks (NN & DNN)

The term ’semi-parametric’ I found in Alpaydin’s textbook. The Expectation-Maximization algorithm can beclassified as parametric or non-parametric - depending on the exact viewpoint.

91

18.4 Algorithm-Independent Issues DHS p453, ch 9, pdf 531

The machine learning community tended to regard the most recently developed classifier methodology asa breakthrough in the quest of a (supposed) superior classification method. However, after decades ofresearch, it has become clear (to most researchers) that no classifier model is absolutely better than anyother one: each classifier has its advantages and disadvantages and their underlying, individual theoreticalmotivations are all justified in principle. In order to find the best performing classifier for a given problem, apractitioner simply has to test them all essentially. Here are two issues that frequently occur in debates onpattern recognition:

Curse of Dimensionality LRU p244, 7.1.3, HTF p22, 2.5 Intuitively, one would think that the more dimensions (attributes)we have at our disposal, the easier it is to separate the classes with any classifier. However, one often findsthe inverse holds: with increasing number of dimensions, it is more challenging to find the appropriateseparability, which is also referred to as the curse of dimensionality. The reason is simple and perplexingnevertheless: the more dimensions there are, the much larger the space and the lonelier the points. Pointsbecome so isolated in high-dimensional space, that their neighboring distances become very far, so far infact, that distance measures loose their meaning. We give a more specific example later, when introducingclustering in high-dimensional space (Section 19.4).

No Free Lunch theorem DHS p454 The theorem essentially states that no classifier technique is superior toany other one. Virtually any powerful algorithm, whether it be kNN, artificial NN, unpruned decision trees,etc. can solve a problem decently if sufficient parameters are created for the problem at hand.

92

18.5 Check List

It is easy to forget some detail that can cost you a few percent of your performance - or even get you stuck.Here are the essentials in short form:

Preparation Command Comment1. datatype single(DAT) Turn data into ’single’ to save on RAM memory; ’double’ if precision is

required2. NaN/Inf isnan, isinf Avoid inf; how many columns with NaN are there?3. feature type numeric (real) or nominal (categorical)? or both?4. permutation randperm Permute training set, in particular for SVM and DNN

Classification5. normalize i.e. zscore In rare cases normalization is detrimental to performance (!).6. group frequency hist Groups (classes) equally distributed? If not, pay attention to class im-

balance issues.7. fold crossvalind 5-fold cross validation is recommendedkNN fitcknn Simple but memory intensiveLinear fitcdiscr Simple and straightforward - may require PCA; if performance not better

than kNN, check points 2, 4, 5 and 7.Tree(s) fitctree,

TreeBagger

Useful if variables are nominal (categorical)

SVM fitcsvm Ideal for binary classification, but may require tuning; if tuning appearsimpossible, then use linear classifier instead; if prediction low (i.e. ≤80%) check prediction also with linear classifier.

Ensemble individual Worth testing if your data comes from different sources.

Frequent failures and their likely cause:- The prediction accuracy is at chance level (e.g. 50% for binary classification, 33% for three classes,...)

or it is 100%:

- Verify that the group labels are assigned properly to samples.- Check for NaN and Inf again - see item 4. under preparation.

- The prediction accuracy for training is suspiciously high: Lack of permutation perhaps - try with permuta-tion (see Section 3 or code example in G.3).

93

19 Clustering III

In this section we will elaborate a bit on the two clustering methods we had introduced before, the K-Means algorithm of Section 9 and the hierarchical method of Section 10. But we will also introduce threenovel categories of clustering algorithms. We overview all five categories now first, in order to point out thedifferences ThKo p629, 12.2, HKP p448, 10.1.3:

Partitioning methods: those methods dynamically evolve spherical clusters; the K-Means algorithm pre-sented in Section 9 was one example of this category. Here we introduce variants that can deal withnominal data and that are less sensitive to outliers (Section 19.1).

Hierarchical methods: there exist variants of the procedures presented in Section 10 that can deal withlarge datasets, meaning they try to beat the O(N2) complexity. Those variants - i.e. CURE, ROCK,CHAMELEON, BIRCH - will be mentioned in the subsection dealing with large databases (Section19.3).

Density-based methods: for those methods one specifies a minimum density value; the algorithm allowsto find cluster shapes that are relatively arbitrary, as opposed to the shapes evolved with partitioningor hierarchical methods. Density-based algorithms have become increasingly popular in recent timeswith DBSCAN being the most used algorithm probably; they will be be introduced in Section 19.2.

High-dimensional approaches: the larger the number of dimensions, the higher are the chances thattraditional clustering methods are not able to find the actual clusters. Section 19.4 explains why thatis, and it mentions the approaches that can deal with high dimensionality.

Methods for Very-Large Databases (VLDB): if the data is of such large sample size (high N ) that it doesnot fit into a computer’s RAM anymore, then one typically applies sub-optimal but computationally fastmethods for the sake of being able to find clusters at all. Those will be introduced in Section 19.3.

It may have become clear, that the number of clustering algorithms is larger than the number of classificationalgorithms. We merely attempt to give an overview in order to guide the reader toward the techniques inneed, but we do not have code for all algorithms.

19.1 Partitioning Methods II

The classical K-Means algorithm enforces a cluster membership for each data point. This is also calledhard clustering sometimes, because it seeks clear-cut boundaries. But groups or classes in data are rarelywell separated and boundaries are rarely clear-cut. It can therefore be of advantage sometimes to regardboundaries as ’fuzzy’: an example is given in Section 19.1.1. Hard clustering is particularly problematicif there exist outliers in the data. In that case taking the so-called medoid is better than taking the mean:Section 19.1.2 introduces an example.

19.1.1 Fuzzy C-Means (K-Means) ThKo p712, 14.3

The clusters that we have sought so far did not intersect: a sample (observation) was assigned only toone cluster, meaning clusters were assumed to have clear boundaries. This condition may be too strict forcertain datasets in particular when some samples show characteristics of two or more groups - and notonly one. In other words, it is possible that groups could overlap: group boundaries can then said to be’fuzzy’. And this is exactly what fuzzy analysis deals with. There exists a variant of the K-Means method,which performs a fuzzy K-Means clustering. For historical reasons, the parameter variable k is called c inthat procedure, and it is therefore known as the Fuzzy C-Means algorithm.

A typical clustering algorithm returns a one-dimensional array with group labels (of length equal thenumber of observations). The Fuzzy C-Means algorithm instead, returns c arrays, where each array holdsthe proportion of membership for a group. Put differently, for each observation we have c values that expressthe likelihood to which cluster the observation belongs to; the sum of those c values is equal 1.

94

Usage in Matlab Matlab has a ’Fuzzy Logic’ toolbox which contains an implementation of the Fuzzy C-Means algorithm, see command fcm, which is applied the same way as kmeans. There is also a moreexplicit implementation with functions such as initfcm and stepfcm: one can obtain an idea how to usethose by looking at the function script irisfcm.m. In Appendix G.22 we give a compact example.

19.1.2 K-Medoids wiki K-medoidsThKo p745, 14.5.2

HKP p454, 10.2.2Because K-Means methods can be sensitive to outliers, it is sensible to try also medoids instead of means.The medoid is a representative point of the group; it is a point of the dataset and not a computed point(see wiki Medoid). The algorithm Partitioning-around-Medoids (PAM) is the most common implementationfor a K-Medoids approach. Not only can the algorithm deal better with outliers, it is in general suitable fornominal data.

Advantages- K-Medoids can also deal with discrete data - in addition to continuous data -, whereas k-means is suitable

only for continuous data.- K-Medoids is less sensitive to outliers.

Disadvantages The method is computationally more demanding than K-Means unfortunately: it has com-plexity O(k(N − k)2) and thus has almost square complexity.

Variants Because the K-Medoids algorithm has large complexity and therefore does not scale well tolarge datasets, there are some attempts to provide variants that work for large datasets as well. There existtwo variants:CLARA (Clustering LARge Applications): this method essentially corresponds to the PAM algorithm, but is

applied on a randomly selected subset of the data. The algorithm is repeated several times, in orderto find all medoids.

CLARANS (Clustering Large Applications based upon RANdomized Search): this method is a modifiedalgorithm of CLARA, that improves its random subselection.

Usage in Matlab Matlab offers the function kmedoids, which is applied like the function kmeans. By defaultit performs the PAM algorithm. It has an option for running CLARA.

19.2 Density-Based Clustering (DBSCAN) ThKo p815, 15.9

HKP p471, 10.4

In the section on Density Estimation we had introduced clustering methods that calculate a density valueat each observation (Sections 14.1.2 and 14.2.1). Those algorithms are however computationally intensiveand the following density-based method that is introduced now is simpler by carrying out a quick, initialsearch for densities, hence the name Density-Based Scan (DBSCAN).

The DBSCAN algorithm searches for regions that show higher density in comparison to their neighbor-hood by applying a user-specified threshold: no density value is calculated but merely a relational decisionis made, for instance if there are sufficient neighbors within the vicinity (Fig. 34). Thus, for that type ofclustering algorithm, one specifies two parameters: the radius (size; range) of the neighborhood under in-vestigation, called distance ε here; and the minimum number of points q that need to be present in thatneighborhood.

Procedure The algorithm randomly selects a point, then calculates the distances to all other points andthen observes how many neighbors there are within distance ε, see two examples in Fig. 34b. If thereare fewer than q neighbors, then the point is considered ’noise’ and the algorithm continues by selectinganother point randomly. If there are ≥ q neighbors, then the point is considered ’core’ and now the algorithmproceeds analyzing only those neighboring points. By identifying a core point it is assumed that one hasfound a cluster and that by searching along its neighboring core points one can recover that cluster. This

95

https://en.wikipedia.org/wiki/Medoid

search along neighboring core points has the advantage that it does not impose any restrictions on thecluster shape: the shape can be arbitrary and that aspect makes the algorithm unique from the traditionalpartitioning and hierarchical algorithms. After a cluster has been ’collected’ by such a search for neighboringcore points, the algorithm returns to the random selection of individual points until one finds the next corepoint of a different cluster.

Figure 34: Some principles of the DBSCAN algorithm.

a. A dataset: a cluster of several points placed amidstsome random points. The algorithm requires two parameters:a distance ε, which can be considered the radius of a circularneighborhood; and a minimum number of points q that arerequired to lie within that neighborhood.

b. Two example neighborhoods: the neighborhood no. 1- gray-dashed circle with diameter equal 2ε contains no neighborsand that selected point is considered a ’noise’ point; the neighbor-hood labeled 2 contains several neighbors and if their cardinalityis greater equal q then the point is considered a ’core’ point.

c. Two clusters of different density: the points on theright make up a second cluster of lower density. If one howeverapplied the same ε as in b, then all those points would beconsidered noise points and only the cluster on the left would beidentified.

a

b

c

1

2

Complexity The complexity isO(N2) in principle, but it does not require the storage of the distance matrix.Thus the algorithm is slightly simpler than hierarchical methods because only the temporal complexity isO(N2). However, for low-dimensional data the temporal complexity can be O(N log(N)) by exploiting tree-type data structures, i.e. R-tree; thus for low-dimensional data the method is suitable for large datasets.

Weaknesses The DBSCAN algorithm has the downside that it can only recover clusters of approximatelysimilar densities. If clusters show very different densities, as illustrated in Fig. 34c, it can be difficult todetect them with one set of parameter values.

96

As one may have anticipated, the challenge for this algorithm is to find the appropriate values for param-eters ε and q. Different values of the parameters may lead to totally different results: one needs to search atwo-dimensional space for the optimal values. One should select the values such that the algorithm detectsthe least dense clusters, that is one needs to try out a range of values.

Usage in Matlab Appendix G.23 gives an example of the DBSCAN algorithm.

19.2.1 Variants

There exist variants of the DBSCAN algorithm that try to address its weaknesses:

- DBCLASD (Distribution-Based Clustering of LArge Spatial Databases) ThKo p818, 15.9.2: this version is sup-posed to be able to deal with clusters of varying intensity and it requires no parameter definition: ittherefore should be able to deal with the two-cluster situation as in Fig. 34c. Its runtime is howevertwice as long as the one for the DBSCAN algorithm, but 60 times faster than the CLARANS algorithm(the K-Medoids variant in Section 19.1.2).

- DENCLUE (DENSity-based CLUEstering) HKP p476, 10.4.3, ThKo p819, 15.9.3: this version is lesser affected by data di-mensionality, meaning it has been used also for high-dimensional data such as used in Bioinformatics.

- OPTICS (Ordering Points To Identify the Clustering Structure) HKP p473, 10.4.2: this algorithm does not clusterthe points but merely orders them. It has the same complexity as DBSCAN.

19.2.2 Recapitulation

We summarize the advantages and disadvantages of density-based method in general:

Advantages- The method has the ability to recover arbitrarily shaped clusters.- It is able to handle outliers efficiently.- If low-dimensional data is employed, then the complexity can be only O(N logN).- It is in particular suitable for spatial data.

Disadvantages- DBSCAN has two parameters, meaning we search in a two-dimensional space for the optimal state.- DBSCAN can only detect clusters of similar density. The variant called DBCLASD is however capable of

recovering clusters of varying densities.- The complexity for multi-dimensional data is O(N2). There exists however a variant that supposedly has

only O(N) (search for HIERDENC)- Clusters are not as easily interpretable as in other methods.- The method is not well suited for high-dimensional data as it involves distance measurements. However

the variant called DENCLUE can deal with high-dimensional data.

19.3 Very Large Data Bases (VLDB)

There exists three principal methods to achieve clustering in very large databases (VLDB) in reasonabletime. Those methods are not really anything novel from a conceptual viewpoint, but represent rather prag-matic approaches combining different techniques:

- Incremental Mining: these are one-pass algorithms that iterate through the data points once. This methodcan also be called on-line clustering: it handles one data point at a time, and then discards it. Thealgorithm DIGNET is an example. It uses K-Means cluster representation without iterative optimiza-tion: centroids are instead pushed or pulled depending on whether they lose or win each next coming

97

point. The method strongly depends on data ordering, and can result in poor-quality clusters. How-ever, it handles outliers, clusters can be dynamically created or discarded, and the training process isresumable. This makes it very appealing for dynamic VLDB.

(If I’m not mistaken than this method corresponds to the sequential algorithms in ThKo p633, 12.2)

- Data Squashing: this method firstly generates statistical summaries of the data, thus reducing the data- either its sample size or its dimensionality. Following this reduction step, one uses a conventionalclustering algorithm. A famous example is the BIRCH algorithm (Section 19.3.2) that starts by creating’local’ clusters first. Another example are the so-called grid-based methods, which we will treat underalgorithms for high-dimensional data (Section 19.4).

- Reliable Sampling: a subset of the samples is selected, before a conventional clustering algorithm isused. The selection of a representative subset is of course the challenging issue. The algorithmCLARA is an example of this category (Section 19.1.2), as well as the algorithm CURE (see below).

In the following we discuss some variants for hierarchical methods (Section 19.3.1) and then discuss theBIRCH algorithm in more detail (Section 19.3.2).

19.3.1 Hierarchical ThKo p682, 13.5

Hierarchical clustering is by its nature not really suitable for very large datasets in particular due to thepairwise measurements between observations (Section 10.1). If one insist on using hierarchical clustering,then there exists a number of hierarchical algorithms that are geared toward dealing with large datasets.We merely summarize those:

Name CommentsCURE(sampling)LRU p262, 7.4

Clustering Using REpresentatives: a cluster is represented by at least 2 points.- suitable for numerical features (particularly low-dimensional spatial data)+ low sensitivity to outliers due to shrinking+ can reveal clusters with non-spherical or elongated shapes, as well as clusters of wide vari-ance in size- efficient implementation of the algorithm is possible using the heap and the k-d tree datastructures

ROCK RObust Clustering using linKs: uses links for merging clusters in place of distances- suitable for nominal (categorical) features

CHAMELEONHKP p466, 10.3.4

ThKo p686

- capable of recovering clusters of various sizes, shapes and densities in 2D data

BIRCH(squashing)

local clustering using hierarchical linking, followed by conventional clustering (Section 19.3.2)

Those algorithms are typically implemented in a lower-level language (C or one of its derivatives) wherethere exists more flexibility than in high-level languages such as Matlab. Those implementations may placepart of the interim-results on the hard-drive to cope with memory problems.

19.3.2 BIRCH wiki BIRCH HKP p462, 10.3.3

The BIRCH algorithm (Balanced Iterative Reducing and Clustering using Hierarchies) combines a varietyof techniques and consists of two principal phases. The first one carries out a local clustering using hier-archical linking and the resulting clusters are summarized as abstractions. The second phase then appliesany type of conventional clustering method to the abstracted clusters.

Phase I: A so-called clustering-feature tree is generated. A clustering feature (CF) is an object thatsummarizes a local group of points. For instance for nloc closely spaced observations xi, their centroid andtheir ’within-cluster’ variance is determined. In order to build the CF-tree efficiently, not the actual centroid

98

and variance is stored, but instead the linear sum LS =∑nloci xi and the squared sum SS =

∑nloci x2

i ,respectively. In addition, the parameter nloc is stored as well. The CF is thus an object consisting of onescalar and two vectors of length ndim:

CF = {nloc, LS, SS} (15)

By storing only two vectors, LS and SS, one saves memory, namely nloc−2 vectors are omitted from storing.To merge two clusters, one simply adds the two parameter values nloc and the respective components ofvectors LS and SS:

CF1 + CF2 = 〈nloc1 + nloc2, LS1 + LS2, SS1 + SS2〉. (16)

Example: Suppose we have a two-dimensional clustering challenge and the values of two CFs are as fol-lows CF1 = {3, (9, 10), (29, 38)} and CF2 = {3, (35, 36), (417, 440)}. If we wish to join the two clusters thenwe form the new CF as follows: CFs = {3+3, (9+35, 10+36), (29+417, 38+440)} = {6, (44, 46), (446, 478).}.

The generation of a CF-tree is controlled by two parameters, which implicitly control the height of the tree:- Branching factor B: specifies the maximum number of children per non-leaf node.- Threshold T : specifies the maximum diameter of sub-clusters stored at the leaf nodes of the tree. The

diameter can be computed from LS and SS as follows:√

2nSS−2LS2

n(n−1) .

The CF-tree is built dynamically as objects are inserted. Thus, the method is also an example of incrementalmining and not only data squashing (see introduction of this Section 19.3 again). An object is inserted intothe closest leaf entry (sub-cluster) and the diameter is recomputed: if the new diameter exceeds thresholdT , then the leaf node and possibly other nodes are split. After the insertion of the new object, informationabout the object is passed toward the root of the tree. If the CF-tree grows beyond the RAM’s size, thenparameter value T is increased and the CF-tree is rebuilt.

Phase II: A conventional clustering algorithm is applied to the leaf nodes of the CF-tree, which removessparse clusters as outliers and groups dense clusters into larger ones.

Disadvantages:- favors spherical cluster shapes due to the use of the diameter parameter- the CF tree can be simply an inappropriate summary for the final clustering result- there are two parameters (B and T ) to tune

Implementation: I am aware of the following implementation in R:http://www.inside-r.org/packages/cran/birch/docs/birch

19.4 High-Dimensional Data ThKo p821, 15.10

HKP p508, 11.2

The larger the dimensionality, the less significant become distance measurements between samples - theyprobably become useless. This is sometimes loosely called the curse of dimensionality. From what exactdimensionality on the data can be considered high-dimensional ranges between 11 and 20, depending onyour number of samples and on the expert’s viewpoint. For dimensionality larger than those limits, theEuclidean space has grown so large that the value between any two points becomes almost equidistantand thus practically indistinguishable. High-dimensional data are therefore better reduced in dimensionality,if one intends to use metric distance measurements. One way to reduce the dimensionality is of courseby use of the methods as introduced in Section 7 already, reviewed in the following subsection (Section19.4.1).

An alternative to approach to clustering high-dimensional data is to directly look for subspaces wherethe clusters reside. This idea appears not to make use of the full dimensionality and it could imply a lossof information. But because for higher dimensionality the space is so ’vast’, samples naturally occur only insubspaces. To illustrate this reasonable assumption, think of only a 10-dimensional space: if you assumedthat there existed at least one point in each bin of a 10-bin histogram for the full dimensionality - namely a

99

http://www.inside-r.org/packages/cran/birch/docs/birch

10-dimensional 10-bin histogram -, then there would have to exist at least 1010 samples. This is an unlikelyscenario (so far) and that is why the search for subspaces is a meaningful approach (Section 19.4.2).

a b

x

y

x

yC1

C2

C1

C2

Figure 35: Cluster situations and their suitability for some algorithms.a. Clusters C1 and C2 occur within two intervals of the y-axis, whereas the values along the x-axis are roughly equallydistributed. In this case, both feature selection or subspace clustering would make sense.b. Different clusters reside in different subspaces of the original feature space. This situation arises in particular forhigh-dimensional data and it is therefore a clear case for subspace clustering, whereas feature selection is far lesssuitable.

19.4.1 Dimensionality Reduction

The advantages and disadvantages of the dimensionality reduction techniques are analogous to the onesencountered for classification (Section 7):Feature generation: the most common techniques are the PCA (principal component analysis) as well as

the SVD (singular value decomposition). Those techniques are useful when a significant number offeatures contributes to the formation of clusters. The danger is that such generation may distort theclusters present in the original space; or that certain important dimensions are omitted.

Feature selection: this technique is useful when all clusters lie in the same subspace of the feature space.Figure 35a shows an example, where dimension x can be eliminated without any loss: the clusterscan be identified by only histogramming along the y-dimension.

19.4.2 Subspace Clustering HKP p510, 11.2.2

These are algorithms that search for clusters in any of the subspaces of the entire feature space. Becausethis is a combinatorial problem, the challenge is to develop algorithms that find the most relevant subspacesin reasonable time. There exist two principal techniques toward that goal. One focuses on generatinghistograms, which are called grids in this context, sketched in paragraph ’grid-based’ next. The otherfocuses on points, see paragraph ’point-based’ below.

Grid-based: HKP p479, 10.5 this approach corresponds essentially to finding clusters in multi-dimensional his-tograms (Section 14.1.1). Because it is unfeasible to generate the multi-dimensional histogram for theentire space, the strategy is to work from low to high dimensionality. One starts by generating a one-dimensional histogram for each individual feature and by identifying ’densities’ in those histograms. Thoseone-dimensional densities are then used to detect unions of densities in two dimensions, that in turn areused to detect densities in higher dimensionality. Figure 35b shows an example, where this approach wouldbe successful. We mention two implementations next, and switch now to the preferred terms grid, unit andedge size (instead of using the terms histogram, bin and bin size, respectively).

100

CLIQUE (CLustering In QUEst): ThKo p825 A user specifies two parameters, an edge size ξ and a densitythreshold τ ; density is determined as unit count divided by total number of samples. Firstly, a grid withedge size ξ is applied to each individual feature and those units with a density larger than τ are stored.For these dense units, one then determines their unions in multiple dimensions and finds those unionsthat lie adjacent. This leads to clusters that are typically much smaller than the full dimensionality.

Advantages

- insensitive to the order of the data- does not impose any distribution or shape on the data- scales linearly with sample size

Disadvantages

- scales exponentially with dimensionality- parameters are not obvious to select- accuracy of cluster shapes can be coarse, due to the use of grids only- large overlap of clusters due to the use of unions- risk of losing small but meaningful clusters after the pruning of subspaces based on their coverage

MAFIA (Merging of Adaptive Finite IntervAls): is a variant of CLIQUE where edge size ξ is variable. Itperforms somewhat better for increasing sample size than the other grid-based algorithms.

Point-based: in this approach we select first potentially representative points and then start to growclusters. The resulting clusters are outlined more accurately than those as obtained with grid-based me-thods.

PROCLUS: ThKo p832 this algorithm borrows concepts from the K-Medoids algorithm (Section 19.1.2). Theuser specifies two parameters: a number of m clusters as well as an average dimensionality s.

ORCLUS: the nature of this algorithm is of type agglomerative as it was introduced for hierarchical algo-rithms (Section 10). The user specifies again a number of m clusters and a maximal dimensionalitys.

101

20 Clustering: Rounding the Picture

We round the subject of clustering by providing first a summary of the most common clustering procedures(Section 20.1). Typically one assumes that the data do possess some type of clusters. But if we areuncertain whether the data do actually contain clusters at all, then one should attempt to verify that: Section20.2 discusses that problem. Section 20.3 provides a check list to avoid some common mistakes.

20.1 Summary of Algorithms

Partitioning Complexity Robustness Shape #Prm Type OrderK-Means O(Nkt) hyper-ellip. 1 depK-Medoids < O(N2) outliers 1 nominal dep

CLARA rand. sel. 1Fuzzy C-Means O(Nkt) noise 1 dep

HierarchicalSingle Linkage O(N2) elongated 1Complete Linkage O(N2 logN) compact 1

DensityDBSCAN ≤ O(N2) out./noise arbitrary 2 (ε, q)OPTICS O(N logN) out./noise arbitrary

Very Large DBBIRCH (hier.) O(N) outliers compact 2 (B, T )

High-DimensionalCLIQUE (grid-based) O(N) arbitrary 2 (ξ, τ )PROCLUS (point-based) O(N) 2 (m, s)

Table 4: Summary of popular clustering algorithms.Complexity: N=number of observations (data points). rand. sel. = random subselection.Robustness: out.=outliers.Shape: cluster shapes;#Prm: number of parameters to tune.Type: preferred data type.Order: order of observations: dep=dependent on order.

To further understand the differences between the principal clustering techniques (partitioning, hierar-chical and density) we look at three example datasets in Figure 36. In a we have a ring of points that istwo points wide approximately, sitting amongst some outliers (or noise). Such a cluster is best detectedwith a density-based method, as this methods allows to detect arbitrary shapes; or with a single-linkagehierarchical method, in which case it would zig-zag through the cluster but form a ring nevertheless.

In b there is a S-shaped cluster placed in a random set of points. As the cluster forms a sequence ofpoints, the single-linkage method appears the most appropriate choice - though the density-based methodcould also detect the cluster.

In c there are several clusters sitting in noise. A K-Means algorithm would probably perform well, but itwould include the noise points (or outlier points) in the computation of the cluster centers, though one couldalso try K-Medoids or fuzzy c-means of course. Similarly, a complete-linkage hierarchical method wouldalso find the cluster centers relatively well. A density-based method would probably find the cluster centersmore accurately, as it excludes the noise points - but only if an appropriate density threshold is specified.The risk with the density-base method is that if the threshold is not specified precisely, it would fuse certainclusters, for instance in the lower left of the image center there are two clusters that appear linked by asequence of three to four points.

102

a b c

Figure 36: Three artificial two-dimensional data sets that should illustrate some of the abilities of the principal clusteringmethods. a. A ring of points: density-based appears most appropriate. b. A sequence of points: single-linkage appearsoptimal. c. A set of clusters: K-Means or density-based. (See text for more explanations.)

20.2 Clustering Tendency ThKo p896, 16.6

HKP p484, 10.6.1

What comes last now, should be the first step in a cluster analysis in principle, namely a test that verifieswhether the data contain true clusters at all. A clustering algorithm will always find some structure in thedata, even if the data is a set of random points. Thus, one should make an attempt to find out whetherthe data tend to cluster at all, before one draws strong conclusions about the data. This early test is calledclustering tendency and is typically carried out with statistical tests. And one should not only test for ran-domness, but also for regularity, hence there are three hypotheses to verify:

- Randomness hypothesis: the data are randomly distributed. This is typically the null hypothesis H0.- Regularity hypothesis: the points are regularly spaced - they are not too close to each other.- Clustering hypothesis: the data form clusters.

For two dimensions there are some tests, but for more dimensions there does not exist a general convincingtest: each one has its advantages and disadvantages. For low dimensionality one may simply visuallydisplay the data - which we recommend doing anyway -; for higher dimensionality there exists the problemthat we do not know exactly the so-called sampling window. Roughly speaking, the sampling window is therange of the data, but there exist different exact definitions. The problem is, if the window is chosen toolarge, then the data itself is interpreted as a single cluster and that would favor the clustering hypothesis.

Intuitively, one would like to analyze the distances between points: for example, we analyze the distancesbetween the data points themselves and observe their MST (minimal spanning tree); or one analyzes thedistances between a set of randomly generated points and the set of data points, Section 20.2.1 gives anexample of the latter. But as mentioned before, there is not a generally accepted test, that can confirm anyhypothesis with large certainty.

In conclusion, one is left with the same type of advice that exists for any statistical test: one needs tointerpret the clustering results carefully and not rush to generalizations. More specifically, the presentedclustering techniques are only tools to arrive at a careful interpretation of the data.

20.2.1 Test for Spatial Randomness - Hopkins Test ThKo p901

In this test for spatial randomness, we measure nearest neighbor (NN) distances, once for some samplesof the entire dataset, S ∈ D, and once for a generated set of random points, R. For each sample of S wemeasure the NN distance di to its other samples in S, power it to dimensionality l, (di)

l, and integrate thosedistances: d̂own =

∑nSamplesi (di)

l. Analogously, for each sample of R we measure the NN distance di to thesamples in S: d̂rs =

∑nSamplesi (di)

l. Then, the following measure is formed:

h =d̂rs

d̂rs + d̂own(17)

103

If the pattern is a set of random points, then d̂own and d̂rs will be about the same size and h has then a valueof around 0.5. If the pattern is a set of regularly spaces points, then d̂own will be larger and the h-value willbe smaller than 0.5. If the pattern contains clusters, then d̂own will be smaller resulting in a h-value largerthan 0.5.

This test is only reliable if the set of random pointsR is exactly within the range of values in S. Otherwise,the h-values cannot be compared meaningfully. Appendix G.24 shows an example of how to apply this teston artificial data.

20.3 Check List

When starting a cluster analysis, the first two points to consider are the data size and the dimensionality:- Data size: if the data fits into a computer’s RAM and is not high-dimensional, then we can proceed with

the check list given below, otherwise we need to approach the data using only algorithms suitable forlarge datasets, see Section 19.3.

- Dimensionality: if the data is high-dimensional - from 11 dimensions on - then one should consider thetechniques introduced in Section 19.4.

Otherwise, one may proceed with the check list as given here:

Preparation Command Comment1. datatype single(DAT) Turn data into ’single’ to save on RAM memory; ’double’ if precision is

required2. NaN/Inf isnan, isinf Avoid inf; how many columns with NaN are there?3. feature type numeric (real) or nominal (categorical)? or both?4. permutation randperm Permute training set, in particular when the algorithm depends on order5. tendency? Are data NOT random? (Section 20.2)

Clustering(no) normalization Try without first. Chances are reasonable it does not help.K-Means kmeans Always try it - with different ks. It is quick and fairly powerful. For large

datasets use:Opt = statset(’MaxIter’,20,’Display’,’iter’);

Lb = kmeans(DAT,5, ’replicates’,1, ’onlinephase’,’off’,

’options’,Opt);

Hierarchical pdist,

linkage,

cluster

Watch your memory - it will take pairwise distances (O(N2))

Table 5: Check list for clustering (for data that fit into a computer’s RAM and is not high-dimensional).

104

A Distance Matrix, Nearest Neighbor Search

Intuitively, one would like to carry out the data analysis as thorough as possible and to measure all inter-pointdistances between the sample points, as it is done for the hierarchical clustering algorithm for instance. Theproblem is, that distance measurements are time-consuming, and that calculating all of them is unfeasiblefor large data sets. If one should need them nevertheless, we explain how to calculate them using thesoftware’s function scripts (Section A.1).

Often it is sufficient to know only which are the neighbors of a point, and its distal points can be ignored,such as in a kNN classifier or some specific clustering algorithms (spectral clustering). In that case, andunder certain conditions, there exist algorithms that allow to determine the nearest neighbors in relativelyshort time even for large data sets, a process which is called Nearest Neighbor Search (Section A.2).

A.1 Distance Matrix

The distance matrix is a two-dimensional array, whose entries hold all pair-wise distance values. There aretwo situations when they are calculated. One calculates it for either one list of vectors, namely by relatingthe sample vectors with themselves. In that case it results in a square matrix. Or one calculates it for twolists of vectors, to relate samples across lists.

One List To calculate pair-wise distances for one list, Matlab offers the functions pdist and squareform;they exist in Python under the same name. One can think of the list of pairwise distance values as filling onehalf of the distance matrix: Fig. 37 shows how the lower half is filled. To obtain a full matrix of distances,one simply copies the values into the corresponding positions of the other half.

Figure 37: The distance matrix. The values for the pairwise dis-tances of a list of n vectors (data matrix D for instance) can bethought of filling the lower half of a so-called n×n distance matrix(or the upper half of a matrix - not shown). The diagonal holdszero values if distances are used - or maximal values if similaritiesare used.

1

2

3

n

1 2 3 n. .

.

.

0

0

0

0

0

0

d21

d31

dn1

d32

dn2 dn3

d.. d.. d..

d.. d.. d.. d..

d.. d..

Another aspect of calculating the distance matrix is memory. For only 50’000 samples, those values requirealready ca. 4.7 Gigabytes of memory for holding all distance values in data-type single. For the full matrix,this doubles the memory requirements to 9.4 Gigabytes. Thus, watch your memory when trying to manip-ulate large lists. If you deal with large matrices, it is useful to calculate the memory required for the output(Appendix G.4), otherwise your PC may occupy the entire computational resources and you need to restartyour machine.

Two Lists To calculate the pair-wise distances for two lists we use pdist2. This function is convenient ifwe want to calculate the distance of observations to centroids, as is done in the nearest-centroid classifier.The following code showcases how to use those functions and its also shows how to program the distancematrix yourself.

105

%% SSSSSSSSSSSSSSSSSSSSSSSSSS ONE SET SSSSSSSSSSSSSSSSSSSSSSSSSSSSS

DAT = rand(100,3); % artificial data set

%% ========= Matlab functions ============

DiPw = pdist(DAT); % pair-wise distances

DM = squareform(DiPw); % distance matrix [nSmp nSmp]

[DMO ORD] = sort(DM,2,’ascend’); % order distance matrix

%% --------- Plotting -------

figure(1); subplot(1,2,1); imagesc(DM);

subplot(1,2,2); imagesc(DMO);

%% SSSSSSSSSSSSSSSSSSSSSSSSSS TWO SETS SSSSSSSSSSSSSSSSSSSSSSSSSSSS

D1 = single(rand(10,3));

D2 = single(rand(5,3));

%% ========= Matlab functions ============

DM = pdist2(D1,D2); % [nSmp1 nSmp2]

[DMO2 ORD2] = sort(DM,2,’ascend’); % ordered along second for first

%% ========= Explicit ==========

[n1 nD1] = size(D1);

[n2 nD2] = size(D2);

DMO=zeros(n1,n2,’single’); ORD=zeros(n1,n2,’uint64’);

for i = 1:n1,

Dis = sqrt( nansum( (repmat(D1(i,:),n2,1)-D2).^2, 2) );

[DisO O] = sort(Dis,’ascend’);

DMO(i,:) = DisO;

ORD(i,:) = O;

end

%% --------- Verification --------

assert(isequal(ORD2,ORD));

In Python this can be programmed very similarly. The crucial functions are in module scipy for instance,but might possibly exist also in the skilearn module.

from scipy.spatial.distance import pdist, squareform, cdist

from numpy import random, argsort, shape, array_equal, zeros, float32, uint64

from numpy import sort, sqrt, nansum, tile

from matplotlib.pyplot import figure, imshow, subplot

#%% SSSSSSSSSSSSSSSSSSSSSSSSSS ONE SET SSSSSSSSSSSSSSSSSSSSSSSSSSSSS

DAT = random.random((100,3)) # artificial data set

#%% ========= Python functions ============

DiPw = pdist(DAT) # pair-wise distances

DM = squareform(DiPw) # distance matrix [nSmp nSmp]

ORD = argsort(DM, axis=1) # order distance matrix

DMO = sort(DM,axis=1) # assigned to itself

#%% --------- Plotting -------

figure(figsize=(8,8))

subplot(1,2,1); imshow(DM)

subplot(1,2,2); imshow(DMO)

#%% SSSSSSSSSSSSSSSSSSSSSSSSSS TWO SETS SSSSSSSSSSSSSSSSSSSSSSSSSSSS

D1 = random.random((10,3))

106

D2 = random.random((5,3))

#%% ========= Python functions ============

DM = cdist(D1,D2) # [nSmp1 nSmp2]

ORD2 = argsort(DM,axis=1) # ordered along second for first

#%% ========= Explicit ==========

n1,nD1 = shape(D1)

n2,nD2 = shape(D2)

DMO=zeros((n1,n2),dtype=float32); ORD=zeros((n1,n2),dtype=uint64)

for i in range(0,n1):

Dis = sqrt( nansum( (tile(D1[i,:],(n2,1))-D2)**2, axis=1) )

DMO[i,:] = sort(Dis)

ORD[i,:] = argsort(Dis)

#%% --------- Verification --------

array_equal(ORD2,ORD)

A.2 Nearest Neighbor Search (NNS) wiki Nearest neighbor search

Nearest Neighbor Search is the process of determining efficiently which points lie near to each other. Ifwe search for neighbors by measuring the distance to all other points - as we did above for the distancematrix -, then the search is called exhaustive or naive, because a comparison to all samples is made; or itis called linear, because the cost (complexity) grows linear with n, the number of samples. Under certainconditions, this search can be carried out much faster by use of techniques that resemble the processesof histogramming and clustering, as they try to locate ’densities’ of points. Some of those techniques areprecise and work only under very specific conditions; other techniques have less strict conditions, yet donot deliver exact neighbors, but rather some neighbor.

Space Partitioning: The k-d Tree One type of NNS optimization are space-partitioning techniques, ofwhich the k-d tree is a very popular one - short form for k-dimensional tree (wiki K-d tree). As the name im-plies, the technique organizes the data in a tree structure by dividing the space into rectangular subspaces,similar to Decision Trees (Section 11). When a query point is then presented, one traverses merely the tree.

More explicitly, the k-d tree iteratively bisects the search space into two regions containing half of thepoints of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evalu-ating the query point at each split. Depending on the distance specified in the query, neighboring branchesthat might contain hits may also need to be evaluated. For constant dimension query time, average com-plexity is O(logN) in the case of randomly distributed points, worst case complexity analyses have beenperformed.

The k-d tree technique works best if the data are low-dimensional and dense, which is more likely tobe the case when the data set is large. For more than 10 dimensions (variables), the exhaustive searchperforms probably better.

Matlab provides an implementation of the k-d tree with the function knnsearch if you apply your testingdata once. If you will apply your search repeatedly, then you prepare it with createns and then searchwith knnsearch by applying the model structure, Mns as we call it in the code fragment below. If you do notspecify any particualr options, then Matlab will decide when to switch between exhaustive and optimized(tree) search. But you can also enforce a type of search by providing parameters, or you can use thefunctions ExhaustiveSearcher and KDTreeSearcher.

nDim = 3;

DAT = single(rand(10000,nDim));

TST = single(rand(500,nDim));

107

%% ========= Single Application =========

ORD = knnsearch(DAT,TST, ’k’,2); % decides automatically

%% ========= Multiple Applications =========

Mns = createns(DAT); % create ’model’

ORDa = knnsearch(Mns,TST, ’k’,2); % apply model

ORDb = knnsearch(Mns,TST*2, ’k’,2); % apply again

%% ========= Exhaustive Search Explicit ==========

Mex = ExhaustiveSearcher(DAT); % prepare an exhaustive searcher

ORD2 = knnsearch(Mex,TST, ’k’,2); % search

%% ========= k-d Tree Explicit ==========

Mkd = KDTreeSearcher(DAT); % prepare a kd-tree searcher

ORD3 = knnsearch(Mkd,TST, ’k’,2); % search

%% --------- Verification --------

assert(isequal(ORD2,ORD(:,1:2)));

assert(isequal(ORD2,ORD3));

Python offers a larger variety of search techniques and we present here only the kd-Tree search. If onehas managed to follow this far, then the SciKit-Learn documentation would be the next step SKL p202.

from sklearn.neighbors import NearestNeighbors, KDTree

from numpy import random, array_equal

nDim = 3

DAT = random.random((10000,nDim))

TST = random.random((500,nDim))

#%% ========= Auotmatic Application =========

Md = NearestNeighbors(n_neighbors=2).fit(DAT)

DIS, ORD = Md.kneighbors(TST)

#%% ========= k-d Tree Explicit ==========

Mkdt = KDTree(DAT, leaf_size=30, metric=’euclidean’)

ORD2 = Mkdt.query(TST, k=2, return_distance=False)

#%% --------- Verification --------

array_equal(ORD,ORD2)

108

B Distance and Similarity Measures ThKo p602, 11.2

HKP p65, 2.4

The ’spacing’ or ’separation’ between two points can be expressed as a distance or as a similarity. Thenotion of distance is perhaps more intuitive at first, because we were taught the Euclidean distance inschool. Similarity can be roughly explained as the inverse of a distance measure and will become equallyintuitive after we have dealt with certain algorithms.

How Matlab implements the following measures exactly, is detailed on their help page ClassificationUsing Nearest Neighbors. Or see the function pdist.

B.1 Distance Measures wiki DistanceLRU p92, 3.5

• Minkowski The most used distance measure, namely the Euclidean distance, is merely one of severaluseful distance measures. The Euclidean distance, as well as some other measures, can be expressed bya single formula, namely the Minkowski metric, which is also referred to as the Lk norm:

Lk(a, b) =( d∑i=1

|ai − bi|k)1/k

(18)

a and b are two vectors of dimensionality d. For the following values of k the distance is also known as: DHS p187

k Norm Name(s) Matlab1 L1 norm Manhattan / city-block / taxi-cab distance mandist

2 L2 norm Euclidean distance dist

∞ L∞ norm Chebyshev distance in pdist

The Manhattan distance has the benefit that it calculates faster than the other metrics, as it measures onlythe sum of absolute distances - the power and root operations fall away. The Euclidean metric is a relativelycostly measure, because it squares and takes the square root; for that reason the Euclidean metric issometimes done without taking the root - called squared Euclidean then -, if the actual (Euclidean) distancevalue is not necessary.

In algorithms you often need to take the distance between one observation DAT(i,:) and all others DAT.Here is how you would calculate the differences:

Dis = sum(abs(bsxfun(@minus,DAT,DAT(i,:))), 2); % city-block

Dis = sqrt(sum(bsxfun(@minus,DAT,DAT(i,:)).^2, 2)); % Euclidean

Dis = sum(bsxfun(@minus,DAT,DAT(i,:)).^2, 2); % squared Euclidean

• Mahalanobis This is another popular distance measure. It uses a covariance matrix S to arrive at adistance value; see also Section 6.1 for covariance matrix and Appendix E for notation:

DM (x) =√

(x− µ)TS−1(x− µ) (19)

where x is a sample and µ is a mean vector; µ often represents the average over the samples for a class.The Mahalanobis measure is for example used in the Naive Bayes classifier (Section 17). In the specialcase where the covariance matrix is the identity matrix (only 1s along the diagonal, 0 elsewhere), theMahalanobis distance reduces to the Euclidean distance (L2 norm above). In Matlab: mahal.

• Hamming Is a distance that is suitable for discrete valued data. It is defined as the number of elementswhere two vectors differ. I found different exact definitions, here is one implementation:

Dis = sum(bsxfun(@ne,DAT,DAT(i,:)), 2) / nDim; % nDim = no. of variables

109

Discrete-Valued Vectors Use the L1 distance (cityblock; Manhattan) or the just mentioned Hammingdistance.

B.2 Similarity Measures wiki Similarity measure

Similarity measures are particularly used for clustering and for Support Vector Machines. In a similaritymeasure, the measure has the highest value when two vectors are identical - often defined as value equalone; the measure drops the more the two vectors differ from each other, often approaching zero for verydistant vectors.

• Dot Product (Linear) exactly as in Equation 24 (Appendix E.1). It is typically applied to normalizeddata, in which case taking the similarity values would then be simply:

Dis = 1 - (DAT * DAT(i,:)’); % assumes data are normalized!

• Cosine Similarity Is the dot product of the two vectors divided by the product of their lengths. When thismeasure is applied, then typically data are normalized to have unit length for each observation, such thatthe divisor becomes one, in which case the similarity corresponds to the dot product. The normalizationcan be done as follows:

Dnorm = sqrt(sum(DAT.^2, 2)); % length of each observation vector

DAT = DAT ./ Dnorm(:,ones(1,nDim));

Taking the similarity values is done as for the dot product above.

• Radial-Basis Function (RBF) Is typically understood as the Gaussian function (Equation 20; AppendixC) but can be any other symmetric function that decreases for increasing difference in input.

• Pearson’s Correlation Coefficient As in statistics: wiki Pearson product-moment correlation coefficient. Valuesrange from -1 to 1. It is used in the Support Vector Machine for instance.

DAT = bsxfun(@minus, DAT, mean(DAT,2)); % difference to mean

Dnorm = sqrt(sum(DAT.^2, 2)); % length of each difference vector

DAT = DAT ./ Dnorm(:,ones(1,nDim)); % E [-1 1]

Taking the similarity values is done as for the dot product above.

• Jaccard/Tanimoto Index wiki Jaccard index. There exist different definitions. Here is an example:

Bnoz = bsxfun(@or,(DAT~=0),(DAT(i,:)~=0)); % pairs with no zeros

Bdff = bsxfun(@ne,DAT,DAT(i,:)); % pairs that are different

Dis = sum(Bdff & Bnoz, 2) ./ sum(Bnoz, 2);

Discrete-Valued Vectors Use the Jaccard/Tanimoto index.

110

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

https://en.wikipedia.org/wiki/Jaccard_index

C Gaussian Function wiki Gaussian function

The Gaussian function is a function whose shape looks like a bell. It is often used for approximating proba-bility distributions and for similarity measures. It has two parameter values: its mean µ which correspondsto the center of the distribution and its standard deviation σ which describes its width. The one-dimensionalfunction is

g(x;µ, σ) =1

σ√

2πexp(−1

2

[x− µσ

]2)(20)

In Matlab we can call the function normpdf to calculate it, for instance here we generate it for x ranging from-4 to 4, with center µ = 0 and a standard deviation σ = 1:

normpdf(-4:0.1:4, 0, 1)

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

σ = 1σ = 2σ = 3σ = 4σ = 5

Figure 38: Gaussian function for five different values of sigma (1,..,5) placed at x=0. The parameter sigma determinesthe width of the bell-shaped curve.

In two (or more) dimensions: The Gaussian function also exists in two or more dimensions in whichcase there are are more µs and σs, namely a pair for each dimension. In two dimensions with axes x andy we can write

g(x, y) =1

2πσxσy√

1− ρ2exp

(− 1

2(1− ρ2)

([x− µxσx

]2+[y − µy

σy

]2− 2ρ(x− µx)(y − µy)

σxσy

))(21)

where µx and µy are the coordinates for the location; σx and σy the two widths of the Gaussian. In additionthere is a parameter ρ that expresses the correlation between X and Y . The standard deviations arepositive σx > 0 and σy > 0, but ρ can also be negative.

The equation is typically written more compactly, namely in matrix notation. Toward that one forms avector for the mean parameters and a matrix for the variance parameters:

µ =

[µxµy

]and Σ =

[σ2x ρσxσy

ρσxσy σ2y

]Σ is the covariance matrix as introduced in Section 6.1. Now we can write the Gaussian as follows DHS p33

g(x) =1

(2π)d/2|Σ|1/2exp

[− 1

2(x− µ)tΣ−1(x− µ)

](22)

which is also the formula for the multivariate Gaussian function wiki Multivariate normal distribution, the Gaussianfunction for two or more dimensions. This formula is still very large and it is therefore often short-noted as

N(µ,Σ). (23)

111

whereby N stands for normal, which is another name for the Gaussian function.

There are three noteworthy terms in equation 22:|Σ| is the so-called determinant of the covariance matrixΣ−1 is the inverse of the covariance matrix(x− µ)tΣ−1(x− µ) is also called Mahalanobis distance, see above

The determinant and the inverse are algebraic operations that we leave to mathematicians. The Maha-lanobis distance is obtained by matrix multiplications.

In Matlab the multivariate Gaussian function is implemented with the command mvnpdf. Here is a two-dimensional example of how to create it. It plots Fig. 39.

Rg = -3 : 0.1 : 3;

[X,Y] = meshgrid(Rg,Rg);

Sg = [1 1];

Z = mvnpdf([X(:) Y(:)],[0 0],Sg);

Z = reshape(Z, length(Rg), length(Rg));

figure(2); clf;

surf(Rg,Rg,Z); hold on;

caxis([min(Z(:))-.5*range(Z(:)),max(Z(:))]);

axis([Rg(1) Rg(end) Rg(1) Rg(end) 0 .4])

xlabel(’x1’); ylabel(’x2’); zlabel(’Probability Density’);

32

10

-1

x1-2

-3-3

-2

-1

x2

0

1

2

0.3

0.4

0.35

0.25

0.2

0.1

0.05

0.15

03

Pro

babi

lity

Den

sity

Figure 39: Two-dimensional Gaussian function for both sigmas equal 1, centered at [0,0].

112

D Programming Hints

Speed To write fast-running code in Matlab, one should exploit Matlab’s matrix-manipulating commands inorder to avoid the costly for loops, see for instance bsxfun, repmat or accumarray. Writing a kNN classifiercan be conveniently done using the repmat command. However, when dealing with high dimensionality andlarge number of samples, exploiting this command can in fact slow down computation because the machinewill spend a significant amount of time allocating the required memory for the large matrices. In that case,the code runs faster if you balance for-loops with memory-allocating commands, i.e. maintain a single forloop and use repmat for the remaining operations. Unfortunately, it is difficult to anticipate the appropriatebalance: one has to try out different combinations to arrive at the fastest implementation.

Vector Multiplication In mathematical notation a vector is assumed a column vector (see also AppendixE). In Matlab however if you define a vector as a=[1 2 3], it is a row vector - in fact as you write. Toconform with mathematical notation, either transpose the vector immediately by using the transpose sign ’

(e.g. a=[1 2 3]’;) or by using semi-colons (e.g. a=[1; 2; 3];); otherwise you are forced to change placeof the transpose sign later when applying the dot product (a*b’ instead of a’*b), in which case it appearsreverse to the mathematical notation! Or simply use the command dot, for which the column/row orientationis irrelevant.

D.1 Parallel Computing Toolbox in Matlab

Should you be lucky owner of the parallel computing toolbox in Matlab, then you can even use it on yourhome PC or laptop, as nowadays home PCs have multiple cores and that permits parallel computing inprinciple. It is relatively simple to exploit the parallel computing features in for-loops that are suitable forparallel processing: simply open a pool of cores, carry out the loop using the parfor command and thenclose the pool again.

matlabpool local 2; % opening two cores (workers)

parfor i = 1:1000

A(i) = SomeFunction(Dat, i); % the data are manipulated in some function by counter i

end

matlabpool close;

The parfor loop can not be used if your computations in the loop depend on previous results, for examplein an iterative process where A(i) depended on A(i-1). It also only makes sense if the process that issupposed to be repeated in parallel is computationally intensive, otherwise the assignment of the individualsteps to the corresponding cores (workers) may slow down the computation.

113

E Matrix and Vector Multiplications

There are several types of multiplications for vectors and matrices. We here summarize only the mostfrequently used ones. First we need to distinguish between the ’orientation’ of vectors, namely between rowand column vectors (see again Fig. 2):

Row vector: ’horizontal’ sequence of numbers, e.g. A =[1 5 3 −2

]. In Matlab entered as follows: A

= [1 5 3 -2];, that is as written in mathematical notation.

Column vector: ’vertical’ sequence of numbers, e.g. B =

−142

. Because such a notation is space

consuming, one can also write the column vector as a row vector with an indication at the end of thebrackets telling us that it is supposed to be a column vector. That indication is the transpose signor the letter T : B = [−1 4 2]′ or B = [−1 4 2]T . In Matlab we can write B = [-1 4 2]’; - note thetranspose sign. But we can also enter the values with semi-colon, e.g. B = [-1; 4; 2], and leavethe transpose sign away.

If you have troubles remembering the two orientations, then think of ’row of seats’ (horizontal) and ’columnsof a temple’ (vertical).

Note: In mathematical notation, a vector is assumed to be a column vector by default. It is thus recom-mended that vectors in Matlab are defined as column vectors immediately, such that multiplications in thecode appear in accordance with the mathematical notation - otherwise it can become truly confusing.

E.1 Dot Product (Vector Multiplication) wiki Dot product

In this case, the orientation of vectors (row or column) does not matter. Given two vectors of equal length,A = [A1, A2, ..., An] and B = [B1, B2, ..., Bn], the dot product is defined as the summation of their element-wise products:

A ·B =

n∑i=1

AiBi = A1B1 +A2B2 + ...+AnBn = A′B (24)

where the left side (A ·B) is the matrix notation using the dot ·; the center uses the summation notation∑

;and the right side (A′B) is the matrix notation using the transpose. This is also known as the scalar product,because the result is a single number. In Matlab you can use the command dot to obtain the product, inwhich case the order and orientation of vectors does not matter. The dot product can also be regarded as aspecial case of the matrix multiplication (coming up next), in which case the orientation of the vectors doesmatter.

E.2 Matrix Multiplication wiki Matrix multiplication

A n×m matrix A consists of n rows and m columns. It is sort of intuitive that if you add or subtract a scalarvalue from a matrix, or multiply or divide a matrix by a scalar, that this is done for each element of the matrix.It is also intuitive that if the matrices are of the exact same size, then you can perform the operations withcorresponding elements. In Matlab one uses .* and ./ to specify those element-wise operations - if not, itmay generate completely wrong results.

It is less intuitive however, how the operations are carried out when we multiply two matrices of differentsizes with each other. In order to perform such a matrix product, it requires that the number of columns ofthe first matrix is equal the number of rows of the second matrix: If A is an n×m matrix and B is an m× pmatrix, then their matrix product AB is an n × p matrix, in which the m entries across the rows of A aremultiplied with the m entries down the columns of B. Remember the expression nmmp→ np to memorizethat requirement. Let us look at the special case when m equals 1, or n and p are equal 1:

114

Product of a Row and Column Vector:

• Row * Column (nmmp=1mm1→11): this corresponds to the dot product as introduced above. In Matlab:A’*B, but only if A and B were defined as row and vector respectively.

• Column * Row (nmmp=n11p→np): creates a n × p matrix, where n and p correspond to the vectorlengths. Here, the elements are pairwise multiplied, no actual summation takes place.

The product of two matrices is then simply the application of the dot product in two loops, one iteratingthrough the rows of the first matrix and the other iterating through the columns of the second matrix. Insteadof formulating this more explicitly we give a code example, which includes several verification steps usingthe command assert:

clear;

a = [2 1 3 5]’; % column vector

b = [-1 2 0 3]’; % column vector

s = a’ * b % dot/scalar product

M = a * b’ % matrix product

s1 = dot(a,b); % dot product

s2 = dot(b,a); % order does not matter

assert(all(s==s1), ’something is wrong’);

assert(all(s1==s2), ’something is wrong’);

A = [a’; 4 7 8 -3];

B = [b [2 6 -2 5]’];

A*B % the matrix product

B*A % works too - even though we reversed the order! Why?

%% --- Add another column to A. Calculate product explicitly.

A = [A; [1 1 -1 7]];

[n m1] = size(A);

[m2 p] = size(B);

assert(m1==m2, ’Dimensionality not correct’);

Mx = nan(n,p);

for i = 1:n

a = A(i,:);

for k = 1:p

b = B(:,k);

Mx(i,k) = dot(a,b);

end

end

assert(all(all(Mx==(A*B))), ’not properly programmed’);

Notes- The loop is given merely for the purpose of illustrating the product of matrices. Of course, one would

prefer to write merely A*B in a code.- Why did then B*A work as well? [Answer: Because the size of A is equal the size of B′ (transpose).]- Observe what error you obtain when you insert the product B*A at the very end of the code again.

115

F Reading

See the Section references below for publication details.

(Theodoridis and Koutroumbas, 2008): Contains a lot of practical tips - more than any other book, thatalso aims at both theory and practice. Treats clustering very thoroughly - in more depth than any othertextbook. Contains code examples for some of the algorithms.

(Han et al., 2012): The main topic is obviously data mining and therefore clustering; it is very practiceoriented. It treats the topic of data processing and preparation as well as outlier detection more elab-orately than any other book, namely in separate chapters. Its treatment of clustering is shorter than inthe book by Theodoridis and Koutroubmas, but explains some of the issues in a more straightforwardmanner - partially due to brevity. However, the book does not provide any code.

(Leskovec et al., 2014): emphasizes large (big) data and also temporary data (streams) and lists manyapplications. Treats the essential techniques of clustering, dimensionality reduction and classificationrelatively straightforward, however no code is given.

(Alpaydin, 2010): An introductory book. Reviews some topics from a different perspective than the pro-fessional, theoretical books (see below). It can be regarded as complementary to this workbook, butalso complementary to other textbooks.

(Witten et al., 2011): The most practice-oriented machine learning book probably, but rather short on themotivation of the individual classifier types. It accompanies the ’WEKA’ machine learning suite (seelink above).

(Kuncheva, 2004): This book contains a simple and straightforward introduction to the essential classifieralgorithms with some Matlab code examples. And of course it explains how to combine classifiers,probably better than any other book.

(Hastie et al., 2009): Advanced book focusing in particular on supervised learning (regression and clas-sification). Perhaps the most cited book of all modern books. Contains a separate section for classifi-cation of high-dimensional data, which most books do not have.

(James et al., 2013): A shorter but visually appealing introductory book. With some code examples forthe software package R. Appears to be the ’child’ of the book by Hastie, et. al 2009.

(Duda et al., 2001): The veteran. For advanced readership. The book excels at relating the differentclassifier philosophies and emphasizes the similarities between classifiers and neural networks. Dueto its age, it lacks in depth treatment for recent advances such as combining classifiers and graphmethods for instance.

(Bishop, 2007): Another professional, theoretical book. Contains beautiful illustrations and some his-toric comments, but aims rather at an advanced readership (upper-level undergraduate and graduatestudents).

(Martinez et al., 2010): Lovely introductory book for clustering in Matlab - with code of course. Suitablefor those who prefer a slower pace, but clustering is not treated in depth.

Wikipedia: Always good for looking up definitions and specific algorithms. But Wikipedia’s ’variety’ -originating from the contribution of different authors - is also its shortcoming: it is hard to comprehendthe topic as a whole from the individual articles (websites). On some broader issues however, somewikipedia articles are not congruent with text books, therefore caution is advised. Hence, textbooksare still irreplaceable.

ReferencesAlpaydin, E. (2010). Introduction to Machine Learning. MIT Press, Cambridge, MA, 2nd edition.Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71.

Springer.Bishop, C. (2007). Pattern Recognition and Machine Learning. Springer, New York.Duda, R., Hart, P., and Stork, D. (2001). Pattern Classification. John Wiley and Sons Inc, 2nd edition.Han, J., Kamber, M., and Pei, J. (2012). Data Mining: Concepts and Techniques. Elsevier.

116

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, andPrediction. Springer, New York.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning, volume 112.Springer.

Kuncheva, L. (2004). Combining Pattern Classifiers. John Wiley & Sons, Inc., Hoboken, NJ.Leskovec, J., Rajaraman, A., and Ullman, J. (2014). Mining of Massive Datasets. Cambridge University Press, Cam-

bridge, UK.Martinez, W. L., Martinez, A., and Solka, J. (2010). Exploratory Data Analysis with MATLAB. CRC Press.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss,

R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011).Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Theodoridis, S. and Koutroumbas, K. (2008). Pattern Recognition. Academic Press, 4th edition.Witten, I., Frank, E., and Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Morgan

Kaufmann, 3rd edition.

117

G Code Examples

The code examples show how the functions are applied - they should work by copy/paste. They do notnecessarily make sense. We use mostly the Fisher’s famous Iris data set consisting of only 150 samplesand 4 dimensions. It is available within Matlab and within Python.

In Python I use the module ’scikit-learn’. The examples here were created with Python version 3.6,which includes that module, there’s no need to install it separately.

G.1 The Classifiers in One Script

The next two pages contain the classifier functions as applied in Matlab and Python. In these examplesthere is no calling of the individual functions fit and predict; the code demonstrates how to apply the’wrapper’ functions. Those functions can be instructed to perform folding automatically be specifying theoption ’kfold’ together with an integer value for the desired number of folds.

clear

% --- Load the data and rename

load fisheriris % a famous data set with 150 samples and 3 classes

DAT = meas; % renaming data variable

GrpLb = species; % renaming group variable

GrpLb = grp2idx(GrpLb); % converting strings to integers

% --- Data Info:

[nSmp nFet] = size(DAT);

GrpU = unique(GrpLb);

nGrp = length(GrpU);

fprintf(’# Samples %d # Features %d # Classes %d\n’, nSmp, nFet, nGrp);

% --- Params

nFld = 5; % number of folds

%% ----- K-Nearest Neibors -------

MdCv = fitcknn(DAT, GrpLb, ’kfold’,nFld);

pcKnn = 1-kfoldLoss(MdCv);

fprintf(’K-Nearest Neighbour %1.3f\n’, pcKnn*100);

%% ----- Linear Discriminant -------

MdCv = fitcdiscr(DAT, GrpLb, ’kfold’,nFld);

pcLD = 1-kfoldLoss(MdCv);

fprintf(’Linear Discriminant %1.3f\n’, pcLD*100);

%% ----- Naive Bayes -------

MdCv = fitcnb(DAT, GrpLb, ’kfold’,nFld);

pcNB = 1-kfoldLoss(MdCv);

fprintf(’Naive Bayes %1.3f\n’, pcNB*100);

%% ----- Decision Tree -------

MdCv = fitctree(DAT, GrpLb, ’kfold’,nFld);

pcTree = 1-kfoldLoss(MdCv);

fprintf(’Decision Tree %1.3f\n’, pcTree*100);

%% ----- Random Forest -------

MdCv = TreeBagger(100, DAT, GrpLb, ’OOBPred’,’on’);

pcRF = mean(1-oobError(MdCv));

fprintf(’Random Forest %1.3f\n’, pcRF*100);

%% ----- SVM + ErrorCorrectingOutputCode -------

MdCv = fitcecoc(DAT, GrpLb, ’kfold’,nFld);

pcSVM = 1-kfoldLoss(MdCv);

fprintf(’SVM + ECOC %1.3f\n’, pcSVM*100);

%% ----- SupportVectorMachine ------

nCat = 3;

IxF = crossvalind(’kfold’, GrpLb, nFld);

Pc = zeros(nFld,1);

for f = 1:nFld

118

Btst = IxF==f; % testing samples

Btrn = ~Btst;

nTst = nnz(Btst);

Grp.Tren = GrpLb(Btrn);

Grp.Test = GrpLb(Btst);

% ====== TRAIN MULTI-SVM =======

ASvm = cell(nCat,1);

for i = 1:nCat

Bown = Grp.Tren==i; % select class

ASvm{i} = fitcsvm(DAT(Btrn,:),Bown,’standardize’,false);

end

% ====== PREDICT =========

Post = zeros(nTst,nCat); % posteriors

for i = 1:nCat

[~,Scor] = predict(ASvm{i},DAT(Btst,:));

Post(:,i) = Scor(:,2); % 2nd column contains positive-class scores

end

[~,LbPred] = max(Post,[],2); % select highest post per sample

% ------ accuracy

pc = nnz(LbPred==Grp.Test)/nTst*100;

Pc(f) = pc;

end

fprintf(’SVM multi simple %1.3f\n’, mean(Pc));

%% ----- Ensemble AdaBoost ------

MdAda = fitensemble(DAT,GrpLb, ’AdaBoostM2’,100,’tree’,’kfold’,nFld);

pcAda = 1-kfoldLoss(MdAda,’Mode’,’Cumulative’);

fprintf(’Ensemble Boosting %1.3f\n’, pcAda(end)*100);

And now for Python. The SVM does carry out the multi-class discrimination task directly, there is no needto program a multi-classifier scheme as we did in Matlab. The SKL documentation also has a one-scriptexample comparing a number of classifiers, see page 718, section 8.4. Our example is simpler:

from sklearn import datasets, svm

from sklearn.model_selection import KFold, cross_val_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.naive_bayes import GaussianNB

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import AdaBoostClassifier

from numpy import shape, unique

# --- Load the data and rename

iris = datasets.load_iris()

DAT = iris.data

GrpLb = iris.target

# --- Data Info:

nSmp,nFet = shape(DAT)


nGrp = len(GrpU)

print(’# Samples ’, nSmp, ’ # Features ’, nFet, ’ # Classes ’, nGrp)

# --- Init

k_fold = KFold(n_splits=5) # prepare folds

#%% ----- K-Nearest Neibors -------

MdKnn = KNeighborsClassifier()

Pcs = cross_val_score(MdKnn, DAT, GrpLb, cv=k_fold)

print(’K-Nearest Neighbour ’, Pcs.mean()*100)

#%% ----- Linear Discriminant -------

MdDisc = LinearDiscriminantAnalysis()

Pcs = cross_val_score(MdDisc, DAT, GrpLb, cv=k_fold)

print(’Linear Discriminant ’, Pcs.mean()*100)

119

#%% ----- Naive Bayes -------

MdNB = GaussianNB()

Pcs = cross_val_score(MdNB, DAT, GrpLb, cv=k_fold)

print(’Naive Bayes ’, Pcs.mean()*100)

#%% ----- Decision Tree -------

MnTree = DecisionTreeClassifier()

Pcs = cross_val_score(MnTree, DAT, GrpLb, cv=k_fold)

print(’Decision Tree ’, Pcs.mean()*100)

#%% ----- Random Forest ------- p239

MnRF = RandomForestClassifier(n_estimators=25)

Pcs = cross_val_score(MnRF, DAT, GrpLb, cv=k_fold)

print(’Random Forest ’, Pcs.mean()*100)

#%% ----- SupportVectorMachine ------

MdSvm = svm.SVC()

Pcs = cross_val_score(MdSvm, DAT, GrpLb, cv=k_fold)

print(’SupportVectorMachine ’, Pcs.mean()*100)

#%% ----- AdaBoost Classifier ------

MdAbo = AdaBoostClassifier(n_estimators=100)

Pcs = cross_val_score(MdAbo, DAT, GrpLb, cv=k_fold)

print(’AdaBoost ’, Pcs.mean()*100)

120

G.2 The Clustering Algorithms in One Script

Matlab offers only a few basic functions, namely kmeans and clusterdata. Here we added two functions,f DbScan and f KmnsFuz, see last two sections; those functions are provided in Appendix G.23 and G.22,respectively.

clear




% --- Data Info:


fprintf(’# Samples %d # Features %d \n’, nSmp, nFet);

%% ----- Kmeans -------

LbMean = kmeans(DAT,5);

ClusSz = histcounts(LbMean,5);

fprintf(’Cluster Sizes Kmeans %2d-%2d\n’, min(ClusSz), max(ClusSz));

%% ----- Kmedoids -------

LbMed = kmedoids(DAT,5);

ClusSz = histcounts(LbMed,5);

fprintf(’Cluster Sizes Kmedoids %2d-%2d\n’, min(ClusSz), max(ClusSz));

%% ----- Hierarchical -------

LbHier = clusterdata(DAT,1.1);

LbU = unique(LbHier);

nClus = length(LbU);

fprintf(’# hierarchical clusters %2d\n’, nClus);

%% ----- Density-Based SCAN -------

LbDBS = f_DbScan(DAT, 2);

LbU = unique(LbDBS);

nClus = length(LbU);

fprintf(’# dense clusters %2d\n’, nClus);

%% ----- Fuzzy Kmeans -------

LbFuz = f_KmnsFuz(DAT,5);

ClusSz = histcounts(LbFuz,5);

fprintf(’Cluster Sizes Fuzzy Kmeans%2d-%2d\n’, min(ClusSz), max(ClusSz));

Python contains a larger number of cluster implementations but we show for the moment only the mostcommon algorithms:

from sklearn import datasets

from sklearn.cluster import KMeans, MiniBatchKMeans, Birch, DBSCAN

from sklearn.cluster import AgglomerativeClustering

from numpy import shape, unique



DAT = iris.data

# --- Data Info:


print(’# Samples ’, nSmp, ’ # Features ’, nFet)

nClus = 5 # we assume 5 clusters

#%% ----- Kmeans (Standard) ------- p748, 9.11

LbKstnd = KMeans(n_clusters=nClus).fit(DAT)

Lu,ClusSz = unique(LbKstnd.labels_,return_counts=1)

print(’Cluster Sizes Kmeans - standard ’, min(ClusSz), ’-’, max(ClusSz))

#%% ----- Kmeans (Large Data) -------

121

LbKbtch = MiniBatchKMeans(n_clusters=nClus).fit(DAT)

Lu,ClusSz = unique(LbKbtch.labels_,return_counts=1)

print(’Cluster Sizes Kmeans - large data’, min(ClusSz), ’-’, max(ClusSz))

#%% ----- Hierarchical ------- p734, 9.5

LbAgglo = AgglomerativeClustering(n_clusters=nClus, linkage=’ward’).fit(DAT)

Lu,ClusSz = unique(LbAgglo.labels_,return_counts=1)

print(’Hierarchical ’, min(ClusSz), ’-’, max(ClusSz))

#%% ----- DBSCAN ------ p753, s9.13

LbDBSCAN = DBSCAN(eps=0.8, min_samples=10).fit(DAT)

Lu,ClusSz = unique(LbDBSCAN.labels_,return_counts=1)

print(’DBSCAN ’, min(ClusSz), ’-’, max(ClusSz))

#%% ----- Birch ------

LbBirch = Birch(threshold=0.6, n_clusters=10).fit(DAT)

Lu,ClusSz = unique(LbBirch.labels_,return_counts=1)

print(’Birch ’, min(ClusSz), ’-’, max(ClusSz))

122

G.3 Prepare Your Data

Some preparatory steps to inspect and adjust your data. Again, it is meant as example code how to applythe functions; it does not necessarily make sense in each step.

clear % clear memory

close all % close all figures

load fisheriris % famous data set provided by Matlab




% --- Data info

[nSmp nFet] = size(DAT); % size of data

GrpU = unique(GrpLb); % the group/class/category labels

nGrp = length(GrpU); % # of groups/classes/cats

Hgrp = histcounts(GrpLb,nGrp); % sample count per class


tabulate(GrpLb); % group analysis in one function

%% -----Inspect visually

figure; imagesc(DAT); colorbar(); % display as image

figure; boxplot(DAT); % box plot

figure; hist(DAT(:,1)); % plot histogram of 1st feature

figure; bar(Hgrp); title(’Sample Count per Class’);

%% ----- Introduce some artificial irregularities

DAT(1:2,1) = NaN; % set first two values to NaN

DAT(5:15,2) = NaN;

DAT(100:150,4) = NaN;

DAT(3,1) = inf; % set one value to infinity

DAT = [DAT ones(nSmp,1)*2]; % add a column of 2s

%% ----- Check for NaN, Inf, zero-standard deviation

Cnan = sum(isnan(DAT),1); % NaN count per feature

Cinf = sum(isinf(DAT),1); % inf count per feature

PropNaN = Cnan / nSmp; % proportion NaN per feature

if any(PropNaN) % plot only if there are any NaNs

figure(); bar(PropNaN);

end

Bzstd = std(DAT,[],1) < eps; % features with 0 standard deviation

if any(Bzstd),

warning(’the following feature dimensions have constant values’);

find(Bzstd)

end

%% ----- Adjust Data ------

BnoNaN = not(logical(Cnan));

DATred = DAT(:,BnoNaN); % reduced data

DAT(isinf(DAT)) = realmax; % use max to replace inf

%% ----- Standardization -----

% watch out: turns any dimension with some NaNs to NaNs only

DATstd = zscore(DAT); % one-standard devation

std(DATstd,[],1) % display to verify

%% ----- Scale to Unit Range ------

DAT = bsxfun(@plus, DAT, -min(DAT,[],1)); % set minimum to 0

DAT = bsxfun(@rdivide, DAT, max(DAT,[],1)); % now we scale to 1

max(DAT,[],1) % display maxima to verify

min(DAT,[],1) % display minima to verify

%% ----- Permute Data --------

% necessary for some classifiers such as NeuralNetworks

IxPerm = randperm(nSmp); % randomize order of training samples

DAT = DAT(IxPerm,:); % reorder training set

GrpLb = GrpLb(IxPerm); % reorder group variable

%% ---- Create Folds -------

123

IxFld = crossvalind(’kfold’, GrpLb, 5);

from numpy import shape, unique, concatenate, ones

from numpy import nan, isnan, inf, isinf, arange, logical_not, random

from matplotlib.pyplot import bar, imshow, figure, colorbar, boxplot, hist


from sklearn.preprocessing import scale, minmax_scale

from sys import float_info



DAT = iris.data

GrpLb = iris.target

# --- Flip Data if necessary

#DAT = DAT.transpose()

# --- Data Info:



nGrp = len(GrpU)

Hgrp = hist(GrpLb, nGrp)


#%% -----Inspect visually

figure(figsize=(4,40)); imshow(DAT); colorbar()

figure(); boxplot(DAT)

figure(); hist(DAT[:,2])

#%% ----- Introduce some irregularities

DAT[0:2,0] = nan

DAT[4:14,1] = nan

DAT[100:150,3] = nan

DAT[2,0] = inf; # set one value to infinity

DAT = concatenate((DAT, ones((nSmp,1))*2), axis=1) # add a column of 2s

#%% ----- Check for NaN, Inf, zero-standard deviation

Cnan = isnan(DAT).sum(axis=0) # NaN count per feature

Cinf = isinf(DAT).sum(axis=0) # inf count per feature

PropNaN = Cnan / nSmp # proportion NaN per feature

if PropNaN.any(): # plot only if there are any NaNs

bar(arange(0,len(PropNaN)), PropNaN)

Bzstd = DAT.std(axis=0) < .000000001 # features with 0 standard deviation

if Bzstd.any():

print(’the following feature dimensions have constant values’)

print(Bzstd.nonzero())

#%% ----- Adjust Data ------

BnoNaN = logical_not(Cnan)

DATred = DAT[:,BnoNaN]

DATred2 = DAT.compress(BnoNaN,axis=1) # excluding columns

DAT[isinf(DAT)] = float_info.max # use max to replace inf

#%% ----- Standardization -----

# we go to reduced data because ’scaler’ cannot deal with NaN

DATstd = scale(DATred) # one-standard devation

print(DATstd.std(axis=0)) # display to verify

#%% ----- Scale to Unit Range ------

DATu = minmax_scale(DATred) # scaling to E [0 1]

print(DATu.max(axis=0)) # display maxima to verify

print(DATu.min(axis=0)) # display minima to verify

#%% ----- Permute Data --------

# necessary for some classifiers such as NeuralNetworks

IxPerm = random.permutation(nSmp); # randomize order of training samples

DAT = DAT[IxPerm,:] # reorder training set

GrpLb = GrpLb[IxPerm] # reorder group variable

124

#%% ---- Create Folds -------

#IxFld = crossvalind(’kfold’, GrpLb, 5);

G.3.1 Whitening Transform wiki Whitening transformation

Input: DAT, a n× d matrixOutput: DWit, the whitened data.

CovMx = cov(DAT); % covariance -> [nDim,nDim] matrix

[EPhi ELam] = eig(CovMx); % eigenvectors & -values [nDim,nDim]

Ddco = DAT * EPhi; % DECORRELATION

LamS = ELam.^(-0.5);

LamS = diag(diag(LamS)); % ensure it’s a diagonal matrix

DWit = Ddco * LamS; % EQUAL VARIANCE

% verify

COVwhi = cov(Ddco); % covariance of decorrelated data (should be a diagonal matrix)

Df = diag(ELam)-diag(COVdco); % difference of diagonal elements

if sum(Df)>0.1, error(’odd: differences of diagonal elements very large!?’); end

See also http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

G.3.2 Loading and Converting Data

clear; close all;

addpath(’c:/Data/’); % adds a folder path to the variable path

%% ---- Import a Single File

DAT = importdata(’filename for data’);

Grp = importdata(’filename for class/group label’);

DAT = single(DAT); % if you do not need double precision

sfp = ’c:/DataMat/DatPrep’; % where data will be save to

save(sfp,’DAT’,’Grp’); % will be save in compact matlab format

%% ---- Import Multiple Files

FOLD.DatRaw = ’c:/DataRaw/’; % folder with different data files

FOLD.DatSave = ’c:/DataMat/’; % folder where we save converted data

FilesAndDir = dir(FOLD.DatRaw); % includes (’.’ and ’..’)

FileNames = FilesAndDir(3:end); % omit first two dir (’.’ and ’..’)

nFileNames = length(FileNames);

DAT = zeros(nFileNames,nDim); % nDim: # of dimensions - if known already

Grp = zeros(nFileNames,1);

for i = 1:nFileNames

fp = [FOLD.DatRaw FileNames(i).name]; % +2: jump ’.’ and ’..’

F = load(fp); % a feature vector

DAT(i,:) = F; % assign to DAT matrix

Grp(i) = label; % assign to group vector

end

%% --- Now Save

save(sfp, ’DAT’, ’Grp’);

125

http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

G.3.3 Loading the MNIST dataset

Note that this function contains two subfunctions, ff LoadImg and ff ReadLab.

% Loads MNIST data and converts them from ubyte to single.

%

function [TREN LblTren TEST LblTest] = LoadMNIST()

filePath = ’C:\DatOrig\MNST\’;

Filenames = cell(4,1);

Filenames{1} = [filePath ’train-images.idx3-ubyte’];

Filenames{2} = [filePath ’train-labels.idx1-ubyte’];

Filenames{3} = [filePath ’t10k-images.idx3-ubyte’];

Filenames{4} = [filePath ’t10k-labels.idx1-ubyte’];

TREN = ff_LoadImg(Filenames{1});

LblTren = ff_ReadLab(Filenames{2});

TEST = ff_LoadImg(Filenames{3});

LblTest = ff_ReadLab(Filenames{4});

TREN = single(TREN)/255.0;

TEST = single(TEST)/255.0;

LblTren = single(LblTren);

LblTest = single(LblTest);

end % MAIN FUNCTION

%% ========== Load Digits

function IMGS = ff_LoadImg(imgFile)

fid = fopen(imgFile, ’rb’);

idf = fread(fid, 1, ’*int32’,0,’b’); % identifier

nImg = fread(fid, 1, ’*int32’,0,’b’);

nRow = fread(fid, 1, ’*int32’,0,’b’);

nCol = fread(fid, 1, ’*int32’,0,’b’);

IMGS = fread(fid, inf, ’*uint8’,0,’b’);

fclose( fid );

assert(idf==2051, ’%s is not MNIST image file.’, imgFile);

IMGS = reshape(IMGS, [nRow*nCol, nImg])’;

for i=1:nImg

Img = reshape(IMGS(i,:), [nRow nCol])’;

IMGS(i,:) = reshape(Img, [1 nRow*nCol]);

end

end % SUB FUNCTION

%% ========== Load Labels

function Lab = ff_ReadLab(labFile)

fid = fopen(labFile, ’rb’);

idf = fread(fid, 1, ’*int32’,0,’b’);

nLabs = fread(fid, 1, ’*int32’,0,’b’);

ind = fread(fid, inf, ’*uint8’,0,’b’);

fclose(fid);

assert(idf==2049, ’%s is not MNIST label file.’, labFile);

Lab = zeros(nLabs, 10);

ind = ind + 1;

for i=1:nLabs

Lab(i,ind(i)) = 1;

end

end % SUB FUNCTION

126

G.4 Utility Functions

G.4.1 Calculating Memory Requirements

% Calculate memory requirements for a data matrix with nEnt entries

%

% IN nEnt number of entries, typically nPoints * nDimensions

% OUT Gb number of GigaBytes

%

function Gb = f_GbSingle(nEnt)

Gb = nEnt*4/(1024^3);

fprintf(’%.3f Gb ’, Gb);

end

127

G.5 Classification Example - Nearest-CentroidSection 4

In all the following examples, only one prediction estimate is carried. To apply cross-folding, one would writea loop that encompasses the code.

G.5.1 Simple Version

clear

load fisheriris




nGrp = length(unique(GrpLb)); % number of groups

nDim = size(DAT,2); % number of dimensions

%% ========= Split Data into Training/Testing Set =============

IxCV = crossvalind(’KFold’,GrpLb,5);

Bfold1 = IxCV==1; % identify 1st fold

Bfrest = ~Bfold1; % identify 2nd-5th folds

TST = DAT(Bfold1,:); % one fold for testing

TRN = DAT(Bfrest,:); % four folds for training

GrpTst = GrpLb(Bfold1); % group labels for testing

GrpTrn = GrpLb(Bfrest); % group labels for training

%% ========= Training (Class Centroids) =============

CEN = zeros(nGrp,nDim);

for i = 1:nGrp

Bg = GrpTrn==i; % logical vector with ith class ON

CEN(i,:) = mean(TRN(Bg,:),1); % calculate centroid

end

%% ========= Testing (Classification) =============

DM = pdist2(TST, CEN); % distance matrix [nSmp nGrp]

[Di LbPred] = min(DM,[],2); % nearest centroid

Bhit = LbPred==GrpTst; % predicted class equal actual class?

pc = nnz(Bhit)/length(GrpTst);

fprintf(’Perc correct %1.2f \n’, pc*100);

In Python I have improvised the folding (due to lack of a function ’crossvalind’). If one intended to rotatethen it needed to be a bit cleverer. Python offers the function NearestCentroid, which I show how to use inthe last code section.


from scipy.spatial.distance import cdist

from sklearn.neighbors import NearestCentroid

from numpy import shape, unique, zeros, random, mean, argmin



DAT = iris.data

GrpLb = iris.target

# --- Data Info:



nGrp = len(GrpU)


#%% ========= Split Data into Training/Testing Set =============

IxPerm = random.permutation(nSmp)

szFld = round(nSmp/5)

IxFold1 = IxPerm[:szFld]

IxFrest = IxPerm[szFld:]

128

TST = DAT[IxFold1,:] # one fold for testing

TRN = DAT[IxFrest,:] # four folds for training

GrpTst = GrpLb[IxFold1] # group labels for testing

GrpTrn = GrpLb[IxFrest] # group labels for training

#%% ========= Training (Class Centroids) =============

CEN = zeros((nGrp,nFet))

for i in range(0,nGrp):

Bg = GrpTrn==i # logical vector with one class ON

CEN[i,:] = mean(TRN[Bg,:],axis=0)# calculate centroid

#%% ========= Testing (Classification) =============

DM = cdist(TST, CEN) # distance matrix [nSmp nGrp]

LbPred = argmin(DM,axis=1) # nearest centroid

Bhit = LbPred==GrpTst;

pc = shape(Bhit.nonzero())[1]/len(GrpTst)

print(’Perc correct’, pc*100, ’ explicit’)

#%% ========= Training/Testing with Function =============

MdNC = NearestCentroid()

MdNC.fit(TRN, GrpTrn)

LbPred = MdNC.predict(TST)

Bhit = LbPred==GrpTst;


print(’Perc correct’, pc*100, ’ function’)

G.5.2 Shrunken Version

An explicit implementation of the shrunken version is demonstrated only in Python:


from scipy.spatial.distance import cdist

from sklearn.neighbors import NearestCentroid

from numpy import shape, unique, zeros, random, mean, array

from numpy import sign, sqrt, median, newaxis



DAT = iris.data

GrpLb = iris.target

# --- Data Info:



nGrp = len(GrpU)


#%% ========= Split Data into Training/Testing Set =============

IxPerm = random.permutation(nSmp)

szFld = round(nSmp/5)

IxFold1 = IxPerm[:szFld]

IxFrest = IxPerm[szFld:]

TST = DAT[IxFold1,:] # one fold for testing

TRN = DAT[IxFrest,:] # four folds for training

GrpTst = GrpLb[IxFold1] # group labels for testing

GrpTrn = GrpLb[IxFrest] # group labels for training

# ---- toy data: will overwrite above variables

if 1:

TRN = array([[-6, 1.02], [-5, 1], [-4, 0.98], [4.9, 1], [5, 1], [5.1, 1]])

TST = array([[0, 0],[1, 0]])

GrpTrn = array([0, 0, 0, 1, 1, 1])

GrpTst = array([0,1])

nGrp, nFet = 2,2

#%% ============== Function ==================

MdNC = NearestCentroid(shrink_threshold=0.3) # initalize struct

MdNC.fit(TRN, GrpTrn) # train

129

LbPred = MdNC.predict(TST) # predict

Bhit = LbPred==GrpTst


print(’Perc correct’, pc*100, ’ function’)

#%% ============== Explicit ==================

shrink_threshold = 0.3

nTrn = len(GrpTrn) # no. of total training samples

# ======= Class Centroids and Sizes =============

CEN = zeros((nGrp,nFet)); Gsz = zeros(nGrp)

for i in range(0,nGrp):

Bg = GrpTrn==i # logical vector with one class ON

CEN[i,:] = mean(TRN[Bg,:],axis=0) # calculate centroid

Gsz[i] = sum(Bg) # group sizes (# of instances per class)

# ====== Scale Factor ========

Vnc = (TRN - CEN[GrpTrn,:]) ** 2 # [nSmpTrn nFet]

Vnc = Vnc.sum(axis=0) # [1 nFet]

Std = sqrt(Vnc / (nTrn - nGrp)) # [1 nFet] std dev

Std += median(Std) # robusting to outliers

M = sqrt((1./Gsz) - (1./nTrn)) # [nGrp 1]

Mr = M.reshape(nGrp,1) # reshape for broadcasting

MS = Mr * Std # [nSmpTrn nFet] scale factor

# ====== Scaled Vectors ========

CenTrn = mean(TRN,axis=0) # total centroid (of training data)

DEV = (CEN - CenTrn) / MS # deviation: scaled vectors

# ====== Threshold ======

SIGNS = sign(DEV)

DEV = (abs(DEV) - shrink_threshold)

DEV[DEV<0] = 0 # soft thresholding

DEV *= SIGNS

## ====== Scale Back ========

CENshr = CenTrn[newaxis,:] + DEV * MS # data centroid + deviations

# ====== Testing ========

LbPred = cdist(TST,CENshr).argmin(axis=1)

Bhit = LbPred==GrpTst


print(’Perc correct’, pc*100, ’ shrinkage explicit’)

130

G.6 Classification Example - kNNThe code serves two purpose: to show how to apply the kNN algorithm and how to cross-validate withincreasing degree of explicitness. The first section labeled ’All-In-One Function’ contains the classificationand folding in a single line - as demonstrated already in G.1. The second section labeled ’Folding MoreExplicit’ calls a separate function crossval for folding. The third section ’Folding Done Yourself’ shows howto be very explicit with folding and we now make use of functions fitxxx and predict for the individualfolds. The remaining two sections explain how to move to a kNN implementation of one’s own, the first onerelying on knnsearch, the second (very last one) how to completely program it yourself.clear

load fisheriris % famous data set provided by Matlab





nNN = 5; % number of nearest neighbors

%% ========= All-In-One-Function =============

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN, ’kfold’,nFld);

pc = 1-kfoldLoss(Mdl); % percent correct = 1-error

fprintf(’Perc correct %1.2f (all-in-one)\n’, pc*100);

%% ========= Folding More Explicit =============

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN);



fprintf(’Perc correct %1.2f (folding more explicit)\n’, pcf*100);

%% ========= Folding Done Yourself =============

IxFlds = crossvalind(’kfold’, GrpLb, nFld);

Pc =[];

for i = 1:nFld

% --- Prepare fold

Btst = IxFlds==i; % logical vector identifying testing samples

Btrn = ~Btst; % logical vector identifying training samples

Grp.Tren = GrpLb(Btrn); % select group labels for testing

Grp.Test = GrpLb(Btst); % select group labels for training

nTst = length(Grp.Test);

% --- Test fold

Mdl = fitcknn(DAT(Btrn,:), Grp.Tren, ’NumNeighbors’,nNN);

LbPred = predict(Mdl, DAT(Btst,:));

Bhit = LbPred==Grp.Test; % binary vector with hits equal 1

pc = nnz(Bhit)/nTst*100;

fprintf(’Fold %d, pc %1.2f\n’, i, pc);

Pc(i) = pc;

end

fprintf(’Perc correct %1.2f (folding done yourself)\n’, mean(Pc));

%% ========= Using Matlab’s knnsearch =============

nCls = 3; % # of classes

Btst = IxFlds==2; % choosing fold 1

TRN = DAT(~Btst,:); % training for fold 1

TST = DAT(Btst,:); % testing for fold 1



[IXNN Dist] = knnsearch(TRN, TST, ’k’,nNN);

GNN = Grp.Tren(IXNN); % indices to group labels

HNN = histc(GNN, 1:nCls, 2); % histogram for 5 NN

[Fq LbPred] = max(HNN, [], 2); % LbTst contains predicted classes

131


fprintf(’Perc correct %1.2f (knnsearch)\n’, nnz(Bhit)/nTst*100);

%% ========= Own Implementation =============

nTrn = size(TRN,1); % # of training samples

nTst = size(TST,1); % # of testing samples

GNN = zeros(nTst,11); % we will analyze 11 nearest neighbors

for i = 1:nTst

iTst = repmat(TST(i,:), nTrn, 1);% replicate to same size [nTrn nDim]

Diff = TRN-iTst; % difference [nTrn nDim]

Dist = sum(abs(Diff),2); % Manhattan distance [nTrn 1]

[~, O] = sort(Dist,’ascend’); % increasing dist for k-NN

GNN(i,:)= Grp.Tren(O(1:11)); % closest 11 samples

end

% --- Quick Knn analysis

HNN = histc(GNN(:,1:nNN), 1:nCls, 2); % histogram for 5 NN

[Fq LbPred] = max(HNN, [], 2); % LbTst contains predicted classes


fprintf(’Perc correct %1.2f (own)\n’, nnz(Bhit)/nTst*100);



from sklearn.neighbors import KNeighborsClassifier

from numpy import shape, unique, zeros



DAT = iris.data

GrpLb = iris.target

# --- Data Info:



nGrp = len(GrpU)


# --- Init


nNN = 5 # number of nearest neighbors

#%% ========= All-In-One Function =============

MdKnn = KNeighborsClassifier(nNN)

Pcs = cross_val_score(MdKnn, DAT, GrpLb, cv=k_fold)

print(’Perc correct ’, Pcs.mean()*100, ’(all-in-one)’)

#%% ========= Folding Done Yourself =============

Pc = zeros((5,1))

i = 0

for IxTrn, IxTst in k_fold.split(DAT):

#print(’Train: %s | test: %s’ % (IxTrn, IxTst))

TRN = DAT[IxTrn,:]

TST = DAT[IxTst,:]

MdKnn.fit(TRN, GrpLb[IxTrn])

LbPred = MdKnn.predict(TST)

Bhit = LbPred==GrpLb[IxTst]

pc = Bhit.sum()/len(IxTst)*100

Pc[i] = pc

i = i+1

print(’Fold ’, i, ’ pc ’, pc)

132

print(’Perc correct ’, Pc.mean(), ’ folding done yourself’)

G.6.1 kNN Analysis Systematic

This should run as a continuation of the above Matlab example.

%% --- Systematic Knn analysis

kNN = [3:2:11];

nNN = length(kNN); % number of NN we are testing

Pc = zeros(nNN,1); % init array

c = 0; % counter

for k = kNN

HNN = histc(GNN(:,1:k), 1:nCls, 2); % histogram for 5 NN

[Fq LbPred] = max(HNN, [], 2); % LbTst contains class assignment

Bhit = LbPred==Grp.Test;

c = c + 1;

Pc(c) = nnz(Bhit)/nTst*100;

end

figure(3);clf;

plot(kNN, Pc, ’*-’); title(’Perc correct for different NN’);

xlabel(’k (# of NN)’);

ylabel(’Perc Correct’);

set(gca,’ylim’,[0 100]);

G.7 Estimating the Covariance Matrix

clear;

D = randn(10,3);

[nO nDim] = size(D); % # observations/dimensions

Mn = mean(D,1); % mean

Dc = bsxfun(@minus, D, Mn); % data - mean

Cv = (Dc’ * Dc) / (nO-1); % covariance

%% ---- Verification

Vnc = var(D,[],1); % variance per dimension

Vnc-Cv(diag(true(nDim,1)))’

Cv2 = cov(D);

Cv-Cv2

133

G.8 Classification Example - Linear Classifier

This is the same Matlab code as in the kNN example (Appendix G.6), except that the function fitcdiscr isused instead of fitcknn. The explicit folding is not shown anymore.

clear

load fisheriris





%% ========= All-In-One Function =============

MdCv = fitcdiscr(DAT, GrpLb, ’kfold’,nFld);

pc = 1-kfoldLoss(MdCv); % percent correct = 1-error

fprintf(’Perc correct %1.2f (all-in-one)\n’, pc*100);

%% ========= Folding More Explicit =============

Mdl = fitcdiscr(DAT, GrpLb);



fprintf(’Perc correct %1.2f (folding more explicit)\n’, pcf*100);

This is also the same Python code as in the kNN example (Appendix G.6), except that the functionLinearDiscriminantAnalysis is imported instead of KNeighborsClassifier. In this example we showhowever one more time how to fold explicitly.



from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from numpy import shape, unique, zeros



DAT = iris.data

GrpLb = iris.target

# --- Data Info:



nGrp = len(GrpU)


# --- Init


#%% ========= All-In-One Function =============

Mdl = LinearDiscriminantAnalysis()

Pcs = cross_val_score(Mdl, DAT, GrpLb, cv=k_fold)

print(’Perc correct ’, Pcs.mean()*100, ’(all-in-one)’)

#%% ========= Folding Done Yourself =============

Pc = zeros((5,1))

i = 0

for IxTrn, IxTst in k_fold.split(DAT):

#print(’Train: %s | test: %s’ % (IxTrn, IxTst))

TRN = DAT[IxTrn,:]

TST = DAT[IxTst,:]

Mdl.fit(TRN, GrpLb[IxTrn])

LbPred = Mdl.predict(TST)

Prob = Mdl.predict_proba(TST) # returns probabilities [nSmp nFet]

134

Bhit = LbPred==GrpLb[IxTst]

pc = Bhit.sum()/len(IxTst)*100

Pc[i] = pc

i = i+1

print(’Fold ’, i, ’ pc ’, pc)

print(’Perc correct ’, Pc.mean(), ’ folding done yourself’)

135

G.9 Principal Component Analysis

clear

load fisheriris




nGrp = length(unique(GrpLb)); % number of groups

[nSmp nDim] = size(DAT); % number of dimensions

%% ===== Select Principal Components ======

[coeff1 score1 lat1]= pca(DAT);

LatN = lat1/sum(lat1); % normalize latencies

Lsc = cumsum(LatN);

nPco = find(Lsc>0.95,1,’first’); % select those that explain 95% or more

PCO = coeff1(:,1:nPco); % select the 1st nPco eigenvectors

%% ------ Plot Latencies -----

figure(1);

plot(lat1);

set(gca,’xtick’,1:4);

ylabel(’Latencies’); xlabel(’Component No’);

%% ====== Apply Principal Components ======

RED = zeros(nSmp, nPco);

for i = 1:nSmp,

RED(i,:) = DAT(i,:) * PCO;

end

%% ----- Classify original and reduced data set --------

nFld = 5;

MdRaw = fitcnb(DAT, GrpLb, ’kfold’,nFld);

pc = 1-kfoldLoss(MdRaw); % percent correct = 1-error

fprintf(’Perc correct %1.2f (raw)\n’, pc*100);

MdRed = fitcnb(RED, GrpLb, ’kfold’,nFld);

pc = 1-kfoldLoss(MdRed); % percent correct = 1-error

fprintf(’Perc correct %1.2f (reduced)\n’, pc*100);


from matplotlib.pyplot import figure, plot, xlabel, ylabel


from sklearn import decomposition


#%% ------ Load Data -----


DAT = iris.data

G = iris.target

#%% ------ Init -----


#%% ====== Transform ========

Pca = decomposition.PCA(n_components=2)

Pca.fit(DAT)

RED = Pca.transform(DAT)

#%% ------ Plot Latencies -----

figure()

plot(Pca.explained_variance_)

ylabel(’Latencies’); xlabel(’Component No’);

#%% ----- Classify original and reduced data set --------

MdNB = GaussianNB()

Pcs = cross_val_score(MdNB, DAT, G, cv=k_fold)

print(’Original ’, Pcs.mean()*100)

Pcs = cross_val_score(MdNB, RED, G, cv=k_fold)

136

print(’Reduced ’, Pcs.mean()*100)

137

G.10 Example Feature Selection: Ranking Features

% Create 10-dimensional data of which we make 6 significantly different.

% Function rankfeatures should identify those 6 significant ones.

clear; rng(’default’);

nP = 250; % # of points

nDim = 10;

% --- generate groups

Grp = round(rand(1,nP))+1;

Hg = hist(Grp,1:2);

% --- generate data

DAT = randn(nP,nDim); % normal (Gaussian) noise

Bg1 = Grp==1; % identify one group (logical vector)

Perm = randperm(nDim); % permutation

IxSig = Perm(1:6); % select 1st 6 and...

DAT(Bg1,IxSig) = DAT(Bg1,IxSig)+1; % ...make those significantly different

%% ===== Ranking ======

% Note: the function rankfeatures expects the data matrix DAT flipped!

[Otts Tts]= rankfeatures(DAT’, Bg1, ’criterion’, ’ttest’);

[Oent Ent]= rankfeatures(DAT’, Bg1, ’criterion’, ’entropy’);

[Oroc Roc]= rankfeatures(DAT’, Bg1, ’criterion’, ’roc’);

[Owcx Wcx]= rankfeatures(DAT’, Bg1, ’criterion’, ’wilcoxon’);

%% ----- Normalize & Plot

Tts = Tts / sum(Tts);

Ent = Ent / sum(Ent);

Roc = Roc / sum(Roc);

Wcx = Wcx / sum(Wcx);

figure(1);clf;

bar([Tts Ent Roc Wcx]);

legend(’t-Test’, ’Entropy’,’ROC’,’Wilcoxon’);

xlabel(’Dimension No.’);

138

G.11 Function k-Fold Cross-Validation

% Generates labels for a k-fold cross-validation.

% taken from Matlab’s crossvalind.

% IN k # of folds

% Grp group variable [nSmp 1]. assumes E [1..k]

% OUT Fld vector of fold labels [nSmp 1], E [1..k]

% IxF list of indices into Fld

%

function [Fld IxF] = f_IxCrossVal(Grp, nFld)

% --- verify Group vector

Gu = unique(Grp);

nGrp = length(Gu);

assert(Gu(end)==nGrp,’Group variable not suitable: use 1,..,nGrp’);

%% ----- LOOP Groups

nSmp = length(Grp);

Fld = zeros(nSmp,1);

for g = 1:nGrp

IxGrp = find(Grp==g);

nMem = length(IxGrp); % # of members

PermMem = randperm(nMem); % permute them

IxMem = ceil(nFld*(1:nMem)/nMem); % fold indices

% and permute them to try to balance among all groups

PermFld = randperm(nFld); % permute the folds in order to balance

% randomly assign the id’s to the observations of this group

Fld(IxGrp(PermMem)) = PermFld(IxMem);

end

%% ----- List of indices

IxF = cell(nFld,1);

for i = 1:nFld

IxF{i} = find(Fld==i);

end

end % MAIN

139

G.12 Example ROC

Calculates the ROC curve using a for-loop for instructional purposes (the next page contains a functiondoing this using matrix commands). Here, three overlapping distributions are analyzed, each one a bitmore separated than the previous one, see line Dat = [Sig-i; Bkg];, where i creates the separation.

clear all; close all;

% Generate data:

nSig = 20;

nBkg = 40;

Sig = randn(nSig,1); % signal

Bkg = randn(nBkg,1); % background

% Generate labels: 1=signal, 2=background

LbSig = ones(nSig,1);

LbBkg = ones(nBkg,1)*2;

Lb = [LbSig; LbBkg];

ntSmp = nSig+nBkg; % # of total samples

%% -------- 3 Different Degrees of Separation

Auc = [];

for i = 1:3

Dat = [Sig-i; Bkg]; % final data for this cycle

% ===== Moving threshold

[aTPR aFPR c] = deal([],[],0);

figure(1);subplot(2,2,1); cla; % clear axis

Th = unique(Dat)’; % generating thresholds

for t = Th

c = c+1; % increase counter

bLrg = Dat <= t; % decision

% ===== Evaluate

bHit = bLrg & Lb==1; % true pos / hits

bFaA = bLrg & Lb==2; % false pos / false alarms

aTPR(c) = nnz(bHit)/nSig;

aFPR(c) = nnz(bFaA)/nBkg;

% --- Plotting

figure(1);

subplot(2,2,1);

plot(Sig,zeros(nSig,1),’g.’); hold on;

plot(Bkg,zeros(nBkg,1),’r.’);

plot([t t], [0 0],’k*’);

subplot(2,2,2);

plot(aFPR,aTPR, ’b.-’); hold on;

set(gca,’xlim’, [0 1], ’ylim’, [0 1]);

% pause();

end

% Area under the curve:

aTPRmid = aTPR(1:end-1)+diff(aTPR)/2; % interpolated mid points

Auc(i) = 0.5 + abs(0.5 - aTPRmid * diff(aFPR’) );

end

%% ------ Area under the Curve Values

subplot(2,2,4);

bar(Auc);

140

G.12.1 ROC Function

Calculates the ROC curve using matrix commands only.

% ROC curve for a signal and noise distribution and the corresponding value

% Area-under-the-Curve.

%

% IN Dsrb distribution with signal and noise (or one cat vs another)

% Bsig logical vector with points == 1 where the signal points are

% OUT C ROC curve [nPtsUnique 2]

% auc area under the curve

%

function [C auc] = f_RocCrv(Dsrb, Bsig)

if isrow(Bsig), Bsig = Bsig’; end % flip to make it column vector

nPsig = nnz(Bsig); % # of signal points

nPnos = length(Bsig)-nPsig; % # of noise points

[~,O] = sort(Dsrb); % sort the distribution

BsigO = Bsig(O); % re-order signal labels

Tpr = cumsum(BsigO) /nPsig; % true positive rate (hits)

Fpr = cumsum(~BsigO)/nPnos; % false positive rate (false alarms)

C = [Fpr Tpr]; % the ROC curve

% --- Area under the curve:

TprMid = Tpr(1:end-1)+diff(Tpr)/2; % interpolated mid points

auc = 0.5 + abs(0.5 - TprMid’ * diff(Fpr) ); % avoid below 0.5

end

141

G.13 Example Feature Selection: Sequential Forward SelectionSection 7.2.2

clear

load fisheriris;

DAT = randn(150,10); % noise data

DAT(:,[1 3 5 7])= meas; % insert real data

Grp = grp2idx(species); % renaming group variable

[nSmp nFet] = size(DAT); % number of dimensions

%% ----- Init -----

CvPrt = cvpartition(Grp,’k’,10);

CostFun=@(TRN,Gtrn,TST,Gtst)(nnz(Gtst~=classify(TST,TRN,Gtrn,’quadratic’)));

Bsel = false(1,nFet); % none are selected at the beginning

HistErr = [];

tolFun = 1e-6; % tolerance for what?

%% ================= LOOP FEATURES =================

for i = 1:nFet

%% ----- Select Subset ------

DIN = [DAT(:,Bsel), zeros(nSmp,1)];

IxRem = find(~Bsel);

nRem = length(IxRem);

%% ===== LOOP REMAINING =======

Err = zeros(1,nRem);

for k = 1:nRem

DIN(:,end) = DAT(:,IxRem(k));

% --- now the cross-validation: we obtain # of misses back

nMiss = crossval(CostFun,DIN,Grp,’partition’,CvPrt,’Mcreps’,1);

Err(k) = sum(nMiss)/ sum(CvPrt.TestSize); % error rate

end

[mxErr ixMin] = min(Err); % minimize the cost

%% ----- Apply Criterion --------

if ~isempty(HistErr)

oldErr = HistErr(end);

thr = oldErr - (abs(oldErr) + sqrt(eps)) * tolFun;

if mxErr > thr

break;

end

end

%% ----- Identify Selected --------

ixSel = IxRem(ixMin);

Bsel(ixSel) = true; % mark as selected feature

HistErr = [HistErr mxErr];

fprintf(’%d: feature %d, error %1.2f\n’, i, ixSel, mxErr);

end

fprintf(’Features selected: ’); fprintf(’%d ’, find(Bsel)); fprintf(’\n’);

142

G.14 Clustering Example - K-Means

The overview in Appendix G.2 showed how to apply the software’s function; here is an explicit implementa-tion of the principle.

clear;

%% ---- Artificial Dataset

nP = 20;

X = [randn(nP,2)+ones(nP,2); randn(nP,2)-ones(nP,2)];

nP = size(X,1);

nCls = 2;

%% === Using kmeans

[Lb CtrMb] = kmeans(X, nCls, ’dist’,’city’, ’rep’,5, ’disp’,’final’);

% ---- Cluster info

IXC = cell(nCls,1);

for i = 1:nCls

IXC{i} = find(Lb==i);

end

%% === Implementation of the Principle

IxCtr = randsample(nP,2);

Ctr = X(IxCtr,:); % initial centroids

D = zeros(size(X));

minErr = 0.1;

mxIter = 100;

for i = 1:mxIter

% === Distances Centroid to All

for c = 1:nCls

ctr = Ctr(c,:); % one centroid [1 nDim]

Df = bsxfun(@minus, ctr, X); % difference only

D(:,c) = sum(Df.^2,2); % square suffices (we don’t root)

end

% === Find Nearest

[v IxMin] = min(D,[],2);

% === Move Centroid (new location)

for c = 1:nCls

Ctr(c,:) = mean(X(IxMin==c,:));

end

end

% ---- Cluster info

IXC2 = cell(nCls,1);

for i = 1:nCls

IXC2{i} = find(IxMin==i);

end

%% ---- Plotting

figure(1); clf;

M = colormap;

Mr = M(randsample(64,64),:);

subplot(1,2,1); hold on;

for i = 1:nCls

plot(X(IXC{i},1),X(IXC{i},2),’.’, ’color’, Mr(i,:));

end

plot(CtrMb(:,1),CtrMb(:,2),’kx’);

plot(Ctr(:,1),Ctr(:,2),’ro’);

subplot(1,2,2); hold on;

for i = 1:nCls

plot(X(IXC2{i},1),X(IXC2{i},2),’.’, ’color’, Mr(i,:));

end

plot(CtrMb(:,1),CtrMb(:,2),’kx’);

plot(Ctr(:,1),Ctr(:,2),’ro’);

143

G.14.1 Cluster Information and Plotting

Function to extract cluster information from label array as returned by kmeans.

% Cluster info: centers, observation indices, member size.

% IN Cls vector with labels as produced by a clustering algorithm

% Pts observations (samples) [nObs nDim]

% minSize minimum cluster size

% strTyp info string

% OUT I .Cen centers

% .Ix indices to points

% .Sz cluster size

%

function I = f_ClsInfo(Cls, DAT, minSize, strTyp)

if nargin<3, minSize = 0; strTyp = ’’; end

if nargin<4, strTyp = ’’; end

nCls = max(Cls); % # of clusters (assuming E [1 nGrp])

nDim = size(DAT,2); % # of dimensions

H = hist(Cls, 1:nCls);

IxMinSz = find(H>=minSize);

I.n = length(IxMinSz); % # of cluster of interest

I.Cen = zeros(I.n,nDim,’single’); % centers [nCls nDim]

I.Ix = cell(I.n,1); % indices of observations

I.Sz = zeros(I.n,1,’single’); % member size (cluster cardinality)

for i = 1:I.n

bCls = Cls==IxMinSz(i); % identify the cluster indices

cen = mean(DAT(bCls,:),1); % center

I.Cen(i,:) = cen;

I.Ix{i} = single(find(bCls)); % actual obs indices

I.Sz(i) = nnz(bCls); % # of members in cluster

end

nP = size(DAT,1);

I.notUsed = nP-sum(I.Sz);

%% ---- Display

fprintf(’%2d Cls %9s Sz %1d-%2d #ObsNotUsed %d oo %d\n’, ...

I.n, strTyp, min(I.Sz), max(I.Sz), I.notUsed, nP);

end % MAIN

Plots clusters, if two-dimensional:

% Plotting 2D clusters.

% IN I struct with center and indices as generated by f_ClsInfo

% DAT observations (samples) [nObs 2]

function [] = p_ClsSimp(I, DAT)

%% ===== All Points in Black =====

plot(DAT(:,1),DAT(:,2),’k.’); hold on;

%% ----- Init Color for Clusters

colormap(’default’); % setting default colormap (avoiding grayscale)

CM = colormap; % obtain the colormap

nCol = size(CM,1);

% permute colormap (to avoid similar colors):

Perm = randperm(nCol); % permuation

CM = CM(Perm(1:2:end),:); % take only every 2nd one

%% ===== LOOP Clusters =====

for i = 1:I.n

Ix = I.Ix{i};

col = CM(i,:);

plot(DAT(Ix,1), DAT(Ix,2), ’o’, ’color’, col, ...

’markerfacecolor’, col); % cluster members

% --- plot center on top

cen = I.Cen(i,:);

plot(cen(1), cen(2), ’x’, ’markersize’, 10);

end

144

end % function

145

G.15 Hierarchical Clustering

The overview in Appendix G.2 showed how to apply the software’s function; here is a slightly more explicituse of the software’s functions, namely we use now the three scripts pdist, linkage and cluster:

clear;

nP = 20;

rng(’default’);

%% All Random

PtsRnd = rand(nP,2); % all random

%% Arc & Square Grid

degirad = pi/180;

wd = 45*degirad;

nap = 10;

yyarc = cos(linspace(-wd,wd,nap))*(0.5)+0.4;

xxarc = linspace(.15,.85,nap);

nsp = 5;

yysqu = repmat(linspace(0.1,0.3,nsp),nsp,1); yysqu = yysqu(:);

xxsqu = repmat(linspace(0.3,0.7,nsp),1,nsp);

PtsPat = [xxarc’ yyarc’];

PtsPat = [PtsPat; [xxsqu’ yysqu]]; % append

%% Clustering Random

DisRnd = pdist(PtsRnd); % pairwise distances

LnkRnd = linkage(DisRnd, ’single’);

[Ln2Rnd NConRnd] = f_LnkTrans(LnkRnd);

ClsRnd = cluster(LnkRnd, ’cutoff’, 0.29, ’criterion’, ’distance’); % 1.14);

DisLnk = sort(LnkRnd(:,3), ’descend’);

DMrnd = squareform(DisRnd);

DMrnd(diag(true(nP,1))) = inf;

[DMrndO ORnd] = sort(DMrnd,2);

NNdi = DMrndO(:,1);

[mxNN1 ixNN1mx] = max(NNdi);

%% Clustering Pattern

DisPat = pdist(PtsPat);

LnkPat = linkage(DisPat, ’single’);

[Ln2Pat NConPat] = f_LnkTrans(LnkPat);

ClsPat = cluster(LnkPat, ’cutoff’, 1.15);

ClsPat = cluster(LnkPat, ’cutoff’, 0.11, ’criterion’, ’distance’);

%% General Stats

fprintf(’#Cls Rnd %d\n’, max(ClsRnd(:)));

fprintf(’#Cls Pat %d\n’, max(ClsPat(:)));

mxl = max([LnkRnd(:,3); LnkPat(:,3)])*1.05; % y-limit

%% Plotting

[rr cc] = deal(3,2);

figure(1); clf;

subplot(rr,cc,1);

scatter(PtsRnd(:,1), PtsRnd(:,2), 100, ClsRnd, ’filled’);

set(gca, ’xlim’, [0 1]);

set(gca, ’ylim’, [0 1]);

p_MST(PtsRnd, LnkRnd, ClsRnd);

title(’Random’, ’fontweight’, ’bold’, ’fontsize’, 12);

plot(PtsRnd(ixNN1mx,1), PtsRnd(ixNN1mx,2), ’k*’);

subplot(rr,cc,2);

scatter(PtsPat(:,1), PtsPat(:,2), 100, ClsPat, ’filled’);

set(gca, ’xlim’, [0 1]);

set(gca, ’ylim’, [0 1]);

p_MST(PtsPat, LnkPat, ClsPat);

title(’Pattern’, ’fontweight’, ’bold’,’fontsize’, 12);

subplot(rr,cc,3);

[HRnd TRng] = dendrogram(LnkRnd);

set(gca,’ylim’,[0 mxl], ’fontsize’, 7);

subplot(rr,cc,4);

[HPat TPat] = dendrogram(LnkPat, 40);

set(gca,’ylim’,[0 mxl], ’fontsize’, 7);

subplot(rr,cc,5);

p_MST2(PtsRnd, Ln2Rnd, ClsRnd, ’entirenum’);

%plot(DisLnk, ’.-’);

subplot(rr,cc,6);

146

p_MST2(PtsPat, Ln2Pat, ClsPat, ’entire’);

G.15.1 Three Functions

Now follow 3 functions for the above script. The first one rearranges the linkage output. The remaining twoones are plotting functions.• Linkage transform:

%TRANSZ Translate output of LINKAGE into another format.

% This is a helper function used by DENDROGRAM and COPHENET.

% In LINKAGE, when a new cluster is formed from cluster i & j, it is

% easier for the latter computation to name the newly formed cluster

% min(i,j). However, this definition makes it hard to understand

% the linkage information. We choose to give the newly formed

% cluster a cluster index M+k, where M is the number of original

% observation, and k means that this new cluster is the kth cluster

% to be formed. This helper function converts the M+k indexing into

% min(i,j) indexing.

function [Z Ncon] = f_LnkTrans(Z)

nL = size(Z,1)+1; % # of leaves

for i = 1:(nL-1)

if Z(i,1) > nL, Z(i,1) = traceback(Z,Z(i,1)); end

if Z(i,2) > nL, Z(i,2) = traceback(Z,Z(i,2)); end

if Z(i,1) > Z(i,2),Z(i,1:2) = Z(i,[2 1]); end

end

Pairs = Z(:,1:2);

Ncon = histc(Pairs(:),1:nL); % # of connections/links

%%

function a = traceback(Z,b)

nL = size(Z,1)+1; % # of leaves

if Z(b-nL,1) > nL, a = traceback(Z,Z(b-nL,1));

else a = Z(b-nL,1); end

if Z(b-nL,2) > nL, c = traceback(Z,Z(b-nL,2));

else c = Z(b-nL,2); end

a = min(a,c);

• Plotting MST, version I:

% Plots minimum spanning

(single-link clustering) for all points

% and for the individual clusters of Cls.

%

function [] = p_MST(Pts, Lnk, Cls, type)

if ~exist(’type’, ’var’), type = ’’; end

hold on;

nPtot = size(Pts,1);

nL = size(Lnk,1);

if nPtot~=(nL+1), error(’Lnk probably not correct: #Pts %d, #Lnk %d’); end

if nPtot==1, pp_Singleton(Pts); return; end

Dis = Lnk(:,3); % distances

Lnk = Lnk(:,1:2); % cluster indices (ix to points and intermed clusters)

maxDist = max(Dis);

Sim = 1.1-Dis./maxDist; % similarity for linewidth

if any(Sim<eps),

warning(’linewidth < 0: %1.5f’, min(Sim));

end

%% ============ ENTIRE MST

147

Cen = zeros(nL,2);

LnkVec = zeros(nL,2,2,’single’);

for i = 1:nL

Ixp = Lnk(i,:); % pair indices

bLef = Ixp<=nPtot; % leaves

if all(bLef), % both are leaves (points)

Xco = Pts(Ixp,1);

Yco = Pts(Ixp,2);

elseif sum(bLef)==1 % one is a leaf (point), the other a cluster

if bLef(1), ixp = Ixp(1); ixc = Ixp(2);

else ixp = Ixp(2); ixc = Ixp(1);

end

Xco = [Pts(ixp,1); Cen(ixc-nPtot,1)];

Yco = [Pts(ixp,2); Cen(ixc-nPtot,2)];

else % both are clusters

Xco = Cen(Ixp-nPtot,1);

Yco = Cen(Ixp-nPtot,2);

end

Cen(i,:) = mean([Xco Yco],1);

LnkVec(i,:,:) = [Xco Yco];

% --- prints entire tree if desired

if strcmp(type, ’entire’)

hp = plot(Xco, Yco, ’color’, ones(1,3)*0.5, ’linestyle’, ’-’);

set(hp, ’linewidth’, Sim(i)*4);

end

end

%% ============ CLUSTER MST

if iscell(Cls),nCls = length(Cls);

else nCls = max(Cls);

end

for i = 1:nCls

if iscell(Cls), IxG = Cls{i}; % pt ixs of cluster (group)

else IxG = find(Cls==i);

end

szG = length(IxG); % group size (#Pts)

if szG==1, pp_Singleton(Pts(IxG,:)); continue; end

Brg = [];

for k = 1:szG

bOcc = Lnk==IxG(k); % find leafs in tree

IxOcc = find(sum(bOcc,2)); % indices

IxL = Lnk(IxOcc,:);

IxB = sum(IxL,2)+nPtot;

Brg = [Brg; setdiff(IxL(:),IxG(k))];

for l = IxOcc

Xco = LnkVec(l,:,1);

Yco = LnkVec(l,:,2);

hp = plot(Xco, Yco, ’color’, ’k’);

set(hp, ’linewidth’, Sim(l)*4);

end

end

B = false(nL,2);

for k = 1:length(Brg)

if Brg(k)<=nPtot, continue; end

B(Lnk==Brg(k)) = true;

end

IxB = []; % find(B(:,1)&B(:,2));

for l = IxB’

Xco = LnkVec(l,:,1);

Yco = LnkVec(l,:,2);

hp = plot(Xco, Yco, ’color’, ’g’);

set(hp, ’linewidth’, Sim(l)*4);

end

% --- connect group’s center point to remaining points

PtsSel = Pts(IxG,:);

cen = mean(PtsSel,1);

for k = 1:szG

148

plot([PtsSel(k,1) cen(1)], [PtsSel(k,2) cen(2)], ’color’, ones(1,3)*0.7);

end

end

%% ------------ Singleton Point

function [] = pp_Singleton(Pts)

plot(Pts(1), Pts(2), ’ko’, ’markersize’, 5);


• Plotting MST, version II:

% Plots minimum spanning tree (single-link clustering) for all points

% and for the individual clusters of Cls.

% sa p_MST

function [] = p_MST2(Pts, Lnk, Cls, type)

if ~exist(’type’, ’var’), type = ’’; end

hold on;

nPtot = size(Pts,1);

nL = size(Lnk,1);

if nPtot~=(nL+1), error(’Lnk probably not correct: #Pts %d, #Lnk %d’); end

if nPtot==1, pp_Singleton(Pts); return; end

Dis = Lnk(:,3); % distances

Lnk = Lnk(:,1:2); % cluster indices (ix to points and intermed clusters)

maxDist = max(Dis);

Sim = 1.1-Dis./maxDist; % similarity for linewidth

%% ============ ENTIRE MST

Cen = zeros(nL,2);

%LnkVec = zeros(nL,2,2,’single’);

for i = 1:nL

Ixp = Lnk(i,:); % pair indices

Xco = Pts(Ixp,1);

Yco = Pts(Ixp,2);

Cen(i,:)= mean([Xco Yco],1);

%LnkVec(i,:,:) = [Xco Yco];

% --- prints entire tree if desired

if strfind(type, ’entire’)

hp = plot(Xco, Yco, ’color’, ones(1,3)*0.5, ’linestyle’, ’-’);

set(hp, ’linewidth’, Sim(i)*4);

end

end

%% ============

if strfind(type, ’num’)

for i = 1:nPtot

Pt = double(Pts(i,:));

text(Pt(1), Pt(2), num2str(i), ’fontsize’, 8);

end

end

return

%% ------------ Singleton Point

function [] = pp_Singleton(Pts)



149

G.16 Classification Example - Decision Tree

clear;

% --- Load the Data

load ionosphere

DAT = X; % renaming data variable

GrpLb = Y; % renaming group variable


% --- Data Info:





%% ===== All-In-One Function ======

TreeCV = fitctree(DAT, GrpLb, ’kfold’,5);

pcTree = 1-kfoldLoss(TreeCV);

fprintf(’Perc correct for Tree %1.4f\n’, pcTree*100);

view(TreeCV.Trained{1},’Mode’,’graph’);

%% ===== Folding Explicitly (2 folds only) ======

Btrn = false(nSmp,1);

IxTrn = randsample(nSmp,round(0.5*nSmp)); % random training indices

Btrn(IxTrn) = true; % logical vector with training=ON

Btst = Btrn==false; % logical vector with testing=ON

Tree = fitctree(DAT(Btrn,:), GrpLb(Btrn));

LbPred = predict(Tree,DAT(Btst,:));

Bhit = LbPred==GrpLb(Btst); % binary vector with hits equal 1

cTst = nnz(Btst);

fprintf(’Perc correct %1.3f\n’, nnz(Bhit)/cTst*100);

view(Tree,’Mode’,’graph’);

%% ===== Test Different Leave Sizes ======

SzLeaf = logspace(1,2,10);

nSz = numel(SzLeaf);

Pc = zeros(nSz,1);

for i = 1:nSz

Tree = fitctree(DAT,GrpLb, ’kfold’,5, ’MinLeaf’,SzLeaf(i));

Pc(i) = 1-kfoldLoss(Tree);

end

%% ----- Plotting -----

figure(3); clf; hold on;

plot(SzLeaf,Pc*100);

plot([1 SzLeaf(end)],ones(1,2)*pcTree*100);

xlabel(’Min Leaf Size’);


150

G.17 Classification Example - Ensemble Voting

The following Matlab example shows how to program a voting classifier explicitly, it is essentially only themaximum function that represents the crucial part - anything else is as introduced before.

clear;



DAT = meas(:,1:3); % renaming data variable



% --- Data Info:





% --- Folds

nFld = 5;

IxFlds = crossvalind(’kfold’, GrpLb, nFld);

%% --------- LOOP FOLDS ----------------

Pc = zeros(nFld,5);

for i = 1:nFld

% --- Prepare fold

Btst = IxFlds==i; % logical vector identifying testing samples

Btrn = ~Btst; % logical vector identifying training samples



nTst = length(Grp.Test);

% --- Training Individual

MdNB = fitcnb(DAT(Btrn,:), Grp.Tren);

MdDC = fitcdiscr(DAT(Btrn,:), Grp.Tren);

MdRF = TreeBagger(10, DAT(Btrn,:), Grp.Tren);

% --- Testing Individual for Comparison

[PrdNB ScNB] = predict(MdNB, DAT(Btst,:));

[PrdDC ScDC] = predict(MdDC, DAT(Btst,:));

[PrdRF ScRF] = predict(MdRF, DAT(Btst,:));

pcNB = nnz(PrdNB==Grp.Test)/nTst*100;

pcDC = nnz(PrdDC==Grp.Test)/nTst*100;

pcRF = nnz(cellfun(@str2num,PrdRF)==Grp.Test)/nTst*100;

Pc(i,1:3) = [pcNB pcDC pcRF];

% --- Testing Ensemble

SC = cat(3,ScNB,ScDC,ScRF); % [nTst nCat nClassifiers]

ScSum = sum(SC,3); % simple voting

ScMax = max(SC,[],3); % max-combination rule

[~,PrdSum] = max(ScSum,[],2);

[~,PrdMax] = max(ScMax,[],2);

Pc(i,4) = nnz(PrdSum==Grp.Test)/nTst*100;

Pc(i,5) = nnz(PrdMax==Grp.Test)/nTst*100;

end

fprintf(’Perc correct %1.2f\n’, mean(Pc,1));

The following Python code exemplifies how to use the all-in-one function function.


from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier, VotingClassifier


DAT, Grp = iris.data[:, 1:3], iris.target

## ------- Individual Classifers

CfLR = LogisticRegression(random_state=1)

CfRF = RandomForestClassifier(random_state=1)

CfNB = GaussianNB()

## ------- Ensemble Classifier

151

CfEns = VotingClassifier(estimators=[(’lr’, CfLR),

(’rf’, CfRF),

(’gnb’, CfNB)], voting=’hard’)

## ------- Evaluation

for Clf, Lb in zip([CfLR, CfRF, CfNB, CfEns],

[’Logistic Regression’, ’Random Forest’, ’naive Bayes’, ’Ensemble’]):

Scrs = cross_val_score(Clf, DAT, Grp, cv=5, scoring=’accuracy’)

print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (Scrs.mean(), Scrs.std(), Lb))

152

G.18 Classification Example - Random Forest

clear;

% --- Load the Data

load ionosphere

DAT = X; % renaming data variable

GrpLb = Y; % renaming group variable


% --- Data Info:





%% ===== Single Tree (for comparison) ======

Tree = fitctree(DAT, GrpLb, ’kfold’,5);

pcTree = 1-kfoldLoss(Tree);

fprintf(’Perc correct for Tree %1.4f\n’, pcTree*100);

%% ===== Random Forest ========

NWk = [1 2 5 10 20 50]; % weak classifiers

nWkt = length(NWk);

Pc = zeros(nWkt,1);


for k = 1:nWkt

nWk = NWk(k);

Forest = TreeBagger(nWk, DAT, GrpLb, ’OOBPred’,’on’);

Pcs = 1-oobError(Forest);

Pc(k) = mean(Pcs)*100;

figure(1); plot(Pcs); pause(.2);

end

%% ===== Folding Explicitly (2 folds only) ======

Btrn = false(nSmp,1);

IxTrn = randsample(nSmp,round(0.5*nSmp)); % random training indices

Btrn(IxTrn) = true; % logical vector with training=ON

Btst = Btrn==false; % logical vector with testing=ON

Forest = TreeBagger(20, DAT(Btrn,:), GrpLb(Btrn), ’OOBPred’,’on’);

LbPredStr = predict(Forest, DAT(Btst,:)); % string labels!!

% ---- Convert string labels to numeric labels

cTst = nnz(Btst);

LbPred = cellfun(@str2num, LbPredStr); % converts string labels to scalar

% ---- Calculate prediction

Bhit = LbPred==GrpLb(Btst); % logical vector with hits equal 1

fprintf(’Perc correct %1.3f\n’, nnz(Bhit)/cTst*100);

%% ----- Plot Results

figure(2);clf;

plot(NWk,Pc); hold on;

plot([1 NWk(end)],ones(1,2)*pcTree*100);

legend(’Forest’,’Single Tree’);

xlabel(’# of Weak Learners’);


153

G.19 Example Density Estimation

G.19.1 Histogramming and Parzen Window

The following script compares histogramming and the density smoothening using Parzen Windows:

clear; rng(’default’);

%% ----- An artifical Data Set

X = [randn(30,1)*5; 10+rand(60,1)*8]; % synthetic data

nP = length(X); % number of data points

%% ===== Histogramming

Edg = linspace(-15,20,35); % edges for bins

H = histcounts(X, Edg); % histogramming

%% ===== Density Estimation

[Pz Ve] = ksdensity(X, Edg(1:end-1)+0.5); % parzen window

%% ----- Plotting

figure(1); clf;

plot(X, zeros(nP,1)-.5,’k.’,’markersize’,12); hold on;

bar(Edg(1:end-1), H, ’histc’);

plot(Ve, Pz*nP, ’g.-’,’markersize’,12);

legend(’Data’, ’Histogram’, ’Kernel’, ’location’, ’northwest’);

set(gca,’ylim’, [-.95 max(H(:))]);

set(gcf,’paperposition’,[0 0 9 4]);

%% ===== Density Estimation: own implementation

bandWth = 1;

PtEv = linspace(X(1),X(end),nP); % locations of evaluation

PzOwn = zeros(nP,1);

for i = 1:nP

PzOwn(i) = sum(pdf(’Normal’, X, PtEv(i), bandWth)) / (nP*bandWth);

end

PzMlb = ksdensity(X,PtEv,’width’,bandWth);% for comparison

% --- Plotting

figure(2);clf;

plot(PtEv, PzOwn, ’g.’); hold on;

plot(PtEv, PzMlb, ’b’);

G.19.2 N-Dimensional Histogramming

A simple function for histogramming in multiple dimensions:

% N-dim histogram.

% IN DAT data matrix [nPts nDim]

% nEdg # of edges (# bins = nEdg-1)

% OUT H n-dimensional histogram [nEdg nEdg ...nEdg]

% aEdg list of edges for each dimension

function [H aEdg] = histcn(DAT, nEdg)

[nP nDim] = size(DAT);

IX = zeros(nP,nDim,’single’);

aEdg = cell(nDim,1);

for i = 1:nDim

Dat = DAT(:,i);

Edg = linspace(min(Dat),max(Dat),nEdg);

[~,Ix] = histc(Dat, Edg);

IX(:,i) = Ix;

aEdg{i} = single(Edg);

end

H = single(accumarray(IX,1,ones(1,nDim,’single’)*nEdg));

154

G.19.3 Gaussian Mixture Model

An example of how to apply the function gmdistribution:

clear

rng(’default’);

X = [randn(30,1)*5; 10+rand(60,1)*8]; % synthetic data

nP = length(X); % number of data points

EvPt = linspace(min(X),max(X),120);

Ogm = gmdistribution.fit(X,2); % we assume 2 peaks

Gm = pdf(Ogm,EvPt’); % create the estimate

[Pz Ve] = ksdensity(X, EvPt); % parzen window for comparison

%% ---- Plotting


plot(Ve, Gm*nP,’.m’);

plot(Ve, Pz*nP, ’g’);

plot(X, zeros(nP,1)-.5,’.’);

legend(’GMM’, ’Parzen Win’, ’location’, ’northwest’);

set(gca,’ylim’, [-.8 max([Ve(:); max(Gm(:))])]);

155

G.20 Classification Example - SVM

The overview in G.1 already gave two examples. Here we give an example of how to use a user-definedkernel function in Matlab. The kernel we have defined is the so-called histogram intersection kernel. It isuseful for data that represent histograms.

tmpSVM = templateSVM(’KernelFunction’,’f_KrnHistItsSVM’, ...

’standardize’,false, ’ClassNames’,1:nGrp);

Mdl = fitcecoc(DAT, Grp, ’kfold’,5, ’learners’,tmpSVM);

pc = 1-kfoldLoss(Mdl);

You place the following function into a separate script. Note that you cannot change the variables U and V.Matlab requires specific variable names when using a self-written kernel function, see Matlab documenta-tion for details.

% Histogram intersection kernel.

%

% Verify: M = f_KrnHistIts(round(rand(5,3)*10),round(rand(3,3)*10));

%

function M = f_KrnHistItsSVM(U, V)

[n1 nDim] = size(U);

n2 = size(V,1);

M = zeros(n1,n2,’single’);

for i = 1:n1

M(i,:) = sum(bsxfun(@min,U(i,:),V),2);

end

end % MAIN

156

G.21 Classification Example - Naive Bayes

The overview in G.1 already gave an example of how to apply the all-in-one function in Matlab and Python.Here we give an explicit example of how one could implement a Naive Bayes model in Matlab.

clear;

rng(’default’);

S1 = [2 1.5; 1.5 3]; % covariance for multi-variate normal distribution

MuCls1 = [0.3 0.5]; % two means (mus) for class 1

MuCls2 = [3.2 0.5]; % " " " " class 2

PC1 = mvnrnd(MuCls1, S1, 50); % training class 1

TEST = mvnrnd(MuCls1, S1, 30); % testing (class 1)

PC2 = mvnrnd(MuCls2, S1, 50); % training class 2

TREN = [PC1; PC2]; % training set

Grp = [ones(size(PC1,1),1); ones(size(PC2,1),1)*2]; % group variable

%% ========== NAIVE BAYES =============

[nCat nDim] = deal(2,2);

% ===== Build class information for TRAINING set:

AVG = zeros(nCat,nDim);

[COV COVInv] = deal(zeros(nCat,nDim,nDim));

CovDet = zeros(nCat,1);

for k = 1:nCat

TrnCat = TREN(Grp==k, :); % [nCatSamp nDim]

AVG(k,:) = mean(TrnCat); % [nCat nDim]

CovCat = cov(TrnCat); % [nDim nDim]

COV(k,:,:) = CovCat; % [nCat, nDim, nDim]

CovDet(k) = det(CovCat); % determinant

COVInv(k,:,:) = pinv(CovCat); % p inverse

end

% ===== Testing a (single) sample with index ix (from TESTING set):

Prob = zeros(nCat,1); % initialize probabilites

for k = 1:nCat

detCat = abs(CovDet(k)); % retrieve class determinant

CovInv = squeeze(COVInv(k,:,:)); % retrieve class inverse

fct = 1/( ( (2*pi)^(nDim/2) )*sqrt(detCat) +eps);

Df = AVG(k,:)-TEST(1,:); % diff between avg and sample

Mah = (Df * CovInv * Df’)/2; % Mahalanobis distance

Prob(k) = fct * exp(-Mah); % probability for this class

end

[mxc ixc] = max(Prob); % final decision (class winner)

157

G.22 Clustering Example - Fuzzy C-Means

% Fuzzy c-Means.

% IN DAT data matrix [nObs nDim]

% nCls k (no. of clusters)

% Opt options:

% .expU exponent for the matrix U (default: 2.0)

% .mxIter maximum number of iterations (default: 100)

% .minImprov minimum amount of improvement (default: 1e-5)

% .bDisp info display during iteration (default: true)

%

% OUT Cen cluster centers [nCls nDim]

% U membership grade matrix [nCls nObs]

% 0 = no; 1 = full membership.

% spd spread (objective function: here it is sum of distances)

%

function [Cen U spd] = f_KmnsFuz(DAT, nCls, Opt)

%% ---------- Options ------------

OptDef = struct(’expU’,2, ’mxIter’,100, ’minImprov’,1e-5, ’bDisp’,true);

if nargin==2,

Opt = OptDef;

else

% verifying exponent

assert(Opt.expU>=1, ’The exponent should be >= 1!’);

end

expo = Opt.expU; % exponent for U

mxIter = Opt.mxIter; % max iteration

minImpro = Opt.minImprov; % min improvement

bDisp = Opt.bDisp; % display progress

%% ---------- Init

[nObs nDim] = size(DAT);

Spd = zeros(mxIter,1); % array for objective function

DM = zeros(nCls, nObs, ’single’);

RepDim = ones(nDim,1,’single’);

RepObs = ones(nObs,1,’single’);

RepCls = ones(nCls,1,’single’);

% --- Init U: must sum to 1 per cluster, as required by fuzzy c-means

U = rand(nCls,nObs,’single’);

Usum = sum(U);

U = U./Usum(RepCls,:);

%% ========= LOOP ===========

for i = 1:mxIter,

Uexp = U.^expo;

Cen = Uexp*DAT./((RepDim*sum(Uexp,2)’)’); % new centers

% ===== Distance Matrix Cen-DAT ======

if nDim>1,

for k = 1:nCls

DM(k,:) = sqrt(sum((DAT-RepObs*Cen(k,:)).^2,2));

end

else % 1-D data

for k = 1:nCls

DM(k,:) = abs(Cen(k)-DAT)’;

end

end

% --- spread and new U

spd = sum(sum((DM.^2).*Uexp)); % objective function

Dpo = DM.^(-2/(expo-1)); % new U, suppose expo != 1

U = Dpo./(RepCls*sum(Dpo));

Spd(i) = spd;

if bDisp, fprintf(’%3d spread %4.1f\n’, i, spd); end

158

% --- break if hardly any improvement

if i>1,

if abs(spd-Spd(i-1)) < minImpro,

break;

end

end

end % for iteration

end % MAIN

159

G.23 Clustering Example - DBSCAN

% Fast density-based clustering with the DBSCAN algorithm.

% IN DAT data [nObs nDim]

% minPts min no. of points per cluster

% epsi max radius

% OUT Grp vector with cluster labels

% Typ vector with point type: -1=outlier; 0=border; 1=core

% epsi calculated radius if not specified as 3rd input argument

function [Grp Typ epsi] = f_DbScan(DAT, minPts, epsi)

[nObs nDim] = size(DAT);

% --- estimate eps if not present

if nargin<3 || isempty(epsi)

rp = prod(range(DAT,1));

dv = nObs*sqrt(pi.^nDim);

epsi = (rp*minPts*gamma(.5*nDim+1)/dv).^(1/nDim);

end

% --- Init

ObsNo = 1:nObs;

Typ = zeros(1,nObs,’single’); % core/border/outlier

Grp = zeros(1,nObs,’single’); % group variable

Usd = false(nObs,1); % used/visited

cGrp = 1; % group counter

%% ========== LOOP Observations =========

for i = 1:nObs

if Usd(i), continue; end

% neiborhood:

Dis = ff_Dist(i);

IxN = find(Dis<=epsi); % neibor indices for radius = epsi

nN = length(IxN); % # neibors

% ----exactly one neibor (itself)

if nN==1

Typ(i) = -1; % mark as outlier

Grp(i) = -1; % mark as outlier

Usd(i) = true; % mark as used (visited)

end

% ----a few neibors

if nN>1 && nN<minPts

Typ(i) = 0; % mark as border

Grp(i) = 0; % mark as border

end

% ----sufficient neibors

if nN>=minPts;

Typ(i) = 1; % core

Grp(IxN) = ones(nN,1)*cGrp;

% ====== LOOP NEIBORS ======

while ~isempty(IxN)

ix1 = IxN(1);

Usd(ix1)= true;

IxN(1) = [];

% neiborhood II:

Dis = ff_Dist(ix1);

IxN2 = find(Dis<=epsi);

nN2 = length(IxN2);

if nN2>1

Grp(IxN2) = cGrp;

obsNo = ObsNo(ix1);

if nN2 >= minPts;

Typ(obsNo) = 1; % core

else

Typ(obsNo) = 0; % border

end

% ---- Loop Neibors II

160

for j = 1:nN2

ixN2 = IxN2(j);

if Usd(ixN2), continue; end

Usd(ixN2) = true;

Grp(ixN2) = cGrp;

IxN = [IxN ixN2];

end

end

end

cGrp = cGrp+1;

end

end % for nObs

IxG0 = find(Grp==0); % identify unused

Grp(IxG0) = -1; % mark as outliers

Typ(IxG0) = -1; % mark as outliers

% ------ Distance Sample/Observation to all Others

function Dis = ff_Dist(ix)

obs = DAT(ix,:);

Df = ones(nObs,1)*obs-DAT;

Dis = sqrt(sum(Df.^2, 2))’;

if nDim==1, Dis = abs(Df)’; end

end

end

161

G.24 Clustering Example - Clustering Tendency

A function for measuring the presence of clusters in a dataset. The corresponding testing script is append-ed.

% Measuring cluster tendency with the Hopkins-Test for Randomness.

% sa ThKo p901, pdf908

%

% IN PTS observations [nPts nDim]

% OUT hop Hopkins measure:

% 0.5 random

% >0.5 clusters present

% <0.5 regularity present

%

function hop = f_RndHopkins(PTS)

[nPts nDim] = size(PTS);

%% ----- Samples

nSmp = round(nPts*0.1); % take a fraction of all samples

PRND = rand(nSmp,nDim); % generate random samples (X’)

Ornd = randperm(nPts); % generate random order

IxSmp = Ornd(1:nSmp);

PSMP = PTS(IxSmp,:); % samples from PTS (X1)

%% ----- Calculate NN distances

[~,DOW] = knnsearch(PSMP, PSMP, ’k’, 2); % delta_j

[~,Dsr] = knnsearch(PSMP, PRND); % d_j

%% ----- Hopkins

sDow = sum(DOW(:,2).^nDim); % power dimensionality and sum

sDsr = sum(Dsr.^nDim); % power dimensionality and sum

hop = sDsr / (sDow + sDsr); % Hopkins measure

fprintf(’Hopkins %1.2f (nSmp=%d)\n’, hop, nSmp);

end

And here the corresponding testing script using artificial data:

clear;

nRep = 1000;

figure(1);clf;

%% ----- Regular Grid

[X Y] = meshgrid(0:30,0:30);

PTS = [X(:) Y(:)]./30; % regular grid


PTS = PTS(randperm(nPts),:);

for i = 1:nRep, Hop(i) = f_RndHopkins(PTS); end

ts = sprintf(’Hopkins avg %1.2f +- %1.2f\n’, mean(Hop), std(Hop));

subplot(2,2,1);plot(PTS(:,1),PTS(:,2),’.’);titleb(ts);

%% ----- Random

PTS = rand(1000,2); % random points




%% ----- Cluster

PTS1 = rand(500,2); % random points

PTS2 = rand(500,2)/2+0.15; % random points dense

PTS = [PTS1; PTS2]; % cluster in lower left


PTS = PTS(randperm(nPts),:);




162

About the Author C. Rasche is a researcher in computer vision focusing on image classification, objectdetection, shape recognition, medical image analysis etc. To solve those tasks, the study of pattern recog-nition and machine learning is inevitable. C. Rasche uses both traditional and modern classifiers and hasoften achieved very similar performances with either classifier; he thereby gained the experience that thedevelopment of simple, task-specific classifiers - in particular ensemble classifiers -, is the quickest way toachieve a fairly high performance without needing to wait for the long learning process of modern classifiers.https://www.researchgate.net/profile/Christoph_Rasche

163

https://www.researchgate.net/profile/Christoph_Rasche

workbook pattern recognition - upbimag.pub.ro/~rasche/course/patrec/patrec1.pdf · workbook pattern...

Documents