brief tour of machine learning

A Brief Tour of Machine Learning

David Lindsay

What is Machine Learning?

• Very multidisciplinary field – statistics, mathematics, artificial intelligence, psychology, philosophy, cognitive science…

• In a nutshell – developing algorithms that learn from data

• Historically – flourished from advances in computing in the early 60’s, resurgence in the late 90’s

Main areas in Machine Learning

#1 Supervised learningassumes a teacher exists to label/annotate data

#2 Unsupervised learningno need for a teacher, try to learn relationships

automatically

#3 Reinforcement learningbiologically plausible, try to learn from

reward/punishment stimuli/feedback

Machine Learning Area #1 Supervised Learning

Supervised Learning

Learning with a teacher


More about Supervised Learning

Perhaps the most well studied area of machine learning – lots of nice theory adapted from statistics/mathematics.

Assume the existence of a training and test set

Main sub-areas of research are:• Pattern recognition (discrete labels)• Regression (continuous labels)• Time series analysis (temporal dependence in

data)

i.i.d. assumption commonly

made


The formalisation of data• How to we formally describe our data?

Object

+

Label

Commonly represented as a feature vector – this describes the object

1 2, , , di i ix x xix

The individual features can be real, discrete, symbolic… eg. patient symptoms: temperature, sex, eye colour…

Property of the object that we want to predict in the future using our training data – e.g.. screening cancer labels could be Y = {normal, benign, malignant}


The formalisation of data (continued)• What is training and test data?

2 7 6 1 7 ?

Training set of images

?

We learn from the training data, and try to predict new unseen test data. More formally we have a set of n training and test examples (information pairs – object + label) from the some unknown probability distribution P(X,Y).

New test images – labels either not known or withheld from the learner

1 1 2 2( , ), ( , ), , ( , ) s.t. ( , )n n iy y y y X Y ix x x x

x

y


More about Pattern Recognition

Lots of algorithms/techniques – the main contenders:

1. Support Vector Machines (SVM)2. Nearest Neighbours3. Decision Trees4. Neural Networks5. Multivariate Statistics6. Bayesian algorithms7. Logic programming


The mighty SVM algorithm• Very popular technique – lots of followers, relatively new• Very simple technique – related to the Perceptron, is a

linear classifier (separates data into half spaces).

☺☺

☺

☺

☺

☺

☺

☺

■■ ■

■ ■■

Concept – keep the classifier simple, don’t over fit the data the classifier generalises well on new test data (Occams razor)

Concept – if data not linearly separable use a kernel Φ map into another higher dimensional feature space and data may be separable


Hot topics in SVM’s

• Kernel design – central to the application to data, eg. when the objects are text documents, the features are words incorporate domain knowledge about grammar.

• Applying the kernel technique to other learning algorithms e.g.. Neural Networks


The trusty old Nearest Neighbour algorithm

• Born in the 60’s – probably the most simple of all algorithms to understand.

• Decision rule = classify new test examples by finding the closest neighbouring example in the training set and predict the same label as the closest.

• Lots of theory justifying its convergence properties.

• Very lazy technique, not very fast – has to search for each test example.


Problems with Nearest Neighbours

• View examples in Euclidean space, can be very sensitive to feature scaling.

• Finding computationally efficient ways to search for the Nearest Neighbour example.


Decision Trees• Many different varieties C4.5, CART, ID3…

• Algorithms build classification rules using a tree of if-then statements.

• Constructs tree using Minimum Description Length (MDL) principles (tries to make the tree as simple as possible)

IF temperature > 65

Patient has fever IF dehydrated = yes

Patient has fluPatient has pneumonia


Benefits/Issues with Decision Trees

• Instability – minor changes to training data makes huge changes to decision tree

• User can visualise/interpret the hypothesis directly, can find interesting classification rules

• Problems with continuous real attributes, must be discretalised.

• Large AI following, and widely used in industry


Mystical Neural Networks• Very flexible, learning is a gradient

descent process (back propagation)• Training neural networks involves a lot of

design choices:– what network structure, how many hidden

layers…– how to encode the data (must be values [0,1])– use momentum to speed up convergence – Use weight decay to keep simple


Training a neural network

Hidden LayerInput layer Output layer

Menopausal status

Ultrasoundscore

CA125

1

0

Sigmoid function

Learnt hypothesis is represented by the weights that interconnect each neuron

E(w)

w1

w2

The aim in training the neural network is find the weight

vector w that minimises the error E(w) on the training set

Gradient descent problem


Interesting applications

• Bioinformatics:– genetic/protein code analysis– microarray analysis– gene regulatory pathways

• WWW:– classifying text/html documents– filtering images– filtering emails


Bayesian Algorithms

• Try to model interrelationships between variables probabilistically.

• Can model expert/domain knowledge directly into the classifier as prior belief in certain events.

• Use basic axioms of probability theory to extract probabilistic estimates


Bayesian algorithms in practice

• Lots of different algorithms – Relevance Vector Machine (RVM), Naïve Bayes, Simple Bayes, Bayesian Belief Networks (BBN)…

• Has a large following – especially Microsoft Research

Weather = sunny

Temperature < 65 Humidity > 100

Play TennisPlay Monopoly

Causal links between features can be modelled


Issues with Bayesian algorithms

• Tractability – to find solutions need numerical approximations or take computational shortcuts

• Can model causal relationships between variables

• Need lots of data to estimate probabilties using obsevered training data frequencies


Very important side problems

• Feature Selection/Extraction – Using Principle Component Analysis, Wavelets, Cananonical Correlation, Factor Analysis, Independent Component Analysis

• Imputation – what to do with missing features?• Visualisation – make the hypothesis human

readable/interpretable• Meta learning – how to add functionality to

existing algorithms, or combine the prediction of many classifiers (Boosting, Bagging, Confidence and Probability Machines)


Very important side problems (continued)

• How to incorporate domain knowledge into a learner

• Trade off between complexity (accuracy on training) vs. generalisation (accuracy on test)

• Pre-processing of data, normalising, standardising, discretalising.

• How to test – leave one out, cross validation, stratify, online, offline…

Machine Learning Area #2 Unsupervised Learning

Unsupervised Learning

Learning without a teacher


An introduction to Unsupervised Learning

• No need for a teacher/supervisor

• Mainly clustering – trying to group objects into sensible clusters

• Novelty detection – finding strange examples in data

Clustering examples Novelty detection


Algorithms available

• For clustering: EM algorithm, K-Means, Self Organising Maps (SOM)

• For novelty detection: 1-Class SVM, support vector regression, Neural Networks


Issues and Applications

• Very useful for extracting information from data.• Used in medicine to identify disease sub types.• Used to cluster web documents automatically• Used to identify customer target groups in

buisness• Not much publicly available data to test

algorithms with

Machine Learning Area #1 Supervised LearningMachine Learning Area #3 Reinforcement Learning

Reinforcement Learning

Learning inspired by nature


An introduction

• Most biologically plausible – feedback given through stimuli reward/punishment

• A field with a lot of theory needing for real life applications (other than playing BackGammon)

• But also encompasses the large field of Evolutionary Computing

• Applications are more open ended • Getting closer to what public consider AI.


Traditional Reinforcement Learning

• Techniques use dynamic programming to search for optimal strategy

• Algorithms search to maximise their reward.• Q – Learning (Chris Watkins next door) is most

well known technique.• Only successful applications are to games and

toy problems.• A lack of real life applications.• Very few researchers in this field.


Evolutionary Computing

• Inspired by the process of biological evolution.

• Essentially an optimisation technique – the problem is encoded as a chromosome.

• We find new/better solutions to problem by sexual reproduction and mutation.

• This will encourage mutation


Techniques available in Evolutionary Computing

• Lower level optimisers:– Evolutionary Programming, Evolutionary

Algorithms– Genetic Programming, Genetic Algorithms, – Evolutionary Strategy– Simulated Annealing

• Higher level optimisers: – TABU search– Multi-objective optimisation

Objective 1

Ob

ject

ive

2

Pareto front of optimal solutions – which one

should we pick?


Issues in Evolutionary Computing

• How to encode the problem is very important

• Setting mutation/crossover rates is very adhoc

• Very computationally/memory intensive

• Not much theory can be developed – frowned upon by machine learning theorists

brief tour of machine learning

Documents

training data

learning algorithms

supervised learning

formalisation of data

new unseen test data

unsupervised learning

training set of images

new test examples