introduction to pattern recognition · what is pattern recognition? a pattern is an object, process...

03/03/2016

1

EE-002

Computational Learning &

Pattern Recognition

Turgay IBRIKCI

Çukurova University

Electrical-Electronics Engineering Department

2

Where or how to find me?

Associate Prof. Dr. Turgay IBRIKCI

Room # 305 Thursdays 9:30- 12:00

(322) 338 6868 / 139

[email protected]

3

Course Outline

The course is divided in two parts: theory and practice.

1. Theory covers basic topics in pattern recognition theory and applications with computational learning.

2. Practice deals with basics of MATLAB and implementation of pattern recognition algorithms. We assume that you know MATLAB or you will learn yourself

4

Course Grading

Grading the Class:

Project 40%

Report

Presentation (Week 14 ; 20 mins)

Final Exam 20% (Week 15-We decide together)

Homeworks 40% (At least 4 homeworks)

Full attending the class 10% (Required Bonus)

In This Course

How should objects to be classified be represented?

What algorithms can be used for recognition

(or matching)?

How should learning (training) be done?

Much of the topics concern statistical classification methods.

They include generative methods such as those based on Bayes decision theory and related techniques of parameter estimation and density estimation.

Apply the algorithms with MATLAB

What is pattern recognition?

A pattern is an object, process or event that can be given a name.

A pattern class (or category) is a set of patterns sharing common attributes and usually originating from the same source.

During recognition (or classification) given objects are assigned to prescribed classes.

A classifier is a machine which performs classification.

“The assignment of a physical object or event to one of several prespecified categeries” -- Duda & Hart

03/03/2016

2

Examples of applications

• Optical Character

Recognition (OCR)

• Biometrics

• Diagnostic systems

• Military applications

• Handwritten: sorting letters by postal code, input device for PDA‘s.

• Printed texts: reading machines for blind people, digitalization of text documents.

• Face recognition, verification, retrieval.

• Finger prints recognition.

• Speech recognition.

• Medical diagnosis: X-Ray, EKG analysis.

• Machine diagnostics, waster detection.

• Automated Target Recognition (ATR).

• Image segmentation and analysis (recognition from aerial or satelite photographs).

What are Patterns?

Laws of Physics & Chemistry generate patterns.

Patterns in Astronomy

Humans tend to see patterns everywhere.

Patterns in Biology

Applications: Biometrics, Computational Anatomy, Brain Mapping.

Patterns of Brain Activity

Relations between brain activity, emotion, cognition, and behaviour.

Variations of Patterns

Patterns vary with expression, lighting, occlusions.

03/03/2016

3

Speech Patterns

Acoustic signals.

Data

examples

Data

Data

examples

Data

Data

examples

Data

Data

examples

Data

Goal of Pattern Recognition

Recognize Patterns. Make decisions about patterns.

Visual Example – is this person happy or sad?

Speech Example – did the speaker say “Yes” or “No”?

Physics Example – is this an atom or a molecule?

03/03/2016

4

Approaches

Statistical PR: based on underlying statistical model of patterns and pattern classes.

Structural (or syntactic) PR: pattern classes represented by means of formal structures as grammars, automata, strings, etc.

Neural networks: classifier is represented as a network of cells modeling neurons of the human brain (connectionist approach).

Basic concepts

y x

nx

x

x

2

1 Feature vector

- A vector of observations (measurements).

- is a point in feature space .

Hidden state

- Cannot be directly measured.

- Patterns with equal hidden state belong to the same class.

Xx

x X

Yy

Task

- To design a classifer (decision rule)

which decides about a hidden state based on an observation.

YX:q

Pattern

Example

x

2

1

x

x

height

weight

Task: jockey-hoopster recognition.

The set of hidden state is

The feature space is

},{ JHY2X

Training examples )},(,),,{( 11 ll yy xx

1x

2x

Jy

Hy Linear classifier:

0)(

0)()q(

bifJ

bifH

xw

xwx

0)( bxw

Example: Salmon versus Sea

Bass

Generative methods attempt to model the full appearance of Salmon and Sea Bass.

Discriminative methods extract features sufficient to make the decision (e.g. length and brightness).

Fish Features. Length.

Salmon are usually shorter than Sea Bass.

Fish Features. Lightness.

Sea Bass are usually brighter than Salmon.

03/03/2016

5

Components of PR system

Sensors and

preprocessing

Feature

extraction Classifier

Class

assignment

• Sensors and preprocessing.

• A feature extraction aims to create discriminative features good for classification

• A classifier.

• A teacher provides information about hidden state -- supervised learning.

• A learning algorithm sets PR from training examples.

Learning algorithm Teacher

Pattern

Feature extraction

Task: to extract features which are good for classification.

Good features: • Objects from the same class have similar feature values.

• Objects from different classes have different values.

“Good” features “Bad” features

Feature extraction methods

km

m

m

2

1

nx

x

x

2

11φ

2φ

nφ

km

m

m

m

3

2

1

nx

x

x

2

1

Feature extraction Feature selection

Problem can be expressed as optimization of parameters of featrure extractor .

Supervised methods: objective function is a criterion of separability (discriminability) of labeled examples, e.g., linear discriminat analysis (LDA).

Unsupervised methods: lower dimesional representation which preserves important characteristics of input data is sought for, e.g., principal component analysis (PCA).

φ(θ)

Classifier

A classifier partitions feature space X into class-labeled regions such that

||21 YXXXX }0{||21 YXXX and

1X 3X

2X

1X

1X

2X

3X

The classification consists of determining to which region a feature vector x belongs to. Borders between decision boundaries are called decision regions.

Representation of classifier

A classifier is typically represented as a set of discriminant functions

||,,1,:)(f YX ii x

The classifier assigns a feature vector x to the i-the class if )(f)(f xx ji ij

)(f1 x

)(f2 x

)(f || xY

maxx y

Feature vector

Discriminant function

Class identifier

30

An Introduction

Bayesian Decision Theory came long before Version Spaces, Decision Tree Learning and Neural Networks. It was studied in the field of Statistical Theory and more specifically, in the field of

Pattern Recognition.

Bayesian Decision Theory is at the basis of important learning schemes such as the Naïve Bayes Classifier, Learning Bayesian Belief Networks and the EM Algorithm.

03/03/2016

6

Bayesian decision making

• The Bayesian decision making is a fundamental statistical approach which

allows to design the optimal classifier if complete statistical model is known.

Definition: Obsevations

Hidden states

Decisions

A loss function

A decision rule

A joint probability D

DX:q

)p( y,x

XY

RDYW :

Task: to design decision rule q which minimizes Bayesian risk

Yy Xx

yy )),W(q(),p(R(q) xx

32

Bayes Theorem

Goal: To determine the most probable hypothesis, given

the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H.

Prior probability of h, P(h): it reflects any

background knowledge we have about the chance that h is a correct hypothesis (before having observed the data).

Prior probability of D, P(D): it reflects the

probability that training data D will be observed given no knowledge about which hypothesis h holds.

Conditional Probability of observation D, P(D|h): it denotes the probability of observing data D

given some world in which hypothesis h holds.

33

Bayes Theorem (Cont’d)

Posterior probability of h, P(h|D): it represents the probability that h holds given the observed training data D. It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Machine Learning researchers are interested in.

Bayes Theorem allows us to compute P(h|D):

P(h|D)=P(D|h)P(h)/P(D) 34

Bayesian Belief Networks

The Bayes Optimal Classifier is often too costly to apply.

The Naïve Bayes Classifier uses the conditional independence assumption to defray these costs. However, in many cases, such an assumption is overly restrictive.

Bayesian belief networks provide an intermediate approach which allows stating conditional independence assumptions that apply to subsets of the variable.

35

Representation in

Bayesian Belief Networks

Storm BusTourGroup

Lightning Campfire

Thunder ForestFire

Each node is asserted to be conditionally independent of its non-descendants, given its immediate parents

Associated with each node is a conditional probability table, which specifies the conditional distribution for the variable given its immediate parents in the graph

36

Inference in Bayesian

Belief Networks

A Bayesian Network can be used to compute the probability distribution for any subset of network variables given the values or distributions for any subset of the remaining variables.

Unfortunately, exact inference of probabilities in general for an arbitrary Bayesian Network is known to be NP-hard.

In theory, approximate techniques (such as Monte Carlo Methods) can also be NP-hard, though in practice, many such methods were shown to be useful.

03/03/2016

7

Example of Bayesian task

Task: minimization of classification error.

A set of decisions D is the same as set of hidden states Y.

0/1 - loss function used

yif

yify

)q(1

)q(0)),W(q(

x

xx

The Bayesian risk R(q) corresponds to probability of

misclassification.

The solution of Bayesian task is

)p(

)p()|p(maxarg)|(maxargR(q)minargq *

q

*

x

xx

yyypy

yy

Limitations of Bayesian approach

• The statistical model p(x,y) is mostly not known therefore learning must be employed to estimate p(x,y) from training examples {(x1,y1),…,(x,y)} -- plug-in Bayes.

• Non-Bayesian methods offers further task formulations:

• A partial statistical model is avaliable only:

• p(y) is not known or does not exist.

• p(x|y,) is influenced by a non-random intervetion .

• The loss function is not defined.

• Examples: Neyman-Pearson‘s task, Minimax task, etc.

Discriminative approaches

Given a class of classification rules q(x;θ) parametrized by

θ the task is to find the “best” parameter θ* based on a set of training examples {(x1,y1),…,(x,y)} -- supervised learning.

The task of learning: recognition which classification rule is to

be used.

The way how to perform the learning is determined by a

selected inductive principle.

Learning Theory

Both Generative and Discriminative methods require training data to learn the models/features/decision rules.

Machine Learning concentrates on learning discrimination rules.

Key Issue: do we have enough training data to learn?

Empirical risk minimization

principle

The true expected risk R(q) is approximated by empirical risk

1emp )),;W(q(

1));(q(R

iii yx θxθ

with respect to a given labeled training set {(x1,y1),…,(x,y)}.

The learning based on the empirical minimization principle is defined as

));(q(Rminarg emp* θxθ

θ

Examples of algorithms: Perceptron, Back-propagation, etc.

Overfitting and underfitting

Problem: how rich class of classifications q(x;θ) to use.

underfitting overfitting good fit

Problem of generalization: a small emprical risk Remp does not

imply small true expected risk R.

03/03/2016

8

Structural risk minimization

principle

An upper bound on the expected risk of a classification rule qQ

)1

log,,1

(R(q)RR(q)

hstremp

where is number of training examples, h is VC-dimension of class

of functions Q and 1- is confidence of the upper bound.

SRM principle: from a given nested function classes Q1,Q2,…,Qm,

such that mhhh 21

select a rule q* which minimizes the upper bound on the expected risk.

Statistical learning theory -- Vapnik & Chervonenkis

Machine Learning is…

Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.


Machine learning is programming computers to optimize a performance criterion using example data or past experience. -- Ethem Alpaydin

The goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. -- Kevin P. Murphy

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions. -- Christopher M. Bishop


Machine learning is about predicting the future based on the past.

-- Hal Daume III


Machine learning is about predicting the future based on the past. -- Hal Daume III

Training

Data

model/

predictor

past

model/

predictor

future

Testing

Data

Supervised learning

Supervised learning: given labeled examples

label

label1

label3

label4

label5

labeled examples

examples

03/03/2016

9

Supervised learning


model/

predictor

label

label1

label3

label4

label5

Supervised learning

model/

predictor

Supervised learning: learn to predict new example

predicted label

Supervised learning:

classification


label

apple

apple

banana

banana

Classification: a finite set of

labels

Classification Example

Differentiate between low-risk and high-risk customers from their income and savings


regression


label

-4.5

10.1

3.2

4.3

Regression: label is real-valued

Regression Example

Price of a used car

x : car attributes (e.g. mileage)

y : price

y = wx+w0

54

03/03/2016

10

Regression Applications

Economics/Finance: predict the value of a stock Epidemiology Car/plane navigation: angle of the steering wheel, acceleration, … Temporal trends: weather over time …


ranking


label

1

4

2

3

Ranking: label is a ranking

Ranking example

Given a query and

a set of web pages,

rank them according

to relevance

Unsupervised learning

Unupervised learning: given data, i.e. examples, but no labels


applications

learn clusters/groups without any label

customer segmentation (i.e. grouping) image compression bioinformatics: learn motifs …


Input: training examples {x1,…,x} without information about

the hidden state.

Clustering: goal is to find clusters of data sharing similar properties.

Classifier

Learning

algorithm

θ

},,{ 1 xx },,{ 1 yy

Classifier

ΘY)(X: L

YΘX :q

Learning algorithm

(supervised)

A broad class of unsupervised learning algorithms:

03/03/2016

11

Example of unsupervised

learning algorithm

k-Means clustering:

Classifier

||||minarg)q(,,1

iki

y mxx

Goal is to minimize

1

2)q( ||||

ii ixmx

ij

j

i

iII

,||

1xm })q(:{ ij ji xI

Learning algorithm

1m

2m

3m

},,{ 1 xx

},,{ 1 kmmθ

},,{ 1 yy

References

Books

Theodoridis, Koutroumbas Pattern Recognition( 4th Edition, 2004)

Duda, Heart: Pattern Classification and Scene Analysis. J. Wiley & Sons, New York, 1982. (2nd edition 2000).

Fukunaga: Introduction to Statistical Pattern Recognition. Academic Press, 1990.

Bishop: Neural Networks for Pattern Recognition. Claredon Press, Oxford, 1997.

Schlesinger, Hlaváč: Ten lectures on statistical and structural pattern recognition. Kluwer Academic Publisher, 2002.

Journals

Journal of Pattern Recognition Society.

IEEE transactions on Neural Networks.

Pattern Recognition and Machine Learning.

Slices : Vojtěch Franc

introduction to pattern recognition · what is pattern recognition? a pattern is an object, process...

Documents