1. introduction - university of haifaidattner/course2015sem2/dm-00.pdf · >plot(uscereal$fat,...

417
1. Introduction Motivation and definition Data mining tasks Methods and algorithms Examples and applications 1

Upload: vanthu

Post on 03-Mar-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Introduction

� Motivation and definition

� Data mining tasks

� Methods and algorithms

� Examples and applications

1

Page 2: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Motivation: why data mining?

◮ Data explosion problem

– Automated data collection tools lead to tremendous amounts of

data stored in databases.

– Processing capacity of computers grows rapidly: CPU, memory,...

– Rapidly growing gap between our ability to generate data, and our

ability to make use of it.

We are drowning in data, but starving for knowledge!

2

Page 3: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

What is data mining?

◮ Data Mining is the process of discovering new patterns from large

data sets involving methods from statistics and artificial intelligence

but also database management. The term is a buzzword, and is

frequently misused to mean any form of large scale data or

information processing. (Wikipedia)

◮ The term Data Mining is a misnomer. Mining of gold from rocks or

sand is called gold mining, rahter than rock mining or sand mining.

More appropriate term is Knowledge Mining from Data.

◮ Analysis of data is a process of inspecting, cleaning, transforming,

and modeling data with the goal of highlighting useful information,

suggesting conclusions, and supporting decision making.

3

Page 4: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

What data mining is not

◮ Data mining differs from traditional database queries

– the query might not be precisely stated

– the data accessed is usually a different version from that of the

operational database

– output of data mining is not a subset of the database.

4

Page 5: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: wage data

◮ The goal: relate wage to education, age, year when the wage is

earned.

20 40 60 80

50

100

200

300

Age

Wage

2003 2006 2009

50

100

200

300

Year

Wage

1 2 3 4 5

50

100

200

300

Education Level

Wage

◮ Predict wage on the basis of age, year , education.

5

Page 6: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: stock market data

◮ The goal: to predict increase or decrease in S&P 500 stock index on a

given day using the past 5 years percentage change in the index.

Down Up

−4

−2

02

46

Yesterday

Today’s Direction

Pe

rce

nta

ge

ch

an

ge

in

S&

P

Down Up

−4

−2

02

46

Two Days Previous

Today’s Direction

Pe

rce

nta

ge

ch

an

ge

in

S&

P

Down Up

−4

−2

02

46

Three Days Previous

Today’s Direction

Pe

rce

nta

ge

ch

an

ge

in

S&

P◮ Boxplots of the previous day’s (2, 3 days) percentage change in index

for 648/602 days when it increased/decreased on the subsequent day.

6

Page 7: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: gene expression data

◮ Data: 6380 measurements of gene expression for 64 cancer cell lines.

◮ The goal: determine if there are groups (clusters) among the cell lines.

−40 −20 0 20 40 60

−6

0−

40

−2

00

20

−40 −20 0 20 40 60−

60

−4

0−

20

020

Z1Z1

Z2

Z2

◮ Data set in two–dimensional space (principal components): each

point is a cell line; on the right panel – for 14 types of cancer.

7

Page 8: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Steps of the data mining process

◮ Learning the application domain

◮ Creating the target data set – data selection

◮ Data cleaning and preprocessing (may take 60% of the effort).

◮ Data reduction and transformation

◮ Choosing functions of data mining

– summarization, classification, regression, association

◮ Data mining – search for patterns

◮ Pattern evaluation and knowledge representation

8

Page 9: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example

Credit card company must detemine whether to authorize credit card

purchases. Based on historical information about purchases, each

purchase is placed in one of 4 classes

� authorize

� ask for further identification before authorization

� do not authorize

� do not authorize and contact police.

Data mining tasks: (1) determine how the data fit into the classes; (2)

apply the model for each new purchase.

9

Page 10: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Objectives of data mining

◮ Exploratory data analysis (data visualization)

Summary statistics, various plots, etc. Starting point of any data

mining process.

◮ Descriptive modeling – describe the data generating mechanism.

Examples include

* estimating probability distribution of the data

* finding groups in data (clustering)

* relationships between variables (dependency modeling: regression,

time series...).

10

Page 11: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Objectives data mining

◮ Predictive modeling – make a prediction about values of data using

known results from the database.

* Regression

* Classification

* Time series models

A key distinction between predictive and descriptive modeling is that

prediction problems focus on a single variable or a set of variables,

while in description problems no single variable is central to the

model.

11

Page 12: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example applications

◮ Fraud detection

– AT&T uses a data mining system to detect fraudulent

international calls

– The Financial Crimes Enforcement Network AI Systems (FAIS)

uses data mining technologies to identify possible money

laundering activity within large cash transactions.

◮ Risk management

– Risk management applications use data mining to determine

insurance rates, manage investment portfolios, and differentiate

between companies and/or individuals who are good and poor

credit risks.12

Page 13: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

– US West Communications uses data mining to determine customer

trends and needs based on characteritics such as family size,

median family age and location.

◮ Text mining and Web analysis

– Personalize the products and pages displayed to a particular user

or set of users.

13

Page 14: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Specific applications

◮ Predict whether a patient, hospitalized due to a heart attach, will

have a second heart attack. Prediction can use demographic, diet and

clinical measurements.

◮ Predict the price of a stock six month from now on the basis of

company performance measures and economic data.

◮ Identify the number in a handwritten ZIP code from a digitized

image.

◮ Identify the risk factors for prostate cancer based on clinical and

demographic variables.

◮ Distinguish pornographic and non–pornographic web pages.

14

Page 15: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

I. Data mining tasks

◮ Regression

Regression is used to map a data item into a real valued prediction

variable. Regression involves the learning of the function that does

this mapping.

◮ Classification

Classification maps data into predefined groups (classes). It is often

referred to as supervised learning because the classes are determined

before examining the data.

◮ Time series analysis

Modeling variables evolving over time.

15

Page 16: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

II. Data mining tasks

◮ Clustering

Clustering is similar to classification except that the groups are not

predifined, but defined by a data.

◮ Association rules

Association rules (affinity analysis) refers to the data mining task of

uncovering relationships in data.

◮ Summarization

Summarization maps data into subsets with associated simple

descriptions. For example, U.S. News World Report uses the average

SAT and ACT scores to compare US universities.

16

Page 17: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Data mining methods and algorithms

◮ Regression methods

Linear regression, kernels, splines, nearest–neighbors, neural

networks, regression trees,...

◮ Classification methods

Discriminant analysis, logistic regression, nearest–neighbors,

classification trees, ensenmble methods,...

◮ Clustering methods

Hierarchical clustering, K–means, K–medoids, mixtures,...

17

Page 18: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Data types and distance measures

18

Page 19: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Data format

◮ Observations: for subject i ∈ {1, . . . , N} we observe k different

features (e.g., age, cholesterol level, weight, marital status, etc.), i.e.

we have a vector xi = (xi,1, . . . , xi,k).

◮ Database can be identified with matrix N × k–matrix

subjects

x1 →x2 →...

xN →

x1,1 x1,2 · · · x1,k

x2,1 x2,2 · · · x2,k...

......

...

xN,1 xN,2 · · · xN,k

︸ ︷︷ ︸features

19

Page 20: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Distance and data types

◮ A distance measure d(x,y) between observations x and y should

satisfy the following axioms:

(i) d(x,y) ≥ 0 for all x, y, and d(x,y) = 0 iff x = y.

(ii) d(x,y) = d(y,x) (symmetry)

(iii) d(x,y) ≤ d(x, z) + d(z,y) (triangle inequality)

◮ Features can be

– Numerical – discrete or continuous (age, weight,...)

– Binary – encoded by 0− 1 (gender, success–failure,...)

– Categorical – extension of binary to more categories

– Ordinal – ordered categorical: {worst, bad, good, best}

20

Page 21: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Distance measures

◮ Numerical data

– Euclidean (L2) distance

d(x,y) =√(x− y)T (x− y) =

√√√√k∑

i=1

(xi − yi)2

– Manhattan (L1) distance: d(x,y) =∑k

i=1 |xi − yi|

– Lp–distance: d(x,y) = {∑k

i=1 |xi − yi|p}1/p, p ∈ [1,∞].

– Mahalanobis distance: if Σ is the covariance matrix of the features

then

d(x,y) =√(x− y)TΣ−1(x− y).

21

Page 22: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Distance measures

◮ Categorical data

– Matching (Hamming) distance: d(x,y) =∑k

i=1 I(xi 6= yi)

– Weighted matching distance:

d(x,y) =k∑

i=1

wiI(xi 6= yi), wi > 0,k∑

i=1

wi = 1.

◮ Ordinal data: xi – ordinal variable with mi levels is substituted by its

rank r(xi) ∈ {1, . . . ,mi}. To normalize the rank to [0, 1] define

z(xi) =r(xi)−1mi−1 , and set

d(x,y) =

k∑

i=1

|z(xi)− z(yi)|.

◮ Mixed data: use a mixture of normalized distances.

22

Page 23: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Data summary and visualization

23

Page 24: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Summary statistics

# The UScereal data frame has 65 rows and 11 columns.

# The data come from the 1993 ASA Statistical Graphics Exposition,

# and are taken from the mandatory F&DA food label.

# The data have been normalized here to a portion of one American cup.

>library(MASS)

>data(UScereal)

>summary(UScereal)

mfr calories protein fat sodium

G:22 Min. : 50.0 Min. : 0.7519 Min. :0.000 Min. : 0.0

K:21 1st Qu.:110.0 1st Qu.: 2.0000 1st Qu.:0.000 1st Qu.:180.0

N: 3 Median :134.3 Median : 3.0000 Median :1.000 Median :232.0

P: 9 Mean :149.4 Mean : 3.6837 Mean :1.423 Mean :237.8

Q: 5 3rd Qu.:179.1 3rd Qu.: 4.4776 3rd Qu.:2.000 3rd Qu.:290.0

R: 5 Max. :440.0 Max. :12.1212 Max. :9.091 Max. :787.9

fibre carbo sugars shelf

Min. : 0.000 Min. :10.53 Min. : 0.00 Min. :1.000

1st Qu.: 0.000 1st Qu.:15.00 1st Qu.: 4.00 1st Qu.:1.000

Median : 2.000 Median :18.67 Median :12.00 Median :2.000

Mean : 3.871 Mean :19.97 Mean :10.05 Mean :2.169

24

Page 25: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3rd Qu.: 4.478 3rd Qu.:22.39 3rd Qu.:14.00 3rd Qu.:3.000

Max. :30.303 Max. :68.00 Max. :20.90 Max. :3.000

potassium vitamins

Min. : 15.0 100% : 5

1st Qu.: 45.0 enriched:57

Median : 96.6 none : 3

Mean :159.1

3rd Qu.:220.0

Max. :969.7

># correlation matrix between some variables

>cor(UScereal[c("calories","protein","fat","fibre","sugars")])

calories protein fat fibre sugars

calories 1.0000000 0.7060105 0.5901757 0.3882179 0.4952942

protein 0.7060105 1.0000000 0.4112661 0.8096397 0.1848484

fat 0.5901757 0.4112661 1.0000000 0.2260715 0.4156740

fibre 0.3882179 0.8096397 0.2260715 1.0000000 0.1489158

sugars 0.4952942 0.1848484 0.4156740 0.1489158 1.0000000

25

Page 26: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Density visualization

Histogram

>hist(UScereal[,"protein"], main="UScereal data", xlab="protein")

UScereal data

protein

Freq

uenc

y

0 2 4 6 8 10 12 14

05

1015

20

26

Page 27: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Density visualization

Kernel smoothing

>plot(density(UScereal[,"protein"],kernel="gaussian"), main="UScereal data",

+ xlab="protein")

0 5 10

0.00

0.05

0.10

0.15

0.20

UScereal data

protein

Dens

ity

27

Page 28: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Boxplot

>mfr=UScereal["mfr"]

>boxplot(UScereal[mfr=="K","protein"], UScereal[mfr=="G", "protein"],

+ names=c("Kellogs", "General Mills"), xlab="Manufacturer", ylab="protein"))

Kellogs General Mills

24

68

1012

Manufacturer

prot

ein

28

Page 29: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Quantile plot

QQ plot displays (zk/(n+1), x(k)), zq is qth quantile of N (0, 1) Φ(zq) = q,

0 < q < 1.

>qqnorm(UScereal$calories)

−2 −1 0 1 2

100

200

300

400

Normal Q−Q Plot

Theoretical Quantiles

Samp

le Qu

antile

s

29

Page 30: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Relations between two variables

Scatterplot

>plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories")

0 2 4 6 8

100

200

300

400

Fat

Calor

ies

30

Page 31: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Relations between more than two variables

Scatterplot matrix

>plot(UScereal[c("calories", "fat", "protein", "sugars","fibre", "sodium")])

calories

0 2 4 6 8 0 5 10 15 20 0 200 600

100

200

300

400

02

46

8

fat

protein

24

68

1012

05

1015

20

sugars

fibre

05

1015

2025

30

100 300

020

040

060

080

0

2 4 6 8 12 0 10 20 30

sodium

31

Page 32: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Parallel plot

>parallel( UScereal[, c("calories","protein", "fat", "fibre")])

Min Max

calories

protein

fat

fibre

32

Page 33: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

4. Association rules

(Market basket analysis)

33

Page 34: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Market basket analysis

◮ Association rules show the relationships between data items.

◮ Typical example

A grocery store keeps a record of weekly transactions. Each

represents the items bought during one cash register

transaction. The objective of the market basket analysis is to

determine the items likely to be purchased together by a

customer.

34

Page 35: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example

◮ Items: {Beer, Bread, Jelly, Milk, PeanutButter}

Transaction Items

t1 Bread, Jelly, PeanutButter

t2 Bread, PeanutButter

t3 Bread, Milk, PeanutButter

t4 Beer, Bread

t5 Beer, Milk

◮ 100% of the time that PeanutButter is purchased, so is Bread.

◮ 33.3% of the time PeanutButter is purchased, Jelly is also

purchased.

◮ PeanutButter exists in 60% of the overall transactions.

35

Page 36: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Definitions

◮ Given:

� a set of items I = {I1, . . . , Im}

� a database of transactions D = {t1, . . . , tn} where ti = {Ii1 , . . . , Iik}and Iij ∈ I

◮ Association rule

Let X and Y be two disjoint subsets (itemsets) of I . We say that

Y is associated with X (and write X ⇒ Y ) if the appearance of

X in an transaction ”usually” implies that Y occur in that

transaction too. We identify

X ⇔ {X is purchased}

36

Page 37: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Support and confidence

◮ Support s of an association rule X ⇒ Y is the percentage of

transactions in the database that contain X ∩ Y

s(X ⇒ Y ) = P (X ∩ Y ) =1

n

n∑

i=1

1{ti ⊇ (X ∩ Y )

}.

◮ Confidence or strength α of an association rule X ⇒ Y is the ratio of

the number of transactions that contain X ∩ Y to the number of

transactions that contain X

α(X ⇒ Y ) = P (Y |X) =P (X ∩ Y )

P (X)=

∑ni=1 1

{ti ⊇ (X ∩ Y )

}∑n

i=1 1{ti ⊇ X

}

◮ Problem: find all rules with support ≥ s0 and confidence ≥ α0.

37

Page 38: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Support and confidence of some rules

X ⇒ Y s α

Bread ⇒ PeanutButter 60% 75%

PeanutButter ⇒ Bread 60% 100%

Beer ⇒ Bread 20% 50%

PeanutButter ⇒ Jelly 20% 33.3%

Jelly ⇒ PeanutButter 20% 100%

Jelly ⇒ Milk 0% 0%

38

Page 39: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Other measures of rule quality

Rules with high support and confidence may be obvious or not interesting.

◮ Example: 100 baskets, purchases of tea and coffee

coffee coffeec∑

row

tee 20 5 25

teec 70 5 75∑

col 90 10 100

Rule tea⇒ coffee: s = 0.2, α = 20/10025/100 = 0.8 ⇒ strong rule!

However, s(coffee) = 90100 = 0.9; thus, there is a negative

“association” between buying tea and buying coffee.

◮ Additional measures of the rule quality are needed.

39

Page 40: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Lift and conviction

◮ Lift (interest)

lift(X ⇒ Y ) =P (X ∩ Y )

P (X)P (Y )=

1n

∑ni=1 1(ti ⊇ X ∩ Y )

1n

∑ni=1 1(ti ⊇ X) 1n

∑ni=1 1(ti ⊇ Y )

Rules with lift ≥ 1 are interesting. In previous example

lift(tea⇒ coffee) =0.2

0.25× 0.9= 0.89.

◮ Conviction

conviction(X ⇒ Y ) =P (X)P (Y c)

P (X ∩ Y c)

=1n

∑ni=1 1{ti ⊇ X} 1n

∑ni=1 1{ti ⊇ Y c}

1n

∑ni=1 1{ti ⊇ X ∩ Y c}

Rules that always hold have conviction =∞. In the previous example

conviction(tea⇒ coffee) = (25/100)·(10/100)5/100 = 0.5

40

Page 41: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Lift and conviction of some rules

X ⇒ Y Lift Conviction

Bread ⇒ PeanutButter 54

85

PeanutButter ⇒ Bread 54 ∞

Beer ⇒ Bread 58

25

PeanutButter ⇒ Jelly 53

65

Jelly ⇒ PeanutButter 53 ∞

Jelly ⇒ Milk 0 35

41

Page 42: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Mining rules from frequent itemsets

1. Find frequent itemsets (itemset whose number of occurrences is above

a threshold s0).

2. Generate rules from frequent itemsets.

Input: D - database, I - collection of all items,

L-collection of all frequent itemsets, s0, α0.

Output: R - association rules satisfying s0 and α0.

R = ∅;for each ℓ ∈ L do

for each x ⊂ ℓ such that x 6= ∅ do

ifsupport(ℓ)support(x) ≥ α0 then R = R ∪ {x⇒ (ℓ− x)};

42

Page 43: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example

Assume s0 = 30% and α0 = 50%.

◮ Frequent itemset L

{{Beer},{Bread},{Milk},{PeanutButter},{Bread,PeanutButter}}

◮ For ℓ = {Bread, PeanutButter} we have two subsets:

support({Bread, PeanutButter})support({Bread}) =

60

80= 0.75 > 0.5

support({Bread, PeanutButter})support({PeanutButter}) =

60

60= 1 > 0.5

◮ Conclusion:

PeanutButter ⇒ Bread and Bread ⇒ PeanutButter are valid

association rules.

43

Page 44: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Finding frequent itemsets: apriori algorithm

◮ Frequent itemset property

Any subset of frequent itemset must be frequent

◮ Basic idea:

– Look at candidate sets of size i

– Choose frequent itemsets of the size i

– Generate frequent itemsets of size i+ 1 by joining (taking unions

of) frequent itemsets found till pass i+ 1.

44

Page 45: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Finding frequent itemsets: apriori algorithm

At kth pass of the apriori algorithm we form a set of candidate itemsets

Ck of size k and a set of frequent itemsets Lk of size k.

1. Start with all singleton itemsets C1. Count support of all items in C1

and form the set L1 of all items from C1 with support ≥ s.

2. Let C2 be the set of all pairs from L1. Count support of all members

of C2 and form the set L2 of pairs with support ≥ s.

3. Let C3 be the set of triples, any two of which is a pair in L2.

Calculate the support of each triple in C3 and form the set of triples

L3 with support ≥ s.

4. Continue...

45

Page 46: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: apriori algorithm

s0 = 30%, α0 = 50%

Pass Candidates Frequent itemsets

1 {Beer},{Bread},{Jelly} {Beer},{Bread},

{PeanutButter},{Milk} {Milk},{PeanutButter}

2 {Beer,Bread},{Beer,Milk}, {Bread,PeanutButter}

{Bear,PeanutButter},{Bread,Milk},

{Bread,PeanutButter},

{Milk,PeanutButter}

46

Page 47: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Summary

◮ Efficient finding frequent itemsets

Finding frequent itemsets is costly. If there are m items, potentially

there may be 2m − 1 frequent itemsets.

◮ Once all frequent itemsets are found, generating the association rules

is easy and straightforward.

47

Page 48: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Other applications

Applications of association rules are not limited to basket analysis.

◮ Finding related concepts: items=words, baskets=documents (web

pages, tweets,...). We look for sets of words appearing together in

many documents. Expect that {Brad, Angelina} appears with

surprising frequency.

◮ Plagiatrism: items=documents, baskets=sentences; an item in the

basket if the sentence is in the document.

◮ Biomarkers: items are of two types – biomarkers (genes, proteins,...)

and diseases; each basket is the set of data about the patient (genome

and blood analysis, medical history...). A frequent itemset that

contains one disease and one or more biomarkers suggests a test for a

disease.

48

Page 49: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: DVD movies purchases

◮ Data:

> data<-read.table("DVDdata.txt",header=T)

> data

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1 LOTR2

1 0 0 1 1 0 1 1

2 1 1 0 0 0 0 0

3 0 0 0 0 0 1 1

4 0 1 0 0 0 0 0

5 0 1 0 0 0 0 0

6 0 1 0 0 0 0 0

7 0 0 0 1 1 0 0

8 0 1 0 0 0 0 0

9 0 1 0 0 0 0 0

10 0 1 1 0 0 1 0

49

Page 50: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Patriot Sixth.Sense

1 0 1

2 1 0

3 0 0

4 1 1

5 1 1

6 1 1

7 0 0

8 1 0

9 1 1

10 0 1

>

◮ Preparations

> nobs<-dim(data)[1]

> n<-dim(data)[2]

> namesvec<-colnames(data)

> namesvec

[1] "Braveheart" "Gladiator" "Green.Mile" "Harry.Potter1"

[5] "Harry.Potter2" "LOTR1" "LOTR2" "Patriot"

[9] "Sixth.Sense"

>

50

Page 51: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

> # thresholds for rules

> supthresh<-0.2

> conftresh<-0.5

> lifttresh<-2

>

> sup1<-array(0,n)

> sup2<-matrix(0,ncol=n,nrow=n,dimnames=list(namesvec,namesvec))

◮ Calculating the chance of appearance P (X) for each movie

> for (i in 1:n){

+ sup1[i]<-sum(data[,i])/nobs}

> sup1

[1] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

◮ Calculating the chance of appearance P (X, Y ) for each pair of movies

> for (j in 1:n){

+ if(sup1[j]>=supthresh){

+ for (k in j:n){

+ if (sup1[k]>=supthresh){

+ sup2[j,k]<-data[,j]%*%data[,k]

+ sup2[k,j]<-sup2[j,k] } } } }

> sup2<-sup2/nobs

> sup2

51

Page 52: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1

Braveheart 0 0.0 0.0 0.0 0 0.0

Gladiator 0 0.7 0.1 0.0 0 0.1

Green.Mile 0 0.1 0.2 0.1 0 0.2

Harry.Potter1 0 0.0 0.1 0.2 0 0.1

Harry.Potter2 0 0.0 0.0 0.0 0 0.0

LOTR1 0 0.1 0.2 0.1 0 0.3

LOTR2 0 0.0 0.1 0.1 0 0.2

Patriot 0 0.6 0.0 0.0 0 0.0

Sixth.Sense 0 0.5 0.2 0.1 0 0.2

LOTR2 Patriot Sixth.Sense

Braveheart 0.0 0.0 0.0

Gladiator 0.0 0.6 0.5

Green.Mile 0.1 0.0 0.2

Harry.Potter1 0.1 0.0 0.1

Harry.Potter2 0.0 0.0 0.0

LOTR1 0.2 0.0 0.2

LOTR2 0.2 0.0 0.1

Patriot 0.0 0.6 0.4

Sixth.Sense 0.1 0.4 0.6

◮ Calculating the confidence matrix P (column|row)

52

Page 53: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

> conf2<-sup2/c(sup1)

> conf2

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2

Braveheart 0 0.0000000 0.0000000 0.0000000 0

Gladiator 0 1.0000000 0.1428571 0.0000000 0

Green.Mile 0 0.5000000 1.0000000 0.5000000 0

Harry.Potter1 0 0.0000000 0.5000000 1.0000000 0

Harry.Potter2 0 0.0000000 0.0000000 0.0000000 0

LOTR1 0 0.3333333 0.6666667 0.3333333 0

LOTR2 0 0.0000000 0.5000000 0.5000000 0

Patriot 0 1.0000000 0.0000000 0.0000000 0

Sixth.Sense 0 0.8333333 0.3333333 0.1666667 0

LOTR1 LOTR2 Patriot Sixth.Sense

Braveheart 0.0000000 0.0000000 0.0000000 0.0000000

Gladiator 0.1428571 0.0000000 0.8571429 0.7142857

Green.Mile 1.0000000 0.5000000 0.0000000 1.0000000

Harry.Potter1 0.5000000 0.5000000 0.0000000 0.5000000

Harry.Potter2 0.0000000 0.0000000 0.0000000 0.0000000

LOTR1 1.0000000 0.6666667 0.0000000 0.6666667

LOTR2 1.0000000 1.0000000 0.0000000 0.5000000

Patriot 0.0000000 0.0000000 1.0000000 0.6666667

Sixth.Sense 0.3333333 0.1666667 0.6666667 1.0000000

53

Page 54: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

◮ Calculating the lift matrix

> tmp<-matrix(c(sup1),nrow=n,ncol=n,byrow=TRUE)

> tmp

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

[1,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[2,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[3,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[4,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[5,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[6,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[7,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[8,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[9,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

>

> lift2<-conf2/tmp

> lift2

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2

Braveheart 0 0.0000000 0.0000000 0.0000000 0

Gladiator 0 1.4285714 0.7142857 0.0000000 0

Green.Mile 0 0.7142857 5.0000000 2.5000000 0

Harry.Potter1 0 0.0000000 2.5000000 5.0000000 0

54

Page 55: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Harry.Potter2 0 0.0000000 0.0000000 0.0000000 0

LOTR1 0 0.4761905 3.3333333 1.6666667 0

LOTR2 0 0.0000000 2.5000000 2.5000000 0

Patriot 0 1.4285714 0.0000000 0.0000000 0

Sixth.Sense 0 1.1904762 1.6666667 0.8333333 0

LOTR1 LOTR2 Patriot Sixth.Sense

Braveheart 0.0000000 0.0000000 0.000000 0.0000000

Gladiator 0.4761905 0.0000000 1.428571 1.1904762

Green.Mile 3.3333333 2.5000000 0.000000 1.6666667

Harry.Potter1 1.6666667 2.5000000 0.000000 0.8333333

Harry.Potter2 0.0000000 0.0000000 0.000000 0.0000000

LOTR1 3.3333333 3.3333333 0.000000 1.1111111

LOTR2 3.3333333 5.0000000 0.000000 0.8333333

Patriot 0.0000000 0.0000000 1.666667 1.1111111

Sixth.Sense 1.1111111 0.8333333 1.111111 1.6666667

◮ Extracting and printing rules

> rulesmat<-(sup2>=supthresh)*(conf2>=conftresh)*(lift2>=lifttresh)

55

Page 56: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

> rulesmat

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1

Braveheart 0 0 0 0 0 0

Gladiator 0 0 0 0 0 0

Green.Mile 0 0 0 0 0 1

Harry.Potter1 0 0 0 0 0 0

Harry.Potter2 0 0 0 0 0 0

LOTR1 0 0 1 0 0 0

LOTR2 0 0 0 0 0 1

Patriot 0 0 0 0 0 0

Sixth.Sense 0 0 0 0 0 0

LOTR2 Patriot Sixth.Sense

Braveheart 0 0 0

Gladiator 0 0 0

Green.Mile 0 0 0

Harry.Potter1 0 0 0

Harry.Potter2 0 0 0

LOTR1 1 0 0

LOTR2 0 0 0

Patriot 0 0 0

Sixth.Sense 0 0 0

56

Page 57: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

> diag(rulesmat)<-0

> rules<-NULL

> for (j in 1:n){

+ if (sum(rulesmat[j,])>0){

+ rules<-c(rules,paste(namesvec[j],"->",namesvec[rulesmat[j,]==1],sep=""))

+ }

+ }

> rules

[1] "Green.Mile->LOTR1" "LOTR1->Green.Mile" "LOTR1->LOTR2"

[4] "LOTR2->LOTR1"

◮ If we set supthresh<-0.1 then we find 12 rules

> rules

[1] "Green.Mile->Harry.Potter1" "Green.Mile->LOTR1"

[3] "Green.Mile->LOTR2" "Harry.Potter1->Green.Mile"

[5] "Harry.Potter1->Harry.Potter2" "Harry.Potter1->LOTR2"

[7] "Harry.Potter2->Harry.Potter1" "LOTR1->Green.Mile"

[9] "LOTR1->LOTR2" "LOTR2->Green.Mile"

[11] "LOTR2->Harry.Potter1" "LOTR2->LOTR1"

57

Page 58: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

5. Predictive data mining: general issues

� Regression problem

� Classification problem

� Assessing goodness of a predictive model

58

Page 59: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Regression problem

◮ Regression problem: to model a response variable Y ∈ R1 as a

function of predictor variables X = (X1, . . . , Xp) ∈ Rp. If Y is a

continuous variable then a plausible model is

Y = f(X) + ǫ, ǫ is a random noise, Eǫ = 0

f is the regression function, f(X) = E{Y |X}.

◮ Fitting the model: based on the data

Dn = {(Yi,Xi), i = 1, . . . , n}, Xi = (Xi1, . . . , Xip)

the goal is to construct an estimate f(·) = f(·;Dn) of f(·).

◮ Model can be parametric and non–parametric

59

Page 60: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Approaches to regression modeling

◮ Parametric approach: a parametric form for unknown regression

function is assumed

f(x) = f(x, θ), θ ∈ Θ ⊂ Rm.

E.g., f(x, θ) = θTx where θ is unknown vector (linear regression).

◮ Nonparametric approach:

no specific parametric form for f is assumed.

60

Page 61: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Classification problem

◮ Classification problem: the objective is to model a binary

(categorical) response variable Y as a function of predictor variables

X = (X1, . . . , Xp).

◮ Each subject belongs to one of the two populations Π0 or Π1. For the

i-th subject Xi is observed and the corresponding ”population label”

Yi ∈ {0, 1}. Based on the data

Dn = {(Yi,Xi), i = 1, . . . , n}, Xi = (X1i, . . . , Xpi)

the goal is to construct a classsifier (prediction rule) f(·) = f(·;Dn)

that for each x predicts the label Y .

◮ Approaches: parametric and non-parametric.

61

Page 62: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Assessing goodness of a predictive model

◮ A ”naive” (resubstitution) approach: f(Xi) should be close to Yi for

each i. One can look, e.g., at the following performance indeces:

SL2 =1

n

n∑

i=1

[Yi − f(Xi)

]2

SL1 =1

n

n∑

i=1

∣∣Yi − f(Xi)∣∣.

If Y is a binary variable (as in classification problem) then the

appropriate index is

S0/1 =1

n

n∑

i=1

I{f(Xi) 6= Yi

}

The smaller SL2 (SL1 , S0/1) is, the better the fit.

Is that a good approach?

62

Page 63: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Training and validation data sets

◮ Overfitting: “fitting the noise”.

The model is evaluated on the fitted data

◮ Training and validation data sets:

Divide the data Dn into two subsamples of sizes n1 and n2

respectively: training sample D′n and validation sample D′′

n.

Construct the estimate on the basis of D′n only f(x) = f(x;D′

n).

Assess goodness–of–fit on the basis of the validation sample D′′n, e.g.,

SL2 = 1n2

i:(Yi,Xi)∈D′′

n

[Yi − f(Xi;D′

n)]2.

◮ Drawback: D′′n used only for the validation purposes.

63

Page 64: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Leave–one–out cross–validation

◮ Cross-validation (leave–one–out): original data set

Dn = {(X1, Y1), . . . , (Xn, Yn)}.

The data set D(−i)n without i-th observation

D(−i)n = {(X1, Y1), . . . , (Xi−1, Yi−1), (Xi+1, Yi+1), . . . , (Xn, Yn)}.

◮ Cross–validation accuracy estimate

SCVL2

= 1n

n∑

i=1

[Yi − f(Xi;D(−i)

n )]2.

◮ In general computationally expensive; cheap for linear regression

SCVL2

= 1n

n∑

i=1

(Yi − Yi1− hi

)2

,

where hi is ith diagonal element of the hat matrix (leverage statistic).

64

Page 65: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

k–fold cross-validation

◮ Divide randomly the set of observations Dn = {(Xi, Yi), i = 1, . . . , n}in k groups D(j)

n , j = 1, . . . , k. Let D(−j)n be the dataset without jth

group of observations, j = 1, . . . , k.

◮ Cross-validation accuracy estimate:

SCV(k)L2

=1

k

k∑

j=1

MSE(j),

MSE(j) =1

n

i∈D(j)n

[Yi − f(Xi,D(−j)n ]2

◮ Computationally faster that leave-on-out CV, but more biased. In

practice use 10–fold cross-validation.

65

Page 66: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Bootstrap methods

◮ Bootstrapping refers to a self-sustained process that proceeds without

external help. The term is sometimes attributed to Rudolf Erich

Raspe’s story “The Surprising Adventures of Baron Munchausen”,

where the main character pulls himself (and his horse) out of a

swamp by his hair (specifically, his pigtail).

◮ In Statistics, bootstrap is a method for accessing accuracy of

statistical procedures by resampling from empirical distributions.

Introduced by Bradley Efron in 1979.

66

Page 67: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Bootstrap idea

◮ Training set: Dn = {(X1, Y1), . . . , (Xn, Yn)}.

◮ The idea is

– randomly draw with replacements B (bootstrap) datasets

{D∗n,b, b = 1, . . . , B}

from Dn, each sample D∗n,b is of the size n;

– refit the model for each of the bootstrap datasets and compute an

accuracy measure;

– average an accuracy measures over B replications.

67

Page 68: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2.8 5.3 3

1.1 2.1 2

2.4 4.3 1

Y X Obs

2.8 5.3 3

2.4 4.3 1

2.8 5.3 3

Y X Obs

2.4 4.3 1

2.8 5.3 3

1.1 2.1 2

Y X Obs

2.4 4.3 1

1.1 2.1 2

1.1 2.1 2

Y X Obs

Original Data (Z)

1*Z

2*Z

Z*B

1*α

2*α

α*B

!!

!!

!!

!!

!

!!

!!

!!

!!

!!

!!

!!

!!

67-1

Page 69: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Bootstrap accuracy estimate

◮ The final bootstrap accuracy estimate:

SBL2

=1

B

B∑

b=1

1

n

n∑

i=1

[Yi − f(Xi,D∗n,b)]

2

◮ There are modifications like leave-one-out bootstrap accuracy

estimates...

68

Page 70: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

6. Review of Linear Algebra

69

Page 71: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Matrix and vector

◮ Matrix: a rectangular array of numbers, e.g., A ∈ Rn×p

A =

a11 a12 · · · a1p

a21 a22 · · · a2p...

......

...

an1 an2 · · · anp

, A = {aij}i=1,...,n

j=1,...,p

◮ Vector: is a matrix containing one column x = [x1; · · · ;xn] ∈ Rn.

◮ Think of a matrix as a linear operation on vectors: when an n× pmatrix A is applied to (multiplies) a vector x ∈ Rp , it returns a

vector in y = Ax ∈ Rn.

70

Page 72: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Matrix multiplication

◮ If A ∈ Rn×p and B ∈ Rp×m, C = AB then C ∈ Rn×m and

C = AB, cij =

p∑

k=1

aikbkj , i = 1, . . . , n; j = 1, . . . ,m.

◮ Special cases

– inner product of vectors: if x ∈ Rn, y ∈ Rn then xT y =∑n

i=1 xiyi.

– matrix–vector multiplication: if A ∈ Rn×p and x ∈ Rp then

Ax = [a1, a2, . . . , ap]x =n∑

i=1

aixi.

The product Ax is a linear combination of matrix columns {aj}with weights x1, . . . , xp.

71

Page 73: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Matrix multiplication

◮ Properties

– Associative: (AB)C = A(BC)

– Distirbutive: (A+ B)C = AC +BC

– Non–commutative: AB 6= BA

◮ Block multiplication: if A = [Aik], B = [Bkj ], where Aik’s and Bkj ’s

are matrix blocks, and the number of columns in Aik equals to the

number of rows in Bkj then

C = AB = [Cij ], Cij =∑

k

AikBkj .

72

Page 74: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Special types of of quadratic matrices

◮ Diagonal A = diag{a11, a22, . . . , ann}

◮ Identity: I = In = diag {1, . . . , 1}︸ ︷︷ ︸n times

◮ Symmetric: A = AT

◮ Orthogonal: ATA = In = AAT

◮ Idempotent (projection): A2 = A ·A = A.

73

Page 75: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Linear independence and rank

◮ A set of vectors x1, . . . , xn are linearly independent if there are no

exist constants c1, . . . , cn (except all cj ’s are zero) such that

c1x1 + · · ·+ cnxn = 0.

◮ Rank of A ∈ Rn×p is the maximal number of linearly independent

columns (or, equivalently, rows). If rank(A) = min(n, p) then iot is

said that A is of the full rank.

◮ Properties: rank(A) ≤ min(p, n), rank(A) = rank(AT ),

rank(AB) ≤ min{rank(A), rank(B)}, rank(A+B) ≤ rank(A)+rank(B).

74

Page 76: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Trace of matrix

◮ Trace of A ∈ Rn×n is the sum of all diagonal elements of A:

tr(A) =n∑

i=1

aii.

◮ Properties

– tr(A) = tr(AT )

– tr(A+B) = tr(A) + tr(B)

– tr(α · A) = α · tr(A) for all α ∈ R

– tr(AB) = tr(BA)

– if x ∈ Rn, y ∈ Rn then tr(xyT ) = xT y.

75

Page 77: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Determinant

◮ 2× 2–matrix: determinant of the matrix

det

([ a b

c d

])= ad− bc.

Absolute value of the determinant is the area of the parallelogram

formed by the vectors (a, b) and (c, d).

◮ In general, if τ = (τ1, . . . , τn) is a permutation of {1, . . . , n} then

det(A) =∑

τ

(−1)|τ |a1τ1a2τ2 · · · anτn

where |τ | = 0 if τ is a permutation with even number of changes in

{1, . . . , n}, and |τ | = 1 otherwise.

76

Page 78: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Determinant

◮ Properties

– det(A) = det(AT ), det(αA) = αndet(A), ∀α ∈ R;

– determinant changes its sign if two columns are interchanged;

– determinant vanishes if and only if there is a linear dependence

between its columns;

– det(AB) = det(A)det(B).

77

Page 79: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Inverse matrix

◮ If A ∈ Rn×n and rank(A) = n then the inverse of A, denoted A−1 is

the matrix such that AA−1 = A−1A = In.

◮ Properties:

(A−1)−1 = A; (AB)−1 = B−1A−1; (A−1)T = (AT )−1.

◮ If det(A) = 0 then matrix A is singular (the inverse matrix does not

exist).

78

Page 80: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Range and null space (kernel) of a matrix

◮ Span: for xi ∈ Rp, i = 1, . . . , n

span(x1, . . . , xn) ={ n∑

i=1

αixi : αi ∈ R, i = 1, . . . , n}.

◮ Range: If A ∈ Rn×p then

Range(A) = {Ax : x ∈ Rp}.

Range(A) is the span of columns of A.

◮ Null space or kernel of a matrix:

Ker(A) = {x ∈ Rp : Ax = 0}.

79

Page 81: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Eigenvalues and eigenvectors

◮ Characteristic polynomial: Let A ∈ Rp×p; then

q(λ) = det(A− λI), λ ∈ R

is the characteristic polynomial of matrix A. Roots of this polynomial

λ1, . . . , λp are eigenvalues of matrix A: det(A− λjI) = 0, j = 1, . . . , p.

◮ A− λjI is a singular matrix; therefore there exists a non–zero vector

γ ∈ Rp such that Aγ = λjγ. This vector is called the eigenvector of A

corresponding to the eigenvalue λj . We can normalize eigenvectors so

that γT γ = 1.

◮ q(λ) = (−1)p ∏pj=1(λ− λj) = det(A− λI); hence

det(A) = q(0) =∏p

j=1 λj . In addition, tr(A) =∑p

j=1 λj .

80

Page 82: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Symmetric matrices, spectral decomposition

◮ All eigenvalues of a symmetric matrix are real.

◮ Orthogonal matrix: if c1, . . . , cp are orthonormal vectors (basis), i.e.,

cTi cj = 0, i 6= j, cTi ci = 1, ∀i then matrix C = [c1, c2, . . . , cp] is

orthogonal, CCT = CTC = I ⇒ C−1 = CT .

◮ Spectral decomposition: any symmetric p× p matrix A can be

represented as

A = ΓΛΓT =

p∑

j=1

λjγjγTj ,

where Λ = diag{λ1, . . . , λp}, λj ’s are eigenvalues, Γ = [γ1, . . . , γp], and

γj ’s are eigenvectors.

◮ If A is non–singular symmetric then An = ΓΛnΓT . In particular, if

λj ≥ 0, ∀j then√A = ΓΛ1/2ΓT .

81

Page 83: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Eigenvalues characterization for symmetric matrices

◮ Let A be an n× n symmetric matrix with eigenvalues

λmin = λ1 ≤ λ2 ≤ · · · ≤ λn−1 ≤ λn = λmax.

Then

λminxTx ≤ xTAx ≤ λmaxx

Tx, ∀x ∈ Rn,

λmax = maxx 6=0

xTAx

xTx= max

xTx=1xTAx,

λmin = minx 6=0

xTAx

xTx= min

xTx=1xTAx.

In addition, if γ1, . . . γn are eigenvectors corresponding to λ1, . . . , λn

then

maxx 6=0

x⊥un,...,un−k+1

xTAx

xTx= λn−k, k = 1, . . . , n− 1.

82

Page 84: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Quadratic forms, projection matrices, etc.

◮ Quadratic form: Q(x) = xTAx.

◮ For symmetric A: if Q(x) = xTAx > 0 for all x ∈ Rp then A is

positive definite. Alternatively, λj(A) > 0 for all j = 1, . . . , p.

◮ Projection (idempotent) matrix: P = P 2. Typical example is the hat

matrix in regression. If X ∈ Rn×p then

P = X(XTX)−1XT

is idempotent. It is projection on the column space (range) of

matrix X .

83

Page 85: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

7. Multiple Linear Regression

84

Page 86: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Linear regression model

◮ Model: response Y is modeled as a linear function of predictors

X1, . . . , Xp plus some errors ǫ:

Y = β0 + β1X1 + · · ·+ βpXp + ǫ.

The data is {Yi, Xi1, . . . , Xip, i = 1, . . . , n}. Then

Y︸︷︷︸n×1

= X︸︷︷︸n×(p+1)

β︸︷︷︸(p+1)×1

+ ǫ︸︷︷︸n×1

,

where

Y =

Y1...

Yn

, X =

1 X11 X12 · · · X1p

......

......

...

1 Xn1 Xn2 · · · Xnp

, ǫ =

ǫ1...

ǫn

.

85

Page 87: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Model fitting

◮ Basic assumptions: zero mean, uncorrelated errors

Eǫ = 0, cov(ǫ) = EǫǫT = σ2In, In − n× n identity matrix.

◮ Least squares estimator β

The idea is to minimize the sum of squares of errors minβ S(β) where

S(β) = (Y −Xβ)T (Y −Xβ) = Y TY − 2βTXTY + βTXTXβ.

Differentiate S(β) w.r.t. β and set to zero:

∇βS(β) = −2XTY + 2XTXβ = 0 ⇒ β = [XTX]−1XTY ,

provided that XTX is non–singular.

86

Page 88: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Predicted values and residuals

◮ Predicted (fitted) values

Y = Xβ = X(XTX)−1XTY = H︸︷︷︸n×n

Y ,

H is the hat matrix; H = HT , H = H2 – projection matrix, projects

on the column space of X .

◮ Residuals

e = Y − Y = (In −H)Y , In − n× n identity matrix

87

Page 89: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Sums of squares

◮ Residual sum of squares:

SSE =n∑

i=1

e2i = eTe = (Y − Y )T (Y − Y ).

◮ Total sum of squares: for Y = 1n

∑ni=1 Yi

SST =n∑

i=1

(Yi − Y )2 = (Y − 1nY )T (Y − 1nY ), 1n = (1, . . . , 1)T .

Characterizes variability of the response Y around its average.

◮ Regression sum of squares

SSreg = SST − SSE.

Characterizes variability in data explained by the regression model.

88

Page 90: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Sums of squares

◮ R2–value characterizes proportion of variablity in the data explained

by the regression model

R2 =SSreg

SST

Closer R2 to one, more variability is explained. R2 grows as more

predictor variables are added to the model.

89

Page 91: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Inference in the linear regression model

◮ Basic assumption: ǫ ∼ Nn(0, σ2In). Under this assumption, if there is

no relationship between X1, . . . , Xp and Y , i.e., if

β1 = β2 = · · · = βp = 0 then

MSreg =SSreg

p∼ χ2(p), MSE =

SSE

n− p− 1∼ χ2(n− p− 1).

◮ F–test for the hypothesis H0 : β1 = · · · = βp = 0 is based on the fact

F ∗ =SSreg/p

SSE/(n− p− 1)∼ F (p, n− p− 1).

Then H0 is rejected when F ∗ > F(1−α)(p, n− p− 1).

90

Page 92: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Inference in the linear regression model

◮ Inference on individual coefficients: β ∼ Np+1(β, σ2(XTX)−1)

βk − βks.d.(βk)

∼ t(n− p− 1), k = 0, 1, . . . , p

where s.d.(βk) is the square root of corresponding diagonal element of

σ2(XTX)−1, σ2 =MSE. Hence H0 : βk = 0 is rejected if

|t∗| > t(1−α/2)(n− p− 1), t∗ =βk

s.d.(βk).

91

Page 93: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. LS regression diagnostics

◮ Non–linearity of the response–predictor relationship

5 10 15 20 25 30

−1

5−

10

−5

05

10

15

20

Fitted values

Re

sid

ua

ls

Residual Plot for Linear Fit

323

330

334

15 20 25 30 35

−1

5−

10

−5

05

10

15

Fitted values

Re

sid

ua

ls

Residual Plot for Quadratic Fit

334323

155

92

Page 94: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. LS regression diagnostics

◮ Correlations of errors

Standard errors computed on the basis of independence assumption.

If there are correlations between errors then confidence intervals can

have less coverage probability than expected.

◮ Tests for serial correlation: run test, sign changes tests, etc.

◮ Time series models

93

Page 95: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. LS regression diagnostics

◮ Non–constant variance of errors

10 15 20 25 30

−1

0−

50

51

01

5

Fitted values

Re

sid

ua

ls

Response Y

998

975845

2.4 2.6 2.8 3.0 3.2 3.4−

0.8

−0

.6−

0.4

−0

.20

.00

.20

.4

Fitted values

Re

sid

ua

ls

Response log(Y)

437671

605

◮ Remedy: transformations

94

Page 96: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

4. LS regression diagnostics

◮ Outliers: points where the response variable is unusually large (small)

given predictors. Outliers can be detected in residuals plots.

◮ High leverage points has unusual X values.

◮ Collinearity refers to a situation when two or more predictors are

closely related to each other. The matrix XTX is close to singular.

95

Page 97: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: ozone data

># airquality data set

>ozone.lm <- lm (Ozone~Solar.R+Wind+Temp, data=airquality)

>summary(ozone.lm)

Call:

lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)

Residuals:

Min 1Q Median 3Q Max

-40.485 -14.219 -3.551 10.097 95.619

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -64.34208 23.05472 -2.791 0.00623 **

Solar.R 0.05982 0.02319 2.580 0.01124 *

Wind -3.33359 0.65441 -5.094 1.52e-06 ***

Temp 1.65209 0.25353 6.516 2.42e-09 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.18 on 107 degrees of freedom

Multiple R-Squared: 0.6059, Adjusted R-squared: 0.5948

F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16

96

Page 98: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Plots of residuals

◮ Plot of residuals vs fitted values

>plot(ozone.lm$fitted.values, ozone.lm$residuals, ylab="Fitted values",

+ xlab="Residuals")

>abline(0,0)

−20 0 20 40 60 80 100

−40

−20

020

4060

8010

0

Residuals

Fitted

value

s

97

Page 99: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Plots of residuals

◮ QQ–plot of residuals

>qqnorm(ozone.lm$residuals)

>qqline(ozone.lm$residuals)

−2 −1 0 1 2

−40

−20

020

4060

8010

0Normal Q−Q Plot

Theoretical Quantiles

Samp

le Qu

antile

s

98

Page 100: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: Boston housing data

◮ Response variable: median value of homes (medv)

◮ Predictor variables:

– crime rate (crim); % land zones for lots (zn)

– % nonretail business (indus); 1/0 on Charles river (chas)

– nitrogene oxide concentration (nox); average number of rooms (rm)

– % built before 1940 (age); tax rate (tax)

– weigthed distance to employment centers (dis)

– % lower-status population (lstat); % black (B)

– accecibility to radial highways (rad)

– pupil/teacher ratio (ptratio)

99

Page 101: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Boston housing data

> library(MASS)

> Boston.lm<- lm(medv~., data=Boston)

> summary(Boston.lm)

Call:

lm(formula = medv ~ ., data = Boston)

Residuals:

Min 1Q Median 3Q Max

-15.594 -2.730 -0.518 1.777 26.199

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***

crim -1.080e-01 3.286e-02 -3.287 0.001087 **

zn 4.642e-02 1.373e-02 3.382 0.000778 ***

indus 2.056e-02 6.150e-02 0.334 0.738288

chas 2.687e+00 8.616e-01 3.118 0.001925 **

nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***

rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***

100

Page 102: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

age 6.922e-04 1.321e-02 0.052 0.958229

dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***

rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***

tax -1.233e-02 3.760e-03 -3.280 0.001112 **

ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***

black 9.312e-03 2.686e-03 3.467 0.000573 ***

lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom

Multiple R-Squared: 0.7406, Adjusted R-squared: 0.7338

F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

◮ indus and age are not significant at the 0.05 level.

>fmBoston=as.formula("medv~crim+zn+chas+nox+rm+dis+rad+tax+ptration+black+lstat"

)

> Boston1.lm <- lm(fmBoston, data=Boston)

> summary(Boston1.lm)

Call:

lm(formula = fmBoston, data = Boston)

Residuals:

101

Page 103: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Min 1Q Median 3Q Max

-15.5984 -2.7386 -0.5046 1.7273 26.2373

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.341145 5.067492 7.171 2.73e-12 ***

crim -0.108413 0.032779 -3.307 0.001010 **

zn 0.045845 0.013523 3.390 0.000754 ***

chas 2.718716 0.854240 3.183 0.001551 **

nox -17.376023 3.535243 -4.915 1.21e-06 ***

rm 3.801579 0.406316 9.356 < 2e-16 ***

dis -1.492711 0.185731 -8.037 6.84e-15 ***

rad 0.299608 0.063402 4.726 3.00e-06 ***

tax -0.011778 0.003372 -3.493 0.000521 ***

ptratio -0.946525 0.129066 -7.334 9.24e-13 ***

black 0.009291 0.002674 3.475 0.000557 ***

lstat -0.522553 0.047424 -11.019 < 2e-16 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom

Multiple R-Squared: 0.7406, Adjusted R-squared: 0.7348

F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16

102

Page 104: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Boston housing data: residual plots

> plot(residuals(Boston1.lm))

> abline(0,0)

> qqnorm(residuals(Boston1.lm))

> qqline(residuals(Boston1.lm))

103

Page 105: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Boston housing data: residual plots

0 100 200 300 400 500

−10

010

20

Index

residu

als(Bo

ston1

.lm)

−3 −2 −1 0 1 2 3

−10

010

20

Normal Q−Q Plot

Theoretical Quantiles

Samp

le Qu

antile

s

104

Page 106: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

8. Linear Model Selection and Regularization

105

Page 107: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

The need for model selection

◮ We have many different potential predictors. Why do not base the

model on all of them?

◮ Two sides of one coin: bias and variance

– The model with more predictors can describe the phenomenon

better – less bias.

– When we estimate more parameters, the variance of estimators

grows – we “fit the noise”, overfitting!

◮ A clever model selection strategy should resolve

the bias–variance trade–off.

106

Page 108: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Subset selection and coefficient shrinkage

◮ Why the least squares are not always satisfactory?

∗ Prediction accuracy: the LS estimates often have large variance

(collinearity problems, large number of predictors etc.)

∗ Interpretation: with large number of predictors, we often would

like to determine a smaller subset that exhibit strongest effects.

◮ Two approaches:

∗ subset selection (identify a subset of predictors having strongest

effect on the response variable)

∗ coefficient shrinkage; this has an effect of reducing variance of

estimates (but increasing bias...)

107

Page 109: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Criteria for subset selection

◮ How to judge if a subset of predictors is good? The R2 index is

useless as it increases as new variables are added to the model.

◮ Criteria for subset selection. The idea is to adjust or penalize the

residual sum of squares SSE for the model complexity.

– Mallows’ Cp: Cp = 1n [SSE + 2pσ2]

– AIC (Akaike infromation criterion): penalization of the likelihood

function; for linear regression is equivalent to Cp

– BIC (Bayesian information criterion): BIC = 1n [SSE + p log(n)σ2].

– Adjusted R2 = 1− SSE/(n−p−1)SST/(n−1)

108

Page 110: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Criteria for subset selection

◮ Typical use of the criteria

2 4 6 8 10

10

00

01

50

00

20

00

02

50

00

30

00

0

Number of Predictors

Cp

2 4 6 8 10

10

00

01

50

00

20

00

02

50

00

30

00

0

Number of Predictors

BIC

2 4 6 8 10

0.8

60

.88

0.9

00

.92

0.9

40

.96

Number of Predictors

Ad

juste

d R

2

◮ Procedures: best subset selection, stepwise (forward, backward)

selection.

109

Page 111: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Best subset selection

1. LetM0 be the model without predictors

2. For k = 1, . . . p:

� Fit all(pk

)models that contain k predictors;

� Pick the best among these(pk

)models with largest R2; call itMk.

3. Select betweenM0, . . . ,Mp using Cp, AIC, BIC, etc.

110

Page 112: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Forward selection procedure

◮ Fact: assume that we have two models

– the first contains p variables

– the second contain the same p variables + more q variables.

We want to test H0 : q variables are not significant. If ǫ ∼ N (0, σ2)

then under H0

[SSreg(p+ q)− SSreg(p)]/q

SSE(p+ q)/(n− p− q − 1)∼ F (q, n− p− q − 1).

◮ Idea of the forward selection procedure: at each step to add a variable

which maximizes the F -statistic (provided it is significant, greater

than the corresponding (1− α)-quantile).

◮ Instead of F–test at each step one can use AIC.

111

Page 113: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Forward selection procedure

1. Fit simple regression model for each variable xk, k = 1, . . . , p and

compute

Fxk=

SSreg(xk)/1

SSE(xk)/(n− 2), k = 1, . . . , p.

Select variable xk1 with k1 = argmaxFxk; if Fxk1

> F(1−α)(1, n− 2)

add xk1 to the model.

2. Fit models with predictors (xk, xk1), and compute

Fxk|xk1=

[SSreg(xk, xk1)− SSreg(xk1)]/1

SSE(xk, xk1)/(n− 3), k = 1, . . . , p, k 6= k1

Select k2 = argmaxFxk|xk1, compare with F(1−α)(1, n− 2) and if

significant add xk2 to the model. Proceed...

112

Page 114: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Boston housing data

> library(MASS)

> maxfmla<-as.formula(paste("medv~", paste(names(Boston[,-14]), collapse="+")))

> maxfmla

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

> Boston.lm <-lm(medv~1, data=Boston)

> Boston.fwd<-step(Boston.lm,direction="forward", scope=list(upper=maxfmla),test="F")

Start: AIC=2246.51

medv ~ 1

Df Sum of Sq RSS AIC F value Pr(>F)

+ lstat 1 23243.9 19472 1851.0 601.618 < 2.2e-16 ***

+ rm 1 20654.4 22062 1914.2 471.847 < 2.2e-16 ***

+ ptratio 1 11014.3 31702 2097.6 175.106 < 2.2e-16 ***

+ indus 1 9995.2 32721 2113.6 153.955 < 2.2e-16 ***

+ tax 1 9377.3 33339 2123.1 141.761 < 2.2e-16 ***

+ nox 1 7800.1 34916 2146.5 112.591 < 2.2e-16 ***

+ crim 1 6440.8 36276 2165.8 89.486 < 2.2e-16 ***

+ rad 1 6221.1 36495 2168.9 85.914 < 2.2e-16 ***

113

Page 115: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

+ age 1 6069.8 36647 2171.0 83.478 < 2.2e-16 ***

+ zn 1 5549.7 37167 2178.1 75.258 < 2.2e-16 ***

+ black 1 4749.9 37966 2188.9 63.054 1.318e-14 ***

+ dis 1 2668.2 40048 2215.9 33.580 1.207e-08 ***

+ chas 1 1312.1 41404 2232.7 15.972 7.391e-05 ***

<none> 42716 2246.5

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1851.01

medv ~ lstat

Df Sum of Sq RSS AIC F value Pr(>F)

+ rm 1 4033.1 15439 1735.6 131.3942 < 2.2e-16 ***

+ ptratio 1 2670.1 16802 1778.4 79.9340 < 2.2e-16 ***

+ chas 1 786.3 18686 1832.2 21.1665 5.336e-06 ***

+ dis 1 772.4 18700 1832.5 20.7764 6.488e-06 ***

+ age 1 304.3 19168 1845.0 7.9840 0.004907 **

+ tax 1 274.4 19198 1845.8 7.1896 0.007574 **

+ black 1 198.3 19274 1847.8 5.1764 0.023316 *

+ zn 1 160.3 19312 1848.8 4.1758 0.041527 *

+ crim 1 146.9 19325 1849.2 3.8246 0.051059 .

114

Page 116: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

+ indus 1 98.7 19374 1850.4 2.5635 0.109981

<none> 19472 1851.0

+ rad 1 25.1 19447 1852.4 0.6491 0.420799

+ nox 1 4.8 19468 1852.9 0.1239 0.724966

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1735.58

medv ~ lstat + rm

Df Sum of Sq RSS AIC F value Pr(>F)

+ ptratio 1 1711.32 13728 1678.1 62.5791 1.645e-14 ***

+ chas 1 548.53 14891 1719.3 18.4921 2.051e-05 ***

+ black 1 512.31 14927 1720.5 17.2290 3.892e-05 ***

+ tax 1 425.16 15014 1723.5 14.2154 0.0001824 ***

+ dis 1 351.15 15088 1725.9 11.6832 0.0006819 ***

+ crim 1 311.42 15128 1727.3 10.3341 0.0013900 **

+ rad 1 180.45 15259 1731.6 5.9367 0.0151752 *

+ indus 1 61.09 15378 1735.6 1.9942 0.1585263

<none> 15439 1735.6

+ zn 1 56.56 15383 1735.7 1.8457 0.1748999

+ age 1 20.18 15419 1736.9 0.6571 0.4179577

115

Page 117: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

+ nox 1 14.90 15424 1737.1 0.4849 0.4865454

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1678.13

medv ~ lstat + rm + ptratio

Df Sum of Sq RSS AIC F value Pr(>F)

+ dis 1 499.08 13229 1661.4 18.9009 1.668e-05 ***

+ black 1 389.68 13338 1665.6 14.6369 0.0001468 ***

+ chas 1 377.96 13350 1666.0 14.1841 0.0001854 ***

+ crim 1 122.52 13606 1675.6 4.5115 0.0341560 *

+ age 1 66.24 13662 1677.7 2.4291 0.1197340

<none> 13728 1678.1

+ tax 1 44.36 13684 1678.5 1.6242 0.2031029

+ nox 1 24.81 13703 1679.2 0.9072 0.3413103

+ zn 1 14.96 13713 1679.6 0.5467 0.4600162

+ rad 1 6.07 13722 1679.9 0.2218 0.6378931

+ indus 1 0.83 13727 1680.1 0.0301 0.8622688

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

116

Page 118: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

..............................................................

Step: AIC=1596.1

medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +

crim + rad

Df Sum of Sq RSS AIC F value Pr(>F)

+ tax 1 273.619 11081 1585.8 12.1978 0.0005214 ***

<none> 11355 1596.1

+ indus 1 33.894 11321 1596.6 1.4790 0.2245162

+ age 1 0.096 11355 1598.1 0.0042 0.9485270

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1585.76

medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +

crim + rad + tax

Df Sum of Sq RSS AIC F value Pr(>F)

<none> 11081 1585.8

+ indus 1 2.51754 11079 1587.7 0.1120 0.7380

+ age 1 0.06271 11081 1587.8 0.0028 0.9579

117

Page 119: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Shrinkage methods

◮ The idea of regularization: contrary to subset selection the idea is to

fit a model keeping all coefficients, but to impose constraints on the

size of the coefficients. For example, to shrink the coefficients to zero.

◮ In general, regularization refers to a process of introducing additional

information in order to solve an ill-posed problem or to prevent

overfitting.

118

Page 120: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Ridge regression

◮ Ridge regression shrinks regression coefficients by imposing a penalty

on their size:

βλ = argminβ

{ n∑

i=1

(Yi − β0 −

p∑

j=1

Xijβj

)2

+ λ

p∑

j=1

β2j

}

= (Y −Xβ)T (Y −Xβ) + λβTβ

βλ = (XTX + λI)−1XTY .

◮ Ridge parameter λ should be chosen: larger λ’s result in smaller

variance but bigger bias.

◮ βλ is a linear estimator.

◮ Usually βλ is computed for a range of λ’s.

119

Page 121: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Ridge regression

◮ The matrix X is usually centered and rescaled: if xj is the jth

(j = 1, . . . , p) column of X then define Z = [z1, . . . , zp] by

zj =1

Sj(xj − xj), S2

j =1

n(xj − xj)

T (xj − xj).

Y is also cenetred. Then consider the model without intercept

Y = Zθ + ǫ.

◮ Ridge trace: graphs of coefficient estimates as function of λ.

◮ Generalized Cross-Validation (GCV): select λ that minimizes

V (λ) =1nY

T [I −A(λ)]2Y(

1n tr[I −A(λ)]

)2 , A(λ) = X(XTX + λI)−1XT .

120

Page 122: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Ridge regression in R

> Boston.ridge<-lm.ridge(medv~., Boston, lambda=seq(0, 100, 0.1))

◮ Output values

* scales - scalings used on the X matrix.

* Inter - was intercept included?

* lambda - vector of lambda values

* ym - mean of y

* xm - column means of x matrix

* GCV - vector of GCV values

* kHKB - HKB estimate of the ridge constant.

* kLW - L-W estimate of the ridge constant.

> plot(Boston.ridge) # produces ridge trace

121

Page 123: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Ridge trace: Boston housing data

0 20 40 60 80 100

−4−3

−2−1

01

23

x$lambda

t(x$co

ef)

122

Page 124: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Ridge regression in R

> select(Boston.ridge)

modified HKB estimator is 4.594163

modified L-W estimator is 3.961575

smallest value of GCV at 4.3

>

> Boston.ridge.cv<-lm.ridge(medv~.,Boston,lambda=4.3)

> Boston.ridge.cv$coef

crim zn indus chas nox rm

-0.895001937 1.020966996 0.049465334 0.694878367 -1.943248437 2.707866705

age dis rad tax ptratio black

-0.005646034 -2.992453378 2.384190136 -1.819613735 -2.026897293 0.847413719

lstat

-3.689619529

123

Page 125: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

LASSO

◮ LASSO estimator

βlasso = argminβ

n∑

i=1

(Yi − β0 −

p∑

j=1

Xijβj

)2

subject to

p∑

j=1

|βj | ≤ t.

◮ Comparison to ridge:∑p

j=1 β2j is replaced by

∑pj=1 |βj |

◮ Properties:

* estimator βlasso is non–linear;

* t small causes some of the coefficients to be exactly zero; if t is

larger than t0 =∑p

j=1 |βLSj | then βlasso = βLS;

* a kind of continuous subset selection

124

Page 126: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

LASSO and ridge

125

Page 127: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

LASSO, ridge and best subset selection

◮ Ridge regression

βridge = argminβ

{ n∑

i=1

(Yi − β0 −

p∑

j=1

Xijβj

)2 ∣∣∣p∑

j=1

β2j ≤ s

}

◮ Lasso

βlasso = argminβ

{ n∑

i=1

(Yi − β0 −

p∑

j=1

Xijβj

)2 ∣∣∣p∑

j=1

|βj | ≤ t}

◮ Best subset selection

βsparse = argminβ

{ n∑

i=1

(Yi − β0 −

p∑

j=1

Xijβj

)2 ∣∣∣p∑

j=1

I{βj 6= 0} ≤ s}

◮ LASSO and ridge are computationally feasible alternatives to best

subset selection.

126

Page 128: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. LASSO and ridge in a special case

Assume that n = p and X is the identity matrix.

◮ Least squares estimator: βls = argminβ∑n

i=1(Yi − βi)2,

βlsi = Yi, i = 1, . . . , n.

◮ Ridge regression: βridge = argminβ

{∑ni=1(Yi − βi)2 + λ

∑ni=1 β

2i

},

βridgei = Yi/(1 + λ), i = 1, . . . , n.

◮ LASSO: βlasso = argminβ

{∑ni=1(Yi − βi)2 + λ

∑ni=1 |βi|

},

βlassoi =

Yi − λ/2, Yi > λ/2,

Yi + λ/2, Yi < −λ/2,0, |Yi| ≤ λ/2

i = 1, . . . , n.

127

Page 129: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. LASSO and ridge in a special case

◮ βlassoi = soft thresholding(Yi)

−1.5 −0.5 0.0 0.5 1.0 1.5

−1

.5−

0.5

0.5

1.5

Co

eff

icie

nt

Estim

ate

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5−

1.5

−0

.50

.51

.5

Co

eff

icie

nt

Estim

ate

Lasso

Least Squares

yjyj

128

Page 130: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

LASSO in R

> library(lars)

> library(MASS)

> x<-as.matrix(Boston[,1:13])

> y<-as.vector(Boston[,14])

> Boston.lasso <- lars(x,y,type="lasso")

> summary(Boston.lasso)

LARS/LASSO

Call: lars(x = x, y = y, type = "lasso")

Df Rss Cp

0 1 42716 1392.997

1 2 36326 1111.195

2 3 21335 447.485

3 4 14960 166.356

4 5 14402 143.588

5 6 13667 112.931

6 7 13449 105.281

7 8 13117 92.515

8 9 12423 63.717

9 10 11950 44.700

129

Page 131: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

10 11 11899 44.446

11 12 11730 38.934

12 13 11317 22.590

13 12 11086 10.341

14 13 11080 12.032

15 14 11079 14.000

◮ Print and plot of complete coefficient path

> print(Boston.lasso)

Call:

lars(x = x, y = y, type = "lasso")

R-squared: 0.741

Sequence of LASSO moves:

lstat rm ptratio black chas crim dis nox zn indus rad tax indus indus age

Var 13 6 11 12 4 1 8 5 2 3 9 10 -3 3 7

Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

> plot(Boston.lasso)

Returns a plot: coefficient values against s = t∑p

j=1 |βLSj

| , 0 ≤ s ≤ 1.

130

Page 132: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

LASSO coefficients path

* * * * * * * * * ** **

* * *

0.0 0.2 0.4 0.6 0.8 1.0

−50

050

|beta|/max|beta|

Stan

dard

ized

Coe

ffici

ents

* * * * * * * * *** *

** * *

* * * * * * * * * ** * * * * ** * * * *

* * * * ** * * * * *

* * * * * * * *

*

***

*

* * *

* *

*

**

* * * * ** * ** * *

* * * * * * * * * ** * * * * ** * * * * * **

*

***

*

* * *

* * * * * * * * * ***

*

** *

* * * * * * * * * ** *

*

** *

* * *

**

* * ** ** * * * * *

* * * **

* * * * ** * * * * *

*

*

*

* * * * * * ** * * * * *

LASSO

138

101

74

29

0 1 2 3 5 7 8 9 11 12 13 15

131

Page 133: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Cross-validated choice of s

> cv.lars(x,y, K=10)

Return K-fold CV mean squared prediction error against s.

0.0 0.2 0.4 0.6 0.8 1.0

2030

4050

6070

8090

fraction

cv

132

Page 134: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Prediction and extraction of coefficients

◮ Extraction of LASSO coefficients for given s

> Boston.coef.03<-coef(Boston.lasso, s=0.3, mode="fraction")

> Boston.coef.03

crim zn indus chas nox rm age

0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 3.3707730 0.0000000

dis rad tax ptratio black lstat

0.0000000 0.0000000 0.0000000 -0.4299806 0.0000000 -0.4664715

◮ Prediction

> Boston.lasso.pr<-predict(Boston.lasso, x, s=0.3, mode="fraction", type="fit")

133

Page 135: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

9. Logistic Regression

134

Page 136: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example

◮ Age and coronary heart desease (CHD) status data: 100 subjects

response (Y) - absence or presence (0/1) of the CHD, predictor (X) -

age.

20 30 40 50 60 70

0.00.2

0.40.6

0.81.0

AGE

CHD

135

Page 137: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Logistic regression model

◮ Linear regression is not appropriate:

E(Y |X = x) = P (Y = 1|X = x) = β0 + β1X

should be in [0, 1], ∀x.

◮ The idea is

to model relathionship between p(x) = P (Y = 1|X = x) and x

using logistic response function

p(x) =eβ0+β1x

1 + eβ0+β1x⇔ logit{p(x)} := log

p(x)

1− p(x) = β0+β1x.

136

Page 138: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Logistic response function

−100 −50 0 50 100

0.00.2

0.40.6

0.81.0

x

y

◮ Why logit? For fixed x the odds p(x)1−p(x) are naturally on a log scale:

usually one has odds like ’10 to 1’, or ’2 to 1’.

◮ A specific case of the generalized linear model (GLM) with logit link

function: g(E(Y |X = x)) = β0 + β1x, g(z) = log z1−z , 0 ≤ z < 1.

137

Page 139: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Interpretation of the logistic regression model

◮ If p(x) = 0.75 then the odds of getting CHD at age x are 3 to 1.

◮ If x = 0 then

logp(0)

1− p(0) = β0 ⇔p(0)

1− p(0) = eβ0 .

Thus eβ0 can be interpreted as baseline odds, especially if zero is

within the range of the data for the predictor variable X .

◮ If we increase x by one unit, we multiply the odds by eβ1 . If β1 > 0

then eβ1 > 1 and odds increase; if β1, 0 then the odds decrease.

138

Page 140: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Likelihood function

◮ Data and model: Dn = {(Yi, Xi), i = 1, . . . , n}, Yi ∈ {0, 1}, i.i.d.

πi = π(Xi) = P (Yi = 1|Xi) = E(Yi|Xi) =eβ0+β1Xi

1 + eβ0+β1Xi, i = 1, . . . , n.

◮ Likelihood and log–likelihood (should be maximized w.r.t. β0 and β1)

L(β0, β1;Dn) =n∏

i=1

πYi

i (1− πi)1−Yi

=n∏

i=1

( eβ0+β1Xi

1 + eβ0+β1Xi

)Yi( 1

1 + eβ0+β1Xi

)1−Yi

=n∏

i=1

e(β0+β1Xi)Yi

1 + eβ0+β1Xi

log{L(β0, β1;Dn)} =n∑

i=1

Yi(β0 + β1Xi)−n∑

i=1

log{1 + eβ0+β1Xi

}.

139

Page 141: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Likelihood function

S1(β0, β1) =∂ log{L(β0, β1)}

∂β0=

n∑

i=1

Yi −n∑

i=1

eβ0+β1Xi

1 + eβ0+β1Xi

S2(β0, β1) =∂ log{L(β0, β1)}

∂β1=

n∑

i=1

XiYi −n∑

i=1

Xieβ0+β1Xi

1 + eβ0+β1Xi.

◮ Solve the system S1(β0, β1) = 0, S2(β0, β1) = 0 for β0, β1.

◮ No close form solution is available, solution by an iterative procedure.

140

Page 142: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Fitting the model: Newton–Raphson algorithm

◮ Idea of the algorithm:

– Assume we want to solve equation g(x) = 0.

– Let x∗ be the solution; then if x is close to x∗ then

0 = g(x∗) ≈ g(x) + g′(x)(x∗ − x) ⇒ x∗ ≈ x−g(x)

g′(x).

◮ Iterative procedure:

– Let xk be a current approximation to x∗ (at kth stage); define the

next approximation point xk+1 by

xk+1 = xk −g(xk)

g′(xk), k = 1, 2, . . .

– Stop when g(xk) is small, e.g., |g(xk)| ≤ ǫ.

141

Page 143: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Fitting the model: Newton–Raphson algorithm

1. Let β(j)0 and β

(j)1 be the current parameter approximations after j-th

step of the algorithm.

2. Let

J(β0, β1) = −

∂S1

∂β0

∂S1

∂β1

∂S2

∂β0

∂S2

∂β1

.

3. Compute

β

(j+1)0

β(j+1)1

=

β

(j)0

β(j)1

+ J−1

(β(j)0 , β

(j)1

)S1

(β(j)0 , β

(j)1

)

S2

(β(j)0 , β

(j)1

)

.

4. Continue untill a convergence criterion is met.

142

Page 144: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Extension to multiple predictors

◮ Model: Xi = (1, Xi1, . . . , Xip), i = 1, . . . , n, β = (β0, β1, . . . , βp)

πi = π(Xi) = P (Yi = 1|Xi) = E(Yi|Xi) =exp{βTXi}

1 + exp{βTXi}.

◮ Likelihood and log–likelihood:

L(β;Dn) =n∏

i=1

[π(Xi)]Yi [1− π(Xi)]

1−Yi

=n∏

i=1

( eβTXi

1 + eβTXi

)Yi( 1

1 + eβTXi

)1−Yi

=n∏

i=1

eβTXiYi

1 + eβTXi

log{L(β;Dn)} =n∑

i=1

βTXiYi −n∑

i=1

log{1 + eβ

TXi

}.

Should be maximized with respect to β.

143

Page 145: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Fitting the model and assessing the fit

◮ No close form solution is available, solution by an iterative procedure.

◮ If β is the ML estimate of β then the fitted values are

Yi = π(Xi) =exp{βTXi}

1 + exp{βTXi}◮ Deviance is twice the difference between the log–likelihoods evaluated

at (a) the MLE π(Xi); and (b) π(Xi) = Yi:

G2 = 2n∑

i=1

{Yi log

( Yiπ(Xi)

)+ (1− Yi) log

( 1− Yi1− π(Xi)

)}.

=n∑

i=1

dev(Yi, π(Xi))

◮ Deviance residuals: ri = sign{Yi − π(Xi)}√

dev(Yi, π(Xi)).

144

Page 146: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Fitting the model and assessing the fit

◮ The degrees of freedom (df) associated with the deviance G2 equals

to n− (p+ 1); (p+ 1) is the dimension of the vector β.

◮ Pearson’s X2 is an approximation to the deviance

X2 =n∑

i=1

[Yi − π(Xi)]2

π(Xi)[1− π(Xi)]

◮ Comparing models: let x = (x1,x2), and consider testing

H0 : log{ π(x)

1− π(x)}= βTx1

H1 : log{ π(x)

1− π(x)}= βTx1 + ηTx2.

Obtain deviance G20 and df0 under H0; and G

21 with corresponding

df1 under H1. Under the null: G20 −G2

1 ≈ χ2(df0 − df1).

145

Page 147: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example

> agchd <-read.table("Age-CHD.dat", header=T)

> agchd.glm<-glm(CHD~Age, data=agchd, family=binomial)

> summary(agchd.glm)

Call:

glm(formula = CHD ~ Age, family = binomial, data = agchd)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.9718 -0.8456 -0.4576 0.8253 2.2859

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -5.30945 1.13365 -4.683 2.82e-06 ***

Age 0.11092 0.02406 4.610 4.02e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 136.66 on 99 degrees of freedom

Residual deviance: 107.35 on 98 degrees of freedom

AIC: 111.35

Number of Fisher Scoring iterations: 4

146

Page 148: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example

◮ The fitted model

log

{P (CHD|Age)

1− P (CHD|Age)

}= −5.309 + 0.111× Age

As Age grows by one unit, the odds to have the CHD are multiplied by

e0.111 ≈ 1.117.

◮ Null deviance is the G2 statistic for the null model (without slope)

◮ Residual deviance is the G2 statistic for the fitted model

147

Page 149: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Classification using logistic regression

◮ With any value x of the predictor variable the fitted logistic

regression model associates the probability π(x).

◮ Classification rule: for some threshold τ ∈ (0, 1) (e.g., τ = 1/2) let

Y (x) =

1, π(x) > τ

0, π(x) ≤ τ.

Changing τ one can get some feel of efficacy of the model.

148

Page 150: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Sensitivity and specificity

◮ Classification results can be represented in the form of the table

True 0 True 1

Predicted 0 a b

Predicted 1 c d

◮ Sensitivity is the proportion of correctly predicted 1’s - true positives.

Sensitivity =d

b+ d

◮ Specificity is the proportion of correctly predicted 0’s - true negatives.

Specificity =a

a+ c

149

Page 151: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

ROC curve

◮ Receiver Operating Characteristic (ROC) curve: Sensitivity versus

1 - Specificity as the threshold τ varies from 0 to 1.

0.0 0.2 0.4 0.6 0.8 1.0

0.00.2

0.40.6

0.81.0

ROC curve, CHD data

1−Specificity

Sens

itivity

τ = 0

τ = 1

150

Page 152: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Interpretation of ROC curve

◮ If τ = 1 then we never classify an observation as positive; here

Sensitivity = 0, Specificity = 1.

◮ If τ = 0 then everything will be classified as positive; here

Sensitivity = 1 and Specificity = 0.

◮ As τ varies between 0 and 1 there is a trade–off between

Sensitivity and Specificity; one looks for the value of τ which

gives ”large” sensitivity and specificity.

◮ The closer ROC curve comes to the 45 degrees straight line, the less

useful the model is.

151

Page 153: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Example

> tau<-0.5 # threshold

> agch1<-as.numeric(fitted(agchd.glm)>=tau)

> table(agchd$CHD, agch1)

agch1

0 1

0 45 12

1 14 29

> 29/(29+14) [1] 0.6744186 # sensitivity

> 45/(45+12) [1] 0.7894737 # specificity

> tau<-0.6

> agch1<-as.numeric(fitted(agchd.glm)>=tau)

> table(agchd$CHD, agch1)

agch1

0 1

0 50 7

1 18 25

> 25/(25+18) [1] 0.5813953 # sensitivity

> 50/(50+7) [1] 0.877193 # specificity

152

Page 154: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Another example: iris data

◮ The Independent variables - Sepal.Length, Sepal.Width,

Petal.Length, Petal.Width.

◮ Response variable - flower species: setosa, versicolor and

virginica.

◮ The ‘iris’ dataset consists of 150 observations, 50 from each species.

Consider a logistic regression on a single species type, versicolor.

> data(iris)

> tmpdata <- iris

> Versicolor <- as.numeric(tmpdata[,"Species"]=="versicolor")

> tmpdata[,"Species"] <- Versicolor

> fmla <- as.formula(paste("Species ~ ",paste(names(tmpdata)[1:4],

collapse="+")))

> ilr <- glm(fmla, data=tmpdata, family=binomial(logit))

> summary(ilr)

Call:

153

Page 155: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

glm(formula = fmla, family = binomial(logit), data = tmpdata)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.1281 -0.7668 -0.3818 0.7866 2.1202

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 7.3785 2.4993 2.952 0.003155 **

Sepal.Length -0.2454 0.6496 -0.378 0.705634

Sepal.Width -2.7966 0.7835 -3.569 0.000358 ***

Petal.Length 1.3136 0.6838 1.921 0.054713 .

Petal.Width -2.7783 1.1731 -2.368 0.017868 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 190.95 on 149 degrees of freedom

Residual deviance: 145.07 on 145 degrees of freedom

AIC: 155.07

Number of Fisher Scoring iterations: 5

154

Page 156: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Another example: Iris data

◮ Model without Sepal.Length

> ilr1<-glm(Species ~ Sepal.Width + Petal.Length + Petal.Width,

+ data=tmpdata, family=binomial(logit))

> summary(ilr1)

Call:

glm(formula = Species ~ Sepal.Width + Petal.Length + Petal.Width,

family = binomial(logit), data = tmpdata)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.1262 -0.7731 -0.3984 0.8063 2.1562

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 6.9506 2.2261 3.122 0.00179 **

Sepal.Width -2.9565 0.6668 -4.434 9.26e-06 ***

Petal.Length 1.1252 0.4619 2.436 0.01484 *

Petal.Width -2.6148 1.0815 -2.418 0.01562 *

155

Page 157: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 190.95 on 149 degrees of freedom

Residual deviance: 145.21 on 146 degrees of freedom

AIC: 153.21

Number of Fisher Scoring iterations: 5

156

Page 158: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

10. Review of multivariate normal distribution.

157

Page 159: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Standard multivariate normal distribution

◮ Definition:

Let Z1, . . . , Zn be independent N (0, 1) random variables. Vector

Z = (Z1, . . . , Zn) has standard multivarite normal distribution:

fZ(z) = fZ1,...,Zn(z1, . . . , zn)

=n∏

j=1

1√2πe−z2

j /2 =(

1√2π

)n

exp{− 12z

T z}.

We write Z ∼ Nn(0, I) where 0 is expectation, EZ = 0, and I is the

covariance matrix, cov(Z) = E[(Z − EZ)(Z − EZ)T ] = EZZT = I .

158

Page 160: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Multivariate normal distribution

◮ Transformation: Let A ∈ Rn×n and µ ∈ Rn. Define random vector

Y = AZ + µ

◮ Y ∼ Nn(µ,AAT ):

EY = EAZ + µ = AEZ + µ = µ,

cov(Y ) = E(Y − EY )(Y − EY )T = AE(ZZT )AT = AAT .

◮ Definition: distribution of random vector Y ∈ Rn is multivariate

normal with expectation µ ∈ Rn and covariance matrix Σ ∈ Rn×n if

fY (y) =1

(2π)n/2|det(Σ)|1/2 exp{− 1

2 (y − µ)TΣ−1(y − µ)}.

Σ > 0 - covariance matrix, Y ∼ Nn(µ,Σ).

159

Page 161: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Properties

◮ If a ∈ Rn and Y ∼ Nn(µ,Σ) then

X = aTY =n∑

i=1

aiYi = aTY ∼ N (aTµ, aTΣa).

◮ In general, if B ∈ Rq×n and Y ∼ Nn(µ,Σ) then

X = BY ∼ Nq(Bµ,BΣBT ).

◮ Any sub–vector of multivariate normal vector is multivariate normal.

◮ If elements of multivariate normal vector are uncorrelated then they

are independent.

◮ If X1, . . . , Xniid∼ N(µ, σ2) then Xn and s2 are independent (in fact,

the sample is normal if and only if Xn and s2 are independent).

160

Page 162: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

11. Discriminant Analysis

161

Page 163: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Cushing syndrome data

Types of the desease: (a), (b) and (c).

Variables: concentrations of Tetrahydrocortisone and Pregnanetriol.

10 20 30 40 50 60

0.05

0.10

0.20

0.50

1.00

2.00

5.00

10.00

Tetrahydrocortisone

Preg

nane

triol

a

a

a

a

a

a

a

b

b

b

b

b

b

b

bb

b

b

b

c

c

c

c

c

c

u

u

Cushings syndrome data

162

Page 164: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Classification problem

◮ Problem: we are given a pair of variables (X, Y )

– X ∈ X ⊆ Rp is the vector of predictor variables (features), belongs

to one of K populations;

– Y is the label of the population;

◮ Assume

– πk is the prior probability that X belongs to kth population

πk = P (Y = k) > 0,K∑

k=1

πk = 1.

– If observation X belongs to kth population then X ∼ pk(x).

◮ Problem: we observe X = x. How to predict the label Y ?

163

Page 165: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Classification problem

◮ Classification rule: any function g : X → {1, . . . , K}. Defines

partition of the feature space X into K disjoint sets:

X =K⋃

k=1

Ak, Ak = {x : g(x) = k}.

◮ Accuracy of a classifier: L(g) = P{g(X) 6= Y }.

◮ Optimal rule

g∗ = arg ming:X→{1,...,K}

P{g(X) 6= Y }.

164

Page 166: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Bayes rule

◮ Bayes formula

p(k|x) = P (Y = k|X = x) =P (X = x|Y = k)P (Y = k)

∑Kk=1 P (X = x|Y = k)P (Y = k)

=πkpk(x)∑Kk=1 πkpk(x)

◮ Bayes classification rule:

g∗(x) = argmaxk=1,...,K p(k|x) = argmaxk=1,...,K [πkpk(x)].

◮ Theorem: Bayes rule g∗ minimizes probability of error, i.e.,

L(g∗) = P{g∗(X) 6= Y } ≤ L(g) = P{g(X) 6= Y }, ∀g.

Bayes error

L(g∗) = 1−∫

Xmax

k=1,...,K[πkpk(x)] dx.

165

Page 167: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Proof

First, note L(g) = P{g(X) 6= Y } = E[P{g(X) 6= Y |X = x}];

P{g(X) 6= Y |X = x} = 1− P{g(X) = Y |X = x}

= 1−K∑

k=1

P{g(X) = k, Y = k|X = x}

= 1−K∑

k=1

I{g(x) = k}P (Y = k|X = x)

= 1−K∑

k=1

I{g(x) = k}p(k|x) ≥ 1−maxk

p(k|x).

The proof is completed by taking expextation and noting that, by

definition,

P{g∗(X) 6= Y |X = x} = 1−maxk

p(k|x).

166

Page 168: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

The Bayes rule for normal populations

◮ Assume K = 2 groups with prior probabilitites πk, k = 1, 2, and the

distribution pk(x) of kth population is Np(µk,Σk);

πkpk(x) =πk

(2π)p/2|Σk|1/2exp

{− 1

2(x−µk)

TΣ−1k (x−µk)

}, k = 1, 2.

◮ The Bayes rule decides 1 if π1p1(x) ≥ π2p2(x), and 2 otherwise.

Equivalently, one can compare h1(x) with h2(x), where

hk(x) = log{πkpk(x)}

= −1

2(x− µk)

TΣ−1k (x− µk)︸ ︷︷ ︸

M2k

−p2log(2π)− 1

2log |Σk|+ log πk.

167

Page 169: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Case I: equal covariance matrices

◮ If Σ1 = Σ2 = Σ then the Bayes rule decides 1 when

h1(x)− h2(x) ≥ 0, i.e.,

h1(x)− h2(x) = −1

2M2

1 + log π1 +1

2M2

2 − log π2

= (µ1 − µ2)TΣ−1

(x− µ1 + µ2

2

)+ log

π1π2≥ 0.

This results in the linear decision surface

(µ1 − µ2)TΣ−1

(x− µ1 + µ2

2

)= log

π2π1.

168

Page 170: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Case II: non–equal covariance matrices

◮ If Σ1 6= Σ2 then the Bayse rule decides 1 if

−1

2(x− µ1)

TΣ−11 (x− µ1)−

1

2|Σ1|+ log π1

≥ −1

2(x− µ2)

TΣ−12 (x− µ2)−

1

2|Σ2|+ log π2.

Thus results in quadratic decision surface in Rp.

◮ The Bayes rule cannot be implemented because nor pk(x) neither πk

are known. The idea is to estimate unknown parameters from the

data...

169

Page 171: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Linear discriminant analysis

◮ Data: two samples of sizes n1 and n2 from normal populations

Xk1, . . . ,Xknk∼ Np(µk,Σ), k = 1, 2.

◮ Estimates of means µk

µk = Xk· =1

nk

nk∑

j=1

Xkj .

◮ Pooled estimator of Σ

Spooled =(n1 − 1)S1 + (n2 − 1)S2

n1 + n2 − 2

Sk =1

nk − 1

nk∑

j=1

(Xkj − µk)(Xkj − µk)T

170

Page 172: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Linear discriminant analysis

◮ Classification rule (LDA):

decide first group if

(µ1 − µ2)TS−1

pooled

(x− µ1 + µ2

2

)≥ log

π2π1,

where

πk =nkn, k = 1, 2.

◮ Although LDA is the ”plug–in Bayes classifier” for normal

populations, it can be applied for any distribution of the data.

171

Page 173: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Another interpretation of the LDA

◮ Idea is to find a linear transformation of X such that separation

between the groups will be maximal. Let β ∈ Rp, and Z = βTX.

– if X is from the first group, then µ1,Z = EZ = EβTX = βTµ1.

– if X is from the second group, then µ2,Z = EZ = EβTX = βTµ2.

– var(Z) does not depend on the group: σ2Z = var(βTX) = βTΣβ.

◮ Choose β so that

(µ1,Z − µ2,Z)2

σ2Z

=[βT (µ1 − µ2)]

2

βTΣβ→ max .

Solution of this problem: β∗ = cΣ−1(µ1 − µ2), c 6= 0.

172

Page 174: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Another interpretation of the LDA

◮ Estimate of β∗: β∗ = S−1pooled(µ1 − µ2).

◮ Estimates of µ1,Z = βT∗ µ1, µ2,Z = βT

∗ µ2:

µ1,Z = (µ1 − µ2)TS−1

pooled µ1, µ2,Z = (µ1 − µ2)TS−1

pooled µ2.

◮ LDA classification rule: for given x decide the first group if

(µ1 − µ2)TS−1

pooled x ≥ 1

2(µ1,Z + µ2,Z) ⇔

(µ1 − µ2)TS−1

pooled x ≥ 1

2(µ1 − µ2)

TS−1pooled(µ1 + µ2).

173

Page 175: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: Leptograpsus Crabs data

Two color forms (blue and orange), 50 of each form of each sex.

◮ sp species - ”B” or ”O” for blue or orange

◮ sex

◮ index index 1:50 within each of the four groups

◮ FL frontal lobe size (mm)

◮ RW rear width (mm)

◮ CL carapace length (mm)

◮ CW carapace width (mm)

◮ BD body depth (mm)

174

Page 176: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Leptograpsus Crabs data

> library(MASS)

> attach(crabs)

> lcrabs<-cbind(sp, sex, log(crabs[,4:8]))

> lcrabs.lda<-lda(sex~FL+RW+CL+CW, lcrabs) # Linear discriminant analysis

> lcrabs.lda

Call:

lda.formula(sex ~ FL + RW + CL + CW, data = lcrabs)

Prior probabilities of groups:

F M

0.5 0.5

Group means:

FL RW CL CW

F 2.708720 2.579503 3.421028 3.555941

M 2.730305 2.466848 3.464200 3.583751

Coefficients of linear discriminants:

LD1

175

Page 177: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

FL -2.889616

RW -25.517644

CL 36.316854

CW -11.827981

LD1 is the vector S−1pooled(µ1 − µ2).

> lcrabs.pred<-predict(lcrabs.lda)

> table(lcrabs$sex, lcrabs.pred$class)

F M

F 97 3

M 3 97

> (3+3)/(97+97+3+3) # Resubstitution (naive) estimate

[1] 0.03

> # Cross-validation estimate

> lcrabs.cv.lda<-lda(sex~FL+RW+CL+CW, lcrabs, CV=T)

> table(lcrabs$sex, lcrabs.cv.lda$class)

F M

F 96 4

M 3 97

176

Page 178: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Leptograpsus Crabs data

> plot(lcrabs.lda) # groups histograms in the dicriminant direction

−4 −2 0 2 4

0.00.1

0.20.3

0.4

group F

−4 −2 0 2 4

0.00.1

0.20.3

0.4

group M

177

Page 179: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Quadratic discriminant analysis

◮ The QDA can be viewed as the plug–in Bayes rule for the normal

populations with non–equal covariance matrices. Now Σ1 and Σ2 are

substituted by

Sk =1

nk − 1

nk∑

j=1

(Xkj − µk)(Xkj − µk)T , k = 1, 2.

◮ The QDA rule: decide the first group if

−1

2(x− µ1)

TS−11 (x− µ1)−

1

2|S1|+ log π1

≥ −1

2(x− µ2)

TS−12 (x− µ2)−

1

2|S2|+ log π2.

178

Page 180: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Quadratic discriminant analysis

> # Quadratic discriminant analysis

>

> lcrabs.qda<-qda(sex~FL+RW+CL+CW, lcrabs)

> lcrabs.qda.pred<-predict(lcrabs.qda)

> table(lcrabs$sex, lcrabs.qda.pred$class)

F M

F 97 3

M 4 96

> lcrabs.cv.qda<-qda(sex~FL+RW+CL+CW, lcrabs, CV=T)

> table(lcrabs$sex, lcrabs.cv.qda$class)

F M

F 96 4

M 5 95

179

Page 181: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

12. Classification: k–Nearest Neighbors

180

Page 182: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. k–nearest neighbors classifier

◮ Data: (Xi, Yi), i = 1, . . . , n, i.i.d. random pairs, Yi ∈ {0, 1},Xi ∈ Rd.

◮ Let d(·, ·) be a distance measure, and for a given x ∈ Rd consider

numbers di(x) = d(Xi,x). Let d(i)(x) be the ith order statistic, i.e.,

d(1)(x) ≤ d(2)(x) ≤ · · · ≤ d(n)(x).

◮ The set of k–nearest neighbors of x

Ak(x) = {Xi : d(Xi,x) ≤ d(k)(x)}.

◮ Classifier

gn(x) =

1,∑n

i=1wn,i(x)I(Yi = 1) >∑n

i=1wni(x)I(Yi = 0),

0, otherwise

where wn,i(x) =1k if Xi ∈ Ak(x), and zero otherwise.

181

Page 183: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. k–nearest neighbors classifier

◮ Choice of k: often k = 1 is chosen, results in 1-NN classifier. Large k

results in more averaging; small k leads to more variability.

Asymptotic theory suggests

– k →∞ as n→∞

– kn → 0 as n→∞

◮ Choice of the distance: most common choice is Euclidean.

182

Page 184: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Spam data

Data: 4601 instances, 57 attributes

◮ Most of the attributes indicate whether a particular word or character

was frequently occuring in the e-mail. In particular, 48 continuous

real [0,100] attributes indicating percentage of words in the e-mail

that match WORD, i.e.

100 * (number of times the WORD appears in the e-mail)

total number of words in e-mail

◮ WORD is any string of alphanumeric characters bounded by

non-alphanumeric characters or end-of-string.

◮ The run-length attributes (55-57) measure the length of sequences of

consecutive capital letters.

183

Page 185: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

k–nearest neighbors: spam data

> library(class)

> spam.d<-spam[, 1:57] # training set

> spam.cl <- spam[,58] # true classifications

> spam.1nn <- knn.cv(spam.d, spam.cl, k=1) # 1-NN with cross-validation

> table(spam.1nn, spam.cl)

spam.cl

spam.1nn 0 1

0 2398 390

1 390 1423

> (390+390)/(2398+1423+390+390) # cross-validation misclassification rate

[1] 0.1695284

184

Page 186: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Spam data: k-nn misclassification rate

5 10 15 20

0.17

0.18

0.19

0.20

0.21

0.22

K

Missc

lassifi

catio

n rate

Spam data: CV−misclassification rate of K−NN

185

Page 187: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Spam data: LDA versus k-nn

> library(MASS)

> spam.cv.lda<-lda(V58~.,data=spam, CV=T)

> spam.cv.lda

Call:

lda(V58 ~ ., data = spam)

Prior probabilities of groups:

0 1

0.6059552 0.3940448

Group means:

V1 V2 V3 V4 V5 V6 V7

0 0.0734792 0.2444656 0.2005811 0.0008859397 0.1810402 0.04454448 0.00938307

1 0.1523387 0.1646498 0.4037948 0.1646718147 0.5139548 0.17487590 0.27540541

V8 V9 V10 V11 V12 V13 V14

0 0.03841463 0.03804878 0.1671700 0.02171090 0.5363235 0.06166428 0.04240316

1 0.20814120 0.17006067 0.3505074 0.11843354 0.5499724 0.14354661 0.08357419

...........................................................................

186

Page 188: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

> table(spam.cl, spam.cv.lda$class)

spam.cl 0 1

0 2652 136

1 390 1423

> (136+390)/(136+390+2652+1423)

[1] 0.1143230 # cross-validation misclassification rate

187

Page 189: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

13. Classification: decision trees (CART)

188

Page 190: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example

◮ In a hospital, when a heart attack patient is admitted, 19 variables

are measured during the first 24 hours. Among the variables: blood

pressure, age and 17 other ordered or binary variables summarizing

different symptoms.

◮ Based on the 24–hours data, the objective of the study is to identify

high risk patients (those who will not survive at least 30 days).

189

Page 191: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example

Is the minimum systolic blood pressureover initial 24 hours > 91?

Is age > 62.5?

Is sinustachycardiapresent?

Yes

Yes

No

No

G

F

F

G

G − high riskF − not high risk

YesNo

190

Page 192: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Binary trees: basic notions

◮ Binary tree is constructed by repeated splits of subsets of X into two

descendant subsets:

– a single variable is found which ”best” splits the data into two

groups

– the data is separated and the process is applied to each sub–group

– stop when the subgroups reach minimal size, or no improvement

can be made.

◮ Terminology

– the root node = X ; a node = a subset of X (circles).

– terminal nodes = subsets which are not split (rectangular box);

each terminal node is designated by the class label.

191

Page 193: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Binary trees: basic notions

◮ Construction of a tree requires:

– The selection of splits

– The decisions when to declare a node terminal or to continue

splitting it

– The assignment of each terminal node to a class

192

Page 194: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Notation

◮ n – number of observations; K - number of classes

◮ N(t) – total number of observations at node t

◮ Nk(t) – number of observations from class k at node t

◮ p(k|t) – proportion of observations X at the node t belonging to kth

class, k = 1, . . . , K

p(k|t) = Nk(t)

N(t)=

#{observations from class k at node t}#{observations at node t}

p(t) = [p(1|t), . . . , p(K|t)].

◮ Y (t) – class assigned to the node t:

Y (t) = arg maxk=1,...,K

p(k|t)

193

Page 195: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Impurity of the node

◮ Node is pure if it contains data only from one class.

◮ Impurity measure Imp(t) of node t for classification into K classes:

Imp(t) = φ(p(t)), p(t) = [p(1|t), . . . , p(K|t)]

where φ is a non–negative function of p(t) satisfying the following

conditions:

(a) φ has a unique maximum at ( 1K , . . . ,

1K );

(b) φ achieves minimum at (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, . . . , 1);

(c) φ is a symmetric function of p1, . . . , pK .

◮ Imp(t) is largest when all classes are equaly mixed, and smallest when

node contains one class.

194

Page 196: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Impurity of the node

◮ Examples of the impurity measure

Imp(t) = −K∑

k=1

p(k|t) log{p(k|t)}, [entropy]

Imp(t) = 1−K∑

k=1

p2(k|t), [Gini index]

when

p(k|t) = Nk(t)

N(t)=

#{observations from class k at node t}#{observations at node t}

195

Page 197: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. To split or not to split?

◮ Split S: node t to two ”sons” tL and tR

– πL – proportion of observations at t going to tL

– πR – proportion of observations at t going to tR

t

tL tR

πL πR

196

Page 198: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. To split or not to split?

◮ Goodness of split S is defined by the decrease in impurity measure

Φ(S, t) = ∆Imp(t) = Imp(t)− πLImp(tL)− πRImp(tR);

◮ Idea: choose the split S that maximizes Φ(S, t).

◮ Impurity of the tree T

Imp(T ) =∑

t∈T

π(t)Imp(t),

where T is the set of terminal nodes, π(t) is the proportion of the

whole popuation at node t

197

Page 199: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Numerical example

60

50 10 20

40

20

π(1)=0.7 π(2)=0.3

node 0

node 1 node 2

100 obs.

70 obs. 30 obs.

Imp(t0) = −(60/100) log(60/100)− (40/100) log(40/100) = 0.673

Imp(t1) = −(50/70) log(50/70)− (20/70) log(20/70) = 0.598

Imp(t2) = −(20/30) log(20/30)− (10/30) log(10/30) = 0.637

∆Imp(t0) = 0.637− 0.7× 0.598− 0.3× 0.637 = 0.0273

198

Page 200: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Splitting rules

X = (X1, . . . , Xp) is the vector of features.

◮ Splits are determined by the standartized set of questions Q:

– Each split depends only on a single variable

– For ordered Xi the questions in Q are of the form

{Is Xi ≤ c}, c ∈ (−∞,∞).

– If Xi is categorical, talking values in B = {b1, . . . , bm} then the

questions in Q are {Is Xi ∈ A}, A is any subset of B.

◮ At each node CART looks at variables Xi one by one, finds the best

split for each Xi, and chooses the best of the best.

199

Page 201: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Stop–splitting and assignment rules

◮ Stop splitting if

– decrease in impurity measure of a node is less than a prespecified

threshold

– impurity of the whole tree is less than a threshold

– depth of tree is greater than some parameter

– ...

◮ CART does not employ stopping rule; pruning is used instead.

◮ To a terminal node t assign the class

Y (t) = arg maxk=1,...,K

p(k|t).

200

Page 202: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: Stage C prostate cancer data

◮ Data: on 146 stage C prostate cancer patients, 7 variables, pgstat –

response

– pgtime – time to progression

– pgstat – status at last follow-up (1=progressed, 0=censored)

– age – age at diagnosis

– eet – early endocrine therapy (1=no, 0=yes)

– ploidy – diploid/tetraploid/aneuploid DNA pattern

– grade – tumor grade (1–4)

– gleason – Gleason grade

201

Page 203: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Stage C prostate cancer data

> library(rpart)

> stagec<-read.table("stagec.dat", header=T, sep=",")

> cfit<-rpart(pgstat~age+eet+grade+gleason+ploidy, data=stagec, method="class")

> print(cfit)

n= 146

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 146 54 0 (0.6301370 0.3698630)

2) grade< 2.5 61 9 0 (0.8524590 0.1475410) *

3) grade>=2.5 85 40 1 (0.4705882 0.5294118)

6) ploidy< 1.5 29 11 0 (0.6206897 0.3793103)

12) gleason< 7.5 22 7 0 (0.6818182 0.3181818) *

13) gleason>=7.5 7 3 1 (0.4285714 0.5714286) *

7) ploidy>=1.5 56 22 1 (0.3928571 0.6071429) # child nodes of node t

14) age>=61.5 34 17 0 (0.5000000 0.5000000) # are numbered as

28) age< 64.5 12 4 0 (0.6666667 0.3333333) * # 2t (left) and 2t+1

29) age>=64.5 22 9 1 (0.4090909 0.5909091) * # (right)

15) age< 61.5 22 5 1 (0.2272727 0.7727273) *

202

Page 204: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Example: Stage C prostate cancer data (default tree)

> plot(cfit)

> text(cfit)

◮ Some default parameters

* minsplit=20: minimal number of observations in a node for which

split is computed

* minbucket=minsplit/3: minimal number of observation in a

terminal node

* cp=0.01: complexity parameter

203

Page 205: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

|grade< 2.5

ploidy< 1.5

gleason< 7.5 age>=61.5

age< 64.5

0

0 1

0 1

1

204

Page 206: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

> summary(cfit)

Call:

rpart(formula = pgstat ~ age + eet + grade + gleason + ploidy,

data = stagec, method = "class")

...............................................................

Node number 1: 146 observations, complexity param=0.1111111

predicted class=0 expected loss=0.369863

class counts: 92 54

probabilities: 0.630 0.370

left son=2 (61 obs) right son=3 (85 obs)

Primary splits:

grade < 2.5 to the left, improve=10.35759000, (0 missing)

gleason < 5.5 to the left, improve= 8.39957400, (3 missing)

ploidy < 1.5 to the left, improve= 7.65653300, (0 missing)

age < 58.5 to the right, improve= 1.38812800, (0 missing)

eet < 1.5 to the right, improve= 0.07407407, (2 missing)

Surrogate splits:

gleason < 5.5 to the left, agree=0.863, adj=0.672, (0 split)

ploidy < 1.5 to the left, agree=0.644, adj=0.148, (0 split)

age < 66.5 to the right, agree=0.589, adj=0.016, (0 split)

205

Page 207: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Node number 2: 61 observations

predicted class=0 expected loss=0.147541

class counts: 52 9

probabilities: 0.852 0.148

Node number 3: 85 observations, complexity param=0.1111111

predicted class=1 expected loss=0.4705882

class counts: 40 45

probabilities: 0.471 0.529

left son=6 (29 obs) right son=7 (56 obs)

Primary splits:

ploidy < 1.5 to the left, improve=1.9834830, (0 missing)

age < 56.5 to the right, improve=1.6596080, (0 missing)

gleason < 8.5 to the left, improve=1.6386550, (0 missing)

eet < 1.5 to the right, improve=0.1086108, (1 missing)

Surrogate splits:

age < 72.5 to the right, agree=0.682, adj=0.069, (0 split)

gleason < 9.5 to the right, agree=0.682, adj=0.069, (0 split)

....................................................................

206

Page 208: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

◮ grade 1 and 2 go to the left, grade 3 and 4 go to the right.

◮ The improvement is n times the change in the ipurity index. The

largest improvement is for grade, 10.36. The actual values are not so

important; relative size gives indication of the utility of variables.

◮ Once a splitting variable and split point have been decided, what is to

be done with observations missing the variable? CART defines

surrogate variables by re–applying the partitioning algorithm to

predict the two categories using other independent variables.

207

Page 209: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Cost–complexity pruning

◮ Misclassification rate of a node t

R(t) =K∑

k 6=Y (t)

p(k|t), Y (t) = arg maxk=1,...,K

p(k|t)︸ ︷︷ ︸label assigned to node t

.

◮ Let T = {t1, . . . , tm} be terminal nodes; misclassification rate of T

R(T ) =

m∑

i=1

N(ti)

nR(ti) =

1

n

t∈T

N(t)R(t).

◮ Let size(T ) = #{t ∈ T : t ∈ T}; for complexity parameter (CP) α > 0

Rα(T ) = R(T ) + α size(T ) =∑

t∈T

[R(t) + α

],

◮ α > 0 imposes a penalty for large trees.

208

Page 210: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Cost–complexity pruning

t(0)

t(6)t(5)t(4)t(3)

t(1) t(2)

T(t(2))

◮ T (t) is the sub–tree rooted at t

◮ Error of sub-tree T (t2) and error of the node t2:

Rα(T (t2)) = R(T (t2)) + α size{T (t2)} = R(T (t2)) + 2α

Rα(t2) = R(t2) + α, t2 is treated as terminal.

209

Page 211: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Cost–complexity pruning

◮ Pruning is worthwhile if

Rα(t2) ≤ Rα(T (t2)) ⇔ g(t2, T ) =R(T (t2))−R(t2)size{T (t2)} − 1

≤ α.

Function g(t, T ) can be computed for any internal node of the tree.

◮ Weakest–link cutting algorithm:

1. Start with the full tree T1. For each non–terminal node t ∈ T1compute g(t, T1), and find t1 = argmint∈T1 g(t, T1). Set

α2 = g(t1, T1).

2. Define new tree T2 by pruning away the branch rooted at t1. Find

the weakest link in T2 and proceed as in 1.

◮ Result: a decreasing sequence of sub–trees with corresponding α’s.

The final selection is by cross-validation or by validation sample.

210

Page 212: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: Stage C prostate cancer data (cont.)

> printcp(cfit)

Classification tree:

rpart(formula = pgstat ~ age + eet + grade + gleason + ploidy,

data = stagec, method = "class")

Variables actually used in tree construction:

[1] age gleason grade ploidy

Root node error: 54/146 = 0.36986

n= 146

CP nsplit rel error xerror xstd

1 0.111111 0 1.00000 1.0000 0.10802

2 0.037037 2 0.77778 1.0741 0.10949

3 0.018519 4 0.70370 1.0556 0.10916

4 0.010000 5 0.68519 1.0556 0.10916

211

Page 213: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Complexity parameter (CP) table

◮ The CP table is printed from the smallest tree (0 splits) to the largest

(5 splits for cancer data)

◮ rel error – relative error on the training set (resubstitution), the

first node has an error of 1.

◮ xerror – the cross–validation estimate of the error

◮ xstd – the standard deviation of the risk

◮ 1-SE rule of thumb: all trees with

xerror ≤ minimal xerror + xstd

are equivalent. Choose the simplest one.

212

Page 214: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: data on spam

◮ Data: 4601 instances, 57 attributes

– Most of the attributes indicate whether a particular word or

character was frequently occuring in the e-mail. The run-length

attributes (55-57) measure the length of sequences of consecutive

capital letters.

◮ Default tree building

> spam.tr<-rpart(V58~., data=spam, method="class") # default tree

> plot(spam.tr, compress=T, branch=.3)

> text(spam.tr)

213

Page 215: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Spam data: default tree

|V53< 0.0555

V7< 0.055

V52< 0.378

V57< 55.5

V16< 0.845

V25>=0.4

0

0 11

1

0 1

214

Page 216: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Spam data: default tree

> printcp(spam.tr)

Classification tree:

rpart(formula = V58 ~ ., data = spam, method = "class")

Variables actually used in tree construction:

[1] V16 V25 V52 V53 V57 V7

Root node error: 1813/4601 = 0.39404

n= 4601

CP nsplit rel error xerror xstd

1 0.476558 0 1.00000 1.00000 0.018282

2 0.148924 1 0.52344 0.54716 0.015386

3 0.043023 2 0.37452 0.44457 0.014222

4 0.030888 4 0.28847 0.33867 0.012723

5 0.010480 5 0.25758 0.28847 0.011875

6 0.010000 6 0.24710 0.27799 0.011685 # classification error

Absolute cross–validation error: 0.27799×0.39404= 0.1095392

215

Page 217: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Spam data: unpruned tree

> sctrl <- rpart.control(minbucket=1, minsplit=2, cp=0)

> spam1.tr <- rpart(V58~., data=spam, method="class", control=sctrl)

> printcp(spam1.tr)

Classification tree:

rpart(formula = V58 ~ ., data = spam, method = "class", control = sctrl)

Variables actually used in tree construction:

[1] V1 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V2 V20 V21 V22 V23 V24 V25 V26

[20] V27 V28 V29 V3 V30 V33 V35 V36 V37 V39 V4 V40 V42 V43 V44 V45 V46 V48 V49

[39] V5 V50 V51 V52 V53 V54 V55 V56 V57 V6 V7 V8 V9

Root node error: 1813/4601 = 0.39404

n= 4601

CP nsplit rel error xerror xstd

1 0.47655819 0 1.0000000 1.00000 0.0182819

2 0.14892443 1 0.5234418 0.55268 0.0154419

3 0.04302261 2 0.3745174 0.46663 0.0144933

4 0.03088803 4 0.2884721 0.32598 0.0125182

5 0.01047987 5 0.2575841 0.29454 0.0119835

6 0.00827358 6 0.2471042 0.27248 0.0115825

7 0.00717044 7 0.2388307 0.26751 0.0114891

216

Page 218: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

8 0.00529509 8 0.2316602 0.26089 0.0113626

9 0.00441258 14 0.1958081 0.23828 0.0109127

10 0.00358522 15 0.1913955 0.23497 0.0108445

11 0.00330943 19 0.1770546 0.23276 0.0107986

12 0.00275786 20 0.1737452 0.22945 0.0107293

13 0.00220629 24 0.1627137 0.22725 0.0106827

14 0.00193050 28 0.1538886 0.22173 0.0105648

15 0.00183857 31 0.1478213 0.22063 0.0105410

16 0.00165472 34 0.1423056 0.20684 0.0102366

17 0.00137893 42 0.1290678 0.19801 0.0100348

18 0.00110314 46 0.1235521 0.19470 0.0099576

19 0.00082736 62 0.1059018 0.18809 0.0098007 # minimal xerror

20 0.00070916 82 0.0871484 0.18919 0.0098271

21 0.00066189 90 0.0810811 0.18864 0.0098139

22 0.00055157 95 0.0777716 0.19250 0.0099057

23 0.00041368 167 0.0380585 0.19360 0.0099317

24 0.00036771 175 0.0347490 0.19746 0.0100220

25 0.00033094 181 0.0325427 0.19967 0.0100731

26 0.00029700 186 0.0308880 0.20132 0.0101111

27 0.00027579 205 0.0242692 0.21346 0.0103843

28 0.00018386 277 0.0044126 0.21511 0.0104208

29 0.00000000 286 0.0027579 0.21566 0.0104329

217

Page 219: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Spam data: unpruned tree

> plot(spam1.tr)

218

Page 220: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

|

219

Page 221: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Spam data: 1-SE rule pruned tree

> spam2.tr <- prune(spam1.tr, cp=0.00137893)

> printcp(spam2.tr)

Classification tree:

rpart(formula = V58 ~ ., data = spam, method = "class", control = sctrl)

Variables actually used in tree construction:

[1] V16 V17 V19 V21 V22 V24 V25 V27 V28 V37 V4 V46 V49 V5 V50 V52 V53 V55 V56

[20] V57 V6 V7 V8

Root node error: 1813/4601 = 0.39404

n= 4601

CP nsplit rel error xerror xstd

1 0.4765582 0 1.00000 1.00000 0.018282

2 0.1489244 1 0.52344 0.55268 0.015442

3 0.0430226 2 0.37452 0.46663 0.014493

4 0.0308880 4 0.28847 0.32598 0.012518

5 0.0104799 5 0.25758 0.29454 0.011984

220

Page 222: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

6 0.0082736 6 0.24710 0.27248 0.011582

7 0.0071704 7 0.23883 0.26751 0.011489

8 0.0052951 8 0.23166 0.26089 0.011363

9 0.0044126 14 0.19581 0.23828 0.010913

10 0.0035852 15 0.19140 0.23497 0.010844

11 0.0033094 19 0.17705 0.23276 0.010799

12 0.0027579 20 0.17375 0.22945 0.010729

13 0.0022063 24 0.16271 0.22725 0.010683

14 0.0019305 28 0.15389 0.22173 0.010565

15 0.0018386 31 0.14782 0.22063 0.010541

16 0.0016547 34 0.14231 0.20684 0.010237

17 0.0013789 42 0.12907 0.19801 0.010035

◮ Absolute cross-validation error

0.19801× 1813/4601 = 0.07802386

221

Page 223: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Spam data: 1-SE rule pruned tree

> plot(spam2.tr)

> text(spam2.tr, cex=.5)

222

Page 224: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

|V53< 0.0555

V7< 0.055

V52< 0.378

V16< 0.2

V24< 0.01

V22< 0.155

V56< 416

V5< 0.715

V4< 0.565

V28< 7.105

V25>=0.015

V28< 0.23

V56< 19.5

V8< 0.335

V55< 5.739

V49>=0.3115

V55< 2.655

V6< 0.185

V5< 0.49

V50>=0.479

V37>=0.195

V52< 0.119

V5< 1.09

V57< 341

V8< 0.44

V21< 0.31

V57< 55.5

V16< 0.845

V17< 0.66

V24< 2.66

V52< 0.984

V19< 3.635

V56< 4

V46>=0.065

V27>=1.375

V27>=0.14

V25>=0.4

V7< 0.075V46>=0.49

V56< 6.5

V19< 2.845V27>=0.21

0

00 1

1

1 0 1 1

10 1

0 1 1

0 1

0

0 1 1

0 1

1

0

0 0 1

1

1

1 0 0 1

0 1

0 1

0

0 1 0 1

223

Page 225: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Spam data: another tree

> spam3.tr <- prune(spam1.tr, cp=0.0018)

> printcp(spam3.tr)

Classification tree:

rpart(formula = V58 ~ ., data = spam, method = "class", control = sctrl)

Variables actually used in tree construction:

[1] V16 V17 V19 V21 V22 V24 V25 V27 V37 V46 V49 V5 V52 V53 V55 V56 V57 V6 V7

[20] V8

Root node error: 1813/4601 = 0.39404

n= 4601

CP nsplit rel error xerror xstd

1 0.4765582 0 1.00000 1.00000 0.018282

2 0.1489244 1 0.52344 0.54661 0.015380

3 0.0430226 2 0.37452 0.43354 0.014081

4 0.0308880 4 0.28847 0.33370 0.012643

5 0.0104799 5 0.25758 0.28792 0.011866

224

Page 226: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

6 0.0082736 6 0.24710 0.27689 0.011665

7 0.0071704 7 0.23883 0.26531 0.011447

8 0.0052951 8 0.23166 0.25648 0.011277

9 0.0044126 14 0.19581 0.23718 0.010890

10 0.0035852 15 0.19140 0.22835 0.010706

11 0.0033094 19 0.17705 0.22449 0.010624

12 0.0027579 20 0.17375 0.22614 0.010659

13 0.0022063 24 0.16271 0.22559 0.010648

14 0.0019305 28 0.15389 0.21732 0.010469

15 0.0018386 31 0.14782 0.21732 0.010469

16 0.0018000 34 0.14231 0.20684 0.010237

◮ Absolute cross-validation error

0.20684× 1813/4601 = 0.08150323

> plot(spam3.tr, branch=.4,uniform=T)

> text(spam3.tr, cex=.5)

225

Page 227: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

|V53< 0.0555

V7< 0.055

V52< 0.378

V16< 0.2

V24< 0.01

V22< 0.155

V56< 416

V5< 0.715

V8< 0.335

V55< 5.739

V49>=0.3115

V55< 2.655

V6< 0.185

V37>=0.195

V52< 0.119

V5< 1.09

V57< 341V21< 0.31

V57< 55.5

V16< 0.845

V17< 0.66

V24< 2.66

V52< 0.984

V19< 3.635

V56< 4

V46>=0.065

V27>=1.375

V27>=0.14

V25>=0.4

V7< 0.075 V46>=0.49

V56< 6.5

V19< 2.845V27>=0.21

0

0 1

1

1 0 1 0 1

1

0

0 1 0 1

1

0

0

0 1

1

1

1 0

0 1

0 1 0 1 0

0 1 0 1

226

Page 228: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

14. Nonparametric smoothing: basic ideas,

kernel and local polynomial estimators

227

Page 229: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Density estimation problem

◮ Old Faithful Geyser data on 272 geyser eruptions, eruption duration

and waiting times between successive eruptions were recorded.

> summary(faithful)

eruptions waiting

Min. :1.600 Min. :43.0

1st Qu.:2.163 1st Qu.:58.0

Median :4.000 Median :76.0

Mean :3.488 Mean :70.9

3rd Qu.:4.454 3rd Qu.:82.0

Max. :5.100 Max. :96.0

◮ We want to estimate density of the waiting times between successive

eruptions.

228

Page 230: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Histogram

◮ Let X1, . . . , Xniid∼ f . We want to estimate density f .

◮ By definition f(x) = limh→012hP{x− h < X ≤ x+ h}.

◮ Idea: fix h, estimate P{x− h < X ≤ x+ h} by1n

∑ni=1 I{x− h < Xi ≤ x+ h}, and let

f(x) =1

2hn

n∑

i=1

I{x− h < Xi ≤ x+ h}.

◮ Histogram: consider bins (bj , bj+1], j = 0, 1, . . ., bj+1 − bj = 2h, ∀j,

f(x) =1

2hn

n∑

i=1

I{bj < Xi ≤ bj+1}, x ∈ (bj , bj+1].

229

Page 231: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Histogram of the Old Faithful data

> hist(faithful$waiting)

Histogram of faithful$waiting

faithful$waiting

Fre

qu

en

cy

40 50 60 70 80 90 100

02

04

0

230

Page 232: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. How to choose binwidth h?

◮ Bias of f(x):

|Ef(x)− f(x)| =∣∣∣ 12h

∫ x+h

x−h

f(t)dt− f(x)∣∣∣

≤ supt∈(x−h,x+h]

|f(t)− f(x)| ≤ 2Lh

provided that |f(x)− f(x′)| ≤ L|x− x′| for all x, x′

◮ Variance of f(x): because I{x− h < Xi ≤ x+ h} are iid Bernoulli

r.v. with parameter p = P{x− h < X1 ≤ x+ h} =∫ x+h

x−hf(t)dt

var{f(x)} = 1

2nh

1

2h

∫ x+h

x−h

f(t)dt ≤ M

2nh

provided that f(x) ≤M , ∀x.

231

Page 233: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. How to choose binwidth h?

◮ Mean Squared Error of f(x):

minh>0{4L2h2 +M(2nh)−1} ⇒ h∗ ≍ n−1/3...

◮ Smoother the density is, larger h should be. For instance, if we

assume that |f ′(x)− f ′(x′)| ≤ L|x− x′|, ∀x, x′ then h∗ ≍ h−1/5...

◮ The rule of thumb in R

h = 1.144 · σn−1/5.

232

Page 234: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Kernel estimators

◮ Another representation of the histogram estimator

f(x) =1

nh

n∑

i=1

K(Xi − x

h

), K(t) =

12 , |t| ≤ 1

0, otherwise.

Function K is called kernel, h is called bandwidth.

◮ General kernel estimator: take function K satisfying∫∞−∞K(t)dt = 1

and define

f(x) =1

hn

n∑

i=1

K(Xi − x

h

).

◮ Commonly used kernels:

– rectangular K(t) = 12I{|t| ≤ 1};

– triangular K(t) = (1− |t|)I{|t| ≤ 1};– Gaussian K(t) = 1√

2πe−t2/2 (default in R).

233

Page 235: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Kernel density estimation is R

> layout(matrix(1:3, ncol = 3))

> plot(density(faithful$waiting))

> plot(density(faithful$waiting,bw=8))

> plot(density(faithful$waiting,bw=0.8))

234

Page 236: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Kernel density estimation in R

40 80

0.0

00

.02

density.default(x = faithful$waiting)

N = 272 Bandwidth = 3.988

De

nsity

20 60 100

0.0

00

0.0

10

0.0

20

density.default(x = faithful$waiting, bw = 8)

N = 272 Bandwidth = 8

De

nsity

40 60 80

0.0

00

.02

0.0

4

density.default(x = faithful$waiting, bw = 0.8)

N = 272 Bandwidth = 0.8D

en

sity

235

Page 237: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: motorcycle data

◮ Measurements of head acceleration in a simulated motorcycleaccident, used to test crash helmets.

> library(MASS)

> plot(mcycle)

236

Page 238: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

10 20 30 40 50

−100

−50

050

times

accel

237

Page 239: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Regression problem

◮ Model: data – {(Xi, Yi), i = 1, . . . , n}, f is unknown ”smooth”

function, ǫ is error, Eǫ = 0

Yi = f(Xi) + ǫi, i = 1, . . . , n ⇔ f(x) = E(Y |X = x).

◮ Parametric models

* Simple linear regression f(x) = β0 + β1x

* Polynomial regression

f(x) = β0 + β1x+ · · ·+ βpxp

Reduced to multiple linear regression:

f(x) = βTx, x = (1, x, x2, . . . , xp)T .

238

Page 240: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Polynomial regression

> attach(mcycle)

> fit3<-lm(accel~times+I(times^2)+I(times^3))

> fit5<-lm(accel~times+I(times^2)+I(times^3)+I(times^4)+I(times^5))

> fit7<-lm(accel~times+I(times^2)+I(times^3)+I(times^4)+I(times^5)+I(times^6)+

+ I(times^7))

> plot(times, accel)

> lines(times, fit3$fitted, lty=1)

> lines(times, fit5$fitted, lty=2)

> lines(times, fit7$fitted, lty=3)

> legend(40, -70, c("fit3", "fit5", "fit7"), lty=c(1,2,3))

239

Page 241: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

10 20 30 40 50

−100

−50

050

times

accel

fit3fit5fit7

240

Page 242: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Nonparametric regression: basic ideas

◮ Local modeling: parametric model in ”local” neighborhood.

* Local average

fh(x) =1

#{Nh(x)}∑

i∈Nh(x)

Yi, Nh(x) = {i : |x−Xi| ≤ h}.

* Local linear (polynomial) regression

* k-NN estimator

◮ How to define the local neighborhood?

241

Page 243: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Kernel estimators

◮ Regression function

f(x) = E(Y |X = x) =

∫yp(x, y)dy∫p(x, y)dy

=

∫yp(x, y)dy

p(x)

◮ Histogram – estimate of p(x)

ph(x) =1

2nh

n∑

i=1

I{i : Xi ∈ (bj , bj+1]} =nj2nh

, x ∈ (bj , bj+1].

◮ Estimator of f(x)

fh(x) =

∑ni=1 YiI

{Xi ∈ [x− h, x+ h]

}∑n

i=1 I{Xi ∈ [x− h, x+ h]

} =:n∑

i=1

wn,i(x)Yi

242

Page 244: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Nadaraya–Watson kernel estimator

◮ More generally, consider the kernel K s.t.∫K(x)dx = 1 and define

fh(x) =

∑ni=1 YiK

(x−Xi

h

)

∑ni=1K

(x−Xi

h

) .

◮ Kernels

– Box kernel: K1(x) = I[−1/2,1/2](x).

– Quadratic kernel: K2(x) =34 (1− x2)I[−1,1](x).

– Gaussian: K2(x) =1√2π

exp{−x2/2}.

◮ Selection of the bandwidth h is important:

fh(Xi)→ Yi as h→ 0; fh(x)→1

n

n∑

i=1

Yi as h→∞.

243

Page 245: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: motorcycle data

> plot(times, accel)

> lines(ksmooth(times, accel, "normal", bandwidth=1), lty=1)

> lines(ksmooth(times, accel, "normal", bandwidth=2), lty=2)

> lines(ksmooth(times, accel, "normal", bandwidth=3), lty=3)

> legend(40, -100, legend=c("bandwidth=1", "bandwidth=2", "bandwidth=3"),

+ lty=c(1,2,3))

The kernels are scaled so that their quartiles (viewed as probability

densities) are at +/- 0.25*bandwidth, i.e. for the normal kernel

h× z0.75 = 0.25× bandwidth.

244

Page 246: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: motorcycle data

10 20 30 40 50

−100

−50

050

times

acce

l

bandwidth=1bandwidth=2bandwidth=3

245

Page 247: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Local polynomial smoothing

◮ Local model at point x:

Yi = a(x) + b(x)Xi + ǫi, Xi ∈ [x− h, x+ h].

◮ Local linear regression estimator

n∑

i=1

[Yi − a(x)− b(x)Xi

]2I

{ |Xi − x|h

≤ 1

}→ min

a(x),b(x)

f(x) = a(x) + b(x)x.

246

Page 248: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Local polynomial smoothing

◮ General local polynomial estimator:

f(z) ≈p∑

j=0

f (j)(x)

j!(z − x)j =

p∑

j=0

βj(z − x)j

n∑

i=1

[Yi −

p∑

j=0

βj(Xi − x)j]2K

( |Xi − x|h

)→ min

β

f(x) = β0, f (j)(x) = j! βj , j = 1, . . . , p.

◮ Both kernel and local polynomial estimators are linear smoothers

f(x) =n∑

i=1

wn,i(x)Yi,n∑

i=1

wn,i(x) = 1.

247

Page 249: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Local polynomial smoothing (LOESS)

◮ Idea: fit locally a polynomial to the data

◮ LOESS smoothing: define the weights

wi(x) =

[1− |x−Xi|3

τ3(x, α)

]3

+

, i = 1, . . . , n.

Let r(x, β) be a polynomial of degree p with the coefficeints

β = (β0, . . . , βp). Define

β(x) = argminβ

n∑

i=1

wi(x)[Yi − r(Xi, β)]2, f(x) = r(x, β(x)).

248

Page 250: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Local polynomial smoothing (LOESS)

◮ Bandwidth τ(x, α) is chosen as follows:

– denote ∆i(x) = |x−Xi| and order these values

∆(1)(x) ≤ ∆(2)(x) ≤ · · · ≤ ∆(n)(x).

– if 0 < α ≤ 1 then τ(x, α) = ∆(q)(x) where q = [αn];

– if α > 1 then τ(x, α) = α∆(n)(x)

249

Page 251: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. LOESS: motorcycle data

> attach(mcycle)

> mcycle.1 <- loess(accel~times, span=0.1)

> mcycle.2 <- loess(accel~times, span=0.5)

> mcycle.3 <- loess(accel~times, span=1)

> prtimes<- matrix((0:1000)*((max(times)-min(times))/1000)+min(times), ncol=1)

> praccel.1 <-predict(mcycle.1, prtimes)

> praccel.2 <-predict(mcycle.2, prtimes)

> praccel.3 <-predict(mcycle.3, prtimes)

> plot(mcycle, pch="+")

> lines(prtimes, praccel.1, lty=1)

> lines(prtimes, praccel.2, lty=2)

> lines(prtimes, praccel.3, lty=3)

> legend(40, -90, legend=c("span=0.1", "span=0.5", "span=1"), lty=c(1,2,3))

span= α controls proportion of the points used in local neighborhood; see

definition of τ(x, α);

degree of the polynomial is 2.

250

Page 252: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. LOESS: motorcycle data

+++++ +++++++++

++++

+++

+

+++

+

+

+

+

+

++

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+++

+

+

+

+ ++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+ +

+

++

+

+

+

+

+

+

+

10 20 30 40 50

−100

−50

050

times

acce

l

span=0.1span=0.5span=1

251

Page 253: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

15. Multivariate nonparametric regression:Regression trees (CART)

252

Page 254: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Multivariate nonparametric regression

◮ Curse of dimensionality

Data are very sparse in high dimensional space

If we have 1000 uniformly distributed points on [0, 1]d then the

average number Nd of points in [0, 0.3]d is as follows

d 1 2 3 4 5

Nd 300 90 27 8.1 2.4

◮ Remedy: nonparametric ”structural” models

– additive structure: f(X) = f1(X1) + . . .+ fp(Xp)

– single–index: f(X) = f0(θTX)

– projection pursuit: f(X) =∑k

i=1 fi(θTi X)

253

Page 255: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Regression trees: basic idea

◮ Data: (Xi, Yi), i = 1, . . . n, X ∈ X ⊂ Rp.

◮ Modeling assumption: there is a partition of X into M regions

D1, . . . , DM , and f is approximated by a contant at each region:

f(x) =M∑

m=1

cmI(x ∈ Dm)

◮ Splitting rules: binary splitting.

Choose a variable Xm and split according to Xm ≤ tm or Xm > tm.

◮ How to partition the domain X ?

254

Page 256: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Aregressiontree:anexample

X1<t1

X2<t2X1<t3

X2<t4D1D2D3

D4D5

255

Page 257: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Correspondingpartitionofthefeaturespace

X1

X2

t1

t2

t3

t4

D1

D2

D3

D4

D5

256

Page 258: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Growing the regression tree

◮ Predictor variable: X = (X1, . . . , Xp) ∈ Rp;

Data: (Xi, Yi), Xi = (Xi1, . . . , Xip), i = 1, . . . , n.

◮ Goodness of split S is defined by decrease in the total sum of squares:

Φ(S, t) = SS(t)−[SS(tL) + SS(tR)

],

SS(t) is the total sum of squares∑

(Yi − Y )2 of observations at

node t.

◮ Choose the split S that maximizes Φ(S, t).

257

Page 259: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Growing the regression tree

◮ Consider splitting variable Xj and split point s and define

D1(j, s) = {X : Xj ≤ s}, D2(j, s) = {X : Xj > s}

Then we look for j and s that solve

minj,s

[ n∑

i=1

(Yi − c1)2I{Xi ∈ D1(j, s)}+n∑

i=1

(Yi − c2)2I{Xi ∈ D2(j, s)}]

ck = ave{Yi : Xi ∈ Dk(j, s)

}=

∑ni=1 YiI{Xi ∈ Dk(j, s)}∑ni=1 I{Xi ∈ Dk(j, s)}

, k = 1, 2.

◮ For each splitting variable determination of s is done very quickly

(finite number of different splits). The pair (j, s) is found by scanning

over all of the inputs.

258

Page 260: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Pruning the regression tree

◮ Let T = {t1, . . . , tm} be terminal nodes of the tree T with regions

D1, . . . , Dm, and numbers of observations N1, . . . , Nm.

◮ Notation: for each terminal node tm define

cm =1

Nm

n∑

i=1

YiI{Xi ∈ Dm}, Q(tm) =1

Nm

n∑

i=1

(Yi−cm)2I{Xi ∈ Dm}

◮ Cost–complexity criterion: For complexity parameter (CP) α > 0 let

Rα(T ) =

M∑

m=1

NmQ(tm) + α size(T ), size(T ) = #{t ∈ T : t ∈ T}.

◮ Weakest–link cutting algorithm produces a decreasing sequence of

trees with corresponding CP’s. Cross–validation based selection from

this collection. [See transparencies for classification trees].

259

Page 261: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1 Example: regression tree for motorcycle data

> mcycle.tree<-rpart(accel~times, data=mcycle,method="anova")

> plot(mcycle.tree)

> text(mcycle.tree)

> plot(times, accel)

> lines(times, predict(mcycle.tree))

260

Page 262: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

|times< 27.4

times>=16.5

times< 24.4

times>=19.5

times>=15.1

times>=35

−114.7 −86.31−42.49

−39.12 −4.357

3.291 29.29

261

Page 263: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: regression tree for motorcycle data

10 20 30 40 50

−100

−50

050

times

acce

l

262

Page 264: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: Boston housing data, regression tree

> library(MASS)

> nobs <- dim(Boston)[1]

> trainx<-sample(1:nobs, 2*nobs/3, replace=F)

> testx<-(1:nobs)[-trainx]

> Boston.tree<-rpart(medv~., data=Boston[trainx,], method="anova")

> print(Boston.tree)

n= 337

node), split, n, deviance, yval

* denotes terminal node

1) root 337 26701.3200 22.51810

2) rm< 6.825 277 10100.7600 19.68123

4) lstat>=15 99 1586.7700 14.28990

8) crim>=7.036505 41 409.5088 11.53171 *

9) crim< 7.036505 58 644.8588 16.23966 *

5) lstat< 15 178 4035.9670 22.67978

10) rm< 6.543 145 2251.4780 21.62069

263

Page 265: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

20) lstat>=9.66 80 573.8339 20.32125 *

21) lstat< 9.66 65 1376.3040 23.22000 *

11) rm>=6.543 33 907.2133 27.33333 *

3) rm>=6.825 60 4079.5770 35.61500

6) rm< 7.435 42 1248.8060 31.77143

12) lstat>=5.415 21 417.2467 28.66667 *

13) lstat< 5.415 21 426.6981 34.87619 *

7) rm>=7.435 18 762.5450 44.58333 *

#

> plot(Boston.tree)

> text(Boston.tree)

#

#

> Boston.pred <- predict(Boston.tree, Boston[testx,])

> sum((Boston.pred-Boston[testx,"medv"])^2)

[1] 4045.943 # prediction error

# on the test set

264

Page 266: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Boston housing data, regression tree

|rm< 6.825

lstat>=15

crim>=7.037 rm< 6.543

lstat>=9.66

rm< 7.435

lstat>=5.415

11.53 16.24

20.32 23.2227.33

28.67 34.8844.58

265

Page 267: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

MARS as extension of CART

◮ CART decision trees are based on approximation of f by

f(x) =M∑

m=1

cmBm(x), Bm(x) = I{x ∈ Dm}︸ ︷︷ ︸basis functions

.

◮ Idea: replace I{x ∈ Dm} with continuous function.

If the basis functions Bm(·) are given, the coefficients cm are

estimated by the least squares.

266

Page 268: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Region representation

◮ Step function:

H[η] =

1, if η ≥ 0

0, otherwise

Each region Dm is obtained by, say, Km splits; k-th split,

k = 1, . . . , Km, is performed on the variable xv(k,m) using the

threshold tkm. Therefore

Bm(x) =

Km∏

k=1

H[skm

(xv(k,m) − tkm

)], skm = ±1

f(x) =M∑

m=1

cm

Km∏

k=1

H[skm

(xv(k,m) − tkm

)].

◮ Minimize a lack–of-fit (LOF) measure w.r.t. cm, skm, v(k,m) and

tkm.

267

Page 269: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

MARS basis functions

◮ MARS algorithm instead of step functions H[xv − t] and H[−xv + t]

uses [xv − t]+ and [−xv + t]+.

tt

(x− t)+ (t− x)+

◮ The collection of basis functions

C ={(xj − t)+, (t− xj)+ : t ∈ {X1j , X2j , . . . , Xnj}, j = 1, . . . , p

}.

268

Page 270: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

MARS: Model building strategy

◮ Forward selection of basis functions: at each iteration we add a pair

of new basis functions. They are obtained by multiplication of the

previously chosen basis functions with [xj − t]+ and [t− xj ]+.

◮ First step: we consider adding to model a function

β1(xj − t)+ + β2(t− xj)+, t ∈ {X1j , X2j , . . . , Xnj}.

Suppose the best choice is β1(x2 −X72)+ + β2(X72 − x2)+. Then the

the set of basis functions at this step is

C1 = {B0(x) = 1, B1(x) = (x2 −X72)+, B2(x) = (X72 − x2)+}

269

Page 271: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

◮ Second step: consider including a pair of products

(xj − t)+Bm(x) and (t− xj)+Bm(x), Bm ∈ C1.

◮ Step M : CM = {Bm(x),m = 1, . . . , 2M + 1}. At step M + 1 the

algorithm adds the terms

c2M+2Bl(x) [xj − t]+ + c2M+3Bl(x) [t− xj ]+, Bl ∈ CM , j = 1, . . . , p,

where Bl and j produce maximal decrease in the training error. Stop

when the model contains preset maximum number of terms Mmax.

◮ Final selection: choose fM based on M basis functions that minimizes

LOF(fM ) =n

(n−M − 1)2

n∑

i=1

[yi − fM (xi)

]2

270

Page 272: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Recursive partitioning algorithm

B1(x) = 1

for M = 2 to Mmax do: lof∗ =∞for m = 1 to M − 1 do:

for v = 1 to n do:

for t ∈ {xvj : Bm(xj) > 0}g =

∑i 6=m ciBi(x) + cmBm(x)H[xv − t] + cMBm(x)H[−xv + t]

lof = minc1,...,cM LOF(g)

if lof < lof∗ then lof = lof∗; m∗ = m; v∗ = v; t∗ = t; endif

endfor

endfor

endfor

BM (x) = Bm∗(x)H[−xv∗ + t∗]

Bm∗(x) = Bm∗(x)H[xv∗ − t∗]endfor; end of algorithm

271

Page 273: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: Boston housing data – MARS

# Number of basis functions<=10, no interactions

> library(mda)

> Boston1.mars<-mars(Boston[,1:13], Boston$medv, degree=1, nk=10)

# ij-th element equal to 1 if term i has a factor of the form x_j>c,

# equal to -1 if term i has a factor of the form x_j <= c,

# and to 0 if x_j is not in term i.

> Boston1.mars$factor

crim zn indus chas nox rm age dis rad tax ptratio black lstat

[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0

[2,] 0 0 0 0 0 0 0 0 0 0 0 0 1

[3,] 0 0 0 0 0 0 0 0 0 0 0 0 -1

[4,] 0 0 0 0 0 1 0 0 0 0 0 0 0

[5,] 0 0 0 0 0 -1 0 0 0 0 0 0 0

[6,] 0 0 0 0 0 0 0 0 0 0 1 0 0

[7,] 0 0 0 0 0 0 0 0 0 0 -1 0 0

[8,] 0 0 0 0 1 0 0 0 0 0 0 0 0

[9,] 0 0 0 0 -1 0 0 0 0 0 0 0 0

> Boston1.mars$cuts

272

Page 274: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

# ij-th element equal to the cut point c for variable j in term i.

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]

[1,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 0.00

[2,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 6.07

[3,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 6.07

[4,] 0 0 0 0 0.000 6.425 0 0 0 0 0.0 0 0.00

[5,] 0 0 0 0 0.000 6.425 0 0 0 0 0.0 0 0.00

[6,] 0 0 0 0 0.000 0.000 0 0 0 0 17.8 0 0.00

[7,] 0 0 0 0 0.000 0.000 0 0 0 0 17.8 0 0.00

[8,] 0 0 0 0 0.472 0.000 0 0 0 0 0.0 0 0.00

[9,] 0 0 0 0 0.472 0.000 0 0 0 0 0.0 0 0.00

> Boston1.mars$selected.terms

[1] 1 2 3 4 5 6 7 8 9

> Boston1.mars$coefficients

[,1]

[1,] 25.3093268

[2,] -0.5529556

[3,] 2.9450633

[4,] 8.0947256

[5,] 1.4712255

273

Page 275: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[6,] -0.6005808

[7,] 0.8962524

[8,] -12.0724248

[9,] -52.6998783

> Boston1.mars$gcv

[1] 17.73768

> plot(Boston1.mars$residuals)

> abline(0,0)

> qqnorm(Boston1.mars$residuals)

> qqline(Boston1.mars$residuals)

◮ Fitted model

f(x) = 25.30− 0.55× (lstat− 6.07)+ + 2.95× (6.07− lstat)+

+ 8.09× (rm− 6.43)+ + 1.47× (6.43− rm)+

−0.6× (ptratio − 17.8)+ + 0.90× (17.8− ptratio)+

−12.07× (nox− 0.47)+ − 52.7× (0.47− nox)+

274

Page 276: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Boston housing data – MARS

0 100 200 300 400 500

−20

−10

010

2030

Index

Bosto

n1.m

ars$re

sidua

ls

−3 −2 −1 0 1 2 3

−20

−10

010

2030

Normal Q−Q Plot

Theoretical Quantiles

Samp

le Qu

antile

s

275

Page 277: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Example: Boston housing data – MARS

# Number of basis function <= 40, degree=2

> Boston2.mars<-mars(Boston[,1:13], Boston$medv, degree=2, nk=40)

> Boston2.mars$selected.terms

[1] 1 2 4 5 6 7 8 9 11 13 14 16 19 20 23 25 26 27 28 29 31 32 34 36 38

[26] 39

> Boston2.factor

crim zn indus chas nox rm age dis rad tax ptratio black lstat

[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0

[2,] 0 0 0 0 0 0 0 0 0 0 0 0 1

[3,] 0 0 0 0 0 0 0 0 0 0 0 0 -1

[4,] 0 0 0 0 0 1 0 0 0 0 0 0 0

[5,] 0 0 0 0 0 -1 0 0 0 0 0 0 0

[6,] 0 0 0 0 0 1 0 0 0 0 1 0 0

[7,] 0 0 0 0 0 1 0 0 0 0 -1 0 0

[8,] 0 0 0 0 0 0 0 0 0 1 0 0 -1

[9,] 0 0 0 0 0 0 0 0 0 -1 0 0 -1

[10,] 0 0 0 0 1 0 0 0 0 0 0 0 1

[11,] 0 0 0 0 -1 0 0 0 0 0 0 0 1

276

Page 278: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[12,] 0 0 0 0 0 -1 0 1 0 0 0 0 0

[13,] 0 0 0 0 0 -1 0 -1 0 0 0 0 0

[14,] 1 0 0 0 0 0 0 0 0 0 0 0 0

[15,] -1 0 0 0 0 0 0 0 0 0 0 0 0

[16,] 1 0 0 1 0 0 0 0 0 0 0 0 0

[17,] 1 0 0 -1 0 0 0 0 0 0 0 0 0

[18,] -1 0 0 0 0 0 0 0 0 1 0 0 0

[19,] -1 0 0 0 0 0 0 0 0 -1 0 0 0

[20,] 0 0 0 0 0 0 0 0 0 0 0 0 1

[21,] 0 0 0 0 0 0 0 0 0 0 0 0 -1

[22,] 0 0 0 0 0 1 1 0 0 0 0 0 0

[23,] 0 0 0 0 0 1 -1 0 0 0 0 0 0

[24,] 0 0 0 0 0 0 0 1 0 0 0 0 0

[25,] 0 0 0 0 0 0 0 -1 0 0 0 0 0

[26,] 0 0 0 0 0 0 0 -1 0 0 0 1 0

[27,] 0 0 0 0 0 0 0 -1 0 0 0 -1 0

[28,] 0 0 0 0 0 0 0 0 0 0 0 1 1

[29,] 0 0 0 0 0 0 0 0 0 0 0 -1 1

[30,] 0 0 0 0 0 0 1 -1 0 0 0 0 0

[31,] 0 0 0 0 0 0 -1 -1 0 0 0 0 0

[32,] 0 0 0 0 1 1 0 0 0 0 0 0 0

[33,] 0 0 0 0 -1 1 0 0 0 0 0 0 0

277

Page 279: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[34,] 0 0 0 0 0 0 0 0 0 0 1 0 1

[35,] 0 0 0 0 0 0 0 0 0 0 -1 0 1

[36,] 0 0 0 0 1 0 0 -1 0 0 0 0 0

[37,] 0 0 0 0 -1 0 0 -1 0 0 0 0 0

[38,] 0 0 0 0 0 0 0 1 0 0 0 0 1

[39,] 0 0 0 0 0 0 0 -1 0 0 0 0 1

>

> Boston2.mars$cuts

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]

[1,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[2,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[3,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[4,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[5,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[6,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 17.8 0.00 0.00

[7,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 17.8 0.00 0.00

[8,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 335 0.0 0.00 6.07

[9,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 335 0.0 0.00 6.07

[10,] 0.00000 0 0 0 0.718 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[11,] 0.00000 0 0 0 0.718 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[12,] 0.00000 0 0 0 0.000 6.425 0.0 1.8195 0 0 0.0 0.00 0.00

278

Page 280: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[13,] 0.00000 0 0 0 0.000 6.425 0.0 1.8195 0 0 0.0 0.00 0.00

[14,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[15,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[16,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[17,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[18,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 242 0.0 0.00 0.00

[19,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 242 0.0 0.00 0.00

[20,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 23.97

[21,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 23.97

[22,] 0.00000 0 0 0 0.000 6.425 84.7 0.0000 0 0 0.0 0.00 0.00

[23,] 0.00000 0 0 0 0.000 6.425 84.7 0.0000 0 0 0.0 0.00 0.00

[24,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[25,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[26,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 373.66 0.00

[27,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 373.66 0.00

[28,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 376.73 6.07

[29,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 376.73 6.07

[30,] 0.00000 0 0 0 0.000 0.000 77.8 4.7075 0 0 0.0 0.00 0.00

[31,] 0.00000 0 0 0 0.000 0.000 77.8 4.7075 0 0 0.0 0.00 0.00

[32,] 0.00000 0 0 0 0.624 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[33,] 0.00000 0 0 0 0.624 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[34,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 12.6 0.00 6.07

279

Page 281: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[35,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 12.6 0.00 6.07

[36,] 0.00000 0 0 0 0.718 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[37,] 0.00000 0 0 0 0.718 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[38,] 0.00000 0 0 0 0.000 0.000 0.0 2.9879 0 0 0.0 0.00 6.07

[39,] 0.00000 0 0 0 0.000 0.000 0.0 2.9879 0 0 0.0 0.00 6.07

>

> Boston2.mars$coefficients

[,1]

[1,] 23.054932863

[2,] -0.475918184

[3,] 9.219231990

[4,] -2.823958066

[5,] -1.785616880

[6,] 0.742716889

[7,] 0.018297135

[8,] 0.015258475

[9,] 1.687538405

[10,] 9.636024180

[11,] -0.062409499

[12,] 2.555488850

[13,] 0.013238063

280

Page 282: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[14,] 0.593931139

[15,] 0.047603018

[16,] 2.588471925

[17,] -0.107836555

[18,] -0.012484569

[19,] 0.017930583

[20,] 0.001807397

[21,] 0.045138047

[22,] -98.745400075

[23,] -0.058675469

[24,] -6.453215924

[25,] -0.071061935

[26,] -0.140013447

>

> Boston2.mars$gcv

[1] 9.199464

281

Page 283: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

4. Example: Boston housing data – MARS

◮ Fitted model

f(x) = 23.05− 0.48× (lstat − 6.07)+ + 9.21× (6.07− lstat)+

− 2.82× (rm− 6.43)+ − 1.78× (6.43− rm)+

+0.74× (rm− 6.43)+ × (ptratio − 17.8)+

+ 0.002× (rm− 6.43)+ × (17.8− ptratio)+

+ · · ·

282

Page 284: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

16. Dimensionality reduction: principal componentsanalysis (PCA)

283

Page 285: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. PCA: basic idea

◮ The idea is to describe/approximate variability (distribution) of

X = (X1, . . . , Xp) by a distribution in the space of smaller dimension.

◮ Approximation by one dimenstional space: let

δTX =

p∑

i=1

δiXi, ‖δ‖2 =

d∑

i=1

δ2i = 1.

Which projection (normalized linear combination) is the “best

representer” of the vector distribution? Or how to choose δ?

◮ First optimization problem

(Opt1) maxδ:‖δ‖=1

var(δTX) = maxδ:‖δ‖=1

δTΣδ, Σ = cov(X).

284

Page 286: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. PCA: basic idea

◮ Solution to (Opt1) is eigenvector γ1 of Σ corresponding to the

maximal eigenvalue λ1. The first principal component is

Y1 = γT1 X.

◮ Second optimization problem: max{δTΣδ : ‖δ‖ = 1, δTγ1 = 0

}.

Solution is the eigenvector γ2 of Σ corresponding to the second largest

eigenvalue λ2. Second principal component: Y2 = γT2 X and so on...

◮ In general, if Σ = ΓΛΓT , Λ is diagonal matrix comprised of

eigenvalues, Γ is an orthogonal matrix comprised of eigenvectors then

the PCA transformation is

Y = ΓT (X − µ).

285

Page 287: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

PCA: theory

◮ Theorem: Let X ∼ Np(µ,Σ), Σ = ΓTΛΓ, and let Y = ΓT (X − µ) bethe principal components. Then

(i) EYj = 0, var(Yj) = λj , j = 1, . . . , p; cov{Yi, Yj} = 0, ∀i 6= j;

(ii) var(Y1) ≥ var(Y2) ≥ · · · ≥ var(Yp)

(iii)∑p

i=1 var(Yi) = tr(Σ),∏p

i=1 var(Yi) = det(Σ).

◮ Proportion of the variability explained by q components:

ψq =

∑qi=1 λi∑pi=1 λi

=

∑qi=1 var(Yi)∑pi=1 var(Yi)

.

286

Page 288: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

PCA: empirical version

◮ Idea: given Xi ∈ Rp, i = 1, . . . , n, estimate Γ and µ, apply PCA to

Xi’s, and keep q first variables that explain “well” the variability.

◮ Estimates:

µ =1

n

n∑

i=1

Xi, Σ =1

n− 1

n∑

i=1

(Xi − µ)(Xi − µ)T .

◮ Spectral decomposition and PCA transformation:

Σ = ΓΛΓT , Yi = ΓT (Xi − µ), i = 1, . . . , n.

If variables are given on different scales, before applying PCA the

data can be standardized:

Xi = D−1/2(Xi − µ), D = diag{Σ}.

287

Page 289: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: heptathlon data

◮ Data on 25 competitors: seven events and total score

>heptathlon

hurdles highjump shot run200m longjump javelin run800m

Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51

John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12

Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20

Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24

...........................................................................

To recode all events in same direction (“large” is “good“) we transformrunning events

>heptathlon$hurdles<-max(heptathlon$hurdles)-heptathlon$hurdles

>heptathlon$run200m<-max(heptathlon$run200m)-heptathlon$run200m

>heptathlon$run800m<-max(heptathlon$run800m)-heptathlon$run800m

288

Page 290: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: heptathlon data

◮ Correlations

>hept<-heptathlon[,-8] #without total score

>round(cor(hept),2)

hurdles highjump shot run200m longjump javelin run800m

hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78

highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59

shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42

run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62

longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70

javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02

run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00

javelin is weakly correlated with other variables...

289

Page 291: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Scatterplot of the data

>plot(hept)

hurdles

1.50 1.70 0 1 2 3 4 36 40 44

01

23

1.50

1.70 highjump

shot

1013

16

02

4

run200m

longjump

5.0

6.0

7.0

3642 javelin

0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40

020

40

run800m

290

Page 292: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Principal components in R

> hept_pca<-prcomp(hept,scale=TRUE)

> print(hept_pca)

Standard deviations:

[1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617

Rotation:

PC1 PC2 PC3 PC4 PC5 PC6

hurdles -0.4528710 0.15792058 -0.04514996 0.02653873 -0.09494792 -0.78334101

highjump -0.3771992 0.24807386 -0.36777902 0.67999172 0.01879888 0.09939981

shot -0.3630725 -0.28940743 0.67618919 0.12431725 0.51165201 -0.05085983

run200m -0.4078950 -0.26038545 0.08359211 -0.36106580 -0.64983404 0.02495639

longjump -0.4562318 0.05587394 0.13931653 0.11129249 -0.18429810 0.59020972

javelin -0.0754090 -0.84169212 -0.47156016 0.12079924 0.13510669 -0.02724076

run800m -0.3749594 0.22448984 -0.39585671 -0.60341130 0.50432116 0.15555520

PC7

hurdles 0.38024707

highjump -0.43393114

shot -0.21762491

run200m -0.45338483

longjump 0.61206388

291

Page 293: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

javelin 0.17294667

run800m -0.09830963

> summary(hept_pca)

Importance of components:

PC1 PC2 PC3 PC4 PC5 PC6 PC7

Standard deviation 2.112 1.093 0.7218 0.6761 0.4952 0.2701 0.221

Proportion of Variance 0.637 0.171 0.0744 0.0653 0.0350 0.0104 0.007

Cumulative Proportion 0.637 0.808 0.8822 0.9475 0.9826 0.9930 1.000

> a1<-hept_pca$rotation[,1] # linear combination for the 1st principal

# component

> a1

hurdles highjump shot run200m longjump javelin run800m

-0.4528710 -0.3771992 -0.3630725 -0.4078950 -0.4562318 -0.0754090 -0.3749594

>

292

Page 294: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Principal components in R

◮ First principal component:

> predict(hept_pca)[,1] # or just hept_pca$x[,1]

Joyner-Kersee (USA) John (GDR) Behmer (GDR) Sablovskaite (URS)

-4.121447626 -2.882185935 -2.649633766 -1.343351210

Choubenkova (URS) Schulz (GDR) Fleming (AUS) Greiner (USA)

-1.359025696 -1.043847471 -1.100385639 -0.923173639

Lajbnerova (CZE) Bouraga (URS) Wijnsma (HOL) Dimitrova (BUL)

-0.530250689 -0.759819024 -0.556268302 -1.186453832

Scheider (SWI) Braun (FRG) Ruotsalainen (FIN) Yuping (CHN)

0.015461226 0.003774223 0.090747709 -0.137225440

Hagger (GB) Brown (USA) Mulliner (GB) Hautenauve (BEL)

0.171128651 0.519252646 1.125481833 1.085697646

Kytola (FIN) Geremias (BRA) Hui-Ing (TAI) Jeong-Mi (KOR)

1.447055499 2.014029620 2.880298635 2.970118607

Launa (PNG)

6.270021972

◮ The first two components account for 81% of the variance.

293

Page 295: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

>plot(hept_pca)

hept_pcaVa

rianc

es

01

23

4

> cor(heptathlon$score,hept_pca$x[,1])

[1] -0.9910978

294

Page 296: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

>plot(heptathlon$score, hept_pca$x[,1])

4500 5000 5500 6000 6500 7000

−4−2

02

46

heptathlon$score

hept

_pca

$x[,

1]

295

Page 297: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

17. Clustering: model–based clustering,K–means, K–medoids

296

Page 298: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Clustering problem

◮ Clustering ⇔ unsupervised learning:

Grouping or segmenting objects (observations) into subsets or

”clusters”, such that those within each cluster are more similar

to each other than they are to the members of other groups.

◮ Example – Iris data

Given the measurements in centimeters of the four variables

sepal length/width and petal length/width

for 50 Iris flowers, the goal is to group the observations with

accoradance to the species. The species are Iris setosa,

versicolor, and virginica.

297

Page 299: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Clustering problem

◮ Data: {X1, . . . , Xn}, Xi ∈ Rp

◮ Dissimilarity (proximity) measure: if d(·, ·) is a distance,

D = {dij}i,j=1,...,n, dij = d(Xi, Xj),

e.g., di,j = ‖Xi −Xj‖2.

◮ Clustering algorithm maps each observation Xi to one of the K

groups,

C : {X1, . . . , Xn} → {1, . . . , K}.

298

Page 300: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Parametric approach

◮ Formulation: Let X1, . . . , Xn independent vectors from K

populations G1, . . . , GK , and

Xi ∼ f(x, θk) when Xi is sampled from Gk.

◮ Observation labels: For i = 1, . . . , n let γi = k if Xi is sampled from

Gk. The vector γ = (γ1, . . . , γn) is unknown.

◮ Clusters

Ck =⋃

i:γi=k

{Xi}, k = 1, . . .K.

The goal is to find Ck, k = 1, . . . , K.

299

Page 301: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Maximum likelihood clustering

◮ Likelihood function to be maximized w.r.t. γ and θ = (θ1, . . . , θK)

L(γ; θ) =∏

i:γi=1

f(Xi, θ1)∏

i:γi=2

f(Xi, θ2) · · ·∏

i:γi=K

f(Xi, θK).

◮ Specific case: f(x, θk) = Np(µk,Σk), k = 1, . . . , K

lnL(γ; θ) = const−1

2

K∑

k=1

i:Xi∈Ck

(Xi−µk)TΣ−1

k (Xi−µk)−1

2

K∑

k=1

nk ln |Σk|,

where nk =∑n

i=1 I{Xi ∈ Ck}, k = 1, . . . , n.

◮ When γ (partition) is fixed and lnL(γ; θ) is maximized w.r.t. θ,

µk(γ) =1

nk

Xi∈Ck

Xi, Σk(γ) =1

nk

Xi∈Ck

[Xi − µk(γ)][Xi − µk(γ)]T

300

Page 302: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Maximum likelihood clustering

◮ Substituting µk(γ) and Σk(γ) we obtain

lnL(γ; θ) = constant− 1

2

K∑

k=1

nk ln |Σk(γ)|.

Thus the optimization problem is to minimize

K∏

k=1

|Σk(γ)|nk

over all partitions of the set of observations into K groups.

◮ Computationally infeasible problem

301

Page 303: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

How to choose K?

In many cases we don’t know how big K should be.

◮ Use some distance measure (”within–groups variance” and

”between–groups variance”)

◮ Use a visual plot if possible (can plot points and distances)

◮ If we can define a cluster ”purity measure” (as in CART), then one

can optimize over this and penalize for complexity.

302

Page 304: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

K–means clustering algorithm

◮ The basic idea

1. Given a preliminary partition {Cold1 , . . . , Cold

K } of observationscompute the group means

mk =1

#(Ck)

i:Xi∈Ck

Xi, k = 1, . . . , K.

2. Associate each observation with the cluster whose mean is closest,

Cnewk =

⋃{Xi : ‖Xi −mk‖ ≤ min

j=1,...,K,j 6=k‖Xi −mk‖

}

3. Go to 1.

◮ Solves the problem:∑n

i=1 mink=1,...,K ‖Xi −mk‖2 → minm1,...,mK.

303

Page 305: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

K–means: a simple numerical example

◮ Data: {2, 4, 10, 12, 3, 20, 30, 11, 25}, K = 2.

1. Let m1 = 2, m2 = 4 ⇒ C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25}

2. New centers m1 = 2.5, m2 = 16 ⇒

C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}.

m1 m2 C1 C2

3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}4.75 19.6 {2, 3, 4, 10, 11, 12} {20, 30, 25}7 25 {2, 3, 4, 10, 11, 12} {20, 30, 25}

and the algorithm stops.

304

Page 306: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. K-means: simulated example

# Data generation

> m1=c(0,1)

> m2=c(4.5,0)

> S1<-cbind(c(2, 0.5), c(0.5, 3))

> S2<-cbind(c(2, -1.5), c(-1.5, 3))

> X1<-mvrnorm(100, m1, S1)

> X2<-mvrnorm(100, m2, S2)

> Y=rbind(X1, X2)

> plot(Y)

> points(X1, col="red")

> points(X2, col="blue")

#

# K-means algorithm

#

>Y.km<-kmeans(Y, 2, iter.max=20)

>Y.km

K-means clustering with 2 clusters of sizes 97, 103

Cluster means:

305

Page 307: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[,1] [,2]

1 4.55303155 -0.0987873

2 0.08029854 1.1484915

Clustering vector:

[1] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2

[38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1

[112] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1

[186] 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1

Within cluster sum of squares by cluster:

[1] 389.4226 426.2129

Available components:

[1] "cluster" "centers" "withinss" "size"

306

Page 308: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. K–means: simulated example

−2 0 2 4 6

−4−2

02

4

Y[,1]

Y[,2]

307

Page 309: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. K-means: old swiss banknote data

308

Page 310: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. K-means: old swiss banknote data

◮ Data: 200 observations on the variables

– X1 = length of the bill

– X2 = height of the bill (right)

– X3 = height of the bill (left)

– X4 = distance of the inner frame to the lower border

– X5 = distance of the inner frame to the upper border

– X6 = length of the diagonal of the central picture

◮ The first 100 are geinuine, the other - counterfeit

309

Page 311: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. K-means: old swiss banknote data

> bank2<- read.table(file="bank2.dat")

> bank2.km<-kmeans(bank2, 2, 20)

> bank2.km

K-means clustering with 2 clusters of sizes 100, 100

Cluster means:

V1 V2 V3 V4 V5 V6

1 214.823 130.300 130.193 10.530 11.133 139.450

2 214.969 129.943 129.720 8.305 10.168 141.517

Clustering vector:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

310

Page 312: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:

[1] 225.2233 142.8852

Available components:

[1] "cluster" "centers" "withinss" "size"

311

Page 313: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

K–medoids clustering algorithm

◮ Extends K–means to non–Euclidean dissimilarity measures d(·, ·).◮ Algorithm

1. Given an initial partition to clusters C = {C1, . . . , CK} find an

observation in each cluster that minimizes the sum of distances to

other observations

i∗k = arg mini:Xi∈Ck

j:Xj∈Ck

d(Xi, Xj) ⇒ mk = Xi∗k, k = 1, . . . , K.

Associate each observation to clusters according to the distance to

the found centroids (medoids) {m1, . . . ,mK}. Proceed until

convergence

◮ Objective function

minC,{ik}K

k=1

K∑

k=1

j:Xj∈Ck

d(Xik , Xj).

312

Page 314: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

K–medoids example: countries dissimilarities

◮ Data set: the average dissimilarity scores matrix between 12 countries

(Belgium, Brazil, Chile, Cuba, Egypt, France, India, Israel, USA,

USSR, Yugoslavia and Zaire).

> library(cluster)

> x <-read.table("countries.data")

> val<-c("BEL","BRA", "CHI","CUB", "EGY", "FRA", "IND", "ISR", "USA", "USS",

+ "YUG", "ZAI")

> rownames(x) <- val

> colnames(x) <- val

> x

313

Page 315: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI

BEL 0.00 5.58 7.00 7.08 4.83 2.17 6.42 3.42 2.50 6.08 5.25 4.75

BRA 5.58 0.00 6.50 7.00 5.08 5.75 5.00 5.50 4.92 6.67 6.83 3.00

CHI 7.00 6.50 0.00 3.83 8.17 6.67 5.58 6.42 6.25 4.25 4.50 6.08

CUB 7.08 7.00 3.83 0.00 5.83 6.92 6.00 6.42 7.33 2.67 3.75 6.67

EGY 4.83 5.08 8.17 5.83 0.00 4.92 4.67 5.00 4.50 6.00 5.75 5.00

FRA 2.17 5.75 6.67 6.92 4.92 0.00 6.42 3.92 2.25 6.17 5.42 5.58

IND 6.42 5.00 5.58 6.00 4.67 6.42 0.00 6.17 6.33 6.17 6.08 4.83

ISR 3.42 5.50 6.42 6.42 5.00 3.92 6.17 0.00 2.75 6.92 5.83 6.17

USA 2.50 4.92 6.25 7.33 4.50 2.25 6.33 2.75 0.00 6.17 6.67 5.67

USS 6.08 6.67 4.25 2.67 6.00 6.17 6.17 6.92 6.17 0.00 3.67 6.50

YUG 5.25 6.83 4.50 3.75 5.75 5.42 6.08 5.83 6.67 3.67 0.00 6.92

ZAI 4.75 3.00 6.08 6.67 5.00 5.58 4.83 6.17 5.67 6.50 6.92 0.00

> x.pam2<-pam(x,2, diss=T)

> summary(x.pam2)

Medoids:

ID

[1,] "9" "USA"

[2,] "4" "CUB"

Clustering vector:

314

Page 316: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI

1 1 2 2 1 1 2 1 1 2 2 1

Objective function:

build swap

3.291667 3.236667

Numerical information per cluster:

size max_diss av_diss diameter separation

[1,] 7 5.67 3.227143 6.17 4.67

[2,] 5 6.00 3.250000 6.17 4.67

# diameter - maximal dissimilarity between observations in the cluster

# separation - minimal dissimilarity between an observation of the cluster

# and an observation of another cluster.

Isolated clusters:

L-clusters: character(0) #

L*-clusters: character(0) # diameter < separation

# L-cluster: for each observation i the maximal dissimilarity between

# i and any other observation of the cluster is smaller than the minimal

# dissimilarity between i and any observation of another cluster.

315

Page 317: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Silhouette plot information:

cluster neighbor sil_width

USA 1 2 0.42519084

BEL 1 2 0.39129752

FRA 1 2 0.35152954

ISR 1 2 0.29785894

BRA 1 2 0.22317708

EGY 1 2 0.19652641

ZAI 1 2 0.18897849

CUB 2 1 0.39814815

USS 2 1 0.34104696

CHI 2 1 0.32512211

YUG 2 1 0.26177642

IND 2 1 -0.04466159

Average silhouette width per cluster:

[1] 0.2963655 0.2562864

Average silhouette width of total data set:

[1] 0.2796659

Available components:

[1] "medoids" "id.med" "clustering" "objective" "isolation"

[6] "clusinfo" "silinfo" "diss" "call"

316

Page 318: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Silhouette plot

> plot(x.pam2)

IND

YUG

CHI

USS

CUB

ZAI

EGY

BRA

ISR

FRA

BEL

USA

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = x, k = 2, diss = T)

Average silhouette width : 0.28

n = 12 2 clusters Cj

j : nj | avei∈Cj si

1 : 7 | 0.30

2 : 5 | 0.26

317

Page 319: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Silhouette plot

◮ For each observation Xi from cluster Ck define

* in–cluster average dissimilarity

a(i) =

∑j:Xj∈Ck

d(Xi, Xj)

#{j 6= i : Xj ∈ Ck}, Xi ∈ Ck.

* between–clusters average dissimilarity

d(i, Cm) =

∑j:Xj∈Cm

d(Xi, Xj)

#{j : Xj ∈ Cm}, m 6= k, m ∈ {1, . . . , K}.

b(i) = minm6=k,m=1,...,K

d(i, Cm)

◮ Silhouette width:

s(i) =b(i)− a(i)

max{a(i), b(i)} , −1 ≤ s(i) ≤ 1.

s(i) = 0 if Xi is the only observation in its cluster. Silhouette plot

plots s(i) in decreasing order fro each cluster.

318

Page 320: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Silhouette plot

◮ Interpretation of silhouette width

– large s(i) (almost 1) – observation is very well clustered;

– small s(i) (around 0) – observation lies between two clusters;

– negative s(i) – observation is badly clustered.

◮ Silhouette coefficient SC = 1n

∑ni=1 s(i): choose K with maximal SC.

– SC ≥ 0.7: strong cluster structure

– 0.5 ≤ SC ≤ 0.7: reasonable structure

– 0.25 ≤ SC ≤ 0.5: weak structure

– SC ≤ 0.25: no structure

319

Page 321: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Countries dissimilarities: 3-medoids

> x.pam3<-pam(x,3, diss=T)

> summary(x.pam3)

Medoids:

ID

[1,] "9" "USA"

[2,] "12" "ZAI"

[3,] "4" "CUB"

Clustering vector:

BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI

1 2 3 3 1 1 2 1 1 3 3 2

Objective function:

build swap

2.583333 2.506667

Numerical information per cluster:

size max_diss av_diss diameter separation

[1,] 5 4.50 2.4000 5.0 4.67

[2,] 3 4.83 2.6100 5.0 4.67

[3,] 4 3.83 2.5625 4.5 5.25

320

Page 322: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Isolated clusters:

L-clusters: character(0)

L*-clusters: [1] 3

Silhouette plot information:

cluster neighbor sil_width

USA 1 2 0.46808511

FRA 1 2 0.43971831

BEL 1 2 0.42149254

ISR 1 2 0.36561099

EGY 1 2 0.02118644

ZAI 2 1 0.27953625

BRA 2 1 0.25456578

IND 2 3 0.17498951

CUB 3 2 0.47890188

USS 3 1 0.43682195

YUG 3 1 0.31304749

CHI 3 2 0.30726872

Average silhouette width per cluster:

[1] 0.3432187 0.2363638 0.3840100

Average silhouette width of total data set:

321

Page 323: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

[1] 0.3301021

Available components:

[1] "medoids" "id.med" "clustering" "objective" "isolation"

[6] "clusinfo" "silinfo" "diss" "call"

>

> plot(x.pam3)

322

Page 324: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

CHI

YUG

USS

CUB

IND

BRA

ZAI

EGY

ISR

BEL

FRA

USA

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = x, k = 3, diss = T)

Average silhouette width : 0.33

n = 12 3 clusters Cj

j : nj | avei∈Cj si

1 : 5 | 0.34

2 : 3 | 0.24

3 : 4 | 0.38

323

Page 325: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

18. Clustering: hierarchical methods

324

Page 326: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Hierarchical methods

◮ Two main types of hierarchical clustering

– Agglomerative:

* Start with the points as individual clusters

* At each step, merge the closest pair of clusters until only one

cluster (or K clusters) left

– Divisive:

* Start with one, all–inclusive cluster

* At each step, split a cluster until each cluster contains a point

(or there are K clusters).

◮ Algorithms use dissimilarity/distance matrix

merge or split one cluster in a time.

325

Page 327: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Dissimilarity measures between clusters

Dissimilarity between clusters R and Q

◮ Average dissimilarity

D(R,Q) =1

NRNQ

Xi∈R, Xj∈Q

d(Xi, Xj)

Best when clusters are ball–shaped, fairly well–separated.

◮ Nearest neighbor (single linkage)

D(R,Q) = minXi∈R,Xj∈Q

d(Xi, Xj)

Can lead to ”chaining” effect, elongated clusters

◮ Furthest neighbor (complete linkage)

D(R,Q) = maxXi∈R,Xj∈Q

d(Xi, Xj)

Tends to produce small compact clusters

326

Page 328: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Agglomerative algorithm

Assume that we have a measure dissimilarity between clusters.

1. Construct the finest partition (one observation in each cluster)

repeat:

2. Compute the dissimilarity matrix between clusters.

3. Find the two clusters with the smallest dissimilarity, and merge them

into one cluster.

4. Compute the dissimilarity between the new groups

until: only one cluster remains.

327

Page 329: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: clustering using single linkage

◮ Initial dissimilarity matrix (5 objects)

1 2 3 4 5

1 0

2 9 0

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0

⇒ because mini,j di,j = d53 = 2,

merge 3 and 5 to cluster (35)

◮ Nearest neighbor distances:

d(35)1 = min{d31, d51} = min{3, 11} = 3

d(35)2 = min{d32, d52} = min{7, 10} = 7

d(35)4 = min{d34, d54} = min{9, 8} = 8.

328

Page 330: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: clustering using single linkage

◮ Dissimilarity matrix after merger of 3 and 5

(35) 1 2 4

(35) 0

1 3 0

2 7 9 0

4 8 6 5 0

⇒ d(35)1 = 3; merge (35) and 1 to (135)

◮ Distances

d(135)2 = min{d(35)2, d12} = min{7, 9} = 7

d(135)4 = min{d(35)4, d14} = min{8, 6} = 6.

329

Page 331: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Example: clustering using single linkage

◮ Dissimilarity matrix after merger of (35) and 1

(135) 2 4

(135) 0

2 7 0

4 6 5 0

⇒ d42 = 5; merge 4 and 2 to (42)

d(135)(24) = min{d(135)2, d(135)4} = min{7, 6} = 6.

◮ The final disssimilarity matrix

(135) (24)

(135) 0

(24) 6 0

⇒ (135) and (24) are merged on the level 6.

330

Page 332: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

4. Example: resulting dendrogram

_

_

_

_

_

_

1

2

3

4

5

6

1 3 5 2 4

331

Page 333: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Agglomerative coefficient

◮ Measures of how much ”clustering structure” exists in the data

◮ Ck(j) is the first cluster Xj is merged with, j = 1, . . . , n; R, Q are two

last clusters that are merged at the final step of the algorithm. For

each observation Xj define

α(j) =D(Xj , Ck(j))

D(R,Q), j = 1, . . . , n.

◮ Agglomerative coefficient: AC = 1n

∑nj=1[1− α(j)]

– large AC (close to 1): observations are merged with clusters close

to them in the beginning as compared to the final merger;

– small AC: evenly distributed data, poor clustering structure.

332

Page 334: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: Swiss Provinces Data (1888)

◮ Data: 47 observations on 5 measures of socio–economic indicators on

Swiss provinces:

– Agriculture: % of males involved in agriculture as occupation

– Examination: % ”draftees” receiving highest mark on army

examination

– Education: % education beyond primary school for ”draftees”

– Catholic: % catholic (as opposed to ”protestant”

– Infant.Mortality: % live births who live less than 1 year

333

Page 335: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Swiss Provinces Data (1888)

> library(cluster)

> swiss.x<-swiss[,-1]

> sagg<-agnes(swiss.x)

> pltree(sagg)

> print(sagg$ac)

[1] 0.8774795

◮ By default agnes uses average dissimilarity (complete and single

linkage can be chosen).

◮ There are two main groups with one point V. De Geneve

well–separated from them.

◮ Fairly large AC = 0.878 suggests good clustering structure in the

data.

334

Page 336: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

CourtelaryLe Locle

ValdeTraversLa Chauxdfnd

La ValleeLausanneNeuchatelVevey

NeuvevilleBoudry

GrandsonVal de Ruz

AigleMorges

RolleAvenchesOrbeMoudonPayerne

YverdonNyoneAubonne

OronCossonay

LavauxPaysd’enhaut

EchallensMoutierRive Droite

Rive GaucheV. De Geneve

DelemontFranches−Mnt

PorrentruyGruyere

SarineBroye

GlaneVeveyse

SionMonthey

ContheySierreHerensEntremontMartigwy

St Maurice

020

4060

80

Dendrogram of agnes(x = swiss.x)

agnes (*, "average")swiss.x

Height

335

Page 337: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Divisive clustering algorithm

◮ Splitting cluster C to A and B

1. Initialization: A = C, B = ∅

2. For each observation Xj compute

� average dissimilarity a(j) from all other observations in A

� average dissimilarity d(j, B) to all observations in B; d(j, ∅) = 0.

3. Select observation Xk such that S(k) = a(k)− d(k,B) is maximal.

4. If S(k) ≥ 0, add Xk to cluster B: B = B ∪ {Xk}, A = A\{Xk},and go to 2. If S(k) < 0, or A contains only one observation, stop.

◮ Apply the same procedure to clusters A and B (at each step split

cluster with maximal diameter).

336

Page 338: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: divisive clustering algorithm

◮ Dissimilarity matrix (5 objects)

1 2 3 4 5

1 0

2 9 0

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0

◮ Initial clusters:

A = {1, 2, 3, 4, 5}, B = ∅, diam(A) = d51 = 11.

337

Page 339: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: divisive clustering algorithm

◮ Step 1: average dissimilarities of observations

a(1) = (9 + 3 + 6 + 11)/4 = 7.25, a(2) = (9 + 7 + 5 + 10)/4 = 7.75

a(3) = (3 + 7 + 9 + 2)/4 = 5.52, a(4) = (6 + 5 + 9 + 8)/4 = 7

a(5) = (11 + 10 + 2 + 8)/4 = 7.75 ⇒A = {1, 2, 3, 4}, B = {5}, diam(A) = 9.

◮ Step 2: Computing average dissimilarities and S(·)

a(1) = (9 + 3 + 6)/3 = 6, S(1) = a(1)− d15 = 6− 11 = −5a(2) = (9 + 7 + 5)/3 = 7, S(2) = a(2)− d25 = 7− 10 = −3a(3) = (3 + 7 + 9)/3 = 6.333, S(3) = a(3)− d35 = 6.333− 2 = 4.333

a(4) = (6 + 5 + 9)/3 = 6.667, S(4) = a(4)− d45 = 6.667− 8 = −1.333⇒ A = {1, 2, 4}, B = {5, 3}, diam(A) = 9, diam(B) = 2.

338

Page 340: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Example: divisive clustering algorithm

◮ Step 3:

a(1) = (9 + 6)/2 = 7.5, d(1, B) = (3 + 11)/2 = 7, S(1) = 0.5

a(2) = (9 + 5)/2 = 7, d(2, B) = (7 + 10)/2 = 8.5, S(2) = −1.5a(4) = (6 + 5)/2 = 5.5, d(4, B) = (9 + 8)/2 = 8.5, S(4) = −3⇒ A = {2, 4}, B = {1, 3, 5}, diam(A) = 5, diam(B) = 11.

◮ Step 4:

a(2) = 5, d(2, B) = (9 + 7 + 10)/3 = 8.667, S(2) = −3.667a(4) = 5, d(4, B) = (6 + 9 + 8)/3 = 7.667, S(4) = −2.667.

Since S(2) and S(4) are negative the procedure stops after step 4 with

two clusters A = {2, 4} and B = {1, 3, 5}. Next we start to divide the

cluster B (it has larger diameter).

339

Page 341: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Divisive coefficient

◮ Cluster diameter

diam(C) = maxXi∈C,Xj∈C

d(Xi, Xj)

◮ For observation Xj , let δ(j) be the diameter of the last cluster to

which Xj belongs (before it is split off as a single observation),

divided by the diameter of the whole dataset.

◮ Divisive coefficient: DC = 1n

∑ni=1[1− δ(j)]

– large DC (close to 1): on average observations are in small

compact clusters (relative to the size of the whole dataset) before

being split off; evidence of a good clustering structure.

340

Page 342: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: Swiss Provinces Data (1888)

> library(cluster)

> swiss.x<-swiss[,-1]

> sdiv<-diana(swiss.x)

> pltree(sdiv)

> print(sdiv$dc)

[1] 0.903375

◮ diana uses average dissimilarity and Euclidaen distance.

◮ There are three well–separated clusters.

◮ Fairly large DC = 0.903 suggests good clustering structure in the

data.

341

Page 343: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

CourtelaryLe Locle

ValdeTraversLa Chauxdfnd

La ValleeLausanneNeuchatelVevey

V. De GeneveRive Droite

Rive GaucheMoutier

NeuvevilleBoudry

GrandsonVal de RuzNyone

YverdonAigle

MorgesRolle

AvenchesOrbeMoudonPayerneAubonne

OronCossonay

LavauxPaysd’enhaut

EchallensDelemont

Franches−MntPorrentruy

GruyereSarine

BroyeGlane

VeveyseSion

MontheyConthey

SierreHerens

EntremontMartigwy

St Maurice

020

4060

80100

120

Dendrogram of diana(x = swiss.x)

diana (*, "NA")swiss.x

Height

342

Page 344: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Hierarchical procedures – comments

◮ The procedures may be sensitive to outliers, ”noise points”

◮ Try different methods, and within a given method, different ways of

assigning distances (complete , average, single linkage). Roughly

consistent outcomes indicate validity of the clutering structure.

◮ Stability of the hierarchical solution can be checked by applying the

algorithm to slightly perturbed data set.

343

Page 345: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Pottery data

◮ Data: chemical composition of 45 specimens of Romano–British

pottery for 9 oxides. A kiln site at which the pottery is found is also

known (5 different sites).

◮ Question: whether the pots can be divided into distinct groups and

how these groups relate to the kiln site?

> pottery

Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO

1 18.8 9.52 2.00 0.79 0.40 3.20 1.01 0.077 0.015

2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018

...................................................

...................................................

44 14.8 2.74 0.67 0.03 0.05 2.15 1.34 0.003 0.015

45 19.1 1.64 0.60 0.10 0.03 1.75 1.04 0.007 0.018

344

Page 346: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

K-means algorithm

>wss<-rep(0,10)

>wss[1]<-44*sum(apply(pottery,2,var))

>for (i in 2:10) wss[i]<-mean(kmeans(pottery,i)$withinss)

>plot(2:10,wss[2:10],type="b",xlab="Number of groups",ylab="Mean-within-group SS")

>kmeans(pottery,5)$clust

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

5 5 5 5 5 5 5 5 3 3 3 3 3 5 5 3 5 5 5 5 5 1 1 1 2 1

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

2 2 2 2 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4

> kmeans(pottery,4)$clust

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

4 4 4 4 4 4 4 4 2 2 2 2 2 4 4 2 4 4 4 4 4 2 2 2 1 1

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3

345

Page 347: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

246810

050

100150

200

Number of groups

Mean−w

ithin−group SS

346

Page 348: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

K-medoids

> sil.w<-rep(0,10)

> for (i in 2:10) sil.w[i]<-pam(pottery,i,diss=F)$silinfo$avg.width

> sil.w[2:10]

[1] 0.5018253 0.6031487 0.5038343 0.4991434 0.5061460 0.4781251 0.4644091

[8] 0.4537507 0.4391603

> p3.med<-pam(pottery,3,diss=F)

> p3.med

Medoids:

ID Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO

2 2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018

32 32 12.4 6.13 5.69 0.22 0.54 4.65 0.70 0.159 0.015

38 38 18.0 1.50 0.67 0.01 0.06 2.11 0.92 0.001 0.016

Clustering vector:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

> plot(p3.med)

347

Page 349: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

−2 0 2 4

−2

−1

01

23

4

clusplot(pam(x = pottery, k = 3, diss = F))

Component 1

Co

mp

on

en

t 2

These two components explain 74.75 % of the point variability.

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = pottery, k = 3, diss = F)

Average silhouette width : 0.6

n = 45 3 clusters Cj

j : nj | avei∈Cj si

1 : 21 | 0.62

2 : 14 | 0.53

3 : 10 | 0.67

348

Page 350: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Agglomerative algorithm

> pot.ag<-agnes(pottery,diss=F)

> pot.ag

Call: agnes(x = pottery, diss = F)

Agglomerative coefficient: 0.8878186

Order of objects:

[1] 1 2 4 14 15 18 7 9 16 3 20 5 21 8 6 19 17 10 12 13 11 22 24 23 26

[26] 32 33 31 25 29 27 30 34 35 28 36 42 41 38 39 45 43 40 37 44

Height (summary):

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.1273 0.5273 1.0020 1.4720 1.8430 7.4430

Available components:

[1] "order" "height" "ac" "merge" "diss" "call"

[7] "method" "order.lab" "data"

> pltree(pot.ag)

349

Page 351: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

12 4 14 15 18

79 16 3 20 5 21

86 1917 10

12 1311

22 2423

2632 33

3125 29

2730

34 3528

36 4241 38 39

4543

4037 440

24

6

Dendrogram of agnes(x = pottery, diss = F, method = "average")

agnes (*, "average")pottery

Hei

ght

350

Page 352: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Divisive algorithm

> pot.dv<-diana(pottery,diss=F)

> pot.dv

......................................................................

Height:

[1] 3.5098771 0.1273460 0.8565378 0.3898923 0.5561591 1.3314567

[7] 2.6377172 0.4109903 0.6023363 0.2934962 0.4666101 1.2262288

[13] 0.2095328 0.5655882 6.3982699 0.7580923 1.8114795 0.7463518

[19] 0.3744663 3.1056614 10.4430689 0.5786156 1.1191769 3.8495285

[25] 1.0225287 1.9532691 5.4350004 0.6852452 1.3666854 2.9777214

[31] 1.8154022 1.1270909 0.3005478 3.5697204 11.7033446 0.3178836

[37] 0.6550580 0.8893852 0.4395009 1.5469845 2.5117392 4.2065322

[43] 6.1297881 1.0822223

Divisive coefficient:

[1] 0.9161985

> pltree(pot.dv)

351

Page 353: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

124141518

732052186191791610

121311

2224232633

312529

322730

343528

3642413839

4543

403744

02

46

810

12

Dendrogram of diana(x = pottery, diss = F)

diana (*, "NA")pottery

Height

352

Page 354: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

19. Multidimensional scaling

353

Page 355: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

MDS: problem and examples

◮ Problem: based on the dissimilary (proximity) matrix between

objects find a spatial representation of these objects in a

low–dimetional space.

◮ MDS deals with “fitting” the data in a low–dimentional space with

minimal distortion to the distances between original points.

◮ Examples:

– Given a matrix of inter–city distances in USA, produce the map.

– Find a geometric representation for dissimilarities between cars.

– Given data on enrollment and graduation in 25 US universities,

produce a two–dimensional representation of the universities.

354

Page 356: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Metric multidimensional scaling

◮ Problem formulation: X be a n× p data matrix, Xi = (xi1, . . . , xip) is

the ith row (observation). We are given the matrix of Euclidean

distances between observations

D = {dij}i,j=1,...,n, d2ij =

p∑

k=1

(xik − xjk)2 = ‖Xi −Xj‖2.

Based on D, we want to reconstruct the data matrix X .

◮ No unique solution exists: the distances do not change if the data

points are rotated or reflected. Usually the restriction that the mean

vector of the configuration is zero is added.

355

Page 357: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. How to recontruct X from D?

◮ Inner product matrix: define the n× n matrix B = XXT ,

bij =

p∑

k=1

xikxjk, i, j = 1, . . . , n.

Matrix B is related to matrix D:

d2ij =

p∑

k=1

(xik − xjk)2 =

p∑

k=1

x2ik +

p∑

k=1

x2jk − 2

p∑

k=1

xikxjk

= bii + bjj − 2bij . (1)

◮ The idea is to reconstruct first matrix B and then to factorize it.

◮ Centering: assume that all variables are centered, i.e.

X·k =1

n

n∑

i=1

xik = 0, ∀k = 1, . . . , p.

This implies that∑n

i=1 bij = 0, ∀j.356

Page 358: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. How to recontruct X from D?

◮ Summing up (1) over i, j and i and j we have

n∑

i=1

d2ij = tr(B) + nbjj , ∀j

n∑

j=1

d2ij = tr(B) + nbii, ∀i

n∑

i=1

n∑

j=1

d2ij = 2n tr(B).

Solving this system for bii, bjj and substituting to (1) we get

bij = −1

2

(d2ij −

1

n

n∑

j=1

d2ij −1

n

n∑

i=1

d2ij +1

n2

n∑

i=1

n∑

j=1

d2ij

). (2)

357

Page 359: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. How to recontruct X from D?

◮ The last step is the factorization of matrix B: the SVD of B is

B = V ΛV T , Λ = diag{λ1, . . . , λn}, V is orthogonal, V V T = 1;

here for definiteness λ1 ≥ λ2 ≥ · · · ≥ λn. Then

X = V Λ1/2, Λ1/2 = diag(√λ1, . . . ,

√λn).

◮ If tne n× p matrix X is of full rank, i.e., rank(X) = p, then n− peigenvalues of B will be zero, i.e., Λ = diag{λ1, . . . , λp, 0, . . . , 0}.

◮ The best q–dimensional representation , q ≤ p, retains first qeigenvalues. The adequacy is judged by Sq =

∑qi=1 λi/

∑pi=1 λi.

358

Page 360: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

MDS algorithm summary

1. For given matrix of distances D compute matrix B using (2).

2. Perform singular value decomposition of B, B = V ΛV T ; for

definiteness, let λ1 ≥ λ2 ≥ · · · ≥ λn.

3. Retain q largest eigenvalues, q ≤ p, set Λ1 = diag{λ1, . . . , λq, 0, . . . , 0}.

4. The new q–dimensional data matrix representation is Y = V Λ1/21 .

The rows of matrix Y are called the principal coordinates of X in

q–dimensions.

5. Judge the adequacy of the representation using index Sq.

A reasonable fit corresponds to Sq ∼ 0.8.

359

Page 361: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

MDS algorithm – comments

◮ Although D is the matrix of Euclidean distances, the algorithm can

be applied for other distances. In this case B in (2) is symmetric but

not necessarily non–negative definite. Then assess adequacy of the

solution using ∑qi=1 |λi|∑pi=1 |λi|

, or

∑qi=1 λ

2i∑p

i=1 λ2i

.

◮ Other criteria:

– trace: choose the number of coordinates so that the sum of

positive eigenvalues is approximately the sum of all eigenvalues

– magnitude: accept only eigenvalues which substantially exceed the

largest negative eigenvalue.

360

Page 362: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Duality of principal coordinates and PCA

◮ PCA is based on the singular value decomposition of the sample

covariance matrix

Σ =1

n− 1

n∑

i=1

(Xi − µ)(Xi − µ)T , µ =1

n

n∑

i=1

Xi.

◮ If the data matrix X is centered, i.e., µ = 0, then Σ = XXT .

Therefore, if D is the matrix of Euclidean distances, then B given by

(2) coincides with XXT . Thus, in this specific case PCA and

principal coordinates are equivalent.

361

Page 363: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Some remarks

◮ Distance and dissimilarity: MDS is applied to the matrix D = {dij}of distances. Dissimilarity matrix is a symmetric matrix C = {cij}satisfying cij ≤ cii, ∀i, j. One can trasform dissimilarity matrix to

distances matrix by setting dij = (cii − 2cij + cjj)1/2, ∀i, j.

◮ Given a distance matrix D, the object of MDS is to find a data

matrix X in Rq with interpoint distances dij “close” to dij , ∀i, j.

◮ Among all projections of X on q–dimensional subspaces of Rp, the

principal coordinates in Rq minimize the expression

φ =n∑

i=1

n∑

j=1

[d2ij − d2ij ].

362

Page 364: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: air distance between US cities

◮ Air-distance data: (1) Atlanta, (2) Boston, (3) Cincinnati,(4) Columbus, (5) Dallas, (6) Indianapolis, (7) Little Rock, (8) LosAngeles, (9) Memphis, (10) St. Louis, (11) Spokane, (12) Tampa

> D

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]

[1,] 0 1068 461 549 805 508 505 2197 366 558 2467 467

[2,] 1068 0 867 769 1819 941 1494 3052 1355 1178 2747 1379

[3,] 461 867 0 107 943 108 618 2186 502 338 2067 928

[4,] 549 769 107 0 1050 172 725 2245 586 409 2131 985

[5,] 805 1819 943 1050 0 882 325 1403 464 645 1891 1077

[6,] 508 941 108 172 882 0 562 2080 436 234 1959 975

[7,] 505 1494 618 725 325 562 0 1701 137 353 1988 912

[8,] 2197 3052 2186 2245 1403 2080 1701 0 1831 1848 1227 2480

[9,] 366 1355 502 586 464 436 137 1831 0 294 2042 779

[10,] 558 1178 338 409 645 234 353 1848 294 0 1820 1016

[11,] 2467 2747 2067 2131 1891 1959 1988 1227 2042 1820 0 2821

[12,] 467 1379 928 985 1077 975 912 2480 779 1016 2821 0

363

Page 365: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: air distance between US cities

> names<-c("Atlanta", "Boston",

+ "Cincinnati","Columbus","Dallas","Indianapolis","Little Rock","Los Angeles",

+ "Memphis","St. Louis","Spokane","Tampa")

>

> air.d<-cmdscale(D,k=2,eig=TRUE)

> air.d$eig

[1] 8234381 2450757

> x<-air.d$points[,1]

> y<-air.d$points[,2]

> plot(x,y,xlab="Coordinate 1",ylab="Coordinate 2", xlim=range(x)*1.2, type="n")

> text(x,y,labels=names)

364

Page 366: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

−1000 −500 0 500 1000 1500 2000

−800

−600

−400

−200

020

040

060

0

Coordinate 1

Coo

rdin

ate

2

Atlanta

Boston

CincinnatiColumbus

Dallas

Indianapolis

Little Rock

Los Angeles

Memphis

St. Louis

Spokane

Tampa

365

Page 367: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: data on US universities

◮ Data: 6 variables on 25 US universities

� X1 = average SAT score for new freshmen;

� X2 = percentage of new freshmen in top 10% of high school class;

� X3 = percentage of applicants accepted;

� X4 = student–faculty ratio;

� X5 = estimated annual expences;

� X6 = graduation rate (%).

366

Page 368: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: data on US universities

> X<-read.table("US-univ.txt")

> names<-as.character(X[,1])

> names

[1] "Harvard" "Princeton" "Yale" "Stanford"

[5] "MIT" "Duke" "Cal_Tech" "Dartmouth"

[9] "Brown" "Johns_Hopkins" "U_Chicago" "U_Penn"

[13] "Cornell" "Northwestern" "Columbia" "NotreDame"

[17] "U_Virginia" "Georgetown" "Carnegie_Mellon" "U_Michigan"

[21] "UC_Berkeley" "U_Wisconsin" "Penn_State" "Purdue"

[25] "Texas_A&M"

> X<-X[,-1]

> D<-dist(X,method="euclidean") # matrix of distances

> univ<-cmdscale(D,k=6,eig=TRUE)

> univ$eig

[1] 2.080000e+04 3.042786e+03 1.287520e+03 5.430015e+02 1.180444e+02

[6] 9.986056e-01

367

Page 369: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Example: data on US universities

> sum(univ$eig[1:2])/sum(univ$eig[1:6])

[1] 0.924413

> x<-univ$points[,1]

> y<-univ$points[,2]

> plot(x,y,xlab="Coordinate 1", ylab="Coordinate 2", xlim=range(x)*1.2,type="n")

> text(x,y,names)

368

Page 370: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

−40 −20 0 20 40 60 80

−30

−20

−10

010

Coordinate 1

Coor

dina

te 2

Harvard

Princeton

Yale

StanfordMIT

Duke

Cal_Tech

Dartmouth

Brown

Johns_Hopkins

U_Chicago

U_Penn

Cornell

NorthwesternColumbia

NotreDameU_VirginiaGeorgetown

Carnegie_Mellon

U_Michigan

UC_Berkeley

U_Wisconsin

Penn_State

Purdue

Texas_A&M

369

Page 371: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Non–metric multidimensional scaling

◮ Another look at MDS problem: we are given a dissimilarity matrix

D = {dij} for points X in Rp. We want to find points X in Rq such

that the corresponding distances D = {dij} will match as close as

possible the matrix D.

◮ In general, exact match is not possible, so we will require

monotonicity: if

di1j1 < di2j2 < · · · < dimjm , m = n(n− 1)/2

then should be

di1j1 < di2j2 < · · · < dimjm .

370

Page 372: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Shepard–Kruskal algorithm

(a) Given dissimilarity matrix D, order off–diagonal elements: for il < jl

di1j1 ≤ · · · ≤ dimjm , il < jl, l = 1, 2, . . . ,m = n(n− 1)/2.

Say that numbers d∗ij are monotonically related to dij (d∗mon∼ d) if

dij < dkl ⇒ d∗ij < d∗kl, ∀i < j, k < l.

(b) Let X (n× q) be a configuration in Rq with interpoint distances dij .

Define

Stress(X) = mind∗:d∗

mon∼ d

∑nj=1

∑i<j(d

∗ij − dij)2∑n

j=1

∑i<j d

2ij

Rank order of d∗ coincides with the rank order of d. Stress(X) is zero

if the rank order of d coincides with the rank order of d.

371

Page 373: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Shepard–Kruskal algorithm

(c) For each dimension q the configuration which has smallest stress is

called the best fitting configuration in q dimensions. Let

Sq = minX(n×q)

Stress(X).

(d) Calculate S1, S2,... until the value becomes low. The rule of thumb:

Sq ≥ 20%–poor; Sq = 10%–fair; Sq ≤ 5%–good; Sq = 0%–perfect.

◮ Remark: the solution is obtained by a numerical procedure.

372

Page 374: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

The Shepard–Kruskal algorithm: computation

◮ Computation

1. find a random configuration of points (e.g., sampling from a

normal distribution);

2. calculate distances between the points dij ;

3. find optimal monotone transformation d∗ij of the dissimilarities dij ;

4. minimize the stress between the optimally scaled data by finding

new configuration of points; if the stress is small enough stop, if

not go to the step 2.

373

Page 375: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Remarks

◮ The procedure uses steepest descent method. There is no way to

distinguish between local and global minima.

◮ The Shepard–Kruskal solution is non–metric since it uses only the

rank orders.

◮ The method is invariant under rotation, translation, and unifor

expansion or contraction of the best fitting configuration.

◮ The method works for distances and dissimilarities (similarities).

374

Page 376: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: New Jersey voting data

◮ Data is the matrix containing the number of times 15 New Jersey

congressmen voted differently in the House of Representatives on

19 evironmental bills (abstentions are not recorded).

> voting

Hunt(R) Sandman(R) Howard(D) Thompson(D) Freylinghuysen(R)

Hunt(R) 0 8 15 15 10 ...

Sandman(R) 8 0 17 12 13 ...

Howard(D) 15 17 0 9 16 ...

Thompson(D) 15 12 9 0 14 ...

Freylinghuysen(R) 10 13 16 14 0 ...

Forsythe(R) 9 13 12 12 8 ...

Widnall(R) 7 12 15 13 9 ...

Roe(D) 15 16 5 10 13 ...

Heltoski(D) 16 17 5 8 14 ...

Rodino(D) 14 15 6 8 12 ...

Minish(D) 15 16 5 8 12 ...

Rinaldo(R) 16 17 4 6 12 ...

Maraziti(R) 7 13 11 15 10 ...

Daniels(D) 11 12 10 10 11 ...

Patten(D) 13 16 7 7 11 ...

375

Page 377: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: New Jersey voting data

> voting.stress<-rep(0,6)

> for (i in 1:6) voting.stress[i]<-isoMDS(voting, k=i)$stress

> voting.stress

[1] 21.1967696 9.8790470 5.4522891 3.6672495 1.4064205 0.8899916

> voting.MDS<-isoMDS(voting,k=2)

initial value 15.268246

iter 5 value 10.264075

final value 9.879047

converged

> x<-voting.MDS$points[,1]

> y<-voting.MDS$points[,2]

> plot(x,y,xlab="Coordinate 1",ylab="Coordinate 2", xlim=range(x)*1.2, type="n")

> text(x,y,colnames(voting))

376

Page 378: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

−10 −5 0 5

−6−4

−20

24

68

Coordinate 1

Coor

dina

te 2

Hunt(R)

Sandman(R)

Howard(D)

Thompson(D)

Freylinghuysen(R)

Forsythe(R)

Widnall(R)

Roe(D)

Heltoski(D)

Rodino(D)Minish(D)

Rinaldo(R)

Maraziti(R)

Daniels(D)

Patten(D)

377

Page 379: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: voting data, metric MDS

> voting.metrMDS <- cmdscale(voting, k=6, eig=TRUE)

> voting.metrMDS$eig

[1] 497.76083 146.17622 102.91314 76.87756 55.11540 24.74374

> sum(voting.metrMDS$eig[1:2])/sum(voting.metrMDS$eig)

[1] 0.7126454

> x<-voting.metrMDS$points[,1]

> y<-voting.metrMDS$points[,2]

> plot(x,y,xlab="Coordinate 1",ylab="Coordinate 2", xlim=range(x)*1.2, type="n")

> text(x,y,colnames(voting))

378

Page 380: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

−10 −5 0 5

−4−2

02

46

8

Coordinate 1

Coor

dina

te 2

Hunt(R)

Sandman(R)

Howard(D)

Thompson(D)

Freylinghuysen(R)Forsythe(R)

Widnall(R)Roe(D)Heltoski(D)Rodino(D)

Minish(D)Rinaldo(R)

Maraziti(R)

Daniels(D)Patten(D)

379

Page 381: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

20. Neural networks

380

Page 382: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Neuron model

◮ Neurological origins: each elementary nerve cell (neuron) is connected

to many others, can be activated by inputs from elsewhere, and can

stimulate other neurons.

◮ Neuron model (perceptron):

y = ϕ( p∑

j=1

wjxj + w0

), v =

p∑

j=1

wjxj + w0,

– x = (x1, . . . , xp) are inputs

– y is output

– (w1, . . . , wp) are connection weights, w0 is a bias term

– ϕ is a (non–linear) function, called the activation function

381

Page 383: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Neuron model

.

.

.

x1

x2

xp

w1

w2

wp

yv

Σ ϕ(·)

382

Page 384: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Activation function

◮ Monotone increasing, bounded function

◮ Examples

* hard limiter: produces ±1 output,

ϕHL(v) = sgn(u).

* sigmoidal (logistic): produces output in (0, 1)

ϕS(v) =1

1 + e−av,

* hyperbolic tan: produces output between −1 and 1,

ϕHT (v) = tanh(v) =1− e−av

1 + e−av

* linear: ϕL(v) = av.

383

Page 385: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Single–unit perceptron

◮ Variables:

* x is a p–vector of features

* z is a prediction target (binary)

* y is the perceptron output used to predict z

y = sgn( p∑

j=1

wjxj + w0

)= ϕHL(w

Tx) =

+1, wTx ≥ 0

−1, wTx < 0.

◮ Training data: D = {(x(t), z(t)), t = 1, . . . , n} – n ”examples”

x(t) = (x(t)1 , . . . , x

(t)p ) is t-th observation of the feature vector,

z(t) is the class indicator (±1) – binary variable.

384

Page 386: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Perceptron learning rule

1. Initialization: set w = w(0) – starting point

2. At every step t = 1, 2, . . .

� select ”example” x(t) from the training set

� compute the perceptron output with current weights w(t−1)

yt = ϕHL([w(t−1)]Tx(t))

� update the weights

w(t) = w(t−1) + η(z(t) − y(t))x(t).

3. Cycle through all the observations in the training set.

◮ η > 0 is the learning rate parameter

◮ z(t) − y(t) is the error on t–th example. If it is zero, the weights are

not updated. Usual stepsize η ≈ 0.1.

385

Page 387: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Perceptron convergence theorem

◮ Theorem

For any data set that is linearly separable, the learning rule is

guaranteed to find the separating hyperplane in a finite number

of steps.

386

Page 388: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Criterion cost function for regression problem

◮ Data:

D ={x(t), z(t), t = 1, . . . , n

},

z(t) is a continuous/discrete variable.

◮ Squared error cost function:

E(w) =1

2

n∑

t=1

[z(t) − ϕ(wTx(t))]2 =1

2

n∑

t=1

e2t (w)

et(w) := z(t) − ϕ(wTx(t)

).

◮ Assumption: activation function ϕ is differentiable.

387

Page 389: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Gradient descent learning rule (batch version)

◮ Initialization: starting vector of weights w(0)

◮ For t = 0, 1, 2, . . . compute w ← w − ηt∇wE(w)⇔

w(t+1) = w(t) − ηt∇wE(w(t)

), ∇wE(w) = {∂E(w)/∂wj}j=0,...,p

◮ Cycle through all the observations in the training set.

∇wE(w) = −n∑

k=1

ek(w)∇w{ϕ(wTx(k)

)} = −

n∑

k=1

ϕ′(wTx(k))ek(w)x

(k) ⇒

w(t+1) = w(t) + ηt

n∑

k=1

ϕ′([w(t)]Tx(k))ek(w

(t))x(k).

w ← w + ηt

n∑

k=1

ϕ′(wTx(k))ek(w)x(k)

◮ E(w) is non–convex, convergence to local minima.

388

Page 390: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Gradient learning rule (sequential version)

◮ At step t = 1, 2, . . . the t-th example x(t) is selected and

w(t+1) = w(t) − ηt∇w

[12e2t (w)

] ∣∣∣w=w(t)

w(t+1) = w(t) + ηtϕ′([w(t)]Tx(t)

)et(w

(t))x(t) = w(t) + ηtδtx(t)

δt := ϕ′([w(t)]Tx(t)

)et(w

(t)).

– Examples are selected sequentially, each example is selected many

times.

– Easy implementation; processes examples in real time.

– Convergence to local minima

389

Page 391: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Neural network

◮ Neural network – multilayer perceptron

◮ Feed–forward network is a network in which units can be numbered so

that all connections go from a vertex to one with a high number. The

vertices are arranged in layers, with connections only to higher layers.

◮ Hidden layer contains M neurons fed by the inputs (x1, . . . xp).

Output of the mth neuron in the hidden layer is

vm = ϕm

(wm0 +

p∑

i=1

wmixi

)= ϕm(wT

mx),

wm is the weights for the perceptron m.

390

Page 392: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Neural network

◮ Output layer (unit) is fed by the neuron outputs in the hidden layer:

y = f(wj0 +

M∑

m=1

wjmvm

)= f(wT v),

where f is the activation function in the output unit, w is the vector

of weights of the output unit.

◮ The network with one hidden layer implements the function

y = f(wT v) = f(w0 +

M∑

m=1

wmϕm(wTmx)

)

391

Page 393: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Neural network

Hidden layer

.

.

.

.Output unit.

.

.

1

2

M

Input layer

x1

x2

xp

v1

v2y

w11

wMp

392

Page 394: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Specific cases

◮ Projection pursuit model: f(wT v) = vT1+ w0 and vm = ϕm(wTmx)

y = w0 +M∑

m=1

ϕm(wTmx)

◮ Generalized additive model: if f is as before and M = p, wm0 = 0,

wmi = I{i = m} then

y = w0 +

p∑

k=1

ϕm(xm).

393

Page 395: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Representation power of neural networks

Let f0 : [0, 1]d → R is a continuous function. Assume that ϕ is

not a polynomial; then for any ǫ > 0 there exist constants M , and

(wmi, wm) such that for

f(x) =

M∑

m=1

wmϕ( p∑

i=1

wmixi + wm0

)

one has

|f(x)− f0(x)| ≤ ǫ, ∀x ∈ [0, 1]d.

◮ Neural network with one hidden layer can approximate any

continuous function.

394

Page 396: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Training

◮ Prediction error criterion

E(W ) =1

2

n∑

t=1

[y(t)(W )− z(t)

]2=

n∑

t=1

et

(y(t)(W ), z(t)

),

where W denotes all the weights, and

y(t) = y(t)(W ) = f(w0 +

M∑

m=1

wmϕm(wTmx

(t))).

◮ Backpropagation algorithm

Wj ←Wj − η∂E(W )

∂Wj=Wj − η

n∑

t=1

∂et(W )

∂Wj.

◮ The algorithm uses the chain rule for differentiation and requires

differentiable activation functions.

395

Page 397: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Training neural networks

◮ Number of hidden units: usually varies in the range 5− 100.

◮ Overfitting: networks with too many units will overfit. The remedy is

to regularize, e.g., to minimize the criterion

E(W ) + λ∑

i

W 2i , λ ∼ 10−4 − 10−2.

Leads to the so–called weight decay algorithm.

◮ Starting values are usually taking random values near zero. Model

starts out nearly linear and becomes non–linear as the weights grow.

◮ Stopping rule: differetnt ad hoc rules, maximal number of iterations.

◮ Multiple minima: E(W ) is non–convex, has many local minima.

396

Page 398: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Projection pursuit regression

◮ Model: let ωm, m = 1, . . . ,M be the unknown unit p–vectors, and let

f(x) =M∑

m=1

fm(ωTmx), fm(·) are unknown.

fm varies only in the direction defined by the vector ωm.

◮ The ”effective” dimensionality is 1, not p.

◮ The error function (to be minimized w.r.t. fm and ωm, m = 1, . . . ,M)

n∑

i=1

[Yi −

M∑

m=1

fm(ωTmXi)

]2

397

Page 399: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Projection pursuit regression: fitting the model

◮ If ω is given then set vi = ωTXi and apply one–dimentional smoother

to get an estimate of g.

◮ Given g and previous guess ωold for ω write

g(ωTXi) ≈ g(ωToldXi) + g′(ωT

oldXi)(ω − ωold)TXi

and

n∑

i=1

[Yi−g(ωTXi)

]2≈

n∑

i=1

[g′(ωT

oldXi)]2[(ωToldXi+

Yi − g(ωToldXi)

g′(ωoldXi)

)−ωTXi

]

Minimize the last expression w.r.t. ω to get ωnew.

◮ Continue until convergence...

398

Page 400: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Neural network

◮ Extension of the idea of perceptron

◮ Feed–forward network is a network in which vertices (units) can be

numbered so that all connections go from a vertex to one with a high

number. The vertices are arranged in layers, with connections only to

higher layers.

◮ Each unit j sums its inputs, adds a constant forming the total input

xj , and applies a function fj to xj to give output yj .

◮ The links have weights wij which multiply the signals travelling

among them by that factor.

399

Page 401: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Fitting (training) neural networks

◮ Starting values: usually are taking random values neart zero. Model

starts out nearly linear and becomes non–linear as the weights grow.

◮ Stopping rule: differetnt ad hoc rules, maximal number of iterations.

◮ Overfitting: networks with too many units will overfit. The remedy is

to regularize, e.g., to minimize the criterion

E(w) + λ∑

i,j

w2ij , λ ∼ 10−4 − 10−2.

Leads to the weight decay algorithm.

◮ Number of hidden units: varies in the range 3− 100.

◮ Multiple minima: E(w) is non–convex, has many local minima.

400

Page 402: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Neural network

◮ The activation functions fj are taken to be

– linear

– logisitc f(x) = ℓ(x) = ex/(1 + ex)

– threshold f(x) = I(x > 0)

◮ Neural networks with linear output units and a single hidden layer can

approximate any continuous function (as size of hidden layer grows)

yk = αk +∑

j→k

wjkfj

(αj +

i→j

wijxi

)

◮ Projection pursuit regression

f(x) = α+∑

j

fj

(αj +

i

βjixi

)= α+

j

fj(αj + βTj x)

401

Page 403: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Example: network for the Boston housing data

◮ Structure: 13 predictors, 3 units in the hidden layer, linear output unit

1

2

3

13

14

15

16

17...

18

19

X1

X2

X3

X13

Y

◮ Represents the function (3× 14 + 4 = 46 weights):

y0 = α+17∑

j=15

wj0ϕj

(αj +

13∑

i=1

wijxi

)

402

Page 404: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Notation

◮ All units are numbered sequentially. Every unit j has input xj and

output yj

yj = fj(xj), xj =∑

i→j

wijyi.

Non-existent links are characterized by wij = 0; wij = 0 unless i < j.

◮ the output vector y∗ of the network is modeled as y∗ = f(x∗;w),

where x∗ is the input vector and w is the vector of weights.

◮ Data: {(x∗m, tm),m = 1, . . . , n} – observed examples;

tm is the observation for y∗m = f(x∗

m;w).

403

Page 405: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Back–propagation algorithm

◮ Discrepancy function to be minimized w.r.t. w

E(w) =n∑

m=1

Em(w) =n∑

m=1

∥∥tm − f(x∗m;w)

∥∥2 =n∑

m=1

∥∥tm − y∗m

∥∥2.

◮ Update rule (gradient descent): for η > 0 define

wij ← wij − η∂E(w)

∂wij,

∂E(w)

∂wij=

n∑

m=1

∂Em(w)

∂wij.

◮ Derivative: because f(x∗m;w) depends on wij only via xj

∂Em(w)

∂wij=∂Em

∂xj

∂xj∂wij

= yi∂Em

∂xj︸ ︷︷ ︸δj

= yiδj

δj =∂Em

∂xj=∂Em

∂yj

∂yj∂xj

= f ′j(xj)∂Em

∂yj.

404

Page 406: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Example: Boston housing data - neural network

> nobs <- dim(Boston)[1]

> trainx<-sample(1:nobs, 2*nobs/3, replace=F)

> testx<-(1:nobs)[-trainx]

#

# size - number of units in the hidden layer

# decay - parameter for weight decay; default 0

# linout - switch for linear output units; default logistic output units

# starting values - uniformly distributed [-0.7, 0.7]

#

> Boston10.1.nn<-nnet(formula=medv~., data=Boston[trainx,], size=10, decay=1.0e-03,

+ linout=TRUE, maxit=500)

# weights: 151

initial value 211829.925033

final value 31162.391756

converged

> pred <-predict(Boston10.1.nn, Boston[testx,])

> sum((pred-Boston[testx,"medv"])^2)

[1] 11590.46

405

Page 407: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Example: Boston housing data - neural network

> # another run with the same parameters

> Boston10.2.nn<-nnet(formula=medv~., data=Boston[trainx,], size=10, decay=1.0e-03,

+ linout=TRUE, maxit=500)

# weights: 151

initial value 223291.589605

iter 10 value 24530.299991

iter 20 value 21653.959154

...........................

iter 490 value 3061.645220

iter 500 value 3043.807915

final value 3043.807915

stopped after 500 iterations

> pred <-predict(Boston10.2.nn, Boston[testx,]) # prediction error on

> sum((pred-Boston[testx,"medv"])^2) # the test set

[1] 2859.247 # CART error was 4045.943

406

Page 408: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

3. Example: Boston housing data – neural network

0 50 100 150 200 250 300 350

−50

510

Index

Bosto

n10.2

.nn$re

sidua

ls

−3 −2 −1 0 1 2 3

−50

510

Normal Q−Q Plot

Theoretical Quantiles

Samp

le Qu

antile

s

407

Page 409: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

4. Example: Is the model good?

> cbind(pred, Boston[testx,"medv"])

[,1] [,2]

4 31.464151 33.4

12 19.442958 18.9

15 19.847892 18.2

18 18.361424 17.5

.......................

397 14.315629 12.5

403 16.850060 12.1

404 12.685557 8.3

405 -1.021680 8.5

406 -1.686176 5.0

413 11.566423 17.9

414 15.509046 16.3

419 -16.737063 8.8

421 20.722749 16.7

423 23.060568 20.8

408

Page 410: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

21. Support Vector Machines (SVM)

409

Page 411: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Hyperplane

◮ Hyperplane in Rp is an affine subspace of dimension p− 1: its

equation is

f(x) = β0 + β1x1 + . . .+ βpxp = 0.

◮ Consider binary classification problem with data

Dn = {(X1, Y1), . . . , (Xn, Yn)}, where Xi = [xi,1, . . . , xi,p] ∈ Rp and

Yi ∈ {−1, 1}.

◮ We say that Dn admits linear separation if there exists a separating

hyperplane f(x) such that

f(Xi) = β0 + β1xi,1 + · · ·+ βpxi,p > 0 if Yi = 1,

f(Xi) = β0 + β1xi,1 + · · ·+ βpxi,p < 0 if Yi = −1.

Separating hyperplane satisfies Yi(β0 + β1xi,1 + · · ·+ βpxi,p) > 0.

410

Page 412: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Separating hyperplanes

−1 0 1 2 3

−1

01

23

−1 0 1 2 3−

10

12

3

X1X1

X2

X2

If f(x) is a separating hyperplane then a natural classifier is sign{f(x)}.

411

Page 413: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Maximal margin classfier

◮ If the data set admits linear separation, it is natural to choose the

separating hyperplane with maximal margin, i.e., the separating

hyperplane farthest from the observations.

◮ Optimization problem:

maxβ0,...,βp

M

s.t.

p∑

j=1

β2j = 1,

Yi(β0 + β1xi,1 + · · ·+ βpxi,p) ≥M, ∀i = 1, . . . , n.

M is the hyperplane margin. The optimization problem is convex, it

can be efficiently solved on computer.

412

Page 414: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Maximal margin classifier

−1 0 1 2 3

−1

01

23

X1

X2

◮ Maximal margin classifier

413

Page 415: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

Drawbacks of maximal margin classifier

◮ What can be done if there is no separating hyperplane?

◮ Maximal margin classifier is very sensitive to single observation.

−1 0 1 2 3

−1

01

23

−1 0 1 2 3

−1

01

23

X1X1

X2

X2

414

Page 416: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

1. Support vector classifier

◮ The idea: allow for misclassification (impose soft margin).

◮ Optimization problem

maxβ0,...,βp,ǫ1,...,ǫn

M

s.t.

p∑

j=1

β2j = 1,

Yi(β0 + β1xi,1 + · · ·+ βpxi,p) ≥M(1− ǫi), ∀i = 1, . . . , n.

ǫi ≥ 0,n∑

i=1

ǫi ≤ C.

where ǫ1, . . . , ǫn are slack variables, C ≥ 0 is a tuning parameter.

415

Page 417: 1. Introduction - University of Haifaidattner/Course2015sem2/DM-00.pdf · >plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories") 0 2 4 6 8 100 200 300 400 Fat Calories

2. Support vector classifier

◮ Slack variable ǫi tells us where ith observation is located:

– ǫi = 0: ith observation on the correct side of the margin;

– ǫi > 0: ith observation violates the margin;

– ǫi > 1: ith observation on the wrong side of the hyperplane.

◮ Tuning parameter C establishes budget for margin violation

– C = 0: no budget for margin violation (maximal margin classifier);

– C > 0: no more than C observations can be misclassified (can lie

on the wrong side of the hyperplane);

– C is usually chosen by cross–valuidation.

416