1. introduction - university of haifaidattner/course2015sem2/dm-00.pdf · >plot(uscereal$fat,...

1. Introduction

� Motivation and definition

� Data mining tasks

� Methods and algorithms

� Examples and applications

Motivation: why data mining?

◮ Data explosion problem

– Automated data collection tools lead to tremendous amounts of

data stored in databases.

– Processing capacity of computers grows rapidly: CPU, memory,...

– Rapidly growing gap between our ability to generate data, and our

ability to make use of it.

We are drowning in data, but starving for knowledge!

What is data mining?

◮ Data Mining is the process of discovering new patterns from large

data sets involving methods from statistics and artificial intelligence

but also database management. The term is a buzzword, and is

frequently misused to mean any form of large scale data or

information processing. (Wikipedia)

◮ The term Data Mining is a misnomer. Mining of gold from rocks or

sand is called gold mining, rahter than rock mining or sand mining.

More appropriate term is Knowledge Mining from Data.

◮ Analysis of data is a process of inspecting, cleaning, transforming,

and modeling data with the goal of highlighting useful information,

suggesting conclusions, and supporting decision making.

What data mining is not

◮ Data mining differs from traditional database queries

– the query might not be precisely stated

– the data accessed is usually a different version from that of the

operational database

– output of data mining is not a subset of the database.

Example: wage data

◮ The goal: relate wage to education, age, year when the wage is

earned.

20 40 60 80

2003 2006 2009

1 2 3 4 5

Education Level

◮ Predict wage on the basis of age, year , education.

Example: stock market data

◮ The goal: to predict increase or decrease in S&P 500 stock index on a

given day using the past 5 years percentage change in the index.

Down Up

Yesterday

Today’s Direction

Down Up

Two Days Previous

Today’s Direction

Down Up

Three Days Previous

Today’s Direction

P◮ Boxplots of the previous day’s (2, 3 days) percentage change in index

for 648/602 days when it increased/decreased on the subsequent day.

Example: gene expression data

◮ Data: 6380 measurements of gene expression for 64 cancer cell lines.

◮ The goal: determine if there are groups (clusters) among the cell lines.

−40 −20 0 20 40 60

−40 −20 0 20 40 60−

◮ Data set in two–dimensional space (principal components): each

point is a cell line; on the right panel – for 14 types of cancer.

Steps of the data mining process

◮ Learning the application domain

◮ Creating the target data set – data selection

◮ Data cleaning and preprocessing (may take 60% of the effort).

◮ Data reduction and transformation

◮ Choosing functions of data mining

– summarization, classification, regression, association

◮ Data mining – search for patterns

◮ Pattern evaluation and knowledge representation

Example

Credit card company must detemine whether to authorize credit card

purchases. Based on historical information about purchases, each

purchase is placed in one of 4 classes

� authorize

� ask for further identification before authorization

� do not authorize

� do not authorize and contact police.

Data mining tasks: (1) determine how the data fit into the classes; (2)

apply the model for each new purchase.

1. Objectives of data mining

◮ Exploratory data analysis (data visualization)

Summary statistics, various plots, etc. Starting point of any data

mining process.

◮ Descriptive modeling – describe the data generating mechanism.

Examples include

* estimating probability distribution of the data

* finding groups in data (clustering)

* relationships between variables (dependency modeling: regression,

time series...).

2. Objectives data mining

◮ Predictive modeling – make a prediction about values of data using

known results from the database.

* Regression

* Classification

* Time series models

A key distinction between predictive and descriptive modeling is that

prediction problems focus on a single variable or a set of variables,

while in description problems no single variable is central to the

model.

Example applications

◮ Fraud detection

– AT&T uses a data mining system to detect fraudulent

international calls

– The Financial Crimes Enforcement Network AI Systems (FAIS)

uses data mining technologies to identify possible money

laundering activity within large cash transactions.

◮ Risk management

– Risk management applications use data mining to determine

insurance rates, manage investment portfolios, and differentiate

between companies and/or individuals who are good and poor

credit risks.12

– US West Communications uses data mining to determine customer

trends and needs based on characteritics such as family size,

median family age and location.

◮ Text mining and Web analysis

– Personalize the products and pages displayed to a particular user

or set of users.

Specific applications

◮ Predict whether a patient, hospitalized due to a heart attach, will

have a second heart attack. Prediction can use demographic, diet and

clinical measurements.

◮ Predict the price of a stock six month from now on the basis of

company performance measures and economic data.

◮ Identify the number in a handwritten ZIP code from a digitized

image.

◮ Identify the risk factors for prostate cancer based on clinical and

demographic variables.

◮ Distinguish pornographic and non–pornographic web pages.

I. Data mining tasks

◮ Regression

Regression is used to map a data item into a real valued prediction

variable. Regression involves the learning of the function that does

this mapping.

◮ Classification

Classification maps data into predefined groups (classes). It is often

referred to as supervised learning because the classes are determined

before examining the data.

◮ Time series analysis

Modeling variables evolving over time.

II. Data mining tasks

◮ Clustering

Clustering is similar to classification except that the groups are not

predifined, but defined by a data.

◮ Association rules

Association rules (affinity analysis) refers to the data mining task of

uncovering relationships in data.

◮ Summarization

Summarization maps data into subsets with associated simple

descriptions. For example, U.S. News World Report uses the average

SAT and ACT scores to compare US universities.

Data mining methods and algorithms

◮ Regression methods

Linear regression, kernels, splines, nearest–neighbors, neural

networks, regression trees,...

◮ Classification methods

Discriminant analysis, logistic regression, nearest–neighbors,

classification trees, ensenmble methods,...

◮ Clustering methods

Hierarchical clustering, K–means, K–medoids, mixtures,...

2. Data types and distance measures

Data format

◮ Observations: for subject i ∈ {1, . . . , N} we observe k different

features (e.g., age, cholesterol level, weight, marital status, etc.), i.e.

we have a vector xi = (xi,1, . . . , xi,k).

◮ Database can be identified with matrix N × k–matrix

subjects

x1 →x2 →...

xN →

x1,1 x1,2 · · · x1,k

x2,1 x2,2 · · · x2,k...

......

xN,1 xN,2 · · · xN,k

︸︷︷︸features

Distance and data types

◮ A distance measure d(x,y) between observations x and y should

satisfy the following axioms:

(i) d(x,y) ≥ 0 for all x, y, and d(x,y) = 0 iff x = y.

(ii) d(x,y) = d(y,x) (symmetry)

(iii) d(x,y) ≤ d(x, z) + d(z,y) (triangle inequality)

◮ Features can be

– Numerical – discrete or continuous (age, weight,...)

– Binary – encoded by 0− 1 (gender, success–failure,...)

– Categorical – extension of binary to more categories

– Ordinal – ordered categorical: {worst, bad, good, best}

1. Distance measures

◮ Numerical data

– Euclidean (L2) distance

d(x,y) =√(x− y)T (x− y) =

√√√√k∑

(xi − yi)2

– Manhattan (L1) distance: d(x,y) =∑k

i=1 |xi − yi|

– Lp–distance: d(x,y) = {∑k

i=1 |xi − yi|p}1/p, p ∈ [1,∞].

– Mahalanobis distance: if Σ is the covariance matrix of the features

d(x,y) =√(x− y)TΣ−1(x− y).

2. Distance measures

◮ Categorical data

– Matching (Hamming) distance: d(x,y) =∑k

i=1 I(xi 6= yi)

– Weighted matching distance:

d(x,y) =k∑

wiI(xi 6= yi), wi > 0,k∑

wi = 1.

◮ Ordinal data: xi – ordinal variable with mi levels is substituted by its

rank r(xi) ∈ {1, . . . ,mi}. To normalize the rank to [0, 1] define

z(xi) =r(xi)−1mi−1 , and set

d(x,y) =

|z(xi)− z(yi)|.

◮ Mixed data: use a mixture of normalized distances.

3. Data summary and visualization

Summary statistics

# The UScereal data frame has 65 rows and 11 columns.

# The data come from the 1993 ASA Statistical Graphics Exposition,

# and are taken from the mandatory F&DA food label.

# The data have been normalized here to a portion of one American cup.

>library(MASS)

>data(UScereal)

>summary(UScereal)

mfr calories protein fat sodium

G:22 Min. : 50.0 Min. : 0.7519 Min. :0.000 Min. : 0.0

K:21 1st Qu.:110.0 1st Qu.: 2.0000 1st Qu.:0.000 1st Qu.:180.0

N: 3 Median :134.3 Median : 3.0000 Median :1.000 Median :232.0

P: 9 Mean :149.4 Mean : 3.6837 Mean :1.423 Mean :237.8

Q: 5 3rd Qu.:179.1 3rd Qu.: 4.4776 3rd Qu.:2.000 3rd Qu.:290.0

R: 5 Max. :440.0 Max. :12.1212 Max. :9.091 Max. :787.9

fibre carbo sugars shelf

Min. : 0.000 Min. :10.53 Min. : 0.00 Min. :1.000

1st Qu.: 0.000 1st Qu.:15.00 1st Qu.: 4.00 1st Qu.:1.000

Median : 2.000 Median :18.67 Median :12.00 Median :2.000

Mean : 3.871 Mean :19.97 Mean :10.05 Mean :2.169

3rd Qu.: 4.478 3rd Qu.:22.39 3rd Qu.:14.00 3rd Qu.:3.000

Max. :30.303 Max. :68.00 Max. :20.90 Max. :3.000

potassium vitamins

Min. : 15.0 100% : 5

1st Qu.: 45.0 enriched:57

Median : 96.6 none : 3

Mean :159.1

3rd Qu.:220.0

Max. :969.7

># correlation matrix between some variables

>cor(UScereal[c("calories","protein","fat","fibre","sugars")])

calories protein fat fibre sugars

calories 1.0000000 0.7060105 0.5901757 0.3882179 0.4952942

protein 0.7060105 1.0000000 0.4112661 0.8096397 0.1848484

fat 0.5901757 0.4112661 1.0000000 0.2260715 0.4156740

fibre 0.3882179 0.8096397 0.2260715 1.0000000 0.1489158

sugars 0.4952942 0.1848484 0.4156740 0.1489158 1.0000000

1. Density visualization

Histogram

>hist(UScereal[,"protein"], main="UScereal data", xlab="protein")

UScereal data

protein

0 2 4 6 8 10 12 14

2. Density visualization

Kernel smoothing

>plot(density(UScereal[,"protein"],kernel="gaussian"), main="UScereal data",

+ xlab="protein")

0 5 10

UScereal data

protein

Boxplot

>mfr=UScereal["mfr"]

>boxplot(UScereal[mfr=="K","protein"], UScereal[mfr=="G", "protein"],

+ names=c("Kellogs", "General Mills"), xlab="Manufacturer", ylab="protein"))

Kellogs General Mills

Manufacturer

Quantile plot

QQ plot displays (zk/(n+1), x(k)), zq is qth quantile of N (0, 1) Φ(zq) = q,

0 < q < 1.

>qqnorm(UScereal$calories)

−2 −1 0 1 2

Normal Q−Q Plot

Theoretical Quantiles

antile

Relations between two variables

Scatterplot

>plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories")

0 2 4 6 8

Relations between more than two variables

Scatterplot matrix

>plot(UScereal[c("calories", "fat", "protein", "sugars","fibre", "sodium")])

calories

0 2 4 6 8 0 5 10 15 20 0 200 600

protein

sugars

100 300

2 4 6 8 12 0 10 20 30

sodium

Parallel plot

>parallel( UScereal[, c("calories","protein", "fat", "fibre")])

Min Max

calories

protein

4. Association rules

(Market basket analysis)

Market basket analysis

◮ Association rules show the relationships between data items.

◮ Typical example

A grocery store keeps a record of weekly transactions. Each

represents the items bought during one cash register

transaction. The objective of the market basket analysis is to

determine the items likely to be purchased together by a

customer.

Example

◮ Items: {Beer, Bread, Jelly, Milk, PeanutButter}

Transaction Items

t1 Bread, Jelly, PeanutButter

t2 Bread, PeanutButter

t3 Bread, Milk, PeanutButter

t4 Beer, Bread

t5 Beer, Milk

◮ 100% of the time that PeanutButter is purchased, so is Bread.

◮ 33.3% of the time PeanutButter is purchased, Jelly is also

purchased.

◮ PeanutButter exists in 60% of the overall transactions.

Definitions

◮ Given:

� a set of items I = {I1, . . . , Im}

� a database of transactions D = {t1, . . . , tn} where ti = {Ii1 , . . . , Iik}and Iij ∈ I

◮ Association rule

Let X and Y be two disjoint subsets (itemsets) of I . We say that

Y is associated with X (and write X ⇒ Y ) if the appearance of

X in an transaction ”usually” implies that Y occur in that

transaction too. We identify

X ⇔ {X is purchased}

Support and confidence

◮ Support s of an association rule X ⇒ Y is the percentage of

transactions in the database that contain X ∩ Y

s(X ⇒ Y ) = P (X ∩ Y ) =1

1{ti ⊇ (X ∩ Y )

◮ Confidence or strength α of an association rule X ⇒ Y is the ratio of

the number of transactions that contain X ∩ Y to the number of

transactions that contain X

α(X ⇒ Y ) = P (Y |X) =P (X ∩ Y )

P (X)=

∑ni=1 1

{ti ⊇ (X ∩ Y )

i=1 1{ti ⊇ X

◮ Problem: find all rules with support ≥ s0 and confidence ≥ α0.

Support and confidence of some rules

X ⇒ Y s α

Bread ⇒ PeanutButter 60% 75%

PeanutButter ⇒ Bread 60% 100%

Beer ⇒ Bread 20% 50%

PeanutButter ⇒ Jelly 20% 33.3%

Jelly ⇒ PeanutButter 20% 100%

Jelly ⇒ Milk 0% 0%

Other measures of rule quality

Rules with high support and confidence may be obvious or not interesting.

◮ Example: 100 baskets, purchases of tea and coffee

coffee coffeec∑

tee 20 5 25

teec 70 5 75∑

col 90 10 100

Rule tea⇒ coffee: s = 0.2, α = 20/10025/100 = 0.8 ⇒ strong rule!

However, s(coffee) = 90100 = 0.9; thus, there is a negative

“association” between buying tea and buying coffee.

◮ Additional measures of the rule quality are needed.

Lift and conviction

◮ Lift (interest)

lift(X ⇒ Y ) =P (X ∩ Y )

P (X)P (Y )=

∑ni=1 1(ti ⊇ X ∩ Y )

∑ni=1 1(ti ⊇ X) 1n

∑ni=1 1(ti ⊇ Y )

Rules with lift ≥ 1 are interesting. In previous example

lift(tea⇒ coffee) =0.2

0.25× 0.9= 0.89.

◮ Conviction

conviction(X ⇒ Y ) =P (X)P (Y c)

P (X ∩ Y c)

∑ni=1 1{ti ⊇ X} 1n

∑ni=1 1{ti ⊇ Y c}

∑ni=1 1{ti ⊇ X ∩ Y c}

Rules that always hold have conviction =∞. In the previous example

conviction(tea⇒ coffee) = (25/100)·(10/100)5/100 = 0.5

Lift and conviction of some rules

X ⇒ Y Lift Conviction

Bread ⇒ PeanutButter 54

PeanutButter ⇒ Bread 54 ∞

Beer ⇒ Bread 58

PeanutButter ⇒ Jelly 53

Jelly ⇒ PeanutButter 53 ∞

Jelly ⇒ Milk 0 35

Mining rules from frequent itemsets

1. Find frequent itemsets (itemset whose number of occurrences is above

a threshold s0).

2. Generate rules from frequent itemsets.

Input: D - database, I - collection of all items,

L-collection of all frequent itemsets, s0, α0.

Output: R - association rules satisfying s0 and α0.

R = ∅;for each ℓ ∈ L do

for each x ⊂ ℓ such that x 6= ∅ do

ifsupport(ℓ)support(x) ≥ α0 then R = R ∪ {x⇒ (ℓ− x)};

Example

Assume s0 = 30% and α0 = 50%.

◮ Frequent itemset L

{{Beer},{Bread},{Milk},{PeanutButter},{Bread,PeanutButter}}

◮ For ℓ = {Bread, PeanutButter} we have two subsets:

support({Bread, PeanutButter})support({Bread}) =

80= 0.75 > 0.5

support({Bread, PeanutButter})support({PeanutButter}) =

60= 1 > 0.5

◮ Conclusion:

PeanutButter ⇒ Bread and Bread ⇒ PeanutButter are valid

association rules.

1. Finding frequent itemsets: apriori algorithm

◮ Frequent itemset property

Any subset of frequent itemset must be frequent

◮ Basic idea:

– Look at candidate sets of size i

– Choose frequent itemsets of the size i

– Generate frequent itemsets of size i+ 1 by joining (taking unions

of) frequent itemsets found till pass i+ 1.

2. Finding frequent itemsets: apriori algorithm

At kth pass of the apriori algorithm we form a set of candidate itemsets

Ck of size k and a set of frequent itemsets Lk of size k.

1. Start with all singleton itemsets C1. Count support of all items in C1

and form the set L1 of all items from C1 with support ≥ s.

2. Let C2 be the set of all pairs from L1. Count support of all members

of C2 and form the set L2 of pairs with support ≥ s.

3. Let C3 be the set of triples, any two of which is a pair in L2.

Calculate the support of each triple in C3 and form the set of triples

L3 with support ≥ s.

4. Continue...

Example: apriori algorithm

s0 = 30%, α0 = 50%

Pass Candidates Frequent itemsets

1 {Beer},{Bread},{Jelly} {Beer},{Bread},

{PeanutButter},{Milk} {Milk},{PeanutButter}

2 {Beer,Bread},{Beer,Milk}, {Bread,PeanutButter}

{Bear,PeanutButter},{Bread,Milk},

{Bread,PeanutButter},

{Milk,PeanutButter}

Summary

◮ Efficient finding frequent itemsets

Finding frequent itemsets is costly. If there are m items, potentially

there may be 2m − 1 frequent itemsets.

◮ Once all frequent itemsets are found, generating the association rules

is easy and straightforward.

Other applications

Applications of association rules are not limited to basket analysis.

◮ Finding related concepts: items=words, baskets=documents (web

pages, tweets,...). We look for sets of words appearing together in

many documents. Expect that {Brad, Angelina} appears with

surprising frequency.

◮ Plagiatrism: items=documents, baskets=sentences; an item in the

basket if the sentence is in the document.

◮ Biomarkers: items are of two types – biomarkers (genes, proteins,...)

and diseases; each basket is the set of data about the patient (genome

and blood analysis, medical history...). A frequent itemset that

contains one disease and one or more biomarkers suggests a test for a

disease.

Example: DVD movies purchases

◮ Data:

> data<-read.table("DVDdata.txt",header=T)

> data

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1 LOTR2

1 0 0 1 1 0 1 1

2 1 1 0 0 0 0 0

3 0 0 0 0 0 1 1

4 0 1 0 0 0 0 0

5 0 1 0 0 0 0 0

6 0 1 0 0 0 0 0

7 0 0 0 1 1 0 0

8 0 1 0 0 0 0 0

9 0 1 0 0 0 0 0

10 0 1 1 0 0 1 0

Patriot Sixth.Sense

10 0 1

◮ Preparations

> nobs<-dim(data)[1]

> n<-dim(data)[2]

> namesvec<-colnames(data)

> namesvec

[1] "Braveheart" "Gladiator" "Green.Mile" "Harry.Potter1"

[5] "Harry.Potter2" "LOTR1" "LOTR2" "Patriot"

[9] "Sixth.Sense"

> # thresholds for rules

> supthresh<-0.2

> conftresh<-0.5

> lifttresh<-2

> sup1<-array(0,n)

> sup2<-matrix(0,ncol=n,nrow=n,dimnames=list(namesvec,namesvec))

◮ Calculating the chance of appearance P (X) for each movie

> for (i in 1:n){

+ sup1[i]<-sum(data[,i])/nobs}

> sup1

[1] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

◮ Calculating the chance of appearance P (X, Y ) for each pair of movies

> for (j in 1:n){

+ if(sup1[j]>=supthresh){

+ for (k in j:n){

+ if (sup1[k]>=supthresh){

+ sup2[j,k]<-data[,j]%*%data[,k]

+ sup2[k,j]<-sup2[j,k] } } } }

> sup2<-sup2/nobs

> sup2

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1

Braveheart 0 0.0 0.0 0.0 0 0.0

Gladiator 0 0.7 0.1 0.0 0 0.1

Green.Mile 0 0.1 0.2 0.1 0 0.2

Harry.Potter1 0 0.0 0.1 0.2 0 0.1

Harry.Potter2 0 0.0 0.0 0.0 0 0.0

LOTR1 0 0.1 0.2 0.1 0 0.3

LOTR2 0 0.0 0.1 0.1 0 0.2

Patriot 0 0.6 0.0 0.0 0 0.0

Sixth.Sense 0 0.5 0.2 0.1 0 0.2

LOTR2 Patriot Sixth.Sense

Braveheart 0.0 0.0 0.0

Gladiator 0.0 0.6 0.5

Green.Mile 0.1 0.0 0.2

Harry.Potter1 0.1 0.0 0.1

Harry.Potter2 0.0 0.0 0.0

LOTR1 0.2 0.0 0.2

LOTR2 0.2 0.0 0.1

Patriot 0.0 0.6 0.4

Sixth.Sense 0.1 0.4 0.6

◮ Calculating the confidence matrix P (column|row)

> conf2<-sup2/c(sup1)

> conf2

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2

Braveheart 0 0.0000000 0.0000000 0.0000000 0

Gladiator 0 1.0000000 0.1428571 0.0000000 0

Green.Mile 0 0.5000000 1.0000000 0.5000000 0

Harry.Potter1 0 0.0000000 0.5000000 1.0000000 0

Harry.Potter2 0 0.0000000 0.0000000 0.0000000 0

LOTR1 0 0.3333333 0.6666667 0.3333333 0

LOTR2 0 0.0000000 0.5000000 0.5000000 0

Patriot 0 1.0000000 0.0000000 0.0000000 0

Sixth.Sense 0 0.8333333 0.3333333 0.1666667 0

LOTR1 LOTR2 Patriot Sixth.Sense

Braveheart 0.0000000 0.0000000 0.0000000 0.0000000

Gladiator 0.1428571 0.0000000 0.8571429 0.7142857

Green.Mile 1.0000000 0.5000000 0.0000000 1.0000000

Harry.Potter1 0.5000000 0.5000000 0.0000000 0.5000000

Harry.Potter2 0.0000000 0.0000000 0.0000000 0.0000000

LOTR1 1.0000000 0.6666667 0.0000000 0.6666667

LOTR2 1.0000000 1.0000000 0.0000000 0.5000000

Patriot 0.0000000 0.0000000 1.0000000 0.6666667

Sixth.Sense 0.3333333 0.1666667 0.6666667 1.0000000

◮ Calculating the lift matrix

> tmp<-matrix(c(sup1),nrow=n,ncol=n,byrow=TRUE)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

[1,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[2,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[3,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[4,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[5,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[6,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[7,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[8,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

[9,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6

> lift2<-conf2/tmp

> lift2

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2

Braveheart 0 0.0000000 0.0000000 0.0000000 0

Gladiator 0 1.4285714 0.7142857 0.0000000 0

Green.Mile 0 0.7142857 5.0000000 2.5000000 0

Harry.Potter1 0 0.0000000 2.5000000 5.0000000 0

Harry.Potter2 0 0.0000000 0.0000000 0.0000000 0

LOTR1 0 0.4761905 3.3333333 1.6666667 0

LOTR2 0 0.0000000 2.5000000 2.5000000 0

Patriot 0 1.4285714 0.0000000 0.0000000 0

Sixth.Sense 0 1.1904762 1.6666667 0.8333333 0

LOTR1 LOTR2 Patriot Sixth.Sense

Braveheart 0.0000000 0.0000000 0.000000 0.0000000

Gladiator 0.4761905 0.0000000 1.428571 1.1904762

Green.Mile 3.3333333 2.5000000 0.000000 1.6666667

Harry.Potter1 1.6666667 2.5000000 0.000000 0.8333333

Harry.Potter2 0.0000000 0.0000000 0.000000 0.0000000

LOTR1 3.3333333 3.3333333 0.000000 1.1111111

LOTR2 3.3333333 5.0000000 0.000000 0.8333333

Patriot 0.0000000 0.0000000 1.666667 1.1111111

Sixth.Sense 1.1111111 0.8333333 1.111111 1.6666667

◮ Extracting and printing rules

> rulesmat<-(sup2>=supthresh)*(conf2>=conftresh)*(lift2>=lifttresh)

> rulesmat

Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1

Braveheart 0 0 0 0 0 0

Gladiator 0 0 0 0 0 0

Green.Mile 0 0 0 0 0 1

Harry.Potter1 0 0 0 0 0 0

Harry.Potter2 0 0 0 0 0 0

LOTR1 0 0 1 0 0 0

LOTR2 0 0 0 0 0 1

Patriot 0 0 0 0 0 0

Sixth.Sense 0 0 0 0 0 0

LOTR2 Patriot Sixth.Sense

Braveheart 0 0 0

Gladiator 0 0 0

Green.Mile 0 0 0

Harry.Potter1 0 0 0

Harry.Potter2 0 0 0

LOTR1 1 0 0

LOTR2 0 0 0

Patriot 0 0 0

Sixth.Sense 0 0 0

> diag(rulesmat)<-0

> rules<-NULL

> for (j in 1:n){

+ if (sum(rulesmat[j,])>0){

+ rules<-c(rules,paste(namesvec[j],"->",namesvec[rulesmat[j,]==1],sep=""))

> rules

[1] "Green.Mile->LOTR1" "LOTR1->Green.Mile" "LOTR1->LOTR2"

[4] "LOTR2->LOTR1"

◮ If we set supthresh<-0.1 then we find 12 rules

> rules

[1] "Green.Mile->Harry.Potter1" "Green.Mile->LOTR1"

[3] "Green.Mile->LOTR2" "Harry.Potter1->Green.Mile"

[5] "Harry.Potter1->Harry.Potter2" "Harry.Potter1->LOTR2"

[7] "Harry.Potter2->Harry.Potter1" "LOTR1->Green.Mile"

[9] "LOTR1->LOTR2" "LOTR2->Green.Mile"

[11] "LOTR2->Harry.Potter1" "LOTR2->LOTR1"

5. Predictive data mining: general issues

� Regression problem

� Classification problem

� Assessing goodness of a predictive model

Regression problem

◮ Regression problem: to model a response variable Y ∈ R1 as a

function of predictor variables X = (X1, . . . , Xp) ∈ Rp. If Y is a

continuous variable then a plausible model is

Y = f(X) + ǫ, ǫ is a random noise, Eǫ = 0

f is the regression function, f(X) = E{Y |X}.

◮ Fitting the model: based on the data

Dn = {(Yi,Xi), i = 1, . . . , n}, Xi = (Xi1, . . . , Xip)

the goal is to construct an estimate f(·) = f(·;Dn) of f(·).

◮ Model can be parametric and non–parametric

Approaches to regression modeling

◮ Parametric approach: a parametric form for unknown regression

function is assumed

f(x) = f(x, θ), θ ∈ Θ ⊂ Rm.

E.g., f(x, θ) = θTx where θ is unknown vector (linear regression).

◮ Nonparametric approach:

no specific parametric form for f is assumed.

Classification problem

◮ Classification problem: the objective is to model a binary

(categorical) response variable Y as a function of predictor variables

X = (X1, . . . , Xp).

◮ Each subject belongs to one of the two populations Π0 or Π1. For the

i-th subject Xi is observed and the corresponding ”population label”

Yi ∈ {0, 1}. Based on the data

Dn = {(Yi,Xi), i = 1, . . . , n}, Xi = (X1i, . . . , Xpi)

the goal is to construct a classsifier (prediction rule) f(·) = f(·;Dn)

that for each x predicts the label Y .

◮ Approaches: parametric and non-parametric.

Assessing goodness of a predictive model

◮ A ”naive” (resubstitution) approach: f(Xi) should be close to Yi for

each i. One can look, e.g., at the following performance indeces:

SL2 =1

[Yi − f(Xi)

SL1 =1

∣∣Yi − f(Xi)∣∣.

If Y is a binary variable (as in classification problem) then the

appropriate index is

S0/1 =1

I{f(Xi) 6= Yi

The smaller SL2 (SL1 , S0/1) is, the better the fit.

Is that a good approach?

Training and validation data sets

◮ Overfitting: “fitting the noise”.

The model is evaluated on the fitted data

◮ Training and validation data sets:

Divide the data Dn into two subsamples of sizes n1 and n2

respectively: training sample D′n and validation sample D′′

Construct the estimate on the basis of D′n only f(x) = f(x;D′

Assess goodness–of–fit on the basis of the validation sample D′′n, e.g.,

SL2 = 1n2

i:(Yi,Xi)∈D′′

[Yi − f(Xi;D′

◮ Drawback: D′′n used only for the validation purposes.

Leave–one–out cross–validation

◮ Cross-validation (leave–one–out): original data set

Dn = {(X1, Y1), . . . , (Xn, Yn)}.

The data set D(−i)n without i-th observation

D(−i)n = {(X1, Y1), . . . , (Xi−1, Yi−1), (Xi+1, Yi+1), . . . , (Xn, Yn)}.

◮ Cross–validation accuracy estimate

[Yi − f(Xi;D(−i)

n )]2.

◮ In general computationally expensive; cheap for linear regression

(Yi − Yi1− hi

where hi is ith diagonal element of the hat matrix (leverage statistic).

k–fold cross-validation

◮ Divide randomly the set of observations Dn = {(Xi, Yi), i = 1, . . . , n}in k groups D(j)

n , j = 1, . . . , k. Let D(−j)n be the dataset without jth

group of observations, j = 1, . . . , k.

◮ Cross-validation accuracy estimate:

SCV(k)L2

MSE(j),

MSE(j) =1

i∈D(j)n

[Yi − f(Xi,D(−j)n ]2

◮ Computationally faster that leave-on-out CV, but more biased. In

practice use 10–fold cross-validation.

Bootstrap methods

◮ Bootstrapping refers to a self-sustained process that proceeds without

external help. The term is sometimes attributed to Rudolf Erich

Raspe’s story “The Surprising Adventures of Baron Munchausen”,

where the main character pulls himself (and his horse) out of a

swamp by his hair (specifically, his pigtail).

◮ In Statistics, bootstrap is a method for accessing accuracy of

statistical procedures by resampling from empirical distributions.

Introduced by Bradley Efron in 1979.

Bootstrap idea

◮ Training set: Dn = {(X1, Y1), . . . , (Xn, Yn)}.

◮ The idea is

– randomly draw with replacements B (bootstrap) datasets

{D∗n,b, b = 1, . . . , B}

from Dn, each sample D∗n,b is of the size n;

– refit the model for each of the bootstrap datasets and compute an

accuracy measure;

– average an accuracy measures over B replications.

2.8 5.3 3

1.1 2.1 2

2.4 4.3 1

Y X Obs

2.8 5.3 3

2.4 4.3 1

2.8 5.3 3

Y X Obs

2.4 4.3 1

2.8 5.3 3

1.1 2.1 2

Y X Obs

2.4 4.3 1

1.1 2.1 2

Y X Obs

Original Data (Z)

Bootstrap accuracy estimate

◮ The final bootstrap accuracy estimate:

[Yi − f(Xi,D∗n,b)]

◮ There are modifications like leave-one-out bootstrap accuracy

estimates...

6. Review of Linear Algebra

Matrix and vector

◮ Matrix: a rectangular array of numbers, e.g., A ∈ Rn×p

a11 a12 · · · a1p

a21 a22 · · · a2p...

......

an1 an2 · · · anp

, A = {aij}i=1,...,n

j=1,...,p

◮ Vector: is a matrix containing one column x = [x1; · · · ;xn] ∈ Rn.

◮ Think of a matrix as a linear operation on vectors: when an n× pmatrix A is applied to (multiplies) a vector x ∈ Rp , it returns a

vector in y = Ax ∈ Rn.

1. Matrix multiplication

◮ If A ∈ Rn×p and B ∈ Rp×m, C = AB then C ∈ Rn×m and

C = AB, cij =

aikbkj , i = 1, . . . , n; j = 1, . . . ,m.

◮ Special cases

– inner product of vectors: if x ∈ Rn, y ∈ Rn then xT y =∑n

i=1 xiyi.

– matrix–vector multiplication: if A ∈ Rn×p and x ∈ Rp then

Ax = [a1, a2, . . . , ap]x =n∑

The product Ax is a linear combination of matrix columns {aj}with weights x1, . . . , xp.

2. Matrix multiplication

◮ Properties

– Associative: (AB)C = A(BC)

– Distirbutive: (A+ B)C = AC +BC

– Non–commutative: AB 6= BA

◮ Block multiplication: if A = [Aik], B = [Bkj ], where Aik’s and Bkj ’s

are matrix blocks, and the number of columns in Aik equals to the

number of rows in Bkj then

C = AB = [Cij ], Cij =∑

AikBkj .

Special types of of quadratic matrices

◮ Diagonal A = diag{a11, a22, . . . , ann}

◮ Identity: I = In = diag {1, . . . , 1}︸︷︷︸n times

◮ Symmetric: A = AT

◮ Orthogonal: ATA = In = AAT

◮ Idempotent (projection): A2 = A ·A = A.

Linear independence and rank

◮ A set of vectors x1, . . . , xn are linearly independent if there are no

exist constants c1, . . . , cn (except all cj ’s are zero) such that

c1x1 + · · ·+ cnxn = 0.

◮ Rank of A ∈ Rn×p is the maximal number of linearly independent

columns (or, equivalently, rows). If rank(A) = min(n, p) then iot is

said that A is of the full rank.

◮ Properties: rank(A) ≤ min(p, n), rank(A) = rank(AT ),

rank(AB) ≤ min{rank(A), rank(B)}, rank(A+B) ≤ rank(A)+rank(B).

Trace of matrix

◮ Trace of A ∈ Rn×n is the sum of all diagonal elements of A:

tr(A) =n∑

◮ Properties

– tr(A) = tr(AT )

– tr(A+B) = tr(A) + tr(B)

– tr(α · A) = α · tr(A) for all α ∈ R

– tr(AB) = tr(BA)

– if x ∈ Rn, y ∈ Rn then tr(xyT ) = xT y.

1. Determinant

◮ 2× 2–matrix: determinant of the matrix

([ a b

])= ad− bc.

Absolute value of the determinant is the area of the parallelogram

formed by the vectors (a, b) and (c, d).

◮ In general, if τ = (τ1, . . . , τn) is a permutation of {1, . . . , n} then

det(A) =∑

(−1)|τ |a1τ1a2τ2 · · · anτn

where |τ | = 0 if τ is a permutation with even number of changes in

{1, . . . , n}, and |τ | = 1 otherwise.

2. Determinant

◮ Properties

– det(A) = det(AT ), det(αA) = αndet(A), ∀α ∈ R;

– determinant changes its sign if two columns are interchanged;

– determinant vanishes if and only if there is a linear dependence

between its columns;

– det(AB) = det(A)det(B).

Inverse matrix

◮ If A ∈ Rn×n and rank(A) = n then the inverse of A, denoted A−1 is

the matrix such that AA−1 = A−1A = In.

◮ Properties:

(A−1)−1 = A; (AB)−1 = B−1A−1; (A−1)T = (AT )−1.

◮ If det(A) = 0 then matrix A is singular (the inverse matrix does not

exist).

Range and null space (kernel) of a matrix

◮ Span: for xi ∈ Rp, i = 1, . . . , n

span(x1, . . . , xn) ={ n∑

αixi : αi ∈ R, i = 1, . . . , n}.

◮ Range: If A ∈ Rn×p then

Range(A) = {Ax : x ∈ Rp}.

Range(A) is the span of columns of A.

◮ Null space or kernel of a matrix:

Ker(A) = {x ∈ Rp : Ax = 0}.

Eigenvalues and eigenvectors

◮ Characteristic polynomial: Let A ∈ Rp×p; then

q(λ) = det(A− λI), λ ∈ R

is the characteristic polynomial of matrix A. Roots of this polynomial

λ1, . . . , λp are eigenvalues of matrix A: det(A− λjI) = 0, j = 1, . . . , p.

◮ A− λjI is a singular matrix; therefore there exists a non–zero vector

γ ∈ Rp such that Aγ = λjγ. This vector is called the eigenvector of A

corresponding to the eigenvalue λj . We can normalize eigenvectors so

that γT γ = 1.

◮ q(λ) = (−1)p ∏pj=1(λ− λj) = det(A− λI); hence

det(A) = q(0) =∏p

j=1 λj . In addition, tr(A) =∑p

j=1 λj .

Symmetric matrices, spectral decomposition

◮ All eigenvalues of a symmetric matrix are real.

◮ Orthogonal matrix: if c1, . . . , cp are orthonormal vectors (basis), i.e.,

cTi cj = 0, i 6= j, cTi ci = 1, ∀i then matrix C = [c1, c2, . . . , cp] is

orthogonal, CCT = CTC = I ⇒ C−1 = CT .

◮ Spectral decomposition: any symmetric p× p matrix A can be

represented as

A = ΓΛΓT =

λjγjγTj ,

where Λ = diag{λ1, . . . , λp}, λj ’s are eigenvalues, Γ = [γ1, . . . , γp], and

γj ’s are eigenvectors.

◮ If A is non–singular symmetric then An = ΓΛnΓT . In particular, if

λj ≥ 0, ∀j then√A = ΓΛ1/2ΓT .

Eigenvalues characterization for symmetric matrices

◮ Let A be an n× n symmetric matrix with eigenvalues

λmin = λ1 ≤ λ2 ≤ · · · ≤ λn−1 ≤ λn = λmax.

λminxTx ≤ xTAx ≤ λmaxx

Tx, ∀x ∈ Rn,

λmax = maxx 6=0

xTx= max

xTx=1xTAx,

λmin = minx 6=0

xTx= min

xTx=1xTAx.

In addition, if γ1, . . . γn are eigenvectors corresponding to λ1, . . . , λn

maxx 6=0

x⊥un,...,un−k+1

xTx= λn−k, k = 1, . . . , n− 1.

Quadratic forms, projection matrices, etc.

◮ Quadratic form: Q(x) = xTAx.

◮ For symmetric A: if Q(x) = xTAx > 0 for all x ∈ Rp then A is

positive definite. Alternatively, λj(A) > 0 for all j = 1, . . . , p.

◮ Projection (idempotent) matrix: P = P 2. Typical example is the hat

matrix in regression. If X ∈ Rn×p then

P = X(XTX)−1XT

is idempotent. It is projection on the column space (range) of

matrix X .

7. Multiple Linear Regression

Linear regression model

◮ Model: response Y is modeled as a linear function of predictors

X1, . . . , Xp plus some errors ǫ:

Y = β0 + β1X1 + · · ·+ βpXp + ǫ.

The data is {Yi, Xi1, . . . , Xip, i = 1, . . . , n}. Then

Y︸︷︷︸n×1

= X︸︷︷︸n×(p+1)

β︸︷︷︸(p+1)×1

+ ǫ︸︷︷︸n×1

1 X11 X12 · · · X1p

......

1 Xn1 Xn2 · · · Xnp

, ǫ =

ǫ1...

Model fitting

◮ Basic assumptions: zero mean, uncorrelated errors

Eǫ = 0, cov(ǫ) = EǫǫT = σ2In, In − n× n identity matrix.

◮ Least squares estimator β

The idea is to minimize the sum of squares of errors minβ S(β) where

S(β) = (Y −Xβ)T (Y −Xβ) = Y TY − 2βTXTY + βTXTXβ.

Differentiate S(β) w.r.t. β and set to zero:

∇βS(β) = −2XTY + 2XTXβ = 0 ⇒ β = [XTX]−1XTY ,

provided that XTX is non–singular.

Predicted values and residuals

◮ Predicted (fitted) values

Y = Xβ = X(XTX)−1XTY = H︸︷︷︸n×n

H is the hat matrix; H = HT , H = H2 – projection matrix, projects

on the column space of X .

◮ Residuals

e = Y − Y = (In −H)Y , In − n× n identity matrix

Sums of squares

◮ Residual sum of squares:

SSE =n∑

e2i = eTe = (Y − Y )T (Y − Y ).

◮ Total sum of squares: for Y = 1n

∑ni=1 Yi

SST =n∑

(Yi − Y )2 = (Y − 1nY )T (Y − 1nY ), 1n = (1, . . . , 1)T .

Characterizes variability of the response Y around its average.

◮ Regression sum of squares

SSreg = SST − SSE.

Characterizes variability in data explained by the regression model.

Sums of squares

◮ R2–value characterizes proportion of variablity in the data explained

by the regression model

R2 =SSreg

Closer R2 to one, more variability is explained. R2 grows as more

predictor variables are added to the model.

1. Inference in the linear regression model

◮ Basic assumption: ǫ ∼ Nn(0, σ2In). Under this assumption, if there is

no relationship between X1, . . . , Xp and Y , i.e., if

β1 = β2 = · · · = βp = 0 then

MSreg =SSreg

p∼ χ2(p), MSE =

n− p− 1∼ χ2(n− p− 1).

◮ F–test for the hypothesis H0 : β1 = · · · = βp = 0 is based on the fact

F ∗ =SSreg/p

SSE/(n− p− 1)∼ F (p, n− p− 1).

Then H0 is rejected when F ∗ > F(1−α)(p, n− p− 1).

2. Inference in the linear regression model

◮ Inference on individual coefficients: β ∼ Np+1(β, σ2(XTX)−1)

βk − βks.d.(βk)

∼ t(n− p− 1), k = 0, 1, . . . , p

where s.d.(βk) is the square root of corresponding diagonal element of

σ2(XTX)−1, σ2 =MSE. Hence H0 : βk = 0 is rejected if

|t∗| > t(1−α/2)(n− p− 1), t∗ =βk

s.d.(βk).

1. LS regression diagnostics

◮ Non–linearity of the response–predictor relationship

5 10 15 20 25 30

Fitted values

Residual Plot for Linear Fit

15 20 25 30 35

Fitted values

Residual Plot for Quadratic Fit

334323

◮ Correlations of errors

Standard errors computed on the basis of independence assumption.

If there are correlations between errors then confidence intervals can

have less coverage probability than expected.

◮ Tests for serial correlation: run test, sign changes tests, etc.

◮ Time series models

◮ Non–constant variance of errors

10 15 20 25 30

Fitted values

Response Y

975845

2.4 2.6 2.8 3.0 3.2 3.4−

Fitted values

Response log(Y)

437671

◮ Remedy: transformations

◮ Outliers: points where the response variable is unusually large (small)

given predictors. Outliers can be detected in residuals plots.

◮ High leverage points has unusual X values.

◮ Collinearity refers to a situation when two or more predictors are

closely related to each other. The matrix XTX is close to singular.

Example: ozone data

># airquality data set

>ozone.lm <- lm (Ozone~Solar.R+Wind+Temp, data=airquality)

>summary(ozone.lm)

lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)

Residuals:

Min 1Q Median 3Q Max

-40.485 -14.219 -3.551 10.097 95.619

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -64.34208 23.05472 -2.791 0.00623 **

Solar.R 0.05982 0.02319 2.580 0.01124 *

Wind -3.33359 0.65441 -5.094 1.52e-06 ***

Temp 1.65209 0.25353 6.516 2.42e-09 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.18 on 107 degrees of freedom

Multiple R-Squared: 0.6059, Adjusted R-squared: 0.5948

F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16

1. Plots of residuals

◮ Plot of residuals vs fitted values

>plot(ozone.lm$fitted.values, ozone.lm$residuals, ylab="Fitted values",

+ xlab="Residuals")

>abline(0,0)

−20 0 20 40 60 80 100

Residuals

Fitted

2. Plots of residuals

◮ QQ–plot of residuals

>qqnorm(ozone.lm$residuals)

>qqline(ozone.lm$residuals)

−2 −1 0 1 2

0Normal Q−Q Plot

antile

1. Example: Boston housing data

◮ Response variable: median value of homes (medv)

◮ Predictor variables:

– crime rate (crim); % land zones for lots (zn)

– % nonretail business (indus); 1/0 on Charles river (chas)

– nitrogene oxide concentration (nox); average number of rooms (rm)

– % built before 1940 (age); tax rate (tax)

– weigthed distance to employment centers (dis)

– % lower-status population (lstat); % black (B)

– accecibility to radial highways (rad)

– pupil/teacher ratio (ptratio)

2. Example: Boston housing data

> library(MASS)

> Boston.lm<- lm(medv~., data=Boston)

> summary(Boston.lm)

lm(formula = medv ~ ., data = Boston)

Residuals:

-15.594 -2.730 -0.518 1.777 26.199

Coefficients:

(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***

crim -1.080e-01 3.286e-02 -3.287 0.001087 **

zn 4.642e-02 1.373e-02 3.382 0.000778 ***

indus 2.056e-02 6.150e-02 0.334 0.738288

chas 2.687e+00 8.616e-01 3.118 0.001925 **

nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***

rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***

age 6.922e-04 1.321e-02 0.052 0.958229

dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***

rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***

tax -1.233e-02 3.760e-03 -3.280 0.001112 **

ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***

black 9.312e-03 2.686e-03 3.467 0.000573 ***

lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

◮ indus and age are not significant at the 0.05 level.

>fmBoston=as.formula("medv~crim+zn+chas+nox+rm+dis+rad+tax+ptration+black+lstat"

> Boston1.lm <- lm(fmBoston, data=Boston)

> summary(Boston1.lm)

lm(formula = fmBoston, data = Boston)

Residuals:

-15.5984 -2.7386 -0.5046 1.7273 26.2373

Coefficients:

(Intercept) 36.341145 5.067492 7.171 2.73e-12 ***

crim -0.108413 0.032779 -3.307 0.001010 **

zn 0.045845 0.013523 3.390 0.000754 ***

chas 2.718716 0.854240 3.183 0.001551 **

nox -17.376023 3.535243 -4.915 1.21e-06 ***

rm 3.801579 0.406316 9.356 < 2e-16 ***

dis -1.492711 0.185731 -8.037 6.84e-15 ***

rad 0.299608 0.063402 4.726 3.00e-06 ***

tax -0.011778 0.003372 -3.493 0.000521 ***

ptratio -0.946525 0.129066 -7.334 9.24e-13 ***

black 0.009291 0.002674 3.475 0.000557 ***

lstat -0.522553 0.047424 -11.019 < 2e-16 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1. Boston housing data: residual plots

> plot(residuals(Boston1.lm))

> abline(0,0)

> qqnorm(residuals(Boston1.lm))

> qqline(residuals(Boston1.lm))

2. Boston housing data: residual plots

0 100 200 300 400 500

residu

als(Bo

−3 −2 −1 0 1 2 3

Normal Q−Q Plot

antile

8. Linear Model Selection and Regularization

The need for model selection

◮ We have many different potential predictors. Why do not base the

model on all of them?

◮ Two sides of one coin: bias and variance

– The model with more predictors can describe the phenomenon

better – less bias.

– When we estimate more parameters, the variance of estimators

grows – we “fit the noise”, overfitting!

◮ A clever model selection strategy should resolve

the bias–variance trade–off.

Subset selection and coefficient shrinkage

◮ Why the least squares are not always satisfactory?

∗ Prediction accuracy: the LS estimates often have large variance

(collinearity problems, large number of predictors etc.)

∗ Interpretation: with large number of predictors, we often would

like to determine a smaller subset that exhibit strongest effects.

◮ Two approaches:

∗ subset selection (identify a subset of predictors having strongest

effect on the response variable)

∗ coefficient shrinkage; this has an effect of reducing variance of

estimates (but increasing bias...)

1. Criteria for subset selection

◮ How to judge if a subset of predictors is good? The R2 index is

useless as it increases as new variables are added to the model.

◮ Criteria for subset selection. The idea is to adjust or penalize the

residual sum of squares SSE for the model complexity.

– Mallows’ Cp: Cp = 1n [SSE + 2pσ2]

– AIC (Akaike infromation criterion): penalization of the likelihood

function; for linear regression is equivalent to Cp

– BIC (Bayesian information criterion): BIC = 1n [SSE + p log(n)σ2].

– Adjusted R2 = 1− SSE/(n−p−1)SST/(n−1)

2. Criteria for subset selection

◮ Typical use of the criteria

2 4 6 8 10

Number of Predictors

2 4 6 8 10

◮ Procedures: best subset selection, stepwise (forward, backward)

selection.

Best subset selection

1. LetM0 be the model without predictors

2. For k = 1, . . . p:

� Fit all(pk

)models that contain k predictors;

� Pick the best among these(pk

)models with largest R2; call itMk.

3. Select betweenM0, . . . ,Mp using Cp, AIC, BIC, etc.

1. Forward selection procedure

◮ Fact: assume that we have two models

– the first contains p variables

– the second contain the same p variables + more q variables.

We want to test H0 : q variables are not significant. If ǫ ∼ N (0, σ2)

then under H0

[SSreg(p+ q)− SSreg(p)]/q

SSE(p+ q)/(n− p− q − 1)∼ F (q, n− p− q − 1).

◮ Idea of the forward selection procedure: at each step to add a variable

which maximizes the F -statistic (provided it is significant, greater

than the corresponding (1− α)-quantile).

◮ Instead of F–test at each step one can use AIC.

2. Forward selection procedure

1. Fit simple regression model for each variable xk, k = 1, . . . , p and

compute

SSreg(xk)/1

SSE(xk)/(n− 2), k = 1, . . . , p.

Select variable xk1 with k1 = argmaxFxk; if Fxk1

> F(1−α)(1, n− 2)

add xk1 to the model.

2. Fit models with predictors (xk, xk1), and compute

Fxk|xk1=

[SSreg(xk, xk1)− SSreg(xk1)]/1

SSE(xk, xk1)/(n− 3), k = 1, . . . , p, k 6= k1

Select k2 = argmaxFxk|xk1, compare with F(1−α)(1, n− 2) and if

significant add xk2 to the model. Proceed...

Boston housing data

> library(MASS)

> maxfmla<-as.formula(paste("medv~", paste(names(Boston[,-14]), collapse="+")))

> maxfmla

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

> Boston.lm <-lm(medv~1, data=Boston)

> Boston.fwd<-step(Boston.lm,direction="forward", scope=list(upper=maxfmla),test="F")

Start: AIC=2246.51

medv ~ 1

Df Sum of Sq RSS AIC F value Pr(>F)

+ lstat 1 23243.9 19472 1851.0 601.618 < 2.2e-16 ***

+ rm 1 20654.4 22062 1914.2 471.847 < 2.2e-16 ***

+ ptratio 1 11014.3 31702 2097.6 175.106 < 2.2e-16 ***

+ indus 1 9995.2 32721 2113.6 153.955 < 2.2e-16 ***

+ tax 1 9377.3 33339 2123.1 141.761 < 2.2e-16 ***

+ nox 1 7800.1 34916 2146.5 112.591 < 2.2e-16 ***

+ crim 1 6440.8 36276 2165.8 89.486 < 2.2e-16 ***

+ rad 1 6221.1 36495 2168.9 85.914 < 2.2e-16 ***

+ age 1 6069.8 36647 2171.0 83.478 < 2.2e-16 ***

+ zn 1 5549.7 37167 2178.1 75.258 < 2.2e-16 ***

+ black 1 4749.9 37966 2188.9 63.054 1.318e-14 ***

+ dis 1 2668.2 40048 2215.9 33.580 1.207e-08 ***

+ chas 1 1312.1 41404 2232.7 15.972 7.391e-05 ***

<none> 42716 2246.5

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1851.01

medv ~ lstat

+ rm 1 4033.1 15439 1735.6 131.3942 < 2.2e-16 ***

+ ptratio 1 2670.1 16802 1778.4 79.9340 < 2.2e-16 ***

+ chas 1 786.3 18686 1832.2 21.1665 5.336e-06 ***

+ dis 1 772.4 18700 1832.5 20.7764 6.488e-06 ***

+ age 1 304.3 19168 1845.0 7.9840 0.004907 **

+ tax 1 274.4 19198 1845.8 7.1896 0.007574 **

+ black 1 198.3 19274 1847.8 5.1764 0.023316 *

+ zn 1 160.3 19312 1848.8 4.1758 0.041527 *

+ crim 1 146.9 19325 1849.2 3.8246 0.051059 .

+ indus 1 98.7 19374 1850.4 2.5635 0.109981

<none> 19472 1851.0

+ rad 1 25.1 19447 1852.4 0.6491 0.420799

+ nox 1 4.8 19468 1852.9 0.1239 0.724966

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1735.58

medv ~ lstat + rm

+ ptratio 1 1711.32 13728 1678.1 62.5791 1.645e-14 ***

+ chas 1 548.53 14891 1719.3 18.4921 2.051e-05 ***

+ black 1 512.31 14927 1720.5 17.2290 3.892e-05 ***

+ tax 1 425.16 15014 1723.5 14.2154 0.0001824 ***

+ dis 1 351.15 15088 1725.9 11.6832 0.0006819 ***

+ crim 1 311.42 15128 1727.3 10.3341 0.0013900 **

+ rad 1 180.45 15259 1731.6 5.9367 0.0151752 *

+ indus 1 61.09 15378 1735.6 1.9942 0.1585263

<none> 15439 1735.6

+ zn 1 56.56 15383 1735.7 1.8457 0.1748999

+ age 1 20.18 15419 1736.9 0.6571 0.4179577

+ nox 1 14.90 15424 1737.1 0.4849 0.4865454

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1678.13

medv ~ lstat + rm + ptratio

+ dis 1 499.08 13229 1661.4 18.9009 1.668e-05 ***

+ black 1 389.68 13338 1665.6 14.6369 0.0001468 ***

+ chas 1 377.96 13350 1666.0 14.1841 0.0001854 ***

+ crim 1 122.52 13606 1675.6 4.5115 0.0341560 *

+ age 1 66.24 13662 1677.7 2.4291 0.1197340

<none> 13728 1678.1

+ tax 1 44.36 13684 1678.5 1.6242 0.2031029

+ nox 1 24.81 13703 1679.2 0.9072 0.3413103

+ zn 1 14.96 13713 1679.6 0.5467 0.4600162

+ rad 1 6.07 13722 1679.9 0.2218 0.6378931

+ indus 1 0.83 13727 1680.1 0.0301 0.8622688

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

..............................................................

Step: AIC=1596.1

medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +

crim + rad

+ tax 1 273.619 11081 1585.8 12.1978 0.0005214 ***

<none> 11355 1596.1

+ indus 1 33.894 11321 1596.6 1.4790 0.2245162

+ age 1 0.096 11355 1598.1 0.0042 0.9485270

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step: AIC=1585.76

medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +

crim + rad + tax

<none> 11081 1585.8

+ indus 1 2.51754 11079 1587.7 0.1120 0.7380

+ age 1 0.06271 11081 1587.8 0.0028 0.9579

Shrinkage methods

◮ The idea of regularization: contrary to subset selection the idea is to

fit a model keeping all coefficients, but to impose constraints on the

size of the coefficients. For example, to shrink the coefficients to zero.

◮ In general, regularization refers to a process of introducing additional

information in order to solve an ill-posed problem or to prevent

overfitting.

Ridge regression

◮ Ridge regression shrinks regression coefficients by imposing a penalty

on their size:

βλ = argminβ

{ n∑

(Yi − β0 −

Xijβj

= (Y −Xβ)T (Y −Xβ) + λβTβ

βλ = (XTX + λI)−1XTY .

◮ Ridge parameter λ should be chosen: larger λ’s result in smaller

variance but bigger bias.

◮ βλ is a linear estimator.

◮ Usually βλ is computed for a range of λ’s.

2. Ridge regression

◮ The matrix X is usually centered and rescaled: if xj is the jth

(j = 1, . . . , p) column of X then define Z = [z1, . . . , zp] by

Sj(xj − xj), S2

n(xj − xj)

T (xj − xj).

Y is also cenetred. Then consider the model without intercept

Y = Zθ + ǫ.

◮ Ridge trace: graphs of coefficient estimates as function of λ.

◮ Generalized Cross-Validation (GCV): select λ that minimizes

V (λ) =1nY

T [I −A(λ)]2Y(

1n tr[I −A(λ)]

)2 , A(λ) = X(XTX + λI)−1XT .

2. Ridge regression in R

> Boston.ridge<-lm.ridge(medv~., Boston, lambda=seq(0, 100, 0.1))

◮ Output values

* scales - scalings used on the X matrix.

* Inter - was intercept included?

* lambda - vector of lambda values

* ym - mean of y

* xm - column means of x matrix

* GCV - vector of GCV values

* kHKB - HKB estimate of the ridge constant.

* kLW - L-W estimate of the ridge constant.

> plot(Boston.ridge) # produces ridge trace

Ridge trace: Boston housing data

0 20 40 60 80 100

−4−3

−2−1

x$lambda

t(x$co

Ridge regression in R

> select(Boston.ridge)

modified HKB estimator is 4.594163

modified L-W estimator is 3.961575

smallest value of GCV at 4.3

> Boston.ridge.cv<-lm.ridge(medv~.,Boston,lambda=4.3)

> Boston.ridge.cv$coef

crim zn indus chas nox rm

-0.895001937 1.020966996 0.049465334 0.694878367 -1.943248437 2.707866705

age dis rad tax ptratio black

-0.005646034 -2.992453378 2.384190136 -1.819613735 -2.026897293 0.847413719

-3.689619529

◮ LASSO estimator

βlasso = argminβ

(Yi − β0 −

Xijβj

subject to

|βj | ≤ t.

◮ Comparison to ridge:∑p

j=1 β2j is replaced by

∑pj=1 |βj |

◮ Properties:

* estimator βlasso is non–linear;

* t small causes some of the coefficients to be exactly zero; if t is

larger than t0 =∑p

j=1 |βLSj | then βlasso = βLS;

* a kind of continuous subset selection

LASSO and ridge

LASSO, ridge and best subset selection

◮ Ridge regression

βridge = argminβ

{ n∑

(Yi − β0 −

Xijβj

)2 ∣∣∣p∑

β2j ≤ s

◮ Lasso

βlasso = argminβ

{ n∑

(Yi − β0 −

Xijβj

)2 ∣∣∣p∑

|βj | ≤ t}

◮ Best subset selection

βsparse = argminβ

{ n∑

(Yi − β0 −

Xijβj

)2 ∣∣∣p∑

I{βj 6= 0} ≤ s}

◮ LASSO and ridge are computationally feasible alternatives to best

subset selection.

1. LASSO and ridge in a special case

Assume that n = p and X is the identity matrix.

◮ Least squares estimator: βls = argminβ∑n

i=1(Yi − βi)2,

βlsi = Yi, i = 1, . . . , n.

◮ Ridge regression: βridge = argminβ

{∑ni=1(Yi − βi)2 + λ

∑ni=1 β

βridgei = Yi/(1 + λ), i = 1, . . . , n.

◮ LASSO: βlasso = argminβ

{∑ni=1(Yi − βi)2 + λ

∑ni=1 |βi|

βlassoi =

Yi − λ/2, Yi > λ/2,

Yi + λ/2, Yi < −λ/2,0, |Yi| ≤ λ/2

i = 1, . . . , n.

2. LASSO and ridge in a special case

◮ βlassoi = soft thresholding(Yi)

−1.5 −0.5 0.0 0.5 1.0 1.5

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5−

Least Squares

LASSO in R

> library(lars)

> library(MASS)

> x<-as.matrix(Boston[,1:13])

> y<-as.vector(Boston[,14])

> Boston.lasso <- lars(x,y,type="lasso")

> summary(Boston.lasso)

LARS/LASSO

Call: lars(x = x, y = y, type = "lasso")

Df Rss Cp

0 1 42716 1392.997

1 2 36326 1111.195

2 3 21335 447.485

3 4 14960 166.356

4 5 14402 143.588

5 6 13667 112.931

6 7 13449 105.281

7 8 13117 92.515

8 9 12423 63.717

9 10 11950 44.700

10 11 11899 44.446

11 12 11730 38.934

12 13 11317 22.590

13 12 11086 10.341

14 13 11080 12.032

15 14 11079 14.000

◮ Print and plot of complete coefficient path

> print(Boston.lasso)

lars(x = x, y = y, type = "lasso")

R-squared: 0.741

Sequence of LASSO moves:

lstat rm ptratio black chas crim dis nox zn indus rad tax indus indus age

Var 13 6 11 12 4 1 8 5 2 3 9 10 -3 3 7

Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

> plot(Boston.lasso)

Returns a plot: coefficient values against s = t∑p

j=1 |βLSj

| , 0 ≤ s ≤ 1.

LASSO coefficients path

* * * * * * * * * ** **

0.0 0.2 0.4 0.6 0.8 1.0

|beta|/max|beta|

* * * * * * * * *** *

** * *

* * * * * * * * * ** * * * * ** * * * *

* * * * ** * * * * *

* * * * * * * *

* * * * ** * ** * *

* * * * * * * * * ** * * * * ** * * * * * **

* * * * * * * * * ***

* * * * * * * * * ** *

* * ** ** * * * * *

* * * **

* * * * ** * * * * *

* * * * * * ** * * * * *

0 1 2 3 5 7 8 9 11 12 13 15

Cross-validated choice of s

> cv.lars(x,y, K=10)

Return K-fold CV mean squared prediction error against s.

0.0 0.2 0.4 0.6 0.8 1.0

fraction

Prediction and extraction of coefficients

◮ Extraction of LASSO coefficients for given s

> Boston.coef.03<-coef(Boston.lasso, s=0.3, mode="fraction")

> Boston.coef.03

crim zn indus chas nox rm age

0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 3.3707730 0.0000000

dis rad tax ptratio black lstat

0.0000000 0.0000000 0.0000000 -0.4299806 0.0000000 -0.4664715

◮ Prediction

> Boston.lasso.pr<-predict(Boston.lasso, x, s=0.3, mode="fraction", type="fit")

9. Logistic Regression

Example

◮ Age and coronary heart desease (CHD) status data: 100 subjects

response (Y) - absence or presence (0/1) of the CHD, predictor (X) -

20 30 40 50 60 70

0.00.2

0.40.6

0.81.0

Logistic regression model

◮ Linear regression is not appropriate:

E(Y |X = x) = P (Y = 1|X = x) = β0 + β1X

should be in [0, 1], ∀x.

◮ The idea is

to model relathionship between p(x) = P (Y = 1|X = x) and x

using logistic response function

p(x) =eβ0+β1x

1 + eβ0+β1x⇔ logit{p(x)} := log

1− p(x) = β0+β1x.

Logistic response function

−100 −50 0 50 100

0.00.2

0.40.6

0.81.0

◮ Why logit? For fixed x the odds p(x)1−p(x) are naturally on a log scale:

usually one has odds like ’10 to 1’, or ’2 to 1’.

◮ A specific case of the generalized linear model (GLM) with logit link

function: g(E(Y |X = x)) = β0 + β1x, g(z) = log z1−z , 0 ≤ z < 1.

Interpretation of the logistic regression model

◮ If p(x) = 0.75 then the odds of getting CHD at age x are 3 to 1.

◮ If x = 0 then

logp(0)

1− p(0) = β0 ⇔p(0)

1− p(0) = eβ0 .

Thus eβ0 can be interpreted as baseline odds, especially if zero is

within the range of the data for the predictor variable X .

◮ If we increase x by one unit, we multiply the odds by eβ1 . If β1 > 0

then eβ1 > 1 and odds increase; if β1, 0 then the odds decrease.

1. Likelihood function

◮ Data and model: Dn = {(Yi, Xi), i = 1, . . . , n}, Yi ∈ {0, 1}, i.i.d.

πi = π(Xi) = P (Yi = 1|Xi) = E(Yi|Xi) =eβ0+β1Xi

1 + eβ0+β1Xi, i = 1, . . . , n.

◮ Likelihood and log–likelihood (should be maximized w.r.t. β0 and β1)

L(β0, β1;Dn) =n∏

i (1− πi)1−Yi

( eβ0+β1Xi

1 + eβ0+β1Xi

)Yi( 1

1 + eβ0+β1Xi

)1−Yi

e(β0+β1Xi)Yi

1 + eβ0+β1Xi

log{L(β0, β1;Dn)} =n∑

Yi(β0 + β1Xi)−n∑

log{1 + eβ0+β1Xi

2. Likelihood function

S1(β0, β1) =∂ log{L(β0, β1)}

∂β0=

Yi −n∑

eβ0+β1Xi

1 + eβ0+β1Xi

S2(β0, β1) =∂ log{L(β0, β1)}

∂β1=

XiYi −n∑

Xieβ0+β1Xi

1 + eβ0+β1Xi.

◮ Solve the system S1(β0, β1) = 0, S2(β0, β1) = 0 for β0, β1.

◮ No close form solution is available, solution by an iterative procedure.

1. Fitting the model: Newton–Raphson algorithm

◮ Idea of the algorithm:

– Assume we want to solve equation g(x) = 0.

– Let x∗ be the solution; then if x is close to x∗ then

0 = g(x∗) ≈ g(x) + g′(x)(x∗ − x) ⇒ x∗ ≈ x−g(x)

g′(x).

◮ Iterative procedure:

– Let xk be a current approximation to x∗ (at kth stage); define the

next approximation point xk+1 by

xk+1 = xk −g(xk)

g′(xk), k = 1, 2, . . .

– Stop when g(xk) is small, e.g., |g(xk)| ≤ ǫ.

2. Fitting the model: Newton–Raphson algorithm

1. Let β(j)0 and β

(j)1 be the current parameter approximations after j-th

step of the algorithm.

2. Let

J(β0, β1) = −

∂β0

∂β1

∂β0

∂β1

3. Compute

(j+1)0

β(j+1)1

β(j)1

+ J−1

(β(j)0 , β

4. Continue untill a convergence criterion is met.

Extension to multiple predictors

◮ Model: Xi = (1, Xi1, . . . , Xip), i = 1, . . . , n, β = (β0, β1, . . . , βp)

πi = π(Xi) = P (Yi = 1|Xi) = E(Yi|Xi) =exp{βTXi}

1 + exp{βTXi}.

◮ Likelihood and log–likelihood:

L(β;Dn) =n∏

[π(Xi)]Yi [1− π(Xi)]

1−Yi

( eβTXi

1 + eβTXi

)Yi( 1

1 + eβTXi

)1−Yi

eβTXiYi

1 + eβTXi

log{L(β;Dn)} =n∑

βTXiYi −n∑

log{1 + eβ

Should be maximized with respect to β.

1. Fitting the model and assessing the fit

◮ No close form solution is available, solution by an iterative procedure.

◮ If β is the ML estimate of β then the fitted values are

Yi = π(Xi) =exp{βTXi}

1 + exp{βTXi}◮ Deviance is twice the difference between the log–likelihoods evaluated

at (a) the MLE π(Xi); and (b) π(Xi) = Yi:

G2 = 2n∑

{Yi log

( Yiπ(Xi)

)+ (1− Yi) log

( 1− Yi1− π(Xi)

dev(Yi, π(Xi))

◮ Deviance residuals: ri = sign{Yi − π(Xi)}√

dev(Yi, π(Xi)).

2. Fitting the model and assessing the fit

◮ The degrees of freedom (df) associated with the deviance G2 equals

to n− (p+ 1); (p+ 1) is the dimension of the vector β.

◮ Pearson’s X2 is an approximation to the deviance

X2 =n∑

[Yi − π(Xi)]2

π(Xi)[1− π(Xi)]

◮ Comparing models: let x = (x1,x2), and consider testing

H0 : log{ π(x)

1− π(x)}= βTx1

H1 : log{ π(x)

1− π(x)}= βTx1 + ηTx2.

Obtain deviance G20 and df0 under H0; and G

21 with corresponding

df1 under H1. Under the null: G20 −G2

1 ≈ χ2(df0 − df1).

1. Example

> agchd <-read.table("Age-CHD.dat", header=T)

> agchd.glm<-glm(CHD~Age, data=agchd, family=binomial)

> summary(agchd.glm)

glm(formula = CHD ~ Age, family = binomial, data = agchd)

Deviance Residuals:

-1.9718 -0.8456 -0.4576 0.8253 2.2859

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -5.30945 1.13365 -4.683 2.82e-06 ***

Age 0.11092 0.02406 4.610 4.02e-06 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 136.66 on 99 degrees of freedom

Residual deviance: 107.35 on 98 degrees of freedom

AIC: 111.35

Number of Fisher Scoring iterations: 4

2. Example

◮ The fitted model

{P (CHD|Age)

1− P (CHD|Age)

}= −5.309 + 0.111× Age

As Age grows by one unit, the odds to have the CHD are multiplied by

e0.111 ≈ 1.117.

◮ Null deviance is the G2 statistic for the null model (without slope)

◮ Residual deviance is the G2 statistic for the fitted model

Classification using logistic regression

◮ With any value x of the predictor variable the fitted logistic

regression model associates the probability π(x).

◮ Classification rule: for some threshold τ ∈ (0, 1) (e.g., τ = 1/2) let

Y (x) =

1, π(x) > τ

0, π(x) ≤ τ.

Changing τ one can get some feel of efficacy of the model.

Sensitivity and specificity

◮ Classification results can be represented in the form of the table

True 0 True 1

Predicted 0 a b

Predicted 1 c d

◮ Sensitivity is the proportion of correctly predicted 1’s - true positives.

Sensitivity =d

◮ Specificity is the proportion of correctly predicted 0’s - true negatives.

Specificity =a

ROC curve

◮ Receiver Operating Characteristic (ROC) curve: Sensitivity versus

1 - Specificity as the threshold τ varies from 0 to 1.

0.0 0.2 0.4 0.6 0.8 1.0

0.00.2

0.40.6

0.81.0

ROC curve, CHD data

1−Specificity

itivity

τ = 0

τ = 1

Interpretation of ROC curve

◮ If τ = 1 then we never classify an observation as positive; here

Sensitivity = 0, Specificity = 1.

◮ If τ = 0 then everything will be classified as positive; here

Sensitivity = 1 and Specificity = 0.

◮ As τ varies between 0 and 1 there is a trade–off between

Sensitivity and Specificity; one looks for the value of τ which

gives ”large” sensitivity and specificity.

◮ The closer ROC curve comes to the 45 degrees straight line, the less

useful the model is.

3. Example

> tau<-0.5 # threshold

> agch1<-as.numeric(fitted(agchd.glm)>=tau)

> table(agchd$CHD, agch1)

0 45 12

1 14 29

> 29/(29+14) [1] 0.6744186 # sensitivity

> 45/(45+12) [1] 0.7894737 # specificity

> tau<-0.6

> agch1<-as.numeric(fitted(agchd.glm)>=tau)

> table(agchd$CHD, agch1)

0 50 7

1 18 25

> 25/(25+18) [1] 0.5813953 # sensitivity

> 50/(50+7) [1] 0.877193 # specificity

1. Another example: iris data

◮ The Independent variables - Sepal.Length, Sepal.Width,

Petal.Length, Petal.Width.

◮ Response variable - flower species: setosa, versicolor and

virginica.

◮ The ‘iris’ dataset consists of 150 observations, 50 from each species.

Consider a logistic regression on a single species type, versicolor.

> data(iris)

> tmpdata <- iris

> Versicolor <- as.numeric(tmpdata[,"Species"]=="versicolor")

> tmpdata[,"Species"] <- Versicolor

> fmla <- as.formula(paste("Species ~ ",paste(names(tmpdata)[1:4],

collapse="+")))

> ilr <- glm(fmla, data=tmpdata, family=binomial(logit))

> summary(ilr)

glm(formula = fmla, family = binomial(logit), data = tmpdata)

Deviance Residuals:

-2.1281 -0.7668 -0.3818 0.7866 2.1202

Coefficients:

(Intercept) 7.3785 2.4993 2.952 0.003155 **

Sepal.Length -0.2454 0.6496 -0.378 0.705634

Sepal.Width -2.7966 0.7835 -3.569 0.000358 ***

Petal.Length 1.3136 0.6838 1.921 0.054713 .

Petal.Width -2.7783 1.1731 -2.368 0.017868 *

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC: 155.07

2. Another example: Iris data

◮ Model without Sepal.Length

> ilr1<-glm(Species ~ Sepal.Width + Petal.Length + Petal.Width,

+ data=tmpdata, family=binomial(logit))

> summary(ilr1)

glm(formula = Species ~ Sepal.Width + Petal.Length + Petal.Width,

family = binomial(logit), data = tmpdata)

Deviance Residuals:

-2.1262 -0.7731 -0.3984 0.8063 2.1562

Coefficients:

(Intercept) 6.9506 2.2261 3.122 0.00179 **

Sepal.Width -2.9565 0.6668 -4.434 9.26e-06 ***

Petal.Length 1.1252 0.4619 2.436 0.01484 *

Petal.Width -2.6148 1.0815 -2.418 0.01562 *

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC: 153.21

10. Review of multivariate normal distribution.

Standard multivariate normal distribution

◮ Definition:

Let Z1, . . . , Zn be independent N (0, 1) random variables. Vector

Z = (Z1, . . . , Zn) has standard multivarite normal distribution:

fZ(z) = fZ1,...,Zn(z1, . . . , zn)

1√2πe−z2

j /2 =(

1√2π

exp{− 12z

We write Z ∼ Nn(0, I) where 0 is expectation, EZ = 0, and I is the

covariance matrix, cov(Z) = E[(Z − EZ)(Z − EZ)T ] = EZZT = I .

Multivariate normal distribution

◮ Transformation: Let A ∈ Rn×n and µ ∈ Rn. Define random vector

Y = AZ + µ

◮ Y ∼ Nn(µ,AAT ):

EY = EAZ + µ = AEZ + µ = µ,

cov(Y ) = E(Y − EY )(Y − EY )T = AE(ZZT )AT = AAT .

◮ Definition: distribution of random vector Y ∈ Rn is multivariate

normal with expectation µ ∈ Rn and covariance matrix Σ ∈ Rn×n if

fY (y) =1

(2π)n/2|det(Σ)|1/2 exp{− 1

2 (y − µ)TΣ−1(y − µ)}.

Σ > 0 - covariance matrix, Y ∼ Nn(µ,Σ).

Properties

◮ If a ∈ Rn and Y ∼ Nn(µ,Σ) then

X = aTY =n∑

aiYi = aTY ∼ N (aTµ, aTΣa).

◮ In general, if B ∈ Rq×n and Y ∼ Nn(µ,Σ) then

X = BY ∼ Nq(Bµ,BΣBT ).

◮ Any sub–vector of multivariate normal vector is multivariate normal.

◮ If elements of multivariate normal vector are uncorrelated then they

are independent.

◮ If X1, . . . , Xniid∼ N(µ, σ2) then Xn and s2 are independent (in fact,

the sample is normal if and only if Xn and s2 are independent).

11. Discriminant Analysis

Cushing syndrome data

Types of the desease: (a), (b) and (c).

Variables: concentrations of Tetrahydrocortisone and Pregnanetriol.

10 20 30 40 50 60

Tetrahydrocortisone

Cushings syndrome data

1. Classification problem

◮ Problem: we are given a pair of variables (X, Y )

– X ∈ X ⊆ Rp is the vector of predictor variables (features), belongs

to one of K populations;

– Y is the label of the population;

◮ Assume

– πk is the prior probability that X belongs to kth population

πk = P (Y = k) > 0,K∑

πk = 1.

– If observation X belongs to kth population then X ∼ pk(x).

◮ Problem: we observe X = x. How to predict the label Y ?

2. Classification problem

◮ Classification rule: any function g : X → {1, . . . , K}. Defines

partition of the feature space X into K disjoint sets:

X =K⋃

Ak, Ak = {x : g(x) = k}.

◮ Accuracy of a classifier: L(g) = P{g(X) 6= Y }.

◮ Optimal rule

g∗ = arg ming:X→{1,...,K}

P{g(X) 6= Y }.

Bayes rule

◮ Bayes formula

p(k|x) = P (Y = k|X = x) =P (X = x|Y = k)P (Y = k)

∑Kk=1 P (X = x|Y = k)P (Y = k)

=πkpk(x)∑Kk=1 πkpk(x)

◮ Bayes classification rule:

g∗(x) = argmaxk=1,...,K p(k|x) = argmaxk=1,...,K [πkpk(x)].

◮ Theorem: Bayes rule g∗ minimizes probability of error, i.e.,

L(g∗) = P{g∗(X) 6= Y } ≤ L(g) = P{g(X) 6= Y }, ∀g.

Bayes error

L(g∗) = 1−∫

k=1,...,K[πkpk(x)] dx.

First, note L(g) = P{g(X) 6= Y } = E[P{g(X) 6= Y |X = x}];

P{g(X) 6= Y |X = x} = 1− P{g(X) = Y |X = x}

= 1−K∑

P{g(X) = k, Y = k|X = x}

= 1−K∑

I{g(x) = k}P (Y = k|X = x)

= 1−K∑

I{g(x) = k}p(k|x) ≥ 1−maxk

p(k|x).

The proof is completed by taking expextation and noting that, by

definition,

P{g∗(X) 6= Y |X = x} = 1−maxk

p(k|x).

The Bayes rule for normal populations

◮ Assume K = 2 groups with prior probabilitites πk, k = 1, 2, and the

distribution pk(x) of kth population is Np(µk,Σk);

πkpk(x) =πk

(2π)p/2|Σk|1/2exp

{− 1

2(x−µk)

TΣ−1k (x−µk)

}, k = 1, 2.

◮ The Bayes rule decides 1 if π1p1(x) ≥ π2p2(x), and 2 otherwise.

Equivalently, one can compare h1(x) with h2(x), where

hk(x) = log{πkpk(x)}

= −1

2(x− µk)

TΣ−1k (x− µk)︸︷︷︸

−p2log(2π)− 1

2log |Σk|+ log πk.

Case I: equal covariance matrices

◮ If Σ1 = Σ2 = Σ then the Bayes rule decides 1 when

h1(x)− h2(x) ≥ 0, i.e.,

h1(x)− h2(x) = −1

1 + log π1 +1

2 − log π2

= (µ1 − µ2)TΣ−1

(x− µ1 + µ2

)+ log

π1π2≥ 0.

This results in the linear decision surface

(µ1 − µ2)TΣ−1

(x− µ1 + µ2

)= log

π2π1.

Case II: non–equal covariance matrices

◮ If Σ1 6= Σ2 then the Bayse rule decides 1 if

2(x− µ1)

TΣ−11 (x− µ1)−

2|Σ1|+ log π1

≥ −1

2(x− µ2)

TΣ−12 (x− µ2)−

2|Σ2|+ log π2.

Thus results in quadratic decision surface in Rp.

◮ The Bayes rule cannot be implemented because nor pk(x) neither πk

are known. The idea is to estimate unknown parameters from the

data...

1. Linear discriminant analysis

◮ Data: two samples of sizes n1 and n2 from normal populations

Xk1, . . . ,Xknk∼ Np(µk,Σ), k = 1, 2.

◮ Estimates of means µk

µk = Xk· =1

◮ Pooled estimator of Σ

Spooled =(n1 − 1)S1 + (n2 − 1)S2

n1 + n2 − 2

nk − 1

(Xkj − µk)(Xkj − µk)T

2. Linear discriminant analysis

◮ Classification rule (LDA):

decide first group if

(µ1 − µ2)TS−1

pooled

(x− µ1 + µ2

)≥ log

π2π1,

πk =nkn, k = 1, 2.

◮ Although LDA is the ”plug–in Bayes classifier” for normal

populations, it can be applied for any distribution of the data.

1. Another interpretation of the LDA

◮ Idea is to find a linear transformation of X such that separation

between the groups will be maximal. Let β ∈ Rp, and Z = βTX.

– if X is from the first group, then µ1,Z = EZ = EβTX = βTµ1.

– if X is from the second group, then µ2,Z = EZ = EβTX = βTµ2.

– var(Z) does not depend on the group: σ2Z = var(βTX) = βTΣβ.

◮ Choose β so that

(µ1,Z − µ2,Z)2

=[βT (µ1 − µ2)]

βTΣβ→ max .

Solution of this problem: β∗ = cΣ−1(µ1 − µ2), c 6= 0.

2. Another interpretation of the LDA

◮ Estimate of β∗: β∗ = S−1pooled(µ1 − µ2).

◮ Estimates of µ1,Z = βT∗ µ1, µ2,Z = βT

∗ µ2:

µ1,Z = (µ1 − µ2)TS−1

pooled µ1, µ2,Z = (µ1 − µ2)TS−1

pooled µ2.

◮ LDA classification rule: for given x decide the first group if

(µ1 − µ2)TS−1

pooled x ≥ 1

2(µ1,Z + µ2,Z) ⇔

(µ1 − µ2)TS−1

pooled x ≥ 1

2(µ1 − µ2)

TS−1pooled(µ1 + µ2).

1. Example: Leptograpsus Crabs data

Two color forms (blue and orange), 50 of each form of each sex.

◮ sp species - ”B” or ”O” for blue or orange

◮ sex

◮ index index 1:50 within each of the four groups

◮ FL frontal lobe size (mm)

◮ RW rear width (mm)

◮ CL carapace length (mm)

◮ CW carapace width (mm)

◮ BD body depth (mm)

> library(MASS)

> attach(crabs)

> lcrabs<-cbind(sp, sex, log(crabs[,4:8]))

> lcrabs.lda<-lda(sex~FL+RW+CL+CW, lcrabs) # Linear discriminant analysis

> lcrabs.lda

lda.formula(sex ~ FL + RW + CL + CW, data = lcrabs)

Prior probabilities of groups:

0.5 0.5

Group means:

FL RW CL CW

F 2.708720 2.579503 3.421028 3.555941

M 2.730305 2.466848 3.464200 3.583751

Coefficients of linear discriminants:

FL -2.889616

RW -25.517644

CL 36.316854

CW -11.827981

LD1 is the vector S−1pooled(µ1 − µ2).

> lcrabs.pred<-predict(lcrabs.lda)

> table(lcrabs$sex, lcrabs.pred$class)

F 97 3

M 3 97

> (3+3)/(97+97+3+3) # Resubstitution (naive) estimate

[1] 0.03

> # Cross-validation estimate

> lcrabs.cv.lda<-lda(sex~FL+RW+CL+CW, lcrabs, CV=T)

> table(lcrabs$sex, lcrabs.cv.lda$class)

F 96 4

M 3 97

> plot(lcrabs.lda) # groups histograms in the dicriminant direction

−4 −2 0 2 4

0.00.1

0.20.3

group F

−4 −2 0 2 4

0.00.1

0.20.3

group M

1. Quadratic discriminant analysis

◮ The QDA can be viewed as the plug–in Bayes rule for the normal

populations with non–equal covariance matrices. Now Σ1 and Σ2 are

substituted by

nk − 1

(Xkj − µk)(Xkj − µk)T , k = 1, 2.

◮ The QDA rule: decide the first group if

2(x− µ1)

TS−11 (x− µ1)−

2|S1|+ log π1

≥ −1

2(x− µ2)

TS−12 (x− µ2)−

2|S2|+ log π2.

2. Quadratic discriminant analysis

> # Quadratic discriminant analysis

> lcrabs.qda<-qda(sex~FL+RW+CL+CW, lcrabs)

> lcrabs.qda.pred<-predict(lcrabs.qda)

> table(lcrabs$sex, lcrabs.qda.pred$class)

F 97 3

M 4 96

> lcrabs.cv.qda<-qda(sex~FL+RW+CL+CW, lcrabs, CV=T)

> table(lcrabs$sex, lcrabs.cv.qda$class)

F 96 4

M 5 95

12. Classification: k–Nearest Neighbors

1. k–nearest neighbors classifier

◮ Data: (Xi, Yi), i = 1, . . . , n, i.i.d. random pairs, Yi ∈ {0, 1},Xi ∈ Rd.

◮ Let d(·, ·) be a distance measure, and for a given x ∈ Rd consider

numbers di(x) = d(Xi,x). Let d(i)(x) be the ith order statistic, i.e.,

d(1)(x) ≤ d(2)(x) ≤ · · · ≤ d(n)(x).

◮ The set of k–nearest neighbors of x

Ak(x) = {Xi : d(Xi,x) ≤ d(k)(x)}.

◮ Classifier

gn(x) =

1,∑n

i=1wn,i(x)I(Yi = 1) >∑n

i=1wni(x)I(Yi = 0),

0, otherwise

where wn,i(x) =1k if Xi ∈ Ak(x), and zero otherwise.

2. k–nearest neighbors classifier

◮ Choice of k: often k = 1 is chosen, results in 1-NN classifier. Large k

results in more averaging; small k leads to more variability.

Asymptotic theory suggests

– k →∞ as n→∞

– kn → 0 as n→∞

◮ Choice of the distance: most common choice is Euclidean.

Spam data

Data: 4601 instances, 57 attributes

◮ Most of the attributes indicate whether a particular word or character

was frequently occuring in the e-mail. In particular, 48 continuous

real [0,100] attributes indicating percentage of words in the e-mail

that match WORD, i.e.

100 * (number of times the WORD appears in the e-mail)

total number of words in e-mail

◮ WORD is any string of alphanumeric characters bounded by

non-alphanumeric characters or end-of-string.

◮ The run-length attributes (55-57) measure the length of sequences of

consecutive capital letters.

k–nearest neighbors: spam data

> library(class)

> spam.d<-spam[, 1:57] # training set

> spam.cl <- spam[,58] # true classifications

> spam.1nn <- knn.cv(spam.d, spam.cl, k=1) # 1-NN with cross-validation

> table(spam.1nn, spam.cl)

spam.cl

spam.1nn 0 1

0 2398 390

1 390 1423

> (390+390)/(2398+1423+390+390) # cross-validation misclassification rate

[1] 0.1695284

Spam data: k-nn misclassification rate

5 10 15 20

lassifi

n rate

Spam data: CV−misclassification rate of K−NN

Spam data: LDA versus k-nn

> library(MASS)

> spam.cv.lda<-lda(V58~.,data=spam, CV=T)

> spam.cv.lda

lda(V58 ~ ., data = spam)

Prior probabilities of groups:

0.6059552 0.3940448

Group means:

V1 V2 V3 V4 V5 V6 V7

0 0.0734792 0.2444656 0.2005811 0.0008859397 0.1810402 0.04454448 0.00938307

1 0.1523387 0.1646498 0.4037948 0.1646718147 0.5139548 0.17487590 0.27540541

V8 V9 V10 V11 V12 V13 V14

0 0.03841463 0.03804878 0.1671700 0.02171090 0.5363235 0.06166428 0.04240316

1 0.20814120 0.17006067 0.3505074 0.11843354 0.5499724 0.14354661 0.08357419

...........................................................................

> table(spam.cl, spam.cv.lda$class)

spam.cl 0 1

0 2652 136

1 390 1423

> (136+390)/(136+390+2652+1423)

[1] 0.1143230 # cross-validation misclassification rate

13. Classification: decision trees (CART)

1. Example

◮ In a hospital, when a heart attack patient is admitted, 19 variables

are measured during the first 24 hours. Among the variables: blood

pressure, age and 17 other ordered or binary variables summarizing

different symptoms.

◮ Based on the 24–hours data, the objective of the study is to identify

high risk patients (those who will not survive at least 30 days).

2. Example

Is the minimum systolic blood pressureover initial 24 hours > 91?

Is age > 62.5?

Is sinustachycardiapresent?

G − high riskF − not high risk

1. Binary trees: basic notions

◮ Binary tree is constructed by repeated splits of subsets of X into two

descendant subsets:

– a single variable is found which ”best” splits the data into two

groups

– the data is separated and the process is applied to each sub–group

– stop when the subgroups reach minimal size, or no improvement

can be made.

◮ Terminology

– the root node = X ; a node = a subset of X (circles).

– terminal nodes = subsets which are not split (rectangular box);

each terminal node is designated by the class label.

2. Binary trees: basic notions

◮ Construction of a tree requires:

– The selection of splits

– The decisions when to declare a node terminal or to continue

splitting it

– The assignment of each terminal node to a class

Notation

◮ n – number of observations; K - number of classes

◮ N(t) – total number of observations at node t

◮ Nk(t) – number of observations from class k at node t

◮ p(k|t) – proportion of observations X at the node t belonging to kth

class, k = 1, . . . , K

p(k|t) = Nk(t)

#{observations from class k at node t}#{observations at node t}

p(t) = [p(1|t), . . . , p(K|t)].

◮ Y (t) – class assigned to the node t:

Y (t) = arg maxk=1,...,K

p(k|t)

1. Impurity of the node

◮ Node is pure if it contains data only from one class.

◮ Impurity measure Imp(t) of node t for classification into K classes:

Imp(t) = φ(p(t)), p(t) = [p(1|t), . . . , p(K|t)]

where φ is a non–negative function of p(t) satisfying the following

conditions:

(a) φ has a unique maximum at ( 1K , . . . ,

(b) φ achieves minimum at (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, . . . , 1);

(c) φ is a symmetric function of p1, . . . , pK .

◮ Imp(t) is largest when all classes are equaly mixed, and smallest when

node contains one class.

2. Impurity of the node

◮ Examples of the impurity measure

Imp(t) = −K∑

p(k|t) log{p(k|t)}, [entropy]

Imp(t) = 1−K∑

p2(k|t), [Gini index]

p(k|t) = Nk(t)

#{observations from class k at node t}#{observations at node t}

1. To split or not to split?

◮ Split S: node t to two ”sons” tL and tR

– πL – proportion of observations at t going to tL

– πR – proportion of observations at t going to tR

πL πR

2. To split or not to split?

◮ Goodness of split S is defined by the decrease in impurity measure

Φ(S, t) = ∆Imp(t) = Imp(t)− πLImp(tL)− πRImp(tR);

◮ Idea: choose the split S that maximizes Φ(S, t).

◮ Impurity of the tree T

Imp(T ) =∑

π(t)Imp(t),

where T is the set of terminal nodes, π(t) is the proportion of the

whole popuation at node t

Numerical example

50 10 20

π(1)=0.7 π(2)=0.3

node 0

node 1 node 2

100 obs.

70 obs. 30 obs.

Imp(t0) = −(60/100) log(60/100)− (40/100) log(40/100) = 0.673

Imp(t1) = −(50/70) log(50/70)− (20/70) log(20/70) = 0.598

Imp(t2) = −(20/30) log(20/30)− (10/30) log(10/30) = 0.637

∆Imp(t0) = 0.637− 0.7× 0.598− 0.3× 0.637 = 0.0273

Splitting rules

X = (X1, . . . , Xp) is the vector of features.

◮ Splits are determined by the standartized set of questions Q:

– Each split depends only on a single variable

– For ordered Xi the questions in Q are of the form

{Is Xi ≤ c}, c ∈ (−∞,∞).

– If Xi is categorical, talking values in B = {b1, . . . , bm} then the

questions in Q are {Is Xi ∈ A}, A is any subset of B.

◮ At each node CART looks at variables Xi one by one, finds the best

split for each Xi, and chooses the best of the best.

Stop–splitting and assignment rules

◮ Stop splitting if

– decrease in impurity measure of a node is less than a prespecified

threshold

– impurity of the whole tree is less than a threshold

– depth of tree is greater than some parameter

– ...

◮ CART does not employ stopping rule; pruning is used instead.

◮ To a terminal node t assign the class

Y (t) = arg maxk=1,...,K

p(k|t).

1. Example: Stage C prostate cancer data

◮ Data: on 146 stage C prostate cancer patients, 7 variables, pgstat –

response

– pgtime – time to progression

– pgstat – status at last follow-up (1=progressed, 0=censored)

– age – age at diagnosis

– eet – early endocrine therapy (1=no, 0=yes)

– ploidy – diploid/tetraploid/aneuploid DNA pattern

– grade – tumor grade (1–4)

– gleason – Gleason grade

2. Example: Stage C prostate cancer data

> library(rpart)

> stagec<-read.table("stagec.dat", header=T, sep=",")

> cfit<-rpart(pgstat~age+eet+grade+gleason+ploidy, data=stagec, method="class")

> print(cfit)

n= 146

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 146 54 0 (0.6301370 0.3698630)

2) grade< 2.5 61 9 0 (0.8524590 0.1475410) *

3) grade>=2.5 85 40 1 (0.4705882 0.5294118)

6) ploidy< 1.5 29 11 0 (0.6206897 0.3793103)

12) gleason< 7.5 22 7 0 (0.6818182 0.3181818) *

13) gleason>=7.5 7 3 1 (0.4285714 0.5714286) *

7) ploidy>=1.5 56 22 1 (0.3928571 0.6071429) # child nodes of node t

14) age>=61.5 34 17 0 (0.5000000 0.5000000) # are numbered as

28) age< 64.5 12 4 0 (0.6666667 0.3333333) * # 2t (left) and 2t+1

29) age>=64.5 22 9 1 (0.4090909 0.5909091) * # (right)

15) age< 61.5 22 5 1 (0.2272727 0.7727273) *

3. Example: Stage C prostate cancer data (default tree)

> plot(cfit)

> text(cfit)

◮ Some default parameters

* minsplit=20: minimal number of observations in a node for which

split is computed

* minbucket=minsplit/3: minimal number of observation in a

terminal node

* cp=0.01: complexity parameter

|grade< 2.5

ploidy< 1.5

gleason< 7.5 age>=61.5

age< 64.5

> summary(cfit)

rpart(formula = pgstat ~ age + eet + grade + gleason + ploidy,

data = stagec, method = "class")

...............................................................

Node number 1: 146 observations, complexity param=0.1111111

predicted class=0 expected loss=0.369863

class counts: 92 54

probabilities: 0.630 0.370

left son=2 (61 obs) right son=3 (85 obs)

Primary splits:

grade < 2.5 to the left, improve=10.35759000, (0 missing)

gleason < 5.5 to the left, improve= 8.39957400, (3 missing)

ploidy < 1.5 to the left, improve= 7.65653300, (0 missing)

age < 58.5 to the right, improve= 1.38812800, (0 missing)

eet < 1.5 to the right, improve= 0.07407407, (2 missing)

Surrogate splits:

gleason < 5.5 to the left, agree=0.863, adj=0.672, (0 split)

ploidy < 1.5 to the left, agree=0.644, adj=0.148, (0 split)

age < 66.5 to the right, agree=0.589, adj=0.016, (0 split)

Node number 2: 61 observations

class counts: 52 9

Node number 3: 85 observations, complexity param=0.1111111

class counts: 40 45

left son=6 (29 obs) right son=7 (56 obs)

Primary splits:

ploidy < 1.5 to the left, improve=1.9834830, (0 missing)

age < 56.5 to the right, improve=1.6596080, (0 missing)

gleason < 8.5 to the left, improve=1.6386550, (0 missing)

eet < 1.5 to the right, improve=0.1086108, (1 missing)

Surrogate splits:

age < 72.5 to the right, agree=0.682, adj=0.069, (0 split)

gleason < 9.5 to the right, agree=0.682, adj=0.069, (0 split)

....................................................................

◮ grade 1 and 2 go to the left, grade 3 and 4 go to the right.

◮ The improvement is n times the change in the ipurity index. The

largest improvement is for grade, 10.36. The actual values are not so

important; relative size gives indication of the utility of variables.

◮ Once a splitting variable and split point have been decided, what is to

be done with observations missing the variable? CART defines

surrogate variables by re–applying the partitioning algorithm to

predict the two categories using other independent variables.

1. Cost–complexity pruning

◮ Misclassification rate of a node t

R(t) =K∑

k 6=Y (t)

p(k|t), Y (t) = arg maxk=1,...,K

p(k|t)︸︷︷︸label assigned to node t

◮ Let T = {t1, . . . , tm} be terminal nodes; misclassification rate of T

R(T ) =

nR(ti) =

N(t)R(t).

◮ Let size(T ) = #{t ∈ T : t ∈ T}; for complexity parameter (CP) α > 0

Rα(T ) = R(T ) + α size(T ) =∑

[R(t) + α

◮ α > 0 imposes a penalty for large trees.

t(6)t(5)t(4)t(3)

t(1) t(2)

T(t(2))

◮ T (t) is the sub–tree rooted at t

◮ Error of sub-tree T (t2) and error of the node t2:

Rα(T (t2)) = R(T (t2)) + α size{T (t2)} = R(T (t2)) + 2α

Rα(t2) = R(t2) + α, t2 is treated as terminal.

◮ Pruning is worthwhile if

Rα(t2) ≤ Rα(T (t2)) ⇔ g(t2, T ) =R(T (t2))−R(t2)size{T (t2)} − 1

≤ α.

Function g(t, T ) can be computed for any internal node of the tree.

◮ Weakest–link cutting algorithm:

1. Start with the full tree T1. For each non–terminal node t ∈ T1compute g(t, T1), and find t1 = argmint∈T1 g(t, T1). Set

α2 = g(t1, T1).

2. Define new tree T2 by pruning away the branch rooted at t1. Find

the weakest link in T2 and proceed as in 1.

◮ Result: a decreasing sequence of sub–trees with corresponding α’s.

The final selection is by cross-validation or by validation sample.

Example: Stage C prostate cancer data (cont.)

> printcp(cfit)

Classification tree:

rpart(formula = pgstat ~ age + eet + grade + gleason + ploidy,

data = stagec, method = "class")

Variables actually used in tree construction:

[1] age gleason grade ploidy

Root node error: 54/146 = 0.36986

n= 146

CP nsplit rel error xerror xstd

1 0.111111 0 1.00000 1.0000 0.10802

2 0.037037 2 0.77778 1.0741 0.10949

3 0.018519 4 0.70370 1.0556 0.10916

4 0.010000 5 0.68519 1.0556 0.10916

Complexity parameter (CP) table

◮ The CP table is printed from the smallest tree (0 splits) to the largest

(5 splits for cancer data)

◮ rel error – relative error on the training set (resubstitution), the

first node has an error of 1.

◮ xerror – the cross–validation estimate of the error

◮ xstd – the standard deviation of the risk

◮ 1-SE rule of thumb: all trees with

xerror ≤ minimal xerror + xstd

are equivalent. Choose the simplest one.

Example: data on spam

◮ Data: 4601 instances, 57 attributes

– Most of the attributes indicate whether a particular word or

character was frequently occuring in the e-mail. The run-length

attributes (55-57) measure the length of sequences of consecutive

capital letters.

◮ Default tree building

> spam.tr<-rpart(V58~., data=spam, method="class") # default tree

> plot(spam.tr, compress=T, branch=.3)

> text(spam.tr)

1. Spam data: default tree

|V53< 0.0555

V7< 0.055

V52< 0.378

V57< 55.5

V16< 0.845

V25>=0.4

2. Spam data: default tree

> printcp(spam.tr)

rpart(formula = V58 ~ ., data = spam, method = "class")

[1] V16 V25 V52 V53 V57 V7

Root node error: 1813/4601 = 0.39404

n= 4601

1 0.476558 0 1.00000 1.00000 0.018282

2 0.148924 1 0.52344 0.54716 0.015386

3 0.043023 2 0.37452 0.44457 0.014222

4 0.030888 4 0.28847 0.33867 0.012723

5 0.010480 5 0.25758 0.28847 0.011875

6 0.010000 6 0.24710 0.27799 0.011685 # classification error

Absolute cross–validation error: 0.27799×0.39404= 0.1095392

1. Spam data: unpruned tree

> sctrl <- rpart.control(minbucket=1, minsplit=2, cp=0)

> spam1.tr <- rpart(V58~., data=spam, method="class", control=sctrl)

> printcp(spam1.tr)

rpart(formula = V58 ~ ., data = spam, method = "class", control = sctrl)

[1] V1 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V2 V20 V21 V22 V23 V24 V25 V26

[39] V5 V50 V51 V52 V53 V54 V55 V56 V57 V6 V7 V8 V9

Root node error: 1813/4601 = 0.39404

n= 4601

1 0.47655819 0 1.0000000 1.00000 0.0182819

2 0.14892443 1 0.5234418 0.55268 0.0154419

3 0.04302261 2 0.3745174 0.46663 0.0144933

4 0.03088803 4 0.2884721 0.32598 0.0125182

5 0.01047987 5 0.2575841 0.29454 0.0119835

6 0.00827358 6 0.2471042 0.27248 0.0115825

7 0.00717044 7 0.2388307 0.26751 0.0114891

8 0.00529509 8 0.2316602 0.26089 0.0113626

9 0.00441258 14 0.1958081 0.23828 0.0109127

10 0.00358522 15 0.1913955 0.23497 0.0108445

11 0.00330943 19 0.1770546 0.23276 0.0107986

12 0.00275786 20 0.1737452 0.22945 0.0107293

13 0.00220629 24 0.1627137 0.22725 0.0106827

14 0.00193050 28 0.1538886 0.22173 0.0105648

15 0.00183857 31 0.1478213 0.22063 0.0105410

16 0.00165472 34 0.1423056 0.20684 0.0102366

17 0.00137893 42 0.1290678 0.19801 0.0100348

18 0.00110314 46 0.1235521 0.19470 0.0099576

19 0.00082736 62 0.1059018 0.18809 0.0098007 # minimal xerror

20 0.00070916 82 0.0871484 0.18919 0.0098271

21 0.00066189 90 0.0810811 0.18864 0.0098139

22 0.00055157 95 0.0777716 0.19250 0.0099057

23 0.00041368 167 0.0380585 0.19360 0.0099317

24 0.00036771 175 0.0347490 0.19746 0.0100220

25 0.00033094 181 0.0325427 0.19967 0.0100731

26 0.00029700 186 0.0308880 0.20132 0.0101111

27 0.00027579 205 0.0242692 0.21346 0.0103843

28 0.00018386 277 0.0044126 0.21511 0.0104208

29 0.00000000 286 0.0027579 0.21566 0.0104329

2. Spam data: unpruned tree

> plot(spam1.tr)

1. Spam data: 1-SE rule pruned tree

> spam2.tr <- prune(spam1.tr, cp=0.00137893)

> printcp(spam2.tr)

[20] V57 V6 V7 V8

Root node error: 1813/4601 = 0.39404

n= 4601

1 0.4765582 0 1.00000 1.00000 0.018282

2 0.1489244 1 0.52344 0.55268 0.015442

3 0.0430226 2 0.37452 0.46663 0.014493

4 0.0308880 4 0.28847 0.32598 0.012518

5 0.0104799 5 0.25758 0.29454 0.011984

6 0.0082736 6 0.24710 0.27248 0.011582

7 0.0071704 7 0.23883 0.26751 0.011489

8 0.0052951 8 0.23166 0.26089 0.011363

9 0.0044126 14 0.19581 0.23828 0.010913

10 0.0035852 15 0.19140 0.23497 0.010844

11 0.0033094 19 0.17705 0.23276 0.010799

12 0.0027579 20 0.17375 0.22945 0.010729

13 0.0022063 24 0.16271 0.22725 0.010683

14 0.0019305 28 0.15389 0.22173 0.010565

15 0.0018386 31 0.14782 0.22063 0.010541

16 0.0016547 34 0.14231 0.20684 0.010237

17 0.0013789 42 0.12907 0.19801 0.010035

◮ Absolute cross-validation error

0.19801× 1813/4601 = 0.07802386

2. Spam data: 1-SE rule pruned tree

> plot(spam2.tr)

> text(spam2.tr, cex=.5)

|V53< 0.0555

V7< 0.055

V52< 0.378

V16< 0.2

V24< 0.01

V22< 0.155

V56< 416

V5< 0.715

V4< 0.565

V28< 7.105

V25>=0.015

V28< 0.23

V56< 19.5

V8< 0.335

V55< 5.739

V49>=0.3115

V55< 2.655

V6< 0.185

V5< 0.49

V50>=0.479

V37>=0.195

V52< 0.119

V5< 1.09

V57< 341

V8< 0.44

V21< 0.31

V57< 55.5

V16< 0.845

V17< 0.66

V24< 2.66

V52< 0.984

V19< 3.635

V56< 4

V46>=0.065

V27>=1.375

V27>=0.14

V25>=0.4

V7< 0.075V46>=0.49

V56< 6.5

V19< 2.845V27>=0.21

1 0 1 1

1 0 0 1

0 1 0 1

1. Spam data: another tree

> spam3.tr <- prune(spam1.tr, cp=0.0018)

> printcp(spam3.tr)

[20] V8

Root node error: 1813/4601 = 0.39404

n= 4601

1 0.4765582 0 1.00000 1.00000 0.018282

2 0.1489244 1 0.52344 0.54661 0.015380

3 0.0430226 2 0.37452 0.43354 0.014081

4 0.0308880 4 0.28847 0.33370 0.012643

5 0.0104799 5 0.25758 0.28792 0.011866

6 0.0082736 6 0.24710 0.27689 0.011665

7 0.0071704 7 0.23883 0.26531 0.011447

8 0.0052951 8 0.23166 0.25648 0.011277

9 0.0044126 14 0.19581 0.23718 0.010890

10 0.0035852 15 0.19140 0.22835 0.010706

11 0.0033094 19 0.17705 0.22449 0.010624

12 0.0027579 20 0.17375 0.22614 0.010659

13 0.0022063 24 0.16271 0.22559 0.010648

14 0.0019305 28 0.15389 0.21732 0.010469

15 0.0018386 31 0.14782 0.21732 0.010469

16 0.0018000 34 0.14231 0.20684 0.010237

◮ Absolute cross-validation error

0.20684× 1813/4601 = 0.08150323

> plot(spam3.tr, branch=.4,uniform=T)

> text(spam3.tr, cex=.5)

|V53< 0.0555

V7< 0.055

V52< 0.378

V16< 0.2

V24< 0.01

V22< 0.155

V56< 416

V5< 0.715

V8< 0.335

V55< 5.739

V49>=0.3115

V55< 2.655

V6< 0.185

V37>=0.195

V52< 0.119

V5< 1.09

V57< 341V21< 0.31

V57< 55.5

V16< 0.845

V17< 0.66

V24< 2.66

V52< 0.984

V19< 3.635

V56< 4

V46>=0.065

V27>=1.375

V27>=0.14

V25>=0.4

V7< 0.075 V46>=0.49

V56< 6.5

V19< 2.845V27>=0.21

1 0 1 0 1

0 1 0 1

0 1 0 1 0

0 1 0 1

14. Nonparametric smoothing: basic ideas,

kernel and local polynomial estimators

Density estimation problem

◮ Old Faithful Geyser data on 272 geyser eruptions, eruption duration

and waiting times between successive eruptions were recorded.

> summary(faithful)

eruptions waiting

Min. :1.600 Min. :43.0

1st Qu.:2.163 1st Qu.:58.0

Median :4.000 Median :76.0

Mean :3.488 Mean :70.9

3rd Qu.:4.454 3rd Qu.:82.0

Max. :5.100 Max. :96.0

◮ We want to estimate density of the waiting times between successive

eruptions.

Histogram

◮ Let X1, . . . , Xniid∼ f . We want to estimate density f .

◮ By definition f(x) = limh→012hP{x− h < X ≤ x+ h}.

◮ Idea: fix h, estimate P{x− h < X ≤ x+ h} by1n

∑ni=1 I{x− h < Xi ≤ x+ h}, and let

f(x) =1

I{x− h < Xi ≤ x+ h}.

◮ Histogram: consider bins (bj , bj+1], j = 0, 1, . . ., bj+1 − bj = 2h, ∀j,

f(x) =1

I{bj < Xi ≤ bj+1}, x ∈ (bj , bj+1].

Histogram of the Old Faithful data

> hist(faithful$waiting)

Histogram of faithful$waiting

faithful$waiting

40 50 60 70 80 90 100

1. How to choose binwidth h?

◮ Bias of f(x):

|Ef(x)− f(x)| =∣∣∣ 12h

∫ x+h

f(t)dt− f(x)∣∣∣

≤ supt∈(x−h,x+h]

|f(t)− f(x)| ≤ 2Lh

provided that |f(x)− f(x′)| ≤ L|x− x′| for all x, x′

◮ Variance of f(x): because I{x− h < Xi ≤ x+ h} are iid Bernoulli

r.v. with parameter p = P{x− h < X1 ≤ x+ h} =∫ x+h

x−hf(t)dt

var{f(x)} = 1

∫ x+h

f(t)dt ≤ M

provided that f(x) ≤M , ∀x.

2. How to choose binwidth h?

◮ Mean Squared Error of f(x):

minh>0{4L2h2 +M(2nh)−1} ⇒ h∗ ≍ n−1/3...

◮ Smoother the density is, larger h should be. For instance, if we

assume that |f ′(x)− f ′(x′)| ≤ L|x− x′|, ∀x, x′ then h∗ ≍ h−1/5...

◮ The rule of thumb in R

h = 1.144 · σn−1/5.

Kernel estimators

◮ Another representation of the histogram estimator

f(x) =1

K(Xi − x

), K(t) =

12 , |t| ≤ 1

0, otherwise.

Function K is called kernel, h is called bandwidth.

◮ General kernel estimator: take function K satisfying∫∞−∞K(t)dt = 1

and define

f(x) =1

K(Xi − x

◮ Commonly used kernels:

– rectangular K(t) = 12I{|t| ≤ 1};

– triangular K(t) = (1− |t|)I{|t| ≤ 1};– Gaussian K(t) = 1√

2πe−t2/2 (default in R).

1. Kernel density estimation is R

> layout(matrix(1:3, ncol = 3))

> plot(density(faithful$waiting))

> plot(density(faithful$waiting,bw=8))

> plot(density(faithful$waiting,bw=0.8))

2. Kernel density estimation in R

density.default(x = faithful$waiting)

N = 272 Bandwidth = 3.988

20 60 100

density.default(x = faithful$waiting, bw = 8)

N = 272 Bandwidth = 8

40 60 80

density.default(x = faithful$waiting, bw = 0.8)

N = 272 Bandwidth = 0.8D

Example: motorcycle data

◮ Measurements of head acceleration in a simulated motorcycleaccident, used to test crash helmets.

> library(MASS)

> plot(mcycle)

10 20 30 40 50

−100

Regression problem

◮ Model: data – {(Xi, Yi), i = 1, . . . , n}, f is unknown ”smooth”

function, ǫ is error, Eǫ = 0

Yi = f(Xi) + ǫi, i = 1, . . . , n ⇔ f(x) = E(Y |X = x).

◮ Parametric models

* Simple linear regression f(x) = β0 + β1x

* Polynomial regression

f(x) = β0 + β1x+ · · ·+ βpxp

Reduced to multiple linear regression:

f(x) = βTx, x = (1, x, x2, . . . , xp)T .

Polynomial regression

> attach(mcycle)

> fit3<-lm(accel~times+I(times^2)+I(times^3))

> fit5<-lm(accel~times+I(times^2)+I(times^3)+I(times^4)+I(times^5))

> fit7<-lm(accel~times+I(times^2)+I(times^3)+I(times^4)+I(times^5)+I(times^6)+

+ I(times^7))

> plot(times, accel)

> lines(times, fit3$fitted, lty=1)

> legend(40, -70, c("fit3", "fit5", "fit7"), lty=c(1,2,3))

10 20 30 40 50

−100

fit3fit5fit7

Nonparametric regression: basic ideas

◮ Local modeling: parametric model in ”local” neighborhood.

* Local average

fh(x) =1

#{Nh(x)}∑

i∈Nh(x)

Yi, Nh(x) = {i : |x−Xi| ≤ h}.

* Local linear (polynomial) regression

* k-NN estimator

◮ How to define the local neighborhood?

Kernel estimators

◮ Regression function

f(x) = E(Y |X = x) =

∫yp(x, y)dy∫p(x, y)dy

∫yp(x, y)dy

◮ Histogram – estimate of p(x)

ph(x) =1

I{i : Xi ∈ (bj , bj+1]} =nj2nh

, x ∈ (bj , bj+1].

◮ Estimator of f(x)

fh(x) =

∑ni=1 YiI

{Xi ∈ [x− h, x+ h]

i=1 I{Xi ∈ [x− h, x+ h]

} =:n∑

wn,i(x)Yi

Nadaraya–Watson kernel estimator

◮ More generally, consider the kernel K s.t.∫K(x)dx = 1 and define

fh(x) =

∑ni=1 YiK

(x−Xi

∑ni=1K

(x−Xi

◮ Kernels

– Box kernel: K1(x) = I[−1/2,1/2](x).

– Quadratic kernel: K2(x) =34 (1− x2)I[−1,1](x).

– Gaussian: K2(x) =1√2π

exp{−x2/2}.

◮ Selection of the bandwidth h is important:

fh(Xi)→ Yi as h→ 0; fh(x)→1

Yi as h→∞.

1. Example: motorcycle data

> lines(ksmooth(times, accel, "normal", bandwidth=1), lty=1)

> legend(40, -100, legend=c("bandwidth=1", "bandwidth=2", "bandwidth=3"),

+ lty=c(1,2,3))

The kernels are scaled so that their quartiles (viewed as probability

densities) are at +/- 0.25*bandwidth, i.e. for the normal kernel

h× z0.75 = 0.25× bandwidth.

2. Example: motorcycle data

10 20 30 40 50

−100

bandwidth=1bandwidth=2bandwidth=3

1. Local polynomial smoothing

◮ Local model at point x:

Yi = a(x) + b(x)Xi + ǫi, Xi ∈ [x− h, x+ h].

◮ Local linear regression estimator

[Yi − a(x)− b(x)Xi

{ |Xi − x|h

}→ min

a(x),b(x)

f(x) = a(x) + b(x)x.

2. Local polynomial smoothing

◮ General local polynomial estimator:

f(z) ≈p∑

f (j)(x)

j!(z − x)j =

βj(z − x)j

[Yi −

βj(Xi − x)j]2K

( |Xi − x|h

)→ min

f(x) = β0, f (j)(x) = j! βj , j = 1, . . . , p.

◮ Both kernel and local polynomial estimators are linear smoothers

f(x) =n∑

wn,i(x)Yi,n∑

wn,i(x) = 1.

1. Local polynomial smoothing (LOESS)

◮ Idea: fit locally a polynomial to the data

◮ LOESS smoothing: define the weights

wi(x) =

[1− |x−Xi|3

τ3(x, α)

, i = 1, . . . , n.

Let r(x, β) be a polynomial of degree p with the coefficeints

β = (β0, . . . , βp). Define

β(x) = argminβ

wi(x)[Yi − r(Xi, β)]2, f(x) = r(x, β(x)).

2. Local polynomial smoothing (LOESS)

◮ Bandwidth τ(x, α) is chosen as follows:

– denote ∆i(x) = |x−Xi| and order these values

∆(1)(x) ≤ ∆(2)(x) ≤ · · · ≤ ∆(n)(x).

– if 0 < α ≤ 1 then τ(x, α) = ∆(q)(x) where q = [αn];

– if α > 1 then τ(x, α) = α∆(n)(x)

1. LOESS: motorcycle data

> attach(mcycle)

> mcycle.1 <- loess(accel~times, span=0.1)

> mcycle.2 <- loess(accel~times, span=0.5)

> mcycle.3 <- loess(accel~times, span=1)

> prtimes<- matrix((0:1000)*((max(times)-min(times))/1000)+min(times), ncol=1)

> praccel.1 <-predict(mcycle.1, prtimes)

> plot(mcycle, pch="+")

> lines(prtimes, praccel.1, lty=1)

> legend(40, -90, legend=c("span=0.1", "span=0.5", "span=1"), lty=c(1,2,3))

span= α controls proportion of the points used in local neighborhood; see

definition of τ(x, α);

degree of the polynomial is 2.

2. LOESS: motorcycle data

+++++ +++++++++

10 20 30 40 50

−100

span=0.1span=0.5span=1

15. Multivariate nonparametric regression:Regression trees (CART)

Multivariate nonparametric regression

◮ Curse of dimensionality

Data are very sparse in high dimensional space

If we have 1000 uniformly distributed points on [0, 1]d then the

average number Nd of points in [0, 0.3]d is as follows

d 1 2 3 4 5

Nd 300 90 27 8.1 2.4

◮ Remedy: nonparametric ”structural” models

– additive structure: f(X) = f1(X1) + . . .+ fp(Xp)

– single–index: f(X) = f0(θTX)

– projection pursuit: f(X) =∑k

i=1 fi(θTi X)

Regression trees: basic idea

◮ Data: (Xi, Yi), i = 1, . . . n, X ∈ X ⊂ Rp.

◮ Modeling assumption: there is a partition of X into M regions

D1, . . . , DM , and f is approximated by a contant at each region:

f(x) =M∑

cmI(x ∈ Dm)

◮ Splitting rules: binary splitting.

Choose a variable Xm and split according to Xm ≤ tm or Xm > tm.

◮ How to partition the domain X ?

Aregressiontree:anexample

X2<t2X1<t3

X2<t4D1D2D3

Correspondingpartitionofthefeaturespace

1. Growing the regression tree

◮ Predictor variable: X = (X1, . . . , Xp) ∈ Rp;

Data: (Xi, Yi), Xi = (Xi1, . . . , Xip), i = 1, . . . , n.

◮ Goodness of split S is defined by decrease in the total sum of squares:

Φ(S, t) = SS(t)−[SS(tL) + SS(tR)

SS(t) is the total sum of squares∑

(Yi − Y )2 of observations at

node t.

◮ Choose the split S that maximizes Φ(S, t).

2. Growing the regression tree

◮ Consider splitting variable Xj and split point s and define

D1(j, s) = {X : Xj ≤ s}, D2(j, s) = {X : Xj > s}

Then we look for j and s that solve

minj,s

[ n∑

(Yi − c1)2I{Xi ∈ D1(j, s)}+n∑

(Yi − c2)2I{Xi ∈ D2(j, s)}]

ck = ave{Yi : Xi ∈ Dk(j, s)

∑ni=1 YiI{Xi ∈ Dk(j, s)}∑ni=1 I{Xi ∈ Dk(j, s)}

, k = 1, 2.

◮ For each splitting variable determination of s is done very quickly

(finite number of different splits). The pair (j, s) is found by scanning

over all of the inputs.

Pruning the regression tree

◮ Let T = {t1, . . . , tm} be terminal nodes of the tree T with regions

D1, . . . , Dm, and numbers of observations N1, . . . , Nm.

◮ Notation: for each terminal node tm define

YiI{Xi ∈ Dm}, Q(tm) =1

(Yi−cm)2I{Xi ∈ Dm}

◮ Cost–complexity criterion: For complexity parameter (CP) α > 0 let

Rα(T ) =

NmQ(tm) + α size(T ), size(T ) = #{t ∈ T : t ∈ T}.

◮ Weakest–link cutting algorithm produces a decreasing sequence of

trees with corresponding CP’s. Cross–validation based selection from

this collection. [See transparencies for classification trees].

1 Example: regression tree for motorcycle data

> mcycle.tree<-rpart(accel~times, data=mcycle,method="anova")

> plot(mcycle.tree)

> text(mcycle.tree)

> lines(times, predict(mcycle.tree))

|times< 27.4

times>=16.5

times< 24.4

times>=19.5

times>=15.1

times>=35

−114.7 −86.31−42.49

−39.12 −4.357

3.291 29.29

2. Example: regression tree for motorcycle data

10 20 30 40 50

−100

1. Example: Boston housing data, regression tree

> library(MASS)

> nobs <- dim(Boston)[1]

> trainx<-sample(1:nobs, 2*nobs/3, replace=F)

> testx<-(1:nobs)[-trainx]

> Boston.tree<-rpart(medv~., data=Boston[trainx,], method="anova")

> print(Boston.tree)

n= 337

node), split, n, deviance, yval

* denotes terminal node

1) root 337 26701.3200 22.51810

2) rm< 6.825 277 10100.7600 19.68123

4) lstat>=15 99 1586.7700 14.28990

8) crim>=7.036505 41 409.5088 11.53171 *

9) crim< 7.036505 58 644.8588 16.23966 *

5) lstat< 15 178 4035.9670 22.67978

10) rm< 6.543 145 2251.4780 21.62069

20) lstat>=9.66 80 573.8339 20.32125 *

21) lstat< 9.66 65 1376.3040 23.22000 *

11) rm>=6.543 33 907.2133 27.33333 *

3) rm>=6.825 60 4079.5770 35.61500

6) rm< 7.435 42 1248.8060 31.77143

12) lstat>=5.415 21 417.2467 28.66667 *

13) lstat< 5.415 21 426.6981 34.87619 *

7) rm>=7.435 18 762.5450 44.58333 *

> plot(Boston.tree)

> text(Boston.tree)

> Boston.pred <- predict(Boston.tree, Boston[testx,])

> sum((Boston.pred-Boston[testx,"medv"])^2)

[1] 4045.943 # prediction error

# on the test set

2. Example: Boston housing data, regression tree

|rm< 6.825

lstat>=15

crim>=7.037 rm< 6.543

lstat>=9.66

rm< 7.435

lstat>=5.415

11.53 16.24

20.32 23.2227.33

28.67 34.8844.58

MARS as extension of CART

◮ CART decision trees are based on approximation of f by

f(x) =M∑

cmBm(x), Bm(x) = I{x ∈ Dm}︸︷︷︸basis functions

◮ Idea: replace I{x ∈ Dm} with continuous function.

If the basis functions Bm(·) are given, the coefficients cm are

estimated by the least squares.

Region representation

◮ Step function:

H[η] =

1, if η ≥ 0

0, otherwise

Each region Dm is obtained by, say, Km splits; k-th split,

k = 1, . . . , Km, is performed on the variable xv(k,m) using the

threshold tkm. Therefore

Bm(x) =

(xv(k,m) − tkm

)], skm = ±1

f(x) =M∑

(xv(k,m) − tkm

◮ Minimize a lack–of-fit (LOF) measure w.r.t. cm, skm, v(k,m) and

MARS basis functions

◮ MARS algorithm instead of step functions H[xv − t] and H[−xv + t]

uses [xv − t]+ and [−xv + t]+.

(x− t)+ (t− x)+

◮ The collection of basis functions

C ={(xj − t)+, (t− xj)+ : t ∈ {X1j , X2j , . . . , Xnj}, j = 1, . . . , p

MARS: Model building strategy

◮ Forward selection of basis functions: at each iteration we add a pair

of new basis functions. They are obtained by multiplication of the

previously chosen basis functions with [xj − t]+ and [t− xj ]+.

◮ First step: we consider adding to model a function

β1(xj − t)+ + β2(t− xj)+, t ∈ {X1j , X2j , . . . , Xnj}.

Suppose the best choice is β1(x2 −X72)+ + β2(X72 − x2)+. Then the

the set of basis functions at this step is

C1 = {B0(x) = 1, B1(x) = (x2 −X72)+, B2(x) = (X72 − x2)+}

◮ Second step: consider including a pair of products

(xj − t)+Bm(x) and (t− xj)+Bm(x), Bm ∈ C1.

◮ Step M : CM = {Bm(x),m = 1, . . . , 2M + 1}. At step M + 1 the

algorithm adds the terms

c2M+2Bl(x) [xj − t]+ + c2M+3Bl(x) [t− xj ]+, Bl ∈ CM , j = 1, . . . , p,

where Bl and j produce maximal decrease in the training error. Stop

when the model contains preset maximum number of terms Mmax.

◮ Final selection: choose fM based on M basis functions that minimizes

LOF(fM ) =n

(n−M − 1)2

[yi − fM (xi)

Recursive partitioning algorithm

B1(x) = 1

for M = 2 to Mmax do: lof∗ =∞for m = 1 to M − 1 do:

for v = 1 to n do:

for t ∈ {xvj : Bm(xj) > 0}g =

∑i 6=m ciBi(x) + cmBm(x)H[xv − t] + cMBm(x)H[−xv + t]

lof = minc1,...,cM LOF(g)

if lof < lof∗ then lof = lof∗; m∗ = m; v∗ = v; t∗ = t; endif

endfor

BM (x) = Bm∗(x)H[−xv∗ + t∗]

Bm∗(x) = Bm∗(x)H[xv∗ − t∗]endfor; end of algorithm

1. Example: Boston housing data – MARS

# Number of basis functions<=10, no interactions

> library(mda)

> Boston1.mars<-mars(Boston[,1:13], Boston$medv, degree=1, nk=10)

# ij-th element equal to 1 if term i has a factor of the form x_j>c,

# equal to -1 if term i has a factor of the form x_j <= c,

# and to 0 if x_j is not in term i.

> Boston1.mars$factor

crim zn indus chas nox rm age dis rad tax ptratio black lstat

[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0

[2,] 0 0 0 0 0 0 0 0 0 0 0 0 1

[3,] 0 0 0 0 0 0 0 0 0 0 0 0 -1

[4,] 0 0 0 0 0 1 0 0 0 0 0 0 0

[5,] 0 0 0 0 0 -1 0 0 0 0 0 0 0

[6,] 0 0 0 0 0 0 0 0 0 0 1 0 0

[7,] 0 0 0 0 0 0 0 0 0 0 -1 0 0

[8,] 0 0 0 0 1 0 0 0 0 0 0 0 0

[9,] 0 0 0 0 -1 0 0 0 0 0 0 0 0

> Boston1.mars$cuts

# ij-th element equal to the cut point c for variable j in term i.

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]

[1,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 0.00

[2,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 6.07

[3,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 6.07

[4,] 0 0 0 0 0.000 6.425 0 0 0 0 0.0 0 0.00

[5,] 0 0 0 0 0.000 6.425 0 0 0 0 0.0 0 0.00

[6,] 0 0 0 0 0.000 0.000 0 0 0 0 17.8 0 0.00

[7,] 0 0 0 0 0.000 0.000 0 0 0 0 17.8 0 0.00

[8,] 0 0 0 0 0.472 0.000 0 0 0 0 0.0 0 0.00

[9,] 0 0 0 0 0.472 0.000 0 0 0 0 0.0 0 0.00

> Boston1.mars$selected.terms

[1] 1 2 3 4 5 6 7 8 9

> Boston1.mars$coefficients

[1,] 25.3093268

[2,] -0.5529556

[3,] 2.9450633

[4,] 8.0947256

[5,] 1.4712255

[6,] -0.6005808

[7,] 0.8962524

[8,] -12.0724248

[9,] -52.6998783

> Boston1.mars$gcv

[1] 17.73768

> plot(Boston1.mars$residuals)

> abline(0,0)

> qqnorm(Boston1.mars$residuals)

> qqline(Boston1.mars$residuals)

◮ Fitted model

f(x) = 25.30− 0.55× (lstat− 6.07)+ + 2.95× (6.07− lstat)+

+ 8.09× (rm− 6.43)+ + 1.47× (6.43− rm)+

−0.6× (ptratio − 17.8)+ + 0.90× (17.8− ptratio)+

−12.07× (nox− 0.47)+ − 52.7× (0.47− nox)+

0 100 200 300 400 500

ars$re

−3 −2 −1 0 1 2 3

Normal Q−Q Plot

antile

# Number of basis function <= 40, degree=2

> Boston2.mars<-mars(Boston[,1:13], Boston$medv, degree=2, nk=40)

> Boston2.mars$selected.terms

[1] 1 2 4 5 6 7 8 9 11 13 14 16 19 20 23 25 26 27 28 29 31 32 34 36 38

[26] 39

> Boston2.factor

crim zn indus chas nox rm age dis rad tax ptratio black lstat

[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0

[2,] 0 0 0 0 0 0 0 0 0 0 0 0 1

[3,] 0 0 0 0 0 0 0 0 0 0 0 0 -1

[4,] 0 0 0 0 0 1 0 0 0 0 0 0 0

[5,] 0 0 0 0 0 -1 0 0 0 0 0 0 0

[6,] 0 0 0 0 0 1 0 0 0 0 1 0 0

[7,] 0 0 0 0 0 1 0 0 0 0 -1 0 0

[8,] 0 0 0 0 0 0 0 0 0 1 0 0 -1

[9,] 0 0 0 0 0 0 0 0 0 -1 0 0 -1

[10,] 0 0 0 0 1 0 0 0 0 0 0 0 1

[11,] 0 0 0 0 -1 0 0 0 0 0 0 0 1

[12,] 0 0 0 0 0 -1 0 1 0 0 0 0 0

[13,] 0 0 0 0 0 -1 0 -1 0 0 0 0 0

[14,] 1 0 0 0 0 0 0 0 0 0 0 0 0

[15,] -1 0 0 0 0 0 0 0 0 0 0 0 0

[16,] 1 0 0 1 0 0 0 0 0 0 0 0 0

[17,] 1 0 0 -1 0 0 0 0 0 0 0 0 0

[18,] -1 0 0 0 0 0 0 0 0 1 0 0 0

[19,] -1 0 0 0 0 0 0 0 0 -1 0 0 0

[20,] 0 0 0 0 0 0 0 0 0 0 0 0 1

[21,] 0 0 0 0 0 0 0 0 0 0 0 0 -1

[22,] 0 0 0 0 0 1 1 0 0 0 0 0 0

[23,] 0 0 0 0 0 1 -1 0 0 0 0 0 0

[24,] 0 0 0 0 0 0 0 1 0 0 0 0 0

[25,] 0 0 0 0 0 0 0 -1 0 0 0 0 0

[26,] 0 0 0 0 0 0 0 -1 0 0 0 1 0

[27,] 0 0 0 0 0 0 0 -1 0 0 0 -1 0

[28,] 0 0 0 0 0 0 0 0 0 0 0 1 1

[29,] 0 0 0 0 0 0 0 0 0 0 0 -1 1

[30,] 0 0 0 0 0 0 1 -1 0 0 0 0 0

[31,] 0 0 0 0 0 0 -1 -1 0 0 0 0 0

[32,] 0 0 0 0 1 1 0 0 0 0 0 0 0

[33,] 0 0 0 0 -1 1 0 0 0 0 0 0 0

[34,] 0 0 0 0 0 0 0 0 0 0 1 0 1

[35,] 0 0 0 0 0 0 0 0 0 0 -1 0 1

[36,] 0 0 0 0 1 0 0 -1 0 0 0 0 0

[37,] 0 0 0 0 -1 0 0 -1 0 0 0 0 0

[38,] 0 0 0 0 0 0 0 1 0 0 0 0 1

[39,] 0 0 0 0 0 0 0 -1 0 0 0 0 1

> Boston2.mars$cuts

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]

[1,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[2,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[3,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[4,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[5,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[6,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 17.8 0.00 0.00

[7,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 17.8 0.00 0.00

[8,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 335 0.0 0.00 6.07

[9,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 335 0.0 0.00 6.07

[10,] 0.00000 0 0 0 0.718 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[11,] 0.00000 0 0 0 0.718 0.000 0.0 0.0000 0 0 0.0 0.00 6.07

[12,] 0.00000 0 0 0 0.000 6.425 0.0 1.8195 0 0 0.0 0.00 0.00

[13,] 0.00000 0 0 0 0.000 6.425 0.0 1.8195 0 0 0.0 0.00 0.00

[14,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[15,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[16,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[17,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00

[18,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 242 0.0 0.00 0.00

[19,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 242 0.0 0.00 0.00

[20,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 23.97

[21,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 23.97

[22,] 0.00000 0 0 0 0.000 6.425 84.7 0.0000 0 0 0.0 0.00 0.00

[23,] 0.00000 0 0 0 0.000 6.425 84.7 0.0000 0 0 0.0 0.00 0.00

[24,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[25,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[26,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 373.66 0.00

[27,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 373.66 0.00

[28,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 376.73 6.07

[29,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 376.73 6.07

[30,] 0.00000 0 0 0 0.000 0.000 77.8 4.7075 0 0 0.0 0.00 0.00

[31,] 0.00000 0 0 0 0.000 0.000 77.8 4.7075 0 0 0.0 0.00 0.00

[32,] 0.00000 0 0 0 0.624 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[33,] 0.00000 0 0 0 0.624 6.425 0.0 0.0000 0 0 0.0 0.00 0.00

[34,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 12.6 0.00 6.07

[35,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 12.6 0.00 6.07

[36,] 0.00000 0 0 0 0.718 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[37,] 0.00000 0 0 0 0.718 0.000 0.0 4.7075 0 0 0.0 0.00 0.00

[38,] 0.00000 0 0 0 0.000 0.000 0.0 2.9879 0 0 0.0 0.00 6.07

[39,] 0.00000 0 0 0 0.000 0.000 0.0 2.9879 0 0 0.0 0.00 6.07

> Boston2.mars$coefficients

[1,] 23.054932863

[2,] -0.475918184

[3,] 9.219231990

[4,] -2.823958066

[5,] -1.785616880

[6,] 0.742716889

[7,] 0.018297135

[8,] 0.015258475

[9,] 1.687538405

[10,] 9.636024180

[11,] -0.062409499

[12,] 2.555488850

[13,] 0.013238063

[14,] 0.593931139

[15,] 0.047603018

[16,] 2.588471925

[17,] -0.107836555

[18,] -0.012484569

[19,] 0.017930583

[20,] 0.001807397

[21,] 0.045138047

[22,] -98.745400075

[23,] -0.058675469

[24,] -6.453215924

[25,] -0.071061935

[26,] -0.140013447

> Boston2.mars$gcv

[1] 9.199464

◮ Fitted model

f(x) = 23.05− 0.48× (lstat − 6.07)+ + 9.21× (6.07− lstat)+

− 2.82× (rm− 6.43)+ − 1.78× (6.43− rm)+

+0.74× (rm− 6.43)+ × (ptratio − 17.8)+

+ 0.002× (rm− 6.43)+ × (17.8− ptratio)+

+ · · ·

16. Dimensionality reduction: principal componentsanalysis (PCA)

1. PCA: basic idea

◮ The idea is to describe/approximate variability (distribution) of

X = (X1, . . . , Xp) by a distribution in the space of smaller dimension.

◮ Approximation by one dimenstional space: let

δTX =

δiXi, ‖δ‖2 =

δ2i = 1.

Which projection (normalized linear combination) is the “best

representer” of the vector distribution? Or how to choose δ?

◮ First optimization problem

(Opt1) maxδ:‖δ‖=1

var(δTX) = maxδ:‖δ‖=1

δTΣδ, Σ = cov(X).

2. PCA: basic idea

◮ Solution to (Opt1) is eigenvector γ1 of Σ corresponding to the

maximal eigenvalue λ1. The first principal component is

Y1 = γT1 X.

◮ Second optimization problem: max{δTΣδ : ‖δ‖ = 1, δTγ1 = 0

Solution is the eigenvector γ2 of Σ corresponding to the second largest

eigenvalue λ2. Second principal component: Y2 = γT2 X and so on...

◮ In general, if Σ = ΓΛΓT , Λ is diagonal matrix comprised of

eigenvalues, Γ is an orthogonal matrix comprised of eigenvectors then

the PCA transformation is

Y = ΓT (X − µ).

PCA: theory

◮ Theorem: Let X ∼ Np(µ,Σ), Σ = ΓTΛΓ, and let Y = ΓT (X − µ) bethe principal components. Then

(i) EYj = 0, var(Yj) = λj , j = 1, . . . , p; cov{Yi, Yj} = 0, ∀i 6= j;

(ii) var(Y1) ≥ var(Y2) ≥ · · · ≥ var(Yp)

(iii)∑p

i=1 var(Yi) = tr(Σ),∏p

i=1 var(Yi) = det(Σ).

◮ Proportion of the variability explained by q components:

∑qi=1 λi∑pi=1 λi

∑qi=1 var(Yi)∑pi=1 var(Yi)

PCA: empirical version

◮ Idea: given Xi ∈ Rp, i = 1, . . . , n, estimate Γ and µ, apply PCA to

Xi’s, and keep q first variables that explain “well” the variability.

◮ Estimates:

Xi, Σ =1

n− 1

(Xi − µ)(Xi − µ)T .

◮ Spectral decomposition and PCA transformation:

Σ = ΓΛΓT , Yi = ΓT (Xi − µ), i = 1, . . . , n.

If variables are given on different scales, before applying PCA the

data can be standardized:

Xi = D−1/2(Xi − µ), D = diag{Σ}.

1. Example: heptathlon data

◮ Data on 25 competitors: seven events and total score

>heptathlon

hurdles highjump shot run200m longjump javelin run800m

Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51

John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12

Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20

Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24

...........................................................................

To recode all events in same direction (“large” is “good“) we transformrunning events

>heptathlon$hurdles<-max(heptathlon$hurdles)-heptathlon$hurdles

>heptathlon$run200m<-max(heptathlon$run200m)-heptathlon$run200m

>heptathlon$run800m<-max(heptathlon$run800m)-heptathlon$run800m

2. Example: heptathlon data

◮ Correlations

>hept<-heptathlon[,-8] #without total score

>round(cor(hept),2)

hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78

highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59

shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42

run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62

longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70

javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02

run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00

javelin is weakly correlated with other variables...

Scatterplot of the data

>plot(hept)

hurdles

1.50 1.70 0 1 2 3 4 36 40 44

1.70 highjump

run200m

longjump

3642 javelin

0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40

run800m

1. Principal components in R

> hept_pca<-prcomp(hept,scale=TRUE)

> print(hept_pca)

Standard deviations:

[1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617

Rotation:

PC1 PC2 PC3 PC4 PC5 PC6

hurdles -0.4528710 0.15792058 -0.04514996 0.02653873 -0.09494792 -0.78334101

highjump -0.3771992 0.24807386 -0.36777902 0.67999172 0.01879888 0.09939981

shot -0.3630725 -0.28940743 0.67618919 0.12431725 0.51165201 -0.05085983

run200m -0.4078950 -0.26038545 0.08359211 -0.36106580 -0.64983404 0.02495639

longjump -0.4562318 0.05587394 0.13931653 0.11129249 -0.18429810 0.59020972

javelin -0.0754090 -0.84169212 -0.47156016 0.12079924 0.13510669 -0.02724076

run800m -0.3749594 0.22448984 -0.39585671 -0.60341130 0.50432116 0.15555520

hurdles 0.38024707

highjump -0.43393114

shot -0.21762491

run200m -0.45338483

longjump 0.61206388

javelin 0.17294667

run800m -0.09830963

> summary(hept_pca)

Importance of components:

PC1 PC2 PC3 PC4 PC5 PC6 PC7

Standard deviation 2.112 1.093 0.7218 0.6761 0.4952 0.2701 0.221

Proportion of Variance 0.637 0.171 0.0744 0.0653 0.0350 0.0104 0.007

Cumulative Proportion 0.637 0.808 0.8822 0.9475 0.9826 0.9930 1.000

> a1<-hept_pca$rotation[,1] # linear combination for the 1st principal

# component

-0.4528710 -0.3771992 -0.3630725 -0.4078950 -0.4562318 -0.0754090 -0.3749594

2. Principal components in R

◮ First principal component:

> predict(hept_pca)[,1] # or just hept_pca$x[,1]

Joyner-Kersee (USA) John (GDR) Behmer (GDR) Sablovskaite (URS)

-4.121447626 -2.882185935 -2.649633766 -1.343351210

Choubenkova (URS) Schulz (GDR) Fleming (AUS) Greiner (USA)

-1.359025696 -1.043847471 -1.100385639 -0.923173639

Lajbnerova (CZE) Bouraga (URS) Wijnsma (HOL) Dimitrova (BUL)

-0.530250689 -0.759819024 -0.556268302 -1.186453832

Scheider (SWI) Braun (FRG) Ruotsalainen (FIN) Yuping (CHN)

0.015461226 0.003774223 0.090747709 -0.137225440

Hagger (GB) Brown (USA) Mulliner (GB) Hautenauve (BEL)

0.171128651 0.519252646 1.125481833 1.085697646

Kytola (FIN) Geremias (BRA) Hui-Ing (TAI) Jeong-Mi (KOR)

1.447055499 2.014029620 2.880298635 2.970118607

Launa (PNG)

6.270021972

◮ The first two components account for 81% of the variance.

>plot(hept_pca)

hept_pcaVa

> cor(heptathlon$score,hept_pca$x[,1])

[1] -0.9910978

>plot(heptathlon$score, hept_pca$x[,1])

4500 5000 5500 6000 6500 7000

−4−2

heptathlon$score

17. Clustering: model–based clustering,K–means, K–medoids

1. Clustering problem

◮ Clustering ⇔ unsupervised learning:

Grouping or segmenting objects (observations) into subsets or

”clusters”, such that those within each cluster are more similar

to each other than they are to the members of other groups.

◮ Example – Iris data

Given the measurements in centimeters of the four variables

sepal length/width and petal length/width

for 50 Iris flowers, the goal is to group the observations with

accoradance to the species. The species are Iris setosa,

versicolor, and virginica.

2. Clustering problem

◮ Data: {X1, . . . , Xn}, Xi ∈ Rp

◮ Dissimilarity (proximity) measure: if d(·, ·) is a distance,

D = {dij}i,j=1,...,n, dij = d(Xi, Xj),

e.g., di,j = ‖Xi −Xj‖2.

◮ Clustering algorithm maps each observation Xi to one of the K

groups,

C : {X1, . . . , Xn} → {1, . . . , K}.

Parametric approach

◮ Formulation: Let X1, . . . , Xn independent vectors from K

populations G1, . . . , GK , and

Xi ∼ f(x, θk) when Xi is sampled from Gk.

◮ Observation labels: For i = 1, . . . , n let γi = k if Xi is sampled from

Gk. The vector γ = (γ1, . . . , γn) is unknown.

◮ Clusters

Ck =⋃

i:γi=k

{Xi}, k = 1, . . .K.

The goal is to find Ck, k = 1, . . . , K.

1. Maximum likelihood clustering

◮ Likelihood function to be maximized w.r.t. γ and θ = (θ1, . . . , θK)

L(γ; θ) =∏

i:γi=1

f(Xi, θ1)∏

i:γi=2

f(Xi, θ2) · · ·∏

i:γi=K

f(Xi, θK).

◮ Specific case: f(x, θk) = Np(µk,Σk), k = 1, . . . , K

lnL(γ; θ) = const−1

i:Xi∈Ck

(Xi−µk)TΣ−1

k (Xi−µk)−1

nk ln |Σk|,

where nk =∑n

i=1 I{Xi ∈ Ck}, k = 1, . . . , n.

◮ When γ (partition) is fixed and lnL(γ; θ) is maximized w.r.t. θ,

µk(γ) =1

Xi∈Ck

Xi, Σk(γ) =1

Xi∈Ck

[Xi − µk(γ)][Xi − µk(γ)]T

2. Maximum likelihood clustering

◮ Substituting µk(γ) and Σk(γ) we obtain

lnL(γ; θ) = constant− 1

nk ln |Σk(γ)|.

Thus the optimization problem is to minimize

|Σk(γ)|nk

over all partitions of the set of observations into K groups.

◮ Computationally infeasible problem

How to choose K?

In many cases we don’t know how big K should be.

◮ Use some distance measure (”within–groups variance” and

”between–groups variance”)

◮ Use a visual plot if possible (can plot points and distances)

◮ If we can define a cluster ”purity measure” (as in CART), then one

can optimize over this and penalize for complexity.

K–means clustering algorithm

◮ The basic idea

1. Given a preliminary partition {Cold1 , . . . , Cold

K } of observationscompute the group means

i:Xi∈Ck

Xi, k = 1, . . . , K.

2. Associate each observation with the cluster whose mean is closest,

Cnewk =

⋃{Xi : ‖Xi −mk‖ ≤ min

j=1,...,K,j 6=k‖Xi −mk‖

3. Go to 1.

◮ Solves the problem:∑n

i=1 mink=1,...,K ‖Xi −mk‖2 → minm1,...,mK.

K–means: a simple numerical example

◮ Data: {2, 4, 10, 12, 3, 20, 30, 11, 25}, K = 2.

1. Let m1 = 2, m2 = 4 ⇒ C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25}

2. New centers m1 = 2.5, m2 = 16 ⇒

C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}.

m1 m2 C1 C2

3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}4.75 19.6 {2, 3, 4, 10, 11, 12} {20, 30, 25}7 25 {2, 3, 4, 10, 11, 12} {20, 30, 25}

and the algorithm stops.

1. K-means: simulated example

# Data generation

> m1=c(0,1)

> m2=c(4.5,0)

> S1<-cbind(c(2, 0.5), c(0.5, 3))

> S2<-cbind(c(2, -1.5), c(-1.5, 3))

> X1<-mvrnorm(100, m1, S1)

> X2<-mvrnorm(100, m2, S2)

> Y=rbind(X1, X2)

> plot(Y)

> points(X1, col="red")

> points(X2, col="blue")

# K-means algorithm

>Y.km<-kmeans(Y, 2, iter.max=20)

K-means clustering with 2 clusters of sizes 97, 103

Cluster means:

[,1] [,2]

1 4.55303155 -0.0987873

2 0.08029854 1.1484915

Clustering vector:

[1] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2

[38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1

[112] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1

[186] 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1

Within cluster sum of squares by cluster:

[1] 389.4226 426.2129

Available components:

[1] "cluster" "centers" "withinss" "size"

2. K–means: simulated example

−2 0 2 4 6

−4−2

1. K-means: old swiss banknote data

◮ Data: 200 observations on the variables

– X1 = length of the bill

– X2 = height of the bill (right)

– X3 = height of the bill (left)

– X4 = distance of the inner frame to the lower border

– X5 = distance of the inner frame to the upper border

– X6 = length of the diagonal of the central picture

◮ The first 100 are geinuine, the other - counterfeit

> bank2<- read.table(file="bank2.dat")

> bank2.km<-kmeans(bank2, 2, 20)

> bank2.km

K-means clustering with 2 clusters of sizes 100, 100

Cluster means:

V1 V2 V3 V4 V5 V6

1 214.823 130.300 130.193 10.530 11.133 139.450

2 214.969 129.943 129.720 8.305 10.168 141.517

Clustering vector:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:

[1] 225.2233 142.8852

[1] "cluster" "centers" "withinss" "size"

K–medoids clustering algorithm

◮ Extends K–means to non–Euclidean dissimilarity measures d(·, ·).◮ Algorithm

1. Given an initial partition to clusters C = {C1, . . . , CK} find an

observation in each cluster that minimizes the sum of distances to

other observations

i∗k = arg mini:Xi∈Ck

j:Xj∈Ck

d(Xi, Xj) ⇒ mk = Xi∗k, k = 1, . . . , K.

Associate each observation to clusters according to the distance to

the found centroids (medoids) {m1, . . . ,mK}. Proceed until

convergence

◮ Objective function

minC,{ik}K

j:Xj∈Ck

d(Xik , Xj).

K–medoids example: countries dissimilarities

◮ Data set: the average dissimilarity scores matrix between 12 countries

(Belgium, Brazil, Chile, Cuba, Egypt, France, India, Israel, USA,

USSR, Yugoslavia and Zaire).

> library(cluster)

> x <-read.table("countries.data")

> val<-c("BEL","BRA", "CHI","CUB", "EGY", "FRA", "IND", "ISR", "USA", "USS",

+ "YUG", "ZAI")

> rownames(x) <- val

> colnames(x) <- val

BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI

BEL 0.00 5.58 7.00 7.08 4.83 2.17 6.42 3.42 2.50 6.08 5.25 4.75

BRA 5.58 0.00 6.50 7.00 5.08 5.75 5.00 5.50 4.92 6.67 6.83 3.00

CHI 7.00 6.50 0.00 3.83 8.17 6.67 5.58 6.42 6.25 4.25 4.50 6.08

CUB 7.08 7.00 3.83 0.00 5.83 6.92 6.00 6.42 7.33 2.67 3.75 6.67

EGY 4.83 5.08 8.17 5.83 0.00 4.92 4.67 5.00 4.50 6.00 5.75 5.00

FRA 2.17 5.75 6.67 6.92 4.92 0.00 6.42 3.92 2.25 6.17 5.42 5.58

IND 6.42 5.00 5.58 6.00 4.67 6.42 0.00 6.17 6.33 6.17 6.08 4.83

ISR 3.42 5.50 6.42 6.42 5.00 3.92 6.17 0.00 2.75 6.92 5.83 6.17

USA 2.50 4.92 6.25 7.33 4.50 2.25 6.33 2.75 0.00 6.17 6.67 5.67

USS 6.08 6.67 4.25 2.67 6.00 6.17 6.17 6.92 6.17 0.00 3.67 6.50

YUG 5.25 6.83 4.50 3.75 5.75 5.42 6.08 5.83 6.67 3.67 0.00 6.92

ZAI 4.75 3.00 6.08 6.67 5.00 5.58 4.83 6.17 5.67 6.50 6.92 0.00

> x.pam2<-pam(x,2, diss=T)

> summary(x.pam2)

Medoids:

[1,] "9" "USA"

[2,] "4" "CUB"

Clustering vector:

1 1 2 2 1 1 2 1 1 2 2 1

Objective function:

build swap

3.291667 3.236667

Numerical information per cluster:

size max_diss av_diss diameter separation

[1,] 7 5.67 3.227143 6.17 4.67

[2,] 5 6.00 3.250000 6.17 4.67

# diameter - maximal dissimilarity between observations in the cluster

# separation - minimal dissimilarity between an observation of the cluster

# and an observation of another cluster.

Isolated clusters:

L-clusters: character(0) #

L*-clusters: character(0) # diameter < separation

# L-cluster: for each observation i the maximal dissimilarity between

# i and any other observation of the cluster is smaller than the minimal

# dissimilarity between i and any observation of another cluster.

Silhouette plot information:

cluster neighbor sil_width

USA 1 2 0.42519084

BEL 1 2 0.39129752

FRA 1 2 0.35152954

ISR 1 2 0.29785894

BRA 1 2 0.22317708

EGY 1 2 0.19652641

ZAI 1 2 0.18897849

CUB 2 1 0.39814815

USS 2 1 0.34104696

CHI 2 1 0.32512211

YUG 2 1 0.26177642

IND 2 1 -0.04466159

Average silhouette width per cluster:

[1] 0.2963655 0.2562864

Average silhouette width of total data set:

[1] 0.2796659

[1] "medoids" "id.med" "clustering" "objective" "isolation"

[6] "clusinfo" "silinfo" "diss" "call"

1. Silhouette plot

> plot(x.pam2)

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = x, k = 2, diss = T)

Average silhouette width : 0.28

n = 12 2 clusters Cj

j : nj | avei∈Cj si

1 : 7 | 0.30

2 : 5 | 0.26

2. Silhouette plot

◮ For each observation Xi from cluster Ck define

* in–cluster average dissimilarity

a(i) =

∑j:Xj∈Ck

d(Xi, Xj)

#{j 6= i : Xj ∈ Ck}, Xi ∈ Ck.

* between–clusters average dissimilarity

d(i, Cm) =

∑j:Xj∈Cm

d(Xi, Xj)

#{j : Xj ∈ Cm}, m 6= k, m ∈ {1, . . . , K}.

b(i) = minm6=k,m=1,...,K

d(i, Cm)

◮ Silhouette width:

s(i) =b(i)− a(i)

max{a(i), b(i)} , −1 ≤ s(i) ≤ 1.

s(i) = 0 if Xi is the only observation in its cluster. Silhouette plot

plots s(i) in decreasing order fro each cluster.

3. Silhouette plot

◮ Interpretation of silhouette width

– large s(i) (almost 1) – observation is very well clustered;

– small s(i) (around 0) – observation lies between two clusters;

– negative s(i) – observation is badly clustered.

◮ Silhouette coefficient SC = 1n

∑ni=1 s(i): choose K with maximal SC.

– SC ≥ 0.7: strong cluster structure

– 0.5 ≤ SC ≤ 0.7: reasonable structure

– 0.25 ≤ SC ≤ 0.5: weak structure

– SC ≤ 0.25: no structure

Countries dissimilarities: 3-medoids

> x.pam3<-pam(x,3, diss=T)

> summary(x.pam3)

Medoids:

[1,] "9" "USA"

[2,] "12" "ZAI"

[3,] "4" "CUB"

Clustering vector:

1 2 3 3 1 1 2 1 1 3 3 2

Objective function:

build swap

2.583333 2.506667

Numerical information per cluster:

size max_diss av_diss diameter separation

[1,] 5 4.50 2.4000 5.0 4.67

[2,] 3 4.83 2.6100 5.0 4.67

[3,] 4 3.83 2.5625 4.5 5.25

Isolated clusters:

L-clusters: character(0)

L*-clusters: [1] 3

Silhouette plot information:

cluster neighbor sil_width

USA 1 2 0.46808511

FRA 1 2 0.43971831

BEL 1 2 0.42149254

ISR 1 2 0.36561099

EGY 1 2 0.02118644

ZAI 2 1 0.27953625

BRA 2 1 0.25456578

IND 2 3 0.17498951

CUB 3 2 0.47890188

USS 3 1 0.43682195

YUG 3 1 0.31304749

CHI 3 2 0.30726872

Average silhouette width per cluster:

[1] 0.3432187 0.2363638 0.3840100

Average silhouette width of total data set:

[1] 0.3301021

[1] "medoids" "id.med" "clustering" "objective" "isolation"

[6] "clusinfo" "silinfo" "diss" "call"

> plot(x.pam3)

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = x, k = 3, diss = T)

1 : 5 | 0.34

2 : 3 | 0.24

3 : 4 | 0.38

18. Clustering: hierarchical methods

Hierarchical methods

◮ Two main types of hierarchical clustering

– Agglomerative:

* Start with the points as individual clusters

* At each step, merge the closest pair of clusters until only one

cluster (or K clusters) left

– Divisive:

* Start with one, all–inclusive cluster

* At each step, split a cluster until each cluster contains a point

(or there are K clusters).

◮ Algorithms use dissimilarity/distance matrix

merge or split one cluster in a time.

Dissimilarity measures between clusters

Dissimilarity between clusters R and Q

◮ Average dissimilarity

D(R,Q) =1

Xi∈R, Xj∈Q

d(Xi, Xj)

Best when clusters are ball–shaped, fairly well–separated.

◮ Nearest neighbor (single linkage)

D(R,Q) = minXi∈R,Xj∈Q

d(Xi, Xj)

Can lead to ”chaining” effect, elongated clusters

◮ Furthest neighbor (complete linkage)

D(R,Q) = maxXi∈R,Xj∈Q

d(Xi, Xj)

Tends to produce small compact clusters

Agglomerative algorithm

Assume that we have a measure dissimilarity between clusters.

1. Construct the finest partition (one observation in each cluster)

repeat:

2. Compute the dissimilarity matrix between clusters.

3. Find the two clusters with the smallest dissimilarity, and merge them

into one cluster.

4. Compute the dissimilarity between the new groups

until: only one cluster remains.

1. Example: clustering using single linkage

◮ Initial dissimilarity matrix (5 objects)

1 2 3 4 5

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0

⇒ because mini,j di,j = d53 = 2,

merge 3 and 5 to cluster (35)

◮ Nearest neighbor distances:

d(35)1 = min{d31, d51} = min{3, 11} = 3

d(35)2 = min{d32, d52} = min{7, 10} = 7

d(35)4 = min{d34, d54} = min{9, 8} = 8.

◮ Dissimilarity matrix after merger of 3 and 5

(35) 1 2 4

(35) 0

2 7 9 0

4 8 6 5 0

⇒ d(35)1 = 3; merge (35) and 1 to (135)

◮ Distances

d(135)2 = min{d(35)2, d12} = min{7, 9} = 7

d(135)4 = min{d(35)4, d14} = min{8, 6} = 6.

◮ Dissimilarity matrix after merger of (35) and 1

(135) 2 4

(135) 0

4 6 5 0

⇒ d42 = 5; merge 4 and 2 to (42)

d(135)(24) = min{d(135)2, d(135)4} = min{7, 6} = 6.

◮ The final disssimilarity matrix

(135) (24)

(135) 0

(24) 6 0

⇒ (135) and (24) are merged on the level 6.

4. Example: resulting dendrogram

1 3 5 2 4

Agglomerative coefficient

◮ Measures of how much ”clustering structure” exists in the data

◮ Ck(j) is the first cluster Xj is merged with, j = 1, . . . , n; R, Q are two

last clusters that are merged at the final step of the algorithm. For

each observation Xj define

α(j) =D(Xj , Ck(j))

D(R,Q), j = 1, . . . , n.

◮ Agglomerative coefficient: AC = 1n

∑nj=1[1− α(j)]

– large AC (close to 1): observations are merged with clusters close

to them in the beginning as compared to the final merger;

– small AC: evenly distributed data, poor clustering structure.

1. Example: Swiss Provinces Data (1888)

◮ Data: 47 observations on 5 measures of socio–economic indicators on

Swiss provinces:

– Agriculture: % of males involved in agriculture as occupation

– Examination: % ”draftees” receiving highest mark on army

examination

– Education: % education beyond primary school for ”draftees”

– Catholic: % catholic (as opposed to ”protestant”

– Infant.Mortality: % live births who live less than 1 year

2. Example: Swiss Provinces Data (1888)

> library(cluster)

> swiss.x<-swiss[,-1]

> sagg<-agnes(swiss.x)

> pltree(sagg)

> print(sagg$ac)

[1] 0.8774795

◮ By default agnes uses average dissimilarity (complete and single

linkage can be chosen).

◮ There are two main groups with one point V. De Geneve

well–separated from them.

◮ Fairly large AC = 0.878 suggests good clustering structure in the

CourtelaryLe Locle

ValdeTraversLa Chauxdfnd

La ValleeLausanneNeuchatelVevey

NeuvevilleBoudry

GrandsonVal de Ruz

AigleMorges

RolleAvenchesOrbeMoudonPayerne

YverdonNyoneAubonne

OronCossonay

LavauxPaysd’enhaut

EchallensMoutierRive Droite

Rive GaucheV. De Geneve

DelemontFranches−Mnt

PorrentruyGruyere

SarineBroye

GlaneVeveyse

SionMonthey

ContheySierreHerensEntremontMartigwy

St Maurice

Dendrogram of agnes(x = swiss.x)

agnes (*, "average")swiss.x

Height

Divisive clustering algorithm

◮ Splitting cluster C to A and B

1. Initialization: A = C, B = ∅

2. For each observation Xj compute

� average dissimilarity a(j) from all other observations in A

� average dissimilarity d(j, B) to all observations in B; d(j, ∅) = 0.

3. Select observation Xk such that S(k) = a(k)− d(k,B) is maximal.

4. If S(k) ≥ 0, add Xk to cluster B: B = B ∪ {Xk}, A = A\{Xk},and go to 2. If S(k) < 0, or A contains only one observation, stop.

◮ Apply the same procedure to clusters A and B (at each step split

cluster with maximal diameter).

1. Example: divisive clustering algorithm

◮ Dissimilarity matrix (5 objects)

1 2 3 4 5

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0

◮ Initial clusters:

A = {1, 2, 3, 4, 5}, B = ∅, diam(A) = d51 = 11.

◮ Step 1: average dissimilarities of observations

a(1) = (9 + 3 + 6 + 11)/4 = 7.25, a(2) = (9 + 7 + 5 + 10)/4 = 7.75

a(3) = (3 + 7 + 9 + 2)/4 = 5.52, a(4) = (6 + 5 + 9 + 8)/4 = 7

a(5) = (11 + 10 + 2 + 8)/4 = 7.75 ⇒A = {1, 2, 3, 4}, B = {5}, diam(A) = 9.

◮ Step 2: Computing average dissimilarities and S(·)

a(1) = (9 + 3 + 6)/3 = 6, S(1) = a(1)− d15 = 6− 11 = −5a(2) = (9 + 7 + 5)/3 = 7, S(2) = a(2)− d25 = 7− 10 = −3a(3) = (3 + 7 + 9)/3 = 6.333, S(3) = a(3)− d35 = 6.333− 2 = 4.333

a(4) = (6 + 5 + 9)/3 = 6.667, S(4) = a(4)− d45 = 6.667− 8 = −1.333⇒ A = {1, 2, 4}, B = {5, 3}, diam(A) = 9, diam(B) = 2.

◮ Step 3:

a(1) = (9 + 6)/2 = 7.5, d(1, B) = (3 + 11)/2 = 7, S(1) = 0.5

a(2) = (9 + 5)/2 = 7, d(2, B) = (7 + 10)/2 = 8.5, S(2) = −1.5a(4) = (6 + 5)/2 = 5.5, d(4, B) = (9 + 8)/2 = 8.5, S(4) = −3⇒ A = {2, 4}, B = {1, 3, 5}, diam(A) = 5, diam(B) = 11.

◮ Step 4:

a(2) = 5, d(2, B) = (9 + 7 + 10)/3 = 8.667, S(2) = −3.667a(4) = 5, d(4, B) = (6 + 9 + 8)/3 = 7.667, S(4) = −2.667.

Since S(2) and S(4) are negative the procedure stops after step 4 with

two clusters A = {2, 4} and B = {1, 3, 5}. Next we start to divide the

cluster B (it has larger diameter).

Divisive coefficient

◮ Cluster diameter

diam(C) = maxXi∈C,Xj∈C

d(Xi, Xj)

◮ For observation Xj , let δ(j) be the diameter of the last cluster to

which Xj belongs (before it is split off as a single observation),

divided by the diameter of the whole dataset.

◮ Divisive coefficient: DC = 1n

∑ni=1[1− δ(j)]

– large DC (close to 1): on average observations are in small

compact clusters (relative to the size of the whole dataset) before

being split off; evidence of a good clustering structure.

Example: Swiss Provinces Data (1888)

> library(cluster)

> swiss.x<-swiss[,-1]

> sdiv<-diana(swiss.x)

> pltree(sdiv)

> print(sdiv$dc)

[1] 0.903375

◮ diana uses average dissimilarity and Euclidaen distance.

◮ There are three well–separated clusters.

◮ Fairly large DC = 0.903 suggests good clustering structure in the

CourtelaryLe Locle

ValdeTraversLa Chauxdfnd

La ValleeLausanneNeuchatelVevey

V. De GeneveRive Droite

Rive GaucheMoutier

NeuvevilleBoudry

GrandsonVal de RuzNyone

YverdonAigle

MorgesRolle

AvenchesOrbeMoudonPayerneAubonne

OronCossonay

LavauxPaysd’enhaut

EchallensDelemont

Franches−MntPorrentruy

GruyereSarine

BroyeGlane

VeveyseSion

MontheyConthey

SierreHerens

EntremontMartigwy

St Maurice

Dendrogram of diana(x = swiss.x)

diana (*, "NA")swiss.x

Height

Hierarchical procedures – comments

◮ The procedures may be sensitive to outliers, ”noise points”

◮ Try different methods, and within a given method, different ways of

assigning distances (complete , average, single linkage). Roughly

consistent outcomes indicate validity of the clutering structure.

◮ Stability of the hierarchical solution can be checked by applying the

algorithm to slightly perturbed data set.

Pottery data

◮ Data: chemical composition of 45 specimens of Romano–British

pottery for 9 oxides. A kiln site at which the pottery is found is also

known (5 different sites).

◮ Question: whether the pots can be divided into distinct groups and

how these groups relate to the kiln site?

> pottery

Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO

1 18.8 9.52 2.00 0.79 0.40 3.20 1.01 0.077 0.015

2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018

...................................................

44 14.8 2.74 0.67 0.03 0.05 2.15 1.34 0.003 0.015

45 19.1 1.64 0.60 0.10 0.03 1.75 1.04 0.007 0.018

K-means algorithm

>wss<-rep(0,10)

>wss[1]<-44*sum(apply(pottery,2,var))

>for (i in 2:10) wss[i]<-mean(kmeans(pottery,i)$withinss)

>plot(2:10,wss[2:10],type="b",xlab="Number of groups",ylab="Mean-within-group SS")

>kmeans(pottery,5)$clust

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

5 5 5 5 5 5 5 5 3 3 3 3 3 5 5 3 5 5 5 5 5 1 1 1 2 1

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

2 2 2 2 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4

> kmeans(pottery,4)$clust

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

4 4 4 4 4 4 4 4 2 2 2 2 2 4 4 2 4 4 4 4 4 2 2 2 1 1

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3

246810

100150

Number of groups

Mean−w

ithin−group SS

K-medoids

> sil.w<-rep(0,10)

> for (i in 2:10) sil.w[i]<-pam(pottery,i,diss=F)$silinfo$avg.width

> sil.w[2:10]

[1] 0.5018253 0.6031487 0.5038343 0.4991434 0.5061460 0.4781251 0.4644091

[8] 0.4537507 0.4391603

> p3.med<-pam(pottery,3,diss=F)

> p3.med

Medoids:

ID Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO

2 2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018

32 32 12.4 6.13 5.69 0.22 0.54 4.65 0.70 0.159 0.015

38 38 18.0 1.50 0.67 0.01 0.06 2.11 0.92 0.001 0.016

Clustering vector:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

> plot(p3.med)

−2 0 2 4

clusplot(pam(x = pottery, k = 3, diss = F))

Component 1

These two components explain 74.75 % of the point variability.

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = pottery, k = 3, diss = F)

1 : 21 | 0.62

2 : 14 | 0.53

3 : 10 | 0.67

Agglomerative algorithm

> pot.ag<-agnes(pottery,diss=F)

> pot.ag

Call: agnes(x = pottery, diss = F)

Agglomerative coefficient: 0.8878186

Order of objects:

[1] 1 2 4 14 15 18 7 9 16 3 20 5 21 8 6 19 17 10 12 13 11 22 24 23 26

[26] 32 33 31 25 29 27 30 34 35 28 36 42 41 38 39 45 43 40 37 44

Height (summary):

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.1273 0.5273 1.0020 1.4720 1.8430 7.4430

[1] "order" "height" "ac" "merge" "diss" "call"

[7] "method" "order.lab" "data"

> pltree(pot.ag)

12 4 14 15 18

79 16 3 20 5 21

86 1917 10

12 1311

22 2423

2632 33

3125 29

34 3528

36 4241 38 39

4037 440

Dendrogram of agnes(x = pottery, diss = F, method = "average")

agnes (*, "average")pottery

Divisive algorithm

> pot.dv<-diana(pottery,diss=F)

> pot.dv

......................................................................

Height:

[1] 3.5098771 0.1273460 0.8565378 0.3898923 0.5561591 1.3314567

[7] 2.6377172 0.4109903 0.6023363 0.2934962 0.4666101 1.2262288

[13] 0.2095328 0.5655882 6.3982699 0.7580923 1.8114795 0.7463518

[19] 0.3744663 3.1056614 10.4430689 0.5786156 1.1191769 3.8495285

[25] 1.0225287 1.9532691 5.4350004 0.6852452 1.3666854 2.9777214

[31] 1.8154022 1.1270909 0.3005478 3.5697204 11.7033446 0.3178836

[37] 0.6550580 0.8893852 0.4395009 1.5469845 2.5117392 4.2065322

[43] 6.1297881 1.0822223

Divisive coefficient:

[1] 0.9161985

> pltree(pot.dv)

124141518

732052186191791610

121311

2224232633

312529

322730

343528

3642413839

403744

Dendrogram of diana(x = pottery, diss = F)

diana (*, "NA")pottery

Height

19. Multidimensional scaling

MDS: problem and examples

◮ Problem: based on the dissimilary (proximity) matrix between

objects find a spatial representation of these objects in a

low–dimetional space.

◮ MDS deals with “fitting” the data in a low–dimentional space with

minimal distortion to the distances between original points.

◮ Examples:

– Given a matrix of inter–city distances in USA, produce the map.

– Find a geometric representation for dissimilarities between cars.

– Given data on enrollment and graduation in 25 US universities,

produce a two–dimensional representation of the universities.

Metric multidimensional scaling

◮ Problem formulation: X be a n× p data matrix, Xi = (xi1, . . . , xip) is

the ith row (observation). We are given the matrix of Euclidean

distances between observations

D = {dij}i,j=1,...,n, d2ij =

(xik − xjk)2 = ‖Xi −Xj‖2.

Based on D, we want to reconstruct the data matrix X .

◮ No unique solution exists: the distances do not change if the data

points are rotated or reflected. Usually the restriction that the mean

vector of the configuration is zero is added.

1. How to recontruct X from D?

◮ Inner product matrix: define the n× n matrix B = XXT ,

xikxjk, i, j = 1, . . . , n.

Matrix B is related to matrix D:

d2ij =

(xik − xjk)2 =

x2ik +

x2jk − 2

xikxjk

= bii + bjj − 2bij . (1)

◮ The idea is to reconstruct first matrix B and then to factorize it.

◮ Centering: assume that all variables are centered, i.e.

X·k =1

xik = 0, ∀k = 1, . . . , p.

This implies that∑n

i=1 bij = 0, ∀j.356

◮ Summing up (1) over i, j and i and j we have

d2ij = tr(B) + nbjj , ∀j

d2ij = tr(B) + nbii, ∀i

d2ij = 2n tr(B).

Solving this system for bii, bjj and substituting to (1) we get

bij = −1

(d2ij −

d2ij −1

d2ij +1

). (2)

◮ The last step is the factorization of matrix B: the SVD of B is

B = V ΛV T , Λ = diag{λ1, . . . , λn}, V is orthogonal, V V T = 1;

here for definiteness λ1 ≥ λ2 ≥ · · · ≥ λn. Then

X = V Λ1/2, Λ1/2 = diag(√λ1, . . . ,

√λn).

◮ If tne n× p matrix X is of full rank, i.e., rank(X) = p, then n− peigenvalues of B will be zero, i.e., Λ = diag{λ1, . . . , λp, 0, . . . , 0}.

◮ The best q–dimensional representation , q ≤ p, retains first qeigenvalues. The adequacy is judged by Sq =

∑qi=1 λi/

∑pi=1 λi.

MDS algorithm summary

1. For given matrix of distances D compute matrix B using (2).

2. Perform singular value decomposition of B, B = V ΛV T ; for

definiteness, let λ1 ≥ λ2 ≥ · · · ≥ λn.

3. Retain q largest eigenvalues, q ≤ p, set Λ1 = diag{λ1, . . . , λq, 0, . . . , 0}.

4. The new q–dimensional data matrix representation is Y = V Λ1/21 .

The rows of matrix Y are called the principal coordinates of X in

q–dimensions.

5. Judge the adequacy of the representation using index Sq.

A reasonable fit corresponds to Sq ∼ 0.8.

MDS algorithm – comments

◮ Although D is the matrix of Euclidean distances, the algorithm can

be applied for other distances. In this case B in (2) is symmetric but

not necessarily non–negative definite. Then assess adequacy of the

solution using ∑qi=1 |λi|∑pi=1 |λi|

∑qi=1 λ

2i∑p

i=1 λ2i

◮ Other criteria:

– trace: choose the number of coordinates so that the sum of

positive eigenvalues is approximately the sum of all eigenvalues

– magnitude: accept only eigenvalues which substantially exceed the

largest negative eigenvalue.

Duality of principal coordinates and PCA

◮ PCA is based on the singular value decomposition of the sample

covariance matrix

n− 1

(Xi − µ)(Xi − µ)T , µ =1

◮ If the data matrix X is centered, i.e., µ = 0, then Σ = XXT .

Therefore, if D is the matrix of Euclidean distances, then B given by

(2) coincides with XXT . Thus, in this specific case PCA and

principal coordinates are equivalent.

Some remarks

◮ Distance and dissimilarity: MDS is applied to the matrix D = {dij}of distances. Dissimilarity matrix is a symmetric matrix C = {cij}satisfying cij ≤ cii, ∀i, j. One can trasform dissimilarity matrix to

distances matrix by setting dij = (cii − 2cij + cjj)1/2, ∀i, j.

◮ Given a distance matrix D, the object of MDS is to find a data

matrix X in Rq with interpoint distances dij “close” to dij , ∀i, j.

◮ Among all projections of X on q–dimensional subspaces of Rp, the

principal coordinates in Rq minimize the expression

φ =n∑

[d2ij − d2ij ].

2. Example: air distance between US cities

◮ Air-distance data: (1) Atlanta, (2) Boston, (3) Cincinnati,(4) Columbus, (5) Dallas, (6) Indianapolis, (7) Little Rock, (8) LosAngeles, (9) Memphis, (10) St. Louis, (11) Spokane, (12) Tampa

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]

[1,] 0 1068 461 549 805 508 505 2197 366 558 2467 467

[2,] 1068 0 867 769 1819 941 1494 3052 1355 1178 2747 1379

[3,] 461 867 0 107 943 108 618 2186 502 338 2067 928

[4,] 549 769 107 0 1050 172 725 2245 586 409 2131 985

[5,] 805 1819 943 1050 0 882 325 1403 464 645 1891 1077

[6,] 508 941 108 172 882 0 562 2080 436 234 1959 975

[7,] 505 1494 618 725 325 562 0 1701 137 353 1988 912

[8,] 2197 3052 2186 2245 1403 2080 1701 0 1831 1848 1227 2480

[9,] 366 1355 502 586 464 436 137 1831 0 294 2042 779

[10,] 558 1178 338 409 645 234 353 1848 294 0 1820 1016

[11,] 2467 2747 2067 2131 1891 1959 1988 1227 2042 1820 0 2821

[12,] 467 1379 928 985 1077 975 912 2480 779 1016 2821 0

2. Example: air distance between US cities

> names<-c("Atlanta", "Boston",

+ "Cincinnati","Columbus","Dallas","Indianapolis","Little Rock","Los Angeles",

+ "Memphis","St. Louis","Spokane","Tampa")

> air.d<-cmdscale(D,k=2,eig=TRUE)

> air.d$eig

[1] 8234381 2450757

> x<-air.d$points[,1]

> y<-air.d$points[,2]

> plot(x,y,xlab="Coordinate 1",ylab="Coordinate 2", xlim=range(x)*1.2, type="n")

> text(x,y,labels=names)

−1000 −500 0 500 1000 1500 2000

−800

−600

−400

−200

Coordinate 1

Atlanta

Boston

CincinnatiColumbus

Dallas

Indianapolis

Little Rock

Los Angeles

Memphis

St. Louis

Spokane

1. Example: data on US universities

◮ Data: 6 variables on 25 US universities

� X1 = average SAT score for new freshmen;

� X2 = percentage of new freshmen in top 10% of high school class;

� X3 = percentage of applicants accepted;

� X4 = student–faculty ratio;

� X5 = estimated annual expences;

� X6 = graduation rate (%).

> X<-read.table("US-univ.txt")

> names<-as.character(X[,1])

> names

[1] "Harvard" "Princeton" "Yale" "Stanford"

[5] "MIT" "Duke" "Cal_Tech" "Dartmouth"

[9] "Brown" "Johns_Hopkins" "U_Chicago" "U_Penn"

[13] "Cornell" "Northwestern" "Columbia" "NotreDame"

[17] "U_Virginia" "Georgetown" "Carnegie_Mellon" "U_Michigan"

[21] "UC_Berkeley" "U_Wisconsin" "Penn_State" "Purdue"

[25] "Texas_A&M"

> X<-X[,-1]

> D<-dist(X,method="euclidean") # matrix of distances

> univ<-cmdscale(D,k=6,eig=TRUE)

> univ$eig

[1] 2.080000e+04 3.042786e+03 1.287520e+03 5.430015e+02 1.180444e+02

[6] 9.986056e-01

> sum(univ$eig[1:2])/sum(univ$eig[1:6])

[1] 0.924413

> x<-univ$points[,1]

> y<-univ$points[,2]

> plot(x,y,xlab="Coordinate 1", ylab="Coordinate 2", xlim=range(x)*1.2,type="n")

> text(x,y,names)

−40 −20 0 20 40 60 80

Coordinate 1

Harvard

Princeton

StanfordMIT

Cal_Tech

Dartmouth

Johns_Hopkins

U_Chicago

U_Penn

Cornell

NorthwesternColumbia

NotreDameU_VirginiaGeorgetown

Carnegie_Mellon

U_Michigan

UC_Berkeley

U_Wisconsin

Penn_State

Purdue

Texas_A&M

Non–metric multidimensional scaling

◮ Another look at MDS problem: we are given a dissimilarity matrix

D = {dij} for points X in Rp. We want to find points X in Rq such

that the corresponding distances D = {dij} will match as close as

possible the matrix D.

◮ In general, exact match is not possible, so we will require

monotonicity: if

di1j1 < di2j2 < · · · < dimjm , m = n(n− 1)/2

then should be

di1j1 < di2j2 < · · · < dimjm .

1. Shepard–Kruskal algorithm

(a) Given dissimilarity matrix D, order off–diagonal elements: for il < jl

di1j1 ≤ · · · ≤ dimjm , il < jl, l = 1, 2, . . . ,m = n(n− 1)/2.

Say that numbers d∗ij are monotonically related to dij (d∗mon∼ d) if

dij < dkl ⇒ d∗ij < d∗kl, ∀i < j, k < l.

(b) Let X (n× q) be a configuration in Rq with interpoint distances dij .

Define

Stress(X) = mind∗:d∗

mon∼ d

∑nj=1

∑i<j(d

∗ij − dij)2∑n

∑i<j d

Rank order of d∗ coincides with the rank order of d. Stress(X) is zero

if the rank order of d coincides with the rank order of d.

2. Shepard–Kruskal algorithm

(c) For each dimension q the configuration which has smallest stress is

called the best fitting configuration in q dimensions. Let

Sq = minX(n×q)

Stress(X).

(d) Calculate S1, S2,... until the value becomes low. The rule of thumb:

Sq ≥ 20%–poor; Sq = 10%–fair; Sq ≤ 5%–good; Sq = 0%–perfect.

◮ Remark: the solution is obtained by a numerical procedure.

The Shepard–Kruskal algorithm: computation

◮ Computation

1. find a random configuration of points (e.g., sampling from a

normal distribution);

2. calculate distances between the points dij ;

3. find optimal monotone transformation d∗ij of the dissimilarities dij ;

4. minimize the stress between the optimally scaled data by finding

new configuration of points; if the stress is small enough stop, if

not go to the step 2.

Remarks

◮ The procedure uses steepest descent method. There is no way to

distinguish between local and global minima.

◮ The Shepard–Kruskal solution is non–metric since it uses only the

rank orders.

◮ The method is invariant under rotation, translation, and unifor

expansion or contraction of the best fitting configuration.

◮ The method works for distances and dissimilarities (similarities).

1. Example: New Jersey voting data

◮ Data is the matrix containing the number of times 15 New Jersey

congressmen voted differently in the House of Representatives on

19 evironmental bills (abstentions are not recorded).

> voting

Hunt(R) Sandman(R) Howard(D) Thompson(D) Freylinghuysen(R)

Hunt(R) 0 8 15 15 10 ...

Sandman(R) 8 0 17 12 13 ...

Howard(D) 15 17 0 9 16 ...

Thompson(D) 15 12 9 0 14 ...

Freylinghuysen(R) 10 13 16 14 0 ...

Forsythe(R) 9 13 12 12 8 ...

Widnall(R) 7 12 15 13 9 ...

Roe(D) 15 16 5 10 13 ...

Heltoski(D) 16 17 5 8 14 ...

Rodino(D) 14 15 6 8 12 ...

Minish(D) 15 16 5 8 12 ...

Rinaldo(R) 16 17 4 6 12 ...

Maraziti(R) 7 13 11 15 10 ...

Daniels(D) 11 12 10 10 11 ...

Patten(D) 13 16 7 7 11 ...

2. Example: New Jersey voting data

> voting.stress<-rep(0,6)

> for (i in 1:6) voting.stress[i]<-isoMDS(voting, k=i)$stress

> voting.stress

[1] 21.1967696 9.8790470 5.4522891 3.6672495 1.4064205 0.8899916

> voting.MDS<-isoMDS(voting,k=2)

initial value 15.268246

iter 5 value 10.264075

final value 9.879047

converged

> x<-voting.MDS$points[,1]

> y<-voting.MDS$points[,2]

> text(x,y,colnames(voting))

−10 −5 0 5

−6−4

Coordinate 1

Hunt(R)

Sandman(R)

Howard(D)

Thompson(D)

Freylinghuysen(R)

Forsythe(R)

Widnall(R)

Roe(D)

Heltoski(D)

Rodino(D)Minish(D)

Rinaldo(R)

Maraziti(R)

Daniels(D)

Patten(D)

Example: voting data, metric MDS

> voting.metrMDS <- cmdscale(voting, k=6, eig=TRUE)

> voting.metrMDS$eig

[1] 497.76083 146.17622 102.91314 76.87756 55.11540 24.74374

> sum(voting.metrMDS$eig[1:2])/sum(voting.metrMDS$eig)

[1] 0.7126454

> x<-voting.metrMDS$points[,1]

> y<-voting.metrMDS$points[,2]

> text(x,y,colnames(voting))

−10 −5 0 5

−4−2

Coordinate 1

Hunt(R)

Sandman(R)

Howard(D)

Thompson(D)

Freylinghuysen(R)Forsythe(R)

Widnall(R)Roe(D)Heltoski(D)Rodino(D)

Minish(D)Rinaldo(R)

Maraziti(R)

Daniels(D)Patten(D)

20. Neural networks

1. Neuron model

◮ Neurological origins: each elementary nerve cell (neuron) is connected

to many others, can be activated by inputs from elsewhere, and can

stimulate other neurons.

◮ Neuron model (perceptron):

y = ϕ( p∑

wjxj + w0

), v =

wjxj + w0,

– x = (x1, . . . , xp) are inputs

– y is output

– (w1, . . . , wp) are connection weights, w0 is a bias term

– ϕ is a (non–linear) function, called the activation function

2. Neuron model

Σ ϕ(·)

Activation function

◮ Monotone increasing, bounded function

◮ Examples

* hard limiter: produces ±1 output,

ϕHL(v) = sgn(u).

* sigmoidal (logistic): produces output in (0, 1)

ϕS(v) =1

1 + e−av,

* hyperbolic tan: produces output between −1 and 1,

ϕHT (v) = tanh(v) =1− e−av

1 + e−av

* linear: ϕL(v) = av.

Single–unit perceptron

◮ Variables:

* x is a p–vector of features

* z is a prediction target (binary)

* y is the perceptron output used to predict z

y = sgn( p∑

wjxj + w0

)= ϕHL(w

+1, wTx ≥ 0

−1, wTx < 0.

◮ Training data: D = {(x(t), z(t)), t = 1, . . . , n} – n ”examples”

x(t) = (x(t)1 , . . . , x

(t)p ) is t-th observation of the feature vector,

z(t) is the class indicator (±1) – binary variable.

Perceptron learning rule

1. Initialization: set w = w(0) – starting point

2. At every step t = 1, 2, . . .

� select ”example” x(t) from the training set

� compute the perceptron output with current weights w(t−1)

yt = ϕHL([w(t−1)]Tx(t))

� update the weights

w(t) = w(t−1) + η(z(t) − y(t))x(t).

3. Cycle through all the observations in the training set.

◮ η > 0 is the learning rate parameter

◮ z(t) − y(t) is the error on t–th example. If it is zero, the weights are

not updated. Usual stepsize η ≈ 0.1.

Perceptron convergence theorem

◮ Theorem

For any data set that is linearly separable, the learning rule is

guaranteed to find the separating hyperplane in a finite number

of steps.

Criterion cost function for regression problem

◮ Data:

D ={x(t), z(t), t = 1, . . . , n

z(t) is a continuous/discrete variable.

◮ Squared error cost function:

E(w) =1

[z(t) − ϕ(wTx(t))]2 =1

e2t (w)

et(w) := z(t) − ϕ(wTx(t)

◮ Assumption: activation function ϕ is differentiable.

Gradient descent learning rule (batch version)

◮ Initialization: starting vector of weights w(0)

◮ For t = 0, 1, 2, . . . compute w ← w − ηt∇wE(w)⇔

w(t+1) = w(t) − ηt∇wE(w(t)

), ∇wE(w) = {∂E(w)/∂wj}j=0,...,p

◮ Cycle through all the observations in the training set.

∇wE(w) = −n∑

ek(w)∇w{ϕ(wTx(k)

)} = −

ϕ′(wTx(k))ek(w)x

(k) ⇒

w(t+1) = w(t) + ηt

ϕ′([w(t)]Tx(k))ek(w

(t))x(k).

w ← w + ηt

ϕ′(wTx(k))ek(w)x(k)

◮ E(w) is non–convex, convergence to local minima.

Gradient learning rule (sequential version)

◮ At step t = 1, 2, . . . the t-th example x(t) is selected and

w(t+1) = w(t) − ηt∇w

[12e2t (w)

] ∣∣∣w=w(t)

w(t+1) = w(t) + ηtϕ′([w(t)]Tx(t)

(t))x(t) = w(t) + ηtδtx(t)

δt := ϕ′([w(t)]Tx(t)

– Examples are selected sequentially, each example is selected many

times.

– Easy implementation; processes examples in real time.

– Convergence to local minima

1. Neural network

◮ Neural network – multilayer perceptron

◮ Feed–forward network is a network in which units can be numbered so

that all connections go from a vertex to one with a high number. The

vertices are arranged in layers, with connections only to higher layers.

◮ Hidden layer contains M neurons fed by the inputs (x1, . . . xp).

Output of the mth neuron in the hidden layer is

vm = ϕm

(wm0 +

)= ϕm(wT

wm is the weights for the perceptron m.

2. Neural network

◮ Output layer (unit) is fed by the neuron outputs in the hidden layer:

y = f(wj0 +

)= f(wT v),

where f is the activation function in the output unit, w is the vector

of weights of the output unit.

◮ The network with one hidden layer implements the function

y = f(wT v) = f(w0 +

wmϕm(wTmx)

3. Neural network

Hidden layer

.Output unit.

Input layer

Specific cases

◮ Projection pursuit model: f(wT v) = vT1+ w0 and vm = ϕm(wTmx)

y = w0 +M∑

ϕm(wTmx)

◮ Generalized additive model: if f is as before and M = p, wm0 = 0,

wmi = I{i = m} then

y = w0 +

ϕm(xm).

Representation power of neural networks

Let f0 : [0, 1]d → R is a continuous function. Assume that ϕ is

not a polynomial; then for any ǫ > 0 there exist constants M , and

(wmi, wm) such that for

f(x) =

wmϕ( p∑

wmixi + wm0

one has

|f(x)− f0(x)| ≤ ǫ, ∀x ∈ [0, 1]d.

◮ Neural network with one hidden layer can approximate any

continuous function.

Training

◮ Prediction error criterion

E(W ) =1

[y(t)(W )− z(t)

(y(t)(W ), z(t)

where W denotes all the weights, and

y(t) = y(t)(W ) = f(w0 +

wmϕm(wTmx

(t))).

◮ Backpropagation algorithm

Wj ←Wj − η∂E(W )

∂Wj=Wj − η

∂et(W )

∂Wj.

◮ The algorithm uses the chain rule for differentiation and requires

differentiable activation functions.

Training neural networks

◮ Number of hidden units: usually varies in the range 5− 100.

◮ Overfitting: networks with too many units will overfit. The remedy is

to regularize, e.g., to minimize the criterion

E(W ) + λ∑

W 2i , λ ∼ 10−4 − 10−2.

Leads to the so–called weight decay algorithm.

◮ Starting values are usually taking random values near zero. Model

starts out nearly linear and becomes non–linear as the weights grow.

◮ Stopping rule: differetnt ad hoc rules, maximal number of iterations.

◮ Multiple minima: E(W ) is non–convex, has many local minima.

Projection pursuit regression

◮ Model: let ωm, m = 1, . . . ,M be the unknown unit p–vectors, and let

f(x) =M∑

fm(ωTmx), fm(·) are unknown.

fm varies only in the direction defined by the vector ωm.

◮ The ”effective” dimensionality is 1, not p.

◮ The error function (to be minimized w.r.t. fm and ωm, m = 1, . . . ,M)

[Yi −

fm(ωTmXi)

Projection pursuit regression: fitting the model

◮ If ω is given then set vi = ωTXi and apply one–dimentional smoother

to get an estimate of g.

◮ Given g and previous guess ωold for ω write

g(ωTXi) ≈ g(ωToldXi) + g′(ωT

oldXi)(ω − ωold)TXi

[Yi−g(ωTXi)

[g′(ωT

oldXi)]2[(ωToldXi+

Yi − g(ωToldXi)

g′(ωoldXi)

)−ωTXi

Minimize the last expression w.r.t. ω to get ωnew.

◮ Continue until convergence...

1. Neural network

◮ Extension of the idea of perceptron

◮ Feed–forward network is a network in which vertices (units) can be

numbered so that all connections go from a vertex to one with a high

number. The vertices are arranged in layers, with connections only to

higher layers.

◮ Each unit j sums its inputs, adds a constant forming the total input

xj , and applies a function fj to xj to give output yj .

◮ The links have weights wij which multiply the signals travelling

among them by that factor.

Fitting (training) neural networks

◮ Starting values: usually are taking random values neart zero. Model

starts out nearly linear and becomes non–linear as the weights grow.

◮ Stopping rule: differetnt ad hoc rules, maximal number of iterations.

◮ Overfitting: networks with too many units will overfit. The remedy is

to regularize, e.g., to minimize the criterion

E(w) + λ∑

w2ij , λ ∼ 10−4 − 10−2.

Leads to the weight decay algorithm.

◮ Number of hidden units: varies in the range 3− 100.

◮ Multiple minima: E(w) is non–convex, has many local minima.

3. Neural network

◮ The activation functions fj are taken to be

– linear

– logisitc f(x) = ℓ(x) = ex/(1 + ex)

– threshold f(x) = I(x > 0)

◮ Neural networks with linear output units and a single hidden layer can

approximate any continuous function (as size of hidden layer grows)

yk = αk +∑

(αj +

◮ Projection pursuit regression

f(x) = α+∑

(αj +

βjixi

)= α+

fj(αj + βTj x)

Example: network for the Boston housing data

◮ Structure: 13 predictors, 3 units in the hidden layer, linear output unit

◮ Represents the function (3× 14 + 4 = 46 weights):

y0 = α+17∑

wj0ϕj

(αj +

Notation

◮ All units are numbered sequentially. Every unit j has input xj and

output yj

yj = fj(xj), xj =∑

wijyi.

Non-existent links are characterized by wij = 0; wij = 0 unless i < j.

◮ the output vector y∗ of the network is modeled as y∗ = f(x∗;w),

where x∗ is the input vector and w is the vector of weights.

◮ Data: {(x∗m, tm),m = 1, . . . , n} – observed examples;

tm is the observation for y∗m = f(x∗

Back–propagation algorithm

◮ Discrepancy function to be minimized w.r.t. w

E(w) =n∑

Em(w) =n∑

∥∥tm − f(x∗m;w)

∥∥2 =n∑

∥∥tm − y∗m

∥∥2.

◮ Update rule (gradient descent): for η > 0 define

wij ← wij − η∂E(w)

∂wij,

∂E(w)

∂wij=

∂Em(w)

∂wij.

◮ Derivative: because f(x∗m;w) depends on wij only via xj

∂Em(w)

∂wij=∂Em

∂xj∂wij

= yi∂Em

∂xj︸︷︷︸δj

= yiδj

δj =∂Em

∂xj=∂Em

∂yj∂xj

= f ′j(xj)∂Em

∂yj.

1. Example: Boston housing data - neural network

> nobs <- dim(Boston)[1]

> trainx<-sample(1:nobs, 2*nobs/3, replace=F)

> testx<-(1:nobs)[-trainx]

# size - number of units in the hidden layer

# decay - parameter for weight decay; default 0

# linout - switch for linear output units; default logistic output units

# starting values - uniformly distributed [-0.7, 0.7]

> Boston10.1.nn<-nnet(formula=medv~., data=Boston[trainx,], size=10, decay=1.0e-03,

+ linout=TRUE, maxit=500)

# weights: 151

converged

> pred <-predict(Boston10.1.nn, Boston[testx,])

> sum((pred-Boston[testx,"medv"])^2)

[1] 11590.46

2. Example: Boston housing data - neural network

> # another run with the same parameters

> Boston10.2.nn<-nnet(formula=medv~., data=Boston[trainx,], size=10, decay=1.0e-03,

+ linout=TRUE, maxit=500)

# weights: 151

iter 10 value 24530.299991

iter 20 value 21653.959154

...........................

iter 490 value 3061.645220

iter 500 value 3043.807915

stopped after 500 iterations

> pred <-predict(Boston10.2.nn, Boston[testx,]) # prediction error on

> sum((pred-Boston[testx,"medv"])^2) # the test set

[1] 2859.247 # CART error was 4045.943

3. Example: Boston housing data – neural network

0 50 100 150 200 250 300 350

.nn$re

−3 −2 −1 0 1 2 3

Normal Q−Q Plot

antile

4. Example: Is the model good?

> cbind(pred, Boston[testx,"medv"])

[,1] [,2]

4 31.464151 33.4

12 19.442958 18.9

15 19.847892 18.2

18 18.361424 17.5

.......................

397 14.315629 12.5

403 16.850060 12.1

404 12.685557 8.3

405 -1.021680 8.5

406 -1.686176 5.0

413 11.566423 17.9

414 15.509046 16.3

419 -16.737063 8.8

421 20.722749 16.7

423 23.060568 20.8

21. Support Vector Machines (SVM)

Hyperplane

◮ Hyperplane in Rp is an affine subspace of dimension p− 1: its

equation is

f(x) = β0 + β1x1 + . . .+ βpxp = 0.

◮ Consider binary classification problem with data

Dn = {(X1, Y1), . . . , (Xn, Yn)}, where Xi = [xi,1, . . . , xi,p] ∈ Rp and

Yi ∈ {−1, 1}.

◮ We say that Dn admits linear separation if there exists a separating

hyperplane f(x) such that

f(Xi) = β0 + β1xi,1 + · · ·+ βpxi,p > 0 if Yi = 1,

f(Xi) = β0 + β1xi,1 + · · ·+ βpxi,p < 0 if Yi = −1.

Separating hyperplane satisfies Yi(β0 + β1xi,1 + · · ·+ βpxi,p) > 0.

Separating hyperplanes

−1 0 1 2 3

−1 0 1 2 3−

If f(x) is a separating hyperplane then a natural classifier is sign{f(x)}.

1. Maximal margin classfier

◮ If the data set admits linear separation, it is natural to choose the

separating hyperplane with maximal margin, i.e., the separating

hyperplane farthest from the observations.

◮ Optimization problem:

maxβ0,...,βp

β2j = 1,

Yi(β0 + β1xi,1 + · · ·+ βpxi,p) ≥M, ∀i = 1, . . . , n.

M is the hyperplane margin. The optimization problem is convex, it

can be efficiently solved on computer.

2. Maximal margin classifier

−1 0 1 2 3

◮ Maximal margin classifier

Drawbacks of maximal margin classifier

◮ What can be done if there is no separating hyperplane?

◮ Maximal margin classifier is very sensitive to single observation.

−1 0 1 2 3

1. Support vector classifier

◮ The idea: allow for misclassification (impose soft margin).

◮ Optimization problem

maxβ0,...,βp,ǫ1,...,ǫn

β2j = 1,

Yi(β0 + β1xi,1 + · · ·+ βpxi,p) ≥M(1− ǫi), ∀i = 1, . . . , n.

ǫi ≥ 0,n∑

ǫi ≤ C.

where ǫ1, . . . , ǫn are slack variables, C ≥ 0 is a tuning parameter.

2. Support vector classifier

◮ Slack variable ǫi tells us where ith observation is located:

– ǫi = 0: ith observation on the correct side of the margin;

– ǫi > 0: ith observation violates the margin;

– ǫi > 1: ith observation on the wrong side of the hyperplane.

◮ Tuning parameter C establishes budget for margin violation

– C = 0: no budget for margin violation (maximal margin classifier);

– C > 0: no more than C observations can be misclassified (can lie

on the wrong side of the hyperplane);

– C is usually chosen by cross–valuidation.

1. introduction - university of haifaidattner/course2015sem2/dm-00.pdf · >plot(uscereal$fat,...

Documents

the cougar cookbook - wordpress.com · nutrition facts for...

starters/ small plates...2020/07/07 · aeng 731 calories...

calories 30 calories from fat 0 sugars

wellness trivia demo. 2 how many calories equal 1 lb. of...

food category product company calories* total fat* (gm

total calories calories from fat total fat (g) saturated...

froot loops nutrition facts€¦ · nutrition facts serving...

cutting fat and calories cut down

nutrition: the competitor’s edge...myth #1: eating dietary...

preventing diabetes cutting calories and fat. topics what...

menu item serving calories calories fat, sat trans chole...

calories cals. from fatotal fat sat. fat rans fat

nutritional & allergen information milios · milio's...

boisson nutritionnelle...total calories 130 7% calories from...

primebusinessdining.com pork po boy calories (kcal): %...

reading nutrition labels and eating breakfast ·...

original crust pizza - godfather's pizza...original crust...

menu item...menu item calories calories from fat t otal fat...

nutritional information®-nutritional... · frios...

!pahl granny smith apples€¦ · serving size: 1 medium...