louis roussos sports data - istics.net

Post on 21-May-2022

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Louis Roussos Sports Data

Rank the sports you most like to participate in, 1 = favorite, 7 =least favorite. There are n=130 rank vectors.

> sportsranks

Baseball Football Basketball Tennis Cycling Swimming Jogging

1 3 7 2 4 5 6

1 3 2 5 4 7 6

1 3 2 5 4 7 6

4 7 3 1 5 6 2

[...]

3 2 1 4 7 5 6

3 2 1 4 5 6 7

5 7 6 4 1 3 2

2 1 6 7 3 5 4

K-means in RSet #Clusters = K = centers. nstart is the number of times it runsthe algorithm, each time using a diferent random starting set ofmeans.> kmeans(sportsranks,centers=2,nstart=10)K−means clustering with 2 clusters of sizes 62, 68

Cluster means:Baseball Football Basketball Tennis Cycling Swimming Jogging

1 2.451613 2.596774 3.064516 4.112903 4.709677 5.209677 5.8548392 5.014706 5.838235 4.352941 3.632353 2.573529 2.470588 4.117647

Clustering vector:

1 1 1 2 1 2 2 2 2 2 2 1 2 1 1 2 2 1 1 1 2 1 1 2 2 1 1 2 1 2 2 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 1

2 1 1 2 2 1 1 1 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2 2 1 1 1 1

2 2 1 1 1 1 2 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 1 1 2 1

Within cluster sum of squares by cluster:[1] 1074.968 1288.176

Available components:[1] ”cluster” ”centers” ”withinss” ”size”

Getting clusters of size K=2, ..., 10

kms <− vector(’list’,10)for(K in 2:10) {

kms[[K]] <− kmeans(sportsranks,centers=K,nstart=10)}

K = 1 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 3.79 4.29 3.74 3.86 3.59 3.78 4.95

K = 2 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 5.01 5.84 4.35 3.63 2.57 2.47 4.12Group 2 2.45 2.60 3.06 4.11 4.71 5.21 5.85

K = 3 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 2.33 2.53 3.05 4.14 4.76 5.33 5.86Group 2 4.94 5.97 5.00 3.71 2.90 3.35 2.13Group 3 5.00 5.51 3.76 3.59 2.46 1.90 5.78

K = 4 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 5.10 5.47 3.75 3.60 2.40 1.90 5.78Group 2 2.30 2.10 2.65 5.17 4.75 5.35 5.67Group 3 2.40 3.75 3.90 1.85 4.85 5.20 6.05Group 4 4.97 6.00 5.07 3.80 2.80 3.23 2.13

K = 2: Group 1 likes swimming and cycling, while group 2 likes the team sports,

baseball, football, and basketball. K = 3: Group 1 appears to be about the same is the

team sports group from K = 2, while groups 2 and 3 both like swimming and cycling.

The difference is that group 3 does not like jogging, while group 2 does. K = 4: The

team-sports group has split into one that likes tennis (group 3), and one that doesn’t

(group 2).

Plotting two clusters

The idea is to project the observations to the subspace (which isjust a line) that goes through the two clusters’ mean vectors.The

z =µ̂1 − µ̂2

‖µ̂1 − µ̂2‖,

is the unit vector pointing from µ̂2 to µ̂1. Then using z as anaxis, the projections of the observations onto z have coordinates

wi = xiz′, i = 1, . . . , N.

The histogram

K=2

W

Fre

quency

−6 −4 −2 0 2 4 6

02

46

810

12

Fre

quency

−6 −4 −2 0 2 4 6

02

46

810

12

XX

Baseball

Football

Basketball

Tennis

Cycling

Swimming

Jogging

Plot for K=3If K = 3, then the three means lie in a plane, hence we wouldlike to project the observations onto that plane. One approachis to use principal components on the means:

Z =

µ̂1µ̂2µ̂3

,

we apply the spectral decomposition to the sample covariancematrix of Z:

13

Z′H3Z = GLG′, (1)

where G is orthogonal and L is diagonal. The diagonals of Lhere are 11.77, 4.07, and five zeros. We then rotate the data andthe means using G,

W = XG and W(means) = ZG,

Only the first two columns in each matrix are relevant.

The Plot

−4 −2 0 2 4

−4

−2

02

4

Var 1

Var

2

1

2

3

BaseballFootball

BasketballTennis

Cycling

Swimming

Jogging

K=3

The sums of squares

2 4 6 8 10

1500

2000

2500

3000

3500

K

SS

SSK = obj(µ̂1, . . . , µ̂K) =K

∑k=1

∑{i|yi=k}

‖xi − µ̂k‖2.

The reduction of sums of squares

2 4 6 8 10

0.05

0.10

0.15

0.20

0.25

0.30

K

1-SS[k]/SS[k-1]

1− SSK

SSK−1

Silhouettes in RThe function silhouette.km finds the silhouettes for a givenclustering, then sort.silhouette orders them, first by clusternumber, then by value. To plot the sillhouettes for k = 2, . . . , 10:

sil.ave <− NULL # To collect silhouette’s means for each Kpar(mfrow=c(3,3))for(K in 2:10) {

sil <− silhouette.km(sportsranks,kms[[K]]$centers)sil.ave <− c(sil.ave,mean(sil))ssil <− sort.silhouette(sil,kms[[K]]$cluster)plot(ssil,type=’h’,xlab=’Observations’,ylab=’Silhouettes’)title(paste(’K =’,K))

}

The sil.ave calculated above can then be used to obtain the plotof averages:

plot(2:10,sil.ave,type=’l’,xlab=’K’,ylab=’Average silhouette width’)

Plotting the silhouettes

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.625

K = 2

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.555

K = 3

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.508

K = 4

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.534

K = 5

Plotting the silhouettes’ averages

2 4 6 8 10

0.5

00.5

40.5

80.6

2

K

Avera

ge s

ilhouette w

idth

K = 2 seems like a good choice.

Model-based clustering – Car data

The data consists of size measurements on 111 automobiles, thevariables include length, wheelbase, width, height, front andrear head room, front leg room, rear seating, front and rearshoulder room, and luggage area. The data are in the file cars.The variables have been normalized to have medians of 0 andmedian absolute deviations (MAD) of 1.4826 (the MAD for aN(0, 1)).

R for model-based clustering

The R function we use is in the package mclust. The function isMclust. The basic command is simple:

mcars <− Mclust(cars)

There are many options for plotting in the package. To see aplot of the BIC’s, use

plot(mcars,cars,what=’BIC’)

You have to clicking on the graphics window, or hit enter, toreveal the plot. Not that the BIC’s in this function are actuallythe −BIC’s. So we want to maximize it.

Plotting the BIC’s

2 4 6 8

-6000

-5500

-5000

-4500

-4000

number of components

BIC

EII

VII

EEI

VEI

EVI

VVI

EEE

EEV

VEV

VVV

K = 2, VVV is best.

What is VVV?

To find the name of the best model:

> mcarsbest model: ellipsoidal, unconstrained with 2 components

That K = 2 is easy to see. The assumptions on the covariancematrices are “ellipsoidal,” which means they have no specialstructure, and “unconstrained,” which means they are notassumed equal for the two groups, Σ1 6= Σ2.

To plot variable 1 (length) versus variable 4 (height), use

plot(mcars,cars,what=’classification’,dimens=c(1,4))

Plotting the clusters

−4 −2 0 2 4

−5

05

1020

Length

Hei

ght

−4 −2 0 2 4

−4

−2

02

4

Width

Frt

LegR

oom

−4 −2 0 2 4 6

−8

−4

02

4

RearHd

Lugg

age

0 10 20 30

−20

−10

05

PC1

PC

2

The cars in group 2

Rear Head Rear Seating Rear Shoulder LuggageChevrolet Corvette −4.0 −19.67 −28.00 −8.0Honda Civic CRX −4.0 −19.67 −28.00 −8.0Mazda MX5 Miata −4.0 −19.67 −28.00 −8.0Mazda RX7 −4.0 −19.67 −28.00 −8.0Nissan 300ZX −4.0 −19.67 −28.00 −8.0Chevrolet Astro 2.5 0.33 −1.75 −8.0Chevrolet Lumina APV 2.0 3.33 4.00 −8.0Dodge Caravan 2.5 −0.33 −6.25 −8.0Dodge Grand Caravan 2.0 2.33 3.25 −8.0Ford Aerostar 1.5 1.67 4.25 −8.0Mazda MPV 3.5 0.00 −5.50 −8.0Mitsubishi Wagon 2.5 −19.00 2.50 −8.0Nissan Axxess 2.5 0.67 1.25 −8.5Nissan Van 3.0 −19.00 2.25 −8.0Volkswagen Vanagon 7.0 6.33 −7.25 −8.0

Just group 1

Redo on just the group 1 automobiles:

cars1 <− cars[mcars$classification==1,]mcars1 <− Mclust(cars1)mcars1best model: elliposidal multivariate normal with 1 components

The best is one big cluster.

The models in mclust

Code Description ΣkEII spherical, equal volume σ2IpVII spherical, unequal volume σ2

k IpEEI diagonal, equal volume and shape ΛVEI diagonal, varying volume, equal shape ck∆EVI diagonal, equal volume, varying shape c∆kVVI diagonal, varying volume and shape ΛkEEE∗ ellipsoidal, equal volume, shape, and orientation ΣEEV ellipsoidal, equal volume and equal shape ΓkΛΓ′kVEV ellipsoidal, equal shape ckΓk∆Γ′kVVV∗ ellipsoidal, varying volume, shape, and orientation arbitrary

Here, Λ’s are diagonal matrices with positive diagonals, ∆’s are diagonal matrices with

positive diagonals whose product is 1, Γ’s are orthogonal matrices, Σ’s are arbitrary

nonnegative definite symmetric matrices, and c’s are positive scalars. A subscript k on

an element means the groups can have different values for that element. No subscript

means that element is the same for each group.

Hierarchical clustering of the sportsplclust(hclust(dist(t(sportsranks))))

Baseball

Footb

all

Basketb

all

Joggin

g

Tennis

Cyclin

g

Sw

imm

ing

20

25

30

35

40

Complete linkage

Heig

ht

Hierarchical clustering of the individualspar(mfrow=c(2,1))dxs <− dist(sportsranks) # Gets Euclidean distanceslbl <− rep(’ ’,130) # Prefer no labels for the individualsplclust(hclust(dxs),xlab=’Complete linkage’,sub=’ ’,labels=lbl)plclust(hclust(dxs,method=’single’),xlab=’Single linkage’,sub=’ ’,labels=lbl)

04

8

Complete linkage

Height

02

4

Single linkage

Height

top related