a study of data mining methods and examples of applications

A Study of Data Mining Methods andExamples of Applications

By

Sun Yiran

Supervisor:

Zhou Wang

ST4199 Honours Project in Statistics

Department of Statistics and Applied Probability

National University of Singapore

September 2, 2016

Sun Yiran

Contents

Summary 3

Acknowledgment 4

1 Introduction 5

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Data Mining in Statistical Learning . . . . . . . . . . . . . . . . . . . 6

2 Methodology 6

2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Linear Regression Model with Penalty or Constraints . . . . . 6

2.1.2 Splines and Semi-Parametric Models . . . . . . . . . . . . . . 10

2.1.3 Kernel Regression Model . . . . . . . . . . . . . . . . . . . . . 15

2.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Theory of Curse of Dimensionality . . . . . . . . . . . . . . . 21

2.2.2 Single-Index Model . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.3 Projection Pursuit Regression Model . . . . . . . . . . . . . . 24

3 Data Description and Empirical Application 25

3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Colon Cancer Dataset . . . . . . . . . . . . . . . . . . . . . . 26

3.1.2 Wine Quality Dataset . . . . . . . . . . . . . . . . . . . . . . 26

3.1.3 Bike Sharing Dataset . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Examples of application . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Conclusion 42

1

Sun Yiran

References 45

Appendix 46

2

Sun Yiran

Summary

This thesis studies about a collection of models proposed in Data Mining techniques

as well as their applications in real-life examples. In this project, I explored theories

of the models in Chapter 2 and conducted simulation studies based on those models

in Chapter 3.

The first two models I introduced are parametric models: Ridge and Lasso, which

outperform the general linear models when there is multicollinearity within the

dataset. Then in the application section, I applied them to ‘Colon Cancer’ dataset

in which the number of predictors hugely exceeds the number of observations. Next

since most often the assumption of linearity between features and response does not

hold, I discussed about semi-parametric approaches such as Spline Regression that

can move beyond linearity. Again, I employed it in analyzing the ‘Bike Sharing’ data

and in such way both linear and non-linear components are taken into account for

model fitting. Finally I talked about non-parametric models: k-Nearest-Neighbors

and Kernel Regression, and applied them to the ‘Wine Quality’ dataset. All data

files are downloaded from UCI Machine Learning Repository.

My contributions are the design of simulation study and the implemented R pro-

grams to assess the e�ciency of the models proposed above. The classification error

for categorical cases and the mean squared error for regression examples are mea-

sured. A detailed interpretation of resulting estimations is also given. Finally in

chapter 5, I drew a conclusion and comparison on the advantages and disadvantages

that the models have over each other.

3

Sun Yiran

Acknowledgment

First and foremost, I would like to express my special appreciation to my thesis

adviser Prof. Zhou Wang. Without his valuable guidance, aspiring encouragement

and great support, this paper would have never been accomplished.

I am also thankful to Prof. Xia Yingcun for being both my academic adviser and

internship adviser in the last four years. His insight suggestions and invaluable

constructive criticism help me to learn and to make continuous progress.

Besides my advisers, I would also like to show gratitude to all the faculty members.

Their caring for students and dedication to academic work have helped to develop

our potential and prepare us for future challenges. I feel great honored to be a

student in this faculty.

Last but not least, I want to thank my parents for their support to my university life.

Their love, attention, and spiritual support has been a constant source of strength

for me.

4

Sun Yiran

1 Introduction

This is a thesis introducing basic knowledge about Data Mining, which is a key

component of Statistical Learning. The underlying background and motivation is

provided in Chapter 1. In Chapter 2.1, three most generally used models in Data

Mining are introduced and in Chapter 2.2, two more models are introduced to al-

leviate the problems that may occur in high dimensional data. Chapter 3 includes

application of the models introduced on two real-life datasets, and an interpreta-

tion of the results obtained. It also covers model comparison among the models.

Finally a conclusion of the whole thesis is given in Chapter 5. Note that the main

reference book is An Introduction to Statistical Learning (James, Witten, Hastie, &

Tibshirani, 2013).

1.1 Background and Motivation

The concept of statistical learning was firstly introduced in 1960’s. In the first three

decades, statistical learning was purely theoretical analysis of a given collection of

sample data. With a deeper understanding of Statistics as an independent scien-

tific area and a higher demand of statistical learning methods applied to real life

problems, scientists have been constructing new algorithms for estimating multi-

dimensional function. Until today the amount of data is exploding in a super fast

speed that people are facing new challenges with managing and processing datasets

of huge size. This is the so-called “Big Data” age. Modern statistical learning o↵ers

a set of tools for understanding the complex datasets. Having an insight of the data

is not only useful in traditional areas like biology and medicine, but also helps in

innovating business disciplines. For example, having realised the commercial oppor-

tunities in big data analysis, Google launched its first ‘big data’ analytics center in

2011. It provides multiple services such as analysis of purchasing patterns of millions

5

Sun Yiran

of visits to e-commercial website, assessment of the how e↵ective a billion dollars’

advertisement campaign is and so on. To learn more about that, the very first step

is to know about Statistical Learning.

1.2 Data Mining in Statistical Learning

In this thesis, we focus on one important element of Statistical Learning, called Data

Mining (Hastie, Tibshirani, Friedman, & Franklin, 2005). It is a computational

process of analyzing large data sets and extracting valuable information for further

use. There are several common tasks of data mining, including anomaly detection,

association rule learning, clustering, classification, regression and summarization. In

this thesis, we will only be discussing classification and regression. Note that one

necessary feature of data mining is data validation. This is used to check if the

model we propose is valid or not. We will talk more on that in the main part of the

thesis.

2 Methodology

2.1 Models

2.1.1 Linear Regression Model with Penalty or Constraints

The first model considered in this paper is the most generally used Linear Model.

Linear Model is one of the most primary statistical models developed in the pre-

computer age. It provides easy-to-interpret description of the relations between

inputs and output. Sometimes it allows better results than a nonlinear model for

prediction purpose. Hence even in today’s computer age, data scientists still like

to use it to perform data analysis. There are commonly two tasks considered: re-

6

Sun Yiran

gression and classification. In terms of linear regression models, there may occur

some problems that call for one’s attention. The most commonly seen problems

are: non-linearity, corrrelation or non-constant variance of error terms, outliers and

multicollinearity (Farrar & Glauber, 1967). Among these we will focus on multi-

collinearity phenomenon first and then non-linearity.

In regression problems, Linear Regression Model assumes that under the condition of

linearity between the input vector XT = (X1, X2, ..., Xp

) and the real-valued output

Y, it has the form

Y = �0 +p

X

j=1

X

j

�

j

(1)

where �

j

’s are unknown coe�cients.

One of the most well-known estimate of coe�cients is the Least Squares Estimation

(LSE). It optimizes the model fit by minimizingP

n

i=1(Yi

� X

T

i

�)2, which is the

Residual Sum of Squares (RSS). Given n observations of (Xi

, Y

i

), the LSE of �’s can

be written as

� = (XTX)�1XY (2)

where XTX =P

n

i=1 Xi

X

T

i

. The main problem with this expression is that for the

LSE to exist, a basic requirement that (XTX)�1 is non-singular must be satisfied.

However in reality, it is highly possible that this condition cannot be met. One

special but common case is called ’multicollinearity’. It refers to the situation that

two or more predictors are highly correlated in a model. When multicollinearity is

present, at least one eigenvalue of (XTX)�1 is close to zero thus it will be close to

singular. The phenomenon typically frequently occurs in high dimensional data. In

this section we will introduce two penalty approaches to remedy this problem, one

is called Ridge and the other is Lasso (McDonald, 2009).

Ridge was firstly suggested by Hoerl and Kennard in 1970. The reason that it is

a called penalty approach is that it uses the strategy of penalizing the values of

coe�cients �. That is, we add a term to the original RSS of linear models and

7

Sun Yiran

estimate � by minimizing

n

X

i=1

(Yi

�X

T

i

�)2 + �(�21 + ...+ �

2p

) (3)

or in matrix form

kY� X�k2 + �k�k2 (4)

� is a shrinkage parameter which takes values from real number space. Solving the

above minimization problem leads to

�

Ridge

= (XTX+ �I)�1XY (5)

The expression guarantees invertibility with a diagonal matrix added. For each �

chosen, there is a corresponding solution of �Ridge

and hence the �s trace out a path

of solutions. We immediately see some properties of �: (1) � controls the size of the

coe�cients; (2) � controls amount of regularization; (3) When � ! 0, we obtain the

least squares estimates; and when � ! 1, the power of the penalty grow infinitely,

thus all coe�cients approach zero.

An alternative approach to Ridge is called Lasso. Lasso is a shrinkage method similar

to Ridge. The only di↵erence is the additional term being sum of absolute values of

�’s, rather than the sum of squared �’s. Hence it solves the problem by minimizing

n

X

i=1

(Yi

�X

T

i

�)2 + �

p

X

j=1

|�j

| (6)

We can also understand the ideas of penalty above in a ‘constraint’ way. For example,

consider the case when p = 2

Ridge:n

X

i=1

(Yi

� �1xi1 � �2xi1)2 subject to �

21 + �

22 t

Lasso:n

X

i=1

(Yi

� �1xi1 � �2xi1)2 subject to |�1|+ |�2| t (7)

If we draw the contours of the least squares error function (red ellipses) and con-

straint regions (solid blue areas), the figure would be like the following:

8

Sun Yiran

Figure 1: Blue regions are constraint |�1| + |�2| s and �

21 + �

22 s, red ellipses

are contours of RSS, for Lasso (left) and Ridge (right) respectively

It is clear to see that if the red ellipses continue to expand, then the coordinates of

the intersection with the blue constraint regions are simply solutions to �1 and �2.

In this process of expansion, once the red ellipses reach the blue regions for the first

time, the coordinate �1 is zero for Lasso while it is impossible for Ridge to obtain a

zero coordinate. This is an important advantageous property that Lasso holds over

Ridge. Such property allows us to eliminate some of the coe�cients to zero and thus

do variable selection.

We may want to do a comparison between Ridge and Lasso. It is obvious that Lasso

has a major advantage over Ridge that it can do variable selection, consequently

leading to an easier-to-interpret result. However, which method performs better

in prediction accuracy? In general, Lasso performs better when a relatively small

number of inputs have substantial coe�cients unequal to zero. In contrast, one

may expect Ridge to perform better when there is large number of predictors with

roughly equal size of coe�cients. Techniques such as Akaike’s (1973) Information

Criterion (AIC) or Craven’s (1979) Cross Validation (CV) can be employed to deter-

mine which method fits the data better. We will talk more on that in later chapters

and for convenience we will use their short forms from now on. Again, Lasso and

Ridge outperform LSE in such a way that they produce a trade-o↵ between vari-

ance and bias. That means in the setting when LSE has high variance, they reduce

variance at expense of bias and hence achieve more accurate prediction.

9

Sun Yiran

2.1.2 Splines and Semi-Parametric Models

A linear regression method is relatively easy to implement and can provide easy-

to-interpret results. However, it is subject to significant limitations in predictive

power when the linearity assumption is poor. Actually in real life environment,

the relationship between output and inputs, i.e. f(X) = E[Y |X], is unlikely to

be linear. Over several decades,considerable e↵orts have been made in establishing

an algorithm of fitting curves to data. Inspired by Taylor expansion approximates

the underlying function: m(x) ⇡ m(a) +m

0(a)(x � a) + m

00(a)2 (x � a)2 + ... up to a

polynomial of power d, it is reasonable for one to think of replacing the linear model

with a polynomial function to achieve curve fitting

y

i

= �0 + �1xi

+ �2x2i

+ �3x3i

+ ...+ �

d

x

d

i

+ ✏

i

(8)

where the coe�cients can be achieved by LSE because this is just a linear regression

model with predictor variables xi

, x

2i

, ..., x

d

i

. The disadvantage of this ordinary poly-

nomial function is that since it imposes a global structure on X, when the order is

large, the approximation procedure becomes extremely complicated and ine�cient.

In other words, a large d (d > 3) leads to over flexibility or strange shapes of the

plotted curve and is particularly the case at boundaries. Therefore instead of regard-

ing the function as a whole, we approximate it in a piecewise manner. For example,

a piecewise cubic polynomial works in such a way that it fits di↵erent polynomials

in di↵erent parts of the range of X. Mathematically a standard cubic polynomial is

of the form y

i

= �0 + �1xi

+ �2x2i

+ �3x3i

+ ✏

i

. If approximating it in two parts, then

the polynomial becomes a cubic spline that takes the form

y

i

=

8

>

<

>

:

�01 + �11xi

+ �21x2i

+ �31x3i

+ ✏

i

, if xi

< c;

�02 + �12xi

+ �22x2i

+ �32x3i

+ ✏

i

, if xi

� c.

where the two polynomial functions di↵er in coe�cients. The first function is fitted

based on subset of observations xi

< c, and the other on x

i

� c.

10

Sun Yiran

Problem with piecewise polynomial model lies in its inherent discontinuity. For ex-

ample, I implemented a piecewise polynomial regression to a subset of the built-in

data ‘wage’ in R. Thought there are several variables in this dataset, I will only plot

‘wage’ versus ‘age’ for simplicity. Set ‘age = 50’ as the break point, we are given the

following figure

20 30 40 50 60 70

50100

150

200

250

300

Age

Wage

Figure 2: Cubic Polynomial Regression for Wage data with break point at age = 50

We immediately see a jump at the break point ‘age = 50’, therefore piecewise poly-

nomial is not a good choice for continuous regression. We want to put further

constraints to make the function continuous even at boundaries: both the first and

second orders of the piece-wise polynomials are continuous. These two restrictions

will help to fulfill the continuity conditions. A regression spline is a good approach

proposed to achieve that constraint (Wand, 2000).

Spline is a strategy that divides the domain of the underlying function into a se-

quence of sub-intervals and estimates the piece of function in each interval by using a

polynomial regression function. Mathematically, a k

th order spline is a function that

is a polynomial of degree k on each of the intervals (�1, t1], [t1, t2], ...[tJ ,1] and is

continuous at its knot points t1 < t2... < t

J

. The most popular spline used is of

order 3, called a cubic spline. In order to parameterize the set of cubic splines, there

11

Sun Yiran

are two sorts of bases that can be used: truncated power basis and B-spline basis.

Truncated power basis is the most naturally used and is easy for interpretation, but

the advent of B-spline basis upgrades the computational speed as well as numerical

accuracy. We will not cover that in this thesis.

A collection of truncated cubic power bases are: 1, x, x2, x

3, (x� t

j

)3+, (j = 1, ..., J).

Employing the above regression ideas, the underlying function Y = m(x)+ ✏ can be

approximated as

Y =J+4X

k=1

✓

k

B

k

(x) (9)

where we write B1(x) = 1, B2(x) = x,B3(x) = x

2, B4(x) = x

3, B4+j

(x) = (x� t

j

)3+.

It is easy to see that the function owns continuous first and second derivatives, and

thus remains continuous at knots. The beauty of this approach is that the expression

is in a linear form once the bases are determined. Therefore we can use LSE to obtain

an estimator for ✓, namely let

Y =

0

B

B

B

B

B

B

B

@

Y1

Y2

...

Y

n

1

C

C

C

C

C

C

C

A

,X =

0

B

B

B

B

B

B

B

@

B1(X1) B2(X1) ... B

J+4(X1)

B1(X2) B2(X2) ... B

J+4(X2)

... ... ... ...

B1(Xn

) B2(Xn

) ... B

J+4(Xn

)

1

C

C

C

C

C

C

C

A

, ✓ =

0

B

B

B

B

B

B

B

@

✓1

✓2

...

✓

n

1

C

C

C

C

C

C

C

A

(10)

The best approximation of ✓ is thus ✓ = (XTX)�1XY. Note that the above expres-

sion requires us to determine the number of knots J. It controls the trade-o↵ between

the smoothness of the curve and the bias of the approximation.

To see this straight-forwardly, I simulated a set of artificial data from the function

Y = sin(⇡x) + 0.3✏ where x ⇠ U(0, 1) and ✏ ⇠ N(0, 1). 100 observations are sam-

pled, and approximated functions based on cubic splines with number of knots k=10,

25, 50, 100 are plotted separately in the figure below.

One can see that at k = 100, the spline curve tries to pass more data points but

is less smooth. Theoretically the optimal number of knots can also be chosen by

finding the smallest AIC or CV value.

12

Sun Yiran

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

0.0

1.0

no. of knots = 10

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

0.0

1.0

no. of knots = 25

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

0.0

1.0

no. of knots = 50

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

0.0

1.0

no. of knots = 100

x

y

Figure 3: Cubic Splines with di↵erent number of knots at 10, 25, 50, 100

A cubic spline has achieved continuity at knots. However, it still has a shortcoming

that the estimation may have high variance at the outer range of the predictors.

In that case, a natural spline performs better by imposing an additional constraint:

requiring the function to be linear at boundaries. Again take the ‘wage’ data as an

example, calling standard functions ‘bs()’ and ‘ns’ in R gives a cubic spline and a

natural cubic spline respectively.

20 30 40 50 60 70 80

50100

150

200

250

300

Age

Wage

Cubic SplineNatural Cubic Spline

Figure 4: A cubic spline and natural cubic spline fitted to a subset of ‘wage’ data

respectively, at knots 25, 40, 60.

13

Sun Yiran

We see that the confidence interval of the cubic spline goes wild when X takes on

large values. In contrast the natural cubic spline can provide more stable estimation

there.

For the multi-variate case where the number of predictors p > 2, we are not able to

know the functional form of the model easily. In other words, the estimation e�-

ciency dramatically decreases as the number of predictors p increases. One attempt

at overcoming this drawback leads to considerations of using a semi-parametric

model, which contains both parametric and non-parametric components. In such

way the semi-parametric models are more flexible than linear models in estima-

tion, but meanwhile more e�cient than non-parametric models in computational

implementation. There is a wide variety of semi-parametric models, among which

the most well-known example is the Generalized Additive Model (GAM) (Wood,

2006). Suppose we are interested in m(x1, ..., xp

) = E(Y |x1, ..., xp

), then the GAM

approximates m(.) by describing the variables separately and summing all their

contributions together

Y = �0 + g1(x1) + ...+ g

p

(xp

) + ✏ (11)

Most of the case we are able to know how the linear predictors are dependent the

response, for example x1, ..., xq

, and our interest focus on the the unknown smooth

functions. We can thus write our function as the following Additive Partially Linear

Model (APLM), which can be regarded as a special case of GAM (Stone, 1985)

Y = �0 + �1x1 + ...+ �

q

x

q

+ g

q+1(xq+1) + ...+ g

p

(xp

) + ✏ (12)

A common way to estimate the nonlinear part of the APLM, gk

(xk

), is by assuming

that they have the form of splines. It gives

Y = �0 + �1x1 + ...+ �

q

x

q

+

J

q+1+4X

j=2

✓

q+1,jBq+1,j(xq+1) + ...+

J

p

+4X

j=2

✓

p,j

B

p,j

(xp

) + ✏

14

Sun Yiran

. Again, this equation is in linear form and we can estimate � by LSE and make

statistical inference on it as:

(�0, �1, ..., �q

, ✓

q+1,Jq+1+4, ..., ✓p,2, ..., ✓p,J

p+4)T = {XTX}�1XTY (13)

The combination of linear and non-linear components in the additive partially lin-

ear model makes the approximation procedure more flexible and the results more

accurate. Also, it allow us to examine the e↵ect of each individual predictor variable

separately.

2.1.3 Kernel Regression Model

We have introduced that semi-parametric models are more flexible than linear mod-

els in estimation. In this section we will describe a non-parametric methods that fits

a di↵erent model at each query point x0 and thus achieves even more flexibility. In

this method, no parameter is involved, and the estimation of a target point is fully

based on learning from the observations that have the nearest distance to it. Such

non-parametric method is called k-Nearest-Neighbor (kNN) and it is the most intu-

itive technique recognized for statistical discrimination so far (Weinberger, Blitzer,

& Saul, 2005).

The name of kNN indicates a concept of ‘near’. So how to define the criteria of

‘near’? Most often it is natural to use the empirical Euclidean distance function in

the feature space. After all features have been standardized to have zero mean and

variance one, the Euclidean distance takes on the form

d(Xi

, X

j

) = kXi

�X

j

k = {p

X

s=1

(xis

� x

js

)2}1/2

orp

X

s=1

|xis

� x

js

| (14)

Let us consider the uni-variate case when x is the only input variable and {(x1, y1), ..., (xn

, y

n

)}

are the n observations. Suppose the target point is x = x0. Based on Euclidean

distance criteria, we can collect k nearest neighbors of x0 to form a set Dx

= {xi

: xi

15

Sun Yiran

is one of the k nearest neighbors to x} and estimate m(x) as

y =

P

x

i

2Dx

y

i

#{xi

2 D

x

} ⌘P

n

i=1 I(xi

2 D

x

)yi

P

n

i=1 I(xi

2 D

x

)(15)

The only unknown component in this expression is k, which is also the size of Dx

.

Determination on k is important in the sense that it governs the trade-o↵ between

variation and bias. If k is too large, then the estimator will be a↵ected by some far-

away observations that may provide irrelevant information; while if it is too small,

the estimators will fluctuate with high variation. Analogous to the choice of number

of knots J in regression spline method, we can also use CV to select the most proper

k. I will implement that later in the examples of real-life applications.

Note that the above estimation assigns equal weight to all the k neighbors. However,

one may think that nearer neighbors are more relevant and thus should be assigned

more importance to in the decision. Therefore a more accurate estimation is

y =

P

k

i=1 wi

y

i

P

k

i=1 wi

(16)

Although k-Nearest-Neighbors (kNN) is an essentially model-free approach that is

e�cient when the estimations are well positioned to capture the distribution of the

real data, it is has inherent disadvantage of discreteness. It leads to the approximated

function plotted by kNN appearing discontinuous and ugly. In addition, limitations

such as, poor run-time performance when training set size is large, sensitivity to

redundant features and computational cost since we have to compute all the relative

distances, may exist in real data problems. Therefore, statisticians developed an

improved method called Kernel Smoothing (Bichler & Kiss, 2004). Rather than

choosing a number of k neighbors to compose the discrete set Dx

, Kernel Smoothing

uses an interval Dx

= [x � h, x + h] with some bandwidth h > 0 and define the

weight as

w

i

=1

h

K(x

i

� x

h

) (17)

where K(x) is a kernel function and x

i

�x

h

is the relative distance between neighbor

x

i

and the target point x within h. For convenience we write above expression

16

Sun Yiran

as K

h

(Xi

� x). The estimator becomes y =P

n

i=1 Kh

(Xi

�x)yiP

n

i=1 Kh

(Xi

�x) which is the so-called

Nadaraya-Watson (N-W) estimator.

Kernel Smoothing has many real-life applications such as pattern recognition, image

reconstruction, data visualization so on. Here is a toy example of application in

pattern recognition. I created a decayed sign from some function below in the left

panel and applied Kernel Smoothing to recover the sign shown in the right panel.

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

x[, 1]

x[, 2

]

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

x[, 1]

x[, 2

]

k=5

Figure 5: Simulated decayed sign (left) and recovered sign (right) using Kernel

Method

One might ask what is a kernel function? The nature of the Kernel function is

indeed a symmetric probability density. Construction of a kernel density estimator

is similar to that of a histogram. Recall the steps of constructing a histogram: break

the whole range of values into a sequence of intervals, and count the frequency that

observations fall into each interval. Let m denote the number of observations that

fall into the interval around x. Then for histogram the density function should be

17

Sun Yiran

f(x) = 1n⇤interval length ⇤m. Likewise for a Kernel density estimator,

f

h

(x) =1

n ⇤ interval length ⇤m

=1

2nh#{X

i

2 [x� h, x+ h]}

=1

2nh

n

X

i=1

I(|x�X

i

| h)

=1

nh

n

X

i=1

1

2I(|x�X

i

h

| 1)

=1

nh

n

X

i=1

K(x�X

i

h

)

We can easily prove that it is an appropriate density function by integrating it to 1.

Z

f(x)dx =1

n

n

X

i=1

Z

1

h

K(X

i

� x

h

)dx

=1

n

n

X

i=1

Z

1

h

K(X

i

� x

h

)dx

=1

n

n

X

i=1

Z

K(X

i

� x

h

)d(X

i

� x

h

)

=1

n

n

X

i=1

Z

K(u)du

= 1

To visually see the comparison, I applied Histogram and Kernel approaches on sam-

ples from two given functions respectively, to estimate their density functions. The

upper graphs are plots of Y = sin(3⇡x)+0.2✏ and the lowers are Y = sin(2x)+0.2✏.

18

Sun Yiran

Histogram of y1

y1

Frequency

-1.5 -0.5 0.5 1.5

040

80

-2 -1 0 1 2

0.0

0.4

0.8

y1

Pro

babi

lity

dens

ity fu

nctio

n

Histogram of y2

y2

Frequency

-0.5 0.0 0.5 1.0 1.5

0100

200

-1.0 0.0 0.5 1.0 1.5 2.00.0

1.0

y2

Pro

babi

lity

dens

ity fu

nctio

n

Figure 6: Histogram and Kernel Plots of function Y = sin(3⇡x) + 0.2✏ (upper) and

Y = sin(2x) + 0.2✏ (lower) respectively

We see that both methods give a rough sense of the density distribution, but Kernel

exhibits properties of smoothness and continuity thus is better at determining the

shape of the density. Some popular Kernel functions are given in the following table:

Table 1: Kernel Functions

Kernel Explicit Form

Gaussian K(u) = 12⇡exp(�u

2), u 2 [�1,1]

Uniform K(u) = 12 , u 2 [�1, 1]

Tri-cube K(u) = (1� |u|), u 2 [�1, 1]

Epanechnikv K(u) = 34(1� u

2), u 2 [�1, 1]

In practice, there are some points that call for one’s attention:

• Similar to k-Nearest-Neighbors, the range of the neighborhood �, needs to be

19

Sun Yiran

determined. A natural bias-variance trade-o↵ occurs as we change the size of

the bandwidth. Small � leads to smaller bias but larger variance, while big �

leads to over smoothness and bad prediction. Note that the selection criteria

of regularization parameter for smoothing splines also applies here.

• Boundary issues arise. Less observations are present on the boundaries which

leads to inadequate information. So estimation on boundaries may be inaccu-

rate.

• When there are ties in x

i

, one should average y

i

at tied values and, add extra

weights to these new observations at xi

For the bandwidth issue, we should know how to determine the bandwidth �. For

example, � is the radius of the support region the Epanechnikov or tri-cube kernel

with metric width, while for the most popular Gaussian kernel, � is the standard

deviation (Sheather & Jones, 1991). Generally it can be determined based on tech-

niques such as

The following figure is kernel smoothing plots of two di↵erent bandwidths to samples

generated from y = sin(3⇡x) + 0.2✏ where ✏’s are iid noisy terms.

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

0.0

0.5

1.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

0.0

0.5

1.0

x

y

Figure 7: Kernel smoothing plots at bandwidth 0.01 (left) and 0.03 (right)

The left-hand plot looks less smoother than the Right. Indeed smaller bandwidth

20

Sun Yiran

leads to more fluctuations in estimations and thus higher variance. However, as it

tries to pass more observations, the bias is also smaller.

In terms of the boundary issue, we refer to the previous local regression idea and

introduce a local linear kernel smoothing approach. The basic idea is, again for

Y = m(X) + ✏, if Xi

is close to x, we can consider a local linear approximation

based on second order Taylor expansion as m(X) ⇡ m(x) + m

0(x)(Xi

� x). Thus

for i = 1 to n, there are n such linear regression models. Then it is converted to a

least squares problem with � = (m(x),m0(x))T . However we will not discuss about

that in detail.

Finally let us consider the multi-variate case X = (x1, ..., xp

)T with n observations

(X1, Y1), ..., (Xn

, Y

n

). The formulation of N-W Estimator remains the same with

that of uni-variate case:

m(x) =

P

n

i=1 Kh

(Xi

� x)Yi

P

n

i=1 Kh

(Xi

� x)(18)

The only di↵erence is that here K(.) is a multivariate function K

h

(Xi

� x) =

1h

p

K(xi1�x1

h

, ...,

x

ip

�x

p

h

) where h can still be chosen using CV method. Both uni-

variate and multi-variate cases can be simulated in R programming. I will show

that in Chapter 3.

2.2 Curse of Dimensionality

2.2.1 Theory of Curse of Dimensionality

In this chapter, we will know what is ‘Curse of Dimensionality’, why it happens and

what solutions we can take to resolve it. First, let us consider what is ‘Curse of

Dimensionality’. Indeed it exists in almost all regression models introduced above,

particularly non-parametric models. For example, for kNN approach, the k near-

est neighbors to a particular data point x0 may be far away when the number of

predictors p is large, leading to a poor fit. The situation for the kernel smoothing

21

Sun Yiran

is relatively di�cult to explain. We can understand that based on mathematical

computations.

Recall in the uni-variate case, the N-W Estimator is

y =

P

n

i=1 Kh

(Xi

� x)Yi

P

n

i=1 Kh

(Xi

� x)

Since K(.) is a symmetric probability density function, by Taylor Expansion the

performance of the estimator can be assessed as

bias(m(x)) = E

ˆm(x)�m(x)

=

P

n

i=1 Kh

(Xi

� x)Yi

P

n

i=1 Kh

(Xi

� x)

⇡P

n

i=1 Kh

(Xi

� x){m(x) +m

0(x)(Xi

� x) + 12m

00(x)(Xi

� x)2 + ✏

i

}P

n

i=1 Kh

(Xi

� x)

⇡ c2{1

2m

00(x) + f

�1(x)m0(x)f 0(x)}h2

(19)

with c2 =R

u

2K(u)du and

V ar(m(x)) = V ar(

P

n

i=1 Kh

(Xi

� x)✏i

P

n

i=1 Kh

(Xi

� x))

=

P

n

i=1{Kh

(Xi

� x)}2

{P

n

i=1 Kh

(Xi

� x)}2�2

=

P

n

i=1{Kh

(Xi

� x)}2

(nf(x)2�

2

! nE{Kh

(Xi

� x)}2

(nf(x)2�

2

=nE{ 1

h

K(Xi

�x

h

)}2

(nf(x)2�

2

(20)

22

Sun Yiran

If we do variable substitution by defining u�x

h

as v, u = x+ hv

E{1h

K(X

i

� x

h

)}2 = 1

h

2EK

2(X1 � x

h

)

=1

h

2

Z

K

2(u� x

h

)f(u)du

=1

h

2

Z

K

2(v)f(x+ hv)d(hv)

=1

h

Z

K

2(v)f(x+ hv)dv

! 1

h

f(x)

Z

K

2(v)dv

We can finally get

V ar(m(x)) ⇡1h

f(x)R

K

2(v)dv

(nf(x))2�

2

=d0�

2

nhf(x)with d0 =

Z

K(v)2dv

(21)

Since the choice of h should minimize

E[m(x)�m(x)]2 = bias

2 + variance

Simplify the equation the optimal bandwidth is then

h

opt

=n

d0�2

4f(x)c22�

12m

00(x) + f

�1(x)m0(x)f 0(x) 1/5

o

n

�1/5 / n

�1/5 (22)

In analog, the multi-variate case has p > 1, therefore bias and variance turn out to

be bias ⇡ C

K,m

(x)h2 and var(m(x)) ⇡ D

K

nh

p

f(x) . Doing the same way, the optimal

bandwidth obtained is proportional to n

�1/(p+4). That means an increase in the

dimension p will lead to a dramatic decrease in the convergence rate, thus an de-

terioration in estimation e�ciency. This is the so-called ‘Curse of Dimensionality’.

There are a collection of approaches to resolve the problem. Here I would like to

introduce two models to take account that.

2.2.2 Single-Index Model

The first model is called Single-Index model. Recall that the linear regression model

is E(Y |X) = �1 + �1x1 + ...+ �

p

x

p

+ ✏. Brilinger (1983) considered imposing a link

23

Sun Yiran

function to the linear regression model:

E(Y |X) = �(�0 + �1x1 + ...+ �

p

x

p

) + ✏ (23)

This model was termed by Stoker (1986) as the ‘Single Index Model’ if �(.) is un-

known. After standardizing the index, we can write it as:

Y = �(↵T

X) + ✏ (24)

where ↵0 = (�1, ..., �p

)T is the direction with k↵0k2 = �

21 + ...+ �

2p

= 1. Since �(.) is

a uni-variate function, its estimation does not have the problem with dimensionality.

2.2.3 Projection Pursuit Regression Model

The second model considered is Projection Pursuit Regression (PPR) which has the

form

f(X) ⇡ f1(↵T

1 x) + ...+ f

k

(↵T

m

x) (25)

This is an additive model with inputs ↵T

x as the derived features of X. The additive

terms are called ridge functions which fully depend on the direction vector ↵ and

k↵k = 1. In other words, ↵T

x is the projection of X onto ↵, and we only need to

determine ↵ that optimizes the model fit. In practice, we start with trying one term,

and add new terms one by one if needed. Since m < p, the dimension is reduced as

well. For a simple example, take 300 observations from the model Y = 6⇤x1⇤x2⇤x3+✏

where all inputs as well as the noise are identically and independently distributed

as N(0,1). By applying PPR, we obtain estimation plots as below:

24

Sun Yiran

-2 -1 0 1 2

-20

-10

010

20

xalpha1

y

-2 -1 0 1 2 3

-25

-20

-15

-10

-50

5

xalpha2

residuals1

-3 -2 -1 0 1 2

-25

-20

-15

-10

-50

5

xalpha3

residuals2

-2 -1 0 1 2

01

23

4

term 1

-2 -1 0 1 2 3

-6-5

-4-3

-2-1

0

term 2

-2 -1 0 1 2

-10

12

34

term 3

Figure 8: PPR plots with three ridge terms

3 Data Description and Empirical Application

3.1 Data Description

The science of learning from the data is employed in many business and scientific

communities. In a typical scenario, we would like to predict a quantitative or cate-

gorical outcome measurement based on some set of features. Usually a training set

of data is given, and we are required to build an appropriate model to predict the

outcome for new observations. We are going to do so by applying the above models

to illustrate simple examples of real-life data. In this thesis, I will use three datasets

of di↵erent types. We would like to assess the e�ciency of the models, in terms of

accuracy and computational cost.

25

Sun Yiran

3.1.1 Colon Cancer Dataset

The first dataset I analyzed is called Colon Cancer Dataset. This dataset was orig-

inally used in a bio-medical research that studies gene expressions of colon tumors.

In the research 6,500 human genes expressions of 40 tumor and 22 normal colon

tissue samples were analyzed. Then a number of 2,000 genes with highest minimal

intensity across the 62 samples were selected to compose the Colon Cancer Dataset.

In other words, the dataset contains two parts: ‘colon.x’ and ‘colon.y’. ‘colon.x’ is

a 62 ⇥ 2000 matrix where each row is a tissue sample and each column is a gene

expression. ‘colon.y’ contains 62 entries of either 0 or 1, where ‘0’ represents normal

colon and ‘1’ represents tumor colon.

The study of human genes is one of the most meaningful but complicated topic in the

modern scientific areas. It is estimated that humans have around 100,000 genes with

each having DNA that encodes a unique protein specialized for a function or a set

of functions. In any one gene, there is a sequence that contain millions of individual

nucleotides arranged in a particular order (Mehmed, 2011). Therefore traditional

statistical methods are definitely inadequate to explore this massive amount of in-

formation. Statistical scientists as well as genetic scientists are trying to find out

better strategies to interpret the genes. Data mining is one of the potential solu-

tions. I will use this dataset to illustrate how Lasso and Ridge regression work in

the real life case where the number of predictors is much more larger than number

of observations, namely (p >> n).

3.1.2 Wine Quality Dataset

This dataset contains two samples for red and white variants of the Portuguese

‘Vinho Verde’ wine separately. There are 1599 instances of red wine and 4898

instances of white wine are involved in the grading. The output variable is wine

quality based on experts grading. For each instance, at least three experts tasted

26

Sun Yiran

and rated on the wine quality that ranges from 0 (very bad) to 10 (very excellent),

and the median of their scores given would be the final quality class. However

note that for white wine, the classes range from 3 to 8; while for red wine, the

classes range from 3 to 9. The input attributes are: fixed acidity, volatile acidity,

citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density,

pH, sulphates and alcohol. All the inputs are real numbers while the output is

categorical.

3.1.3 Bike Sharing Dataset

The third dataset is called Bike Sharing. Currently there are over 500 automatic

bike sharing systems in the world. These systems allow users to rent a bike at

one place and return to another place, making the rental process more convenient

than traditional ones. Since bike-sharing rental process is closely related to the

environmental settings, an analysis on the users’ data records may give us an insight

into their renting habits and thus improve the bike sharing system. The downloaded

folder contains two files containing bike sharing counts on hourly and daily basis

respectively. I will only use the daily-based data for convenience. There are 16

input fields within the dataset. I will abandon 8 fields in my research, for the reason

that some of them provide repetitive information of the others. For example, the field

‘holiday: (1:yes, 0:no)’ and ‘working day (if day is neither weekend nor holiday is 1,

otherwise is 0)’ and ‘weekday’ provide similar information. Therefore the response

variable is count of total bikes on a day and the eight independent variables after

being filtered are as the followings:

27

Sun Yiran

Table 2: List of predictors in Bike Sharing Dataset

1 month 1-12: January to December

2 holiday1: yes

0: no

3 weekday 0-6: Sunday to Saturday

4 weathersit

1: Clear, Few Clouds, Partly Cloudy

2: Mist+Cloudy, Mist+Broken Clouds, Mist+Few Clouds,

3: Light Snow, Light Rain+Thunderstorm+Scattered Clouds,

Light + Scattered Clouds

4: Heavy Rain+Ice Pallets+Thunderstorm+Mist, Snow+Fog

5 temp Normalized temperature in Celsius, divided to 41

6 atemp Normalized feeling temperature in Celsius, divided to 50

7 hum Normalized humidity, divided to 100

8 windspeed Normalized wind speed, divided to 67

3.2 Examples of application

3.2.1 Feature Scaling

I have mentioned in the introduction section that the four basic steps involved in

statistical analysis are: data processing, model fitting, model checking, and result

interpretation. We will follow these steps to find out how the above models are

applied in real life circumstances and what we can infer and benefit from the results.

Let us start o↵ with data processing. Data processing is a procedure that makes the

data meaningful and reasonable for later analysis. A general requirement of data

processing is feature scaling or standardization.

Why is scaling necessary? Let us take Ridge for instance. Recall that the commonly

known standard LSE is indeed scale equivalent: Xi

�

i

will remain the same no matter

how X

i

is scaled. This means multiplying X

i

by some constant c will always lead

to a scaling of the LSE by 1/c. However, Ridge estimations can change along with

the predictor. For example, a variable representing ‘weight’ can be measured in a

variety of units, such as ‘g’, ‘kg’, ‘pounds’ and so on. However due to the sum term

28

Sun Yiran

in Ridge regression’s formula (3), the di↵erence in scale cannot be simply adjusted

in the final estimation by some factor. Therefore it is necessary to apply scaling

on the predictors beforehand. We can use the formula x

ij

= x

ijp1n

Pn

i=1(xij

�x

j

)2in

computations or simply call function ‘scale()’ in R programming. In addition, scaling

can eliminate the intercept and make the resulted coe�cients easier-to-interpret.

Note that, outlier detection is also a necessary part of data processing. We will see

this later.

3.2.2 Results Interpretation

First let us analyze the Colon Cancer Dataset. Having noticed that the number of

predictors is much larger than the number of observations (p >> n), we think of

applying Ridge or Lasso Regression to it. Since no additional testing data is given

in this dataset, a subset of the original data needs to be reserved for the model

checking later. A general way to do that is to split the dataset randomly into two

sets and define them as training set and testing set respectively. In this case, I used

a splitting rate of 0.8 and obtained floor(62 ⇥ 0.8) = 49 observations to compose

the training set and the remaining 13 for validation purpose.

Recall that Lasso penalizes the coe�cients by imposing a constraint on them: �’s are

chosen by minimizingP

n

i=1{Yi

� �0 �P

p

j=1 xij

�

j

}2 +�

P

p

j=1 |�j

| orP

n

i=1{Yi

� �0 �P

p

j=1 xij

�

j

}2 subject toP

p

j=1 t. This means that for a strict constraint, or t ! 0,

most coe�cients will be eliminated to zero and only few coe�cients survive in the

model; while if t ! 1, the constraint imposed means no power and the coe�cients

estimated are just the LSE. In R programming, I called function ‘glmnet()’ which

fits a generalized linear model via penalized maximum likelihood.

The left panel below traces out coe�cients paths versus L1 Norm, namely t in the

above constraint expression.

29

Sun Yiran

0.0 0.5 1.0 1.5 2.0 2.5

-0.2

-0.1

0.0

0.1

0.2

L1 Norm

Coefficients

0 10 21 28 38 48

-5 -4 -3 -2 -1

0.12

0.16

0.20

0.24

log(Lambda)M

ean-

Squ

ared

Err

or

51 47 41 37 26 17 4 3 1

Figure 9: Coe�cients for the colon cancer example, as the constraint size t is varied.

Interpretation of this graph should be, for example at t = 0.5 with a vertical gray-

colored line, there are 11 coe�cients unequal to zero. Thus the corresponding 11

inputs are regarded as influential on the output in the case of t = 0.5. It roughly

shows the outstanding property of variable selection that Lasso is endowed with.

We mentioned in Chapter 2.1.1 that we can select the � based on AIC or CV value.

The basic idea of CV is to split the number of n samples into 2 sets: training set of

size (n-m) and testing set of size m. The CV value is just the average of prediction

errors of all possible partitions with size m. Hence the model with smallest CV

value makes the best prediction. Similarly the K-fold CV just an improved way that

avoids too many iterations compared with the ordinary CV. Rather than leaving one

observations out for checking, K-fold CV approach splits the data into approximately

equally K sets. The (K � 1) sets are used for model fitting, and the remaining one

set for validation. The whole procedure should be repeated K times.

By calling the function ’cv.glmnet()’ and set number of folds to be 5, we get the

above right panel in Figure 9 showing 5-fold CV paths at di↵erent �’s. The CV

value achieves its minimum at � = 0.07777501, and the final Lasso model obtained

30

Sun Yiran

is

Y =� 0.14676x4 + 0.03666x164 + 0.18835x175 + 0.26834x353 � 0.20746x377

+ 0.13086x576 � 0.02910x611 � 0.01285x654 � 0.04579x788 � 0.39661x792

� 0.02843x823 + 0.06770x1073 � 0.04461x1094 + 0.04027x1221 � 0.08913x1231

+ 0.04195x1256 + 0.11274x1346 + 0.01682x1360 + 0.12097x1400 + 0.00544x1473

+ 0.04121x1549 � 0.04568x1570 + 0.10489x1582 � 0.04668x1668 + 0.14415x1679

+ 0.30034x1772 � 0.11222x1843 � 0.00228x1873 � 0.22062x1924

We can find the corresponding names of the gene features in the ‘names.html’ file

and these features are regarded as significant in Colon Cancer tumor.

Now comes the third step: to verify if the fitted model is appropriate or not. We

examine its performance in terms of prediction accuracy. Treating the testing set

as new observations, we get a classification error by averaging all the errors that

the predictions are not correctly classified as the observations. Having repeated the

whole procedure ten times, the classification error turned out to be zero in nine

times. A more straight-forward classification plot is shown below.

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

predicted values

classes

true

classified

Figure 10: Classification error plot for Lasso applied in Colon Cancer Dataset

31

Sun Yiran

The plot shows that the model fitted by Lasso is overall satisfying. Lasso’s ability

in dealing with the situation when there are a huge number of features but few

observations is hence proved.

The second dataset is Wine Quality. We analyze the white wine and red wine data

separately but in a similar way. Since the response variable ‘quality’ is categorical, we

consider of using multilevel logistic regression model and K-Nearest Neighborhood

classification method. We have learned in elementary statistics courses that for

categorical data with K classes, we can model the probability as

P (Y = k|X) =exp(a

k

+ �

T

k

X)

exp(a1 + �

T

1 X) + ...+ exp(aK

+ �

T

K

X)(26)

Once ↵ and � are obtained, we can make prediction that the sample falls into class

k with probability p

k

.

For the white wine data, again taking splitting rate to be 0.8, the training set com-

prises 3918 observations and testing set has 980 observations. So the multilevel

logistic regression model is fitted by calling function ‘glmnet()’ with family set as

’multinomial’. One needs to be careful that the function ‘as.factor(response vari-

able)’ is necessary as it changes numerical values to symbol levels. Therefore the

resulting classification error which is the mean of the frequency that prediction does

not equal to true value in testing set, turns out to be 0.459187. Note that the re-

sponse variable in this dataset has 7 classes. If I tried omitting the first three classes

3, 4, and 5 and reduced number of classes to 4. Using the same algorithm, the

classification error decreased to 0.3190184 in such way. If I continued to omit one

more class and left with only 3 classes, then the error is reduced to 0.1462264, which

gives a satisfying prediction. Hence we doubt if logistic regression model is e�cient

for data with relatively large number of classes. Now we applied the alternative

data mining method, kNN. Simply calling function kknn(), the classification error

obtained is 0.3785714. Similarly for the red wine data, the classification errors for

multilevel logistic model and kNN model are 0.425013 and 0.359375 respectively.

32

Sun Yiran

It is summarized in the table below

Table 3: Classification Error for Wine Quality Dataset

Classification Error White Wine Red Wine

Logistic Model 0.459187 0.425013

kNN 0.3785714 0.359375

To visually see the performance, I also plotted the following figure based on a subset

of 50 resulted predictions as well as their true values.

0 10 20 30 40 50

4.0

4.5

5.0

5.5

6.0

6.5

7.0

predicted values

classes

trueclassified

white wine

0 10 20 30 40 50

4.0

4.5

5.0

5.5

6.0

6.5

7.0

predicted values

classes

trueclassified

red wine

Figure 11: A subset of 50 classification errors plot for white wine (left) and red wine

(right) based on kNN method

Based on all these outputs, it is reasonable to conclude that kNN performs better

than multilevel logistic regression model in the case when the number of classes is

big. However, in this case when number of classes is below 3 (including 3), the

logistic regression model is indeed e�cient enough.

Finally we come to analyze the Bike-sharing Dataset. This dataset has 8 input vari-

ables and 730 observations. Again, the first step is to process the data in advance.

During data processing, I detected an outlier in row 668. The population mean of

the bike count ”cnt” is 4548, while that of row 668 is only 22. Hence I removed it

33

Sun Yiran

before proceeding to model fitting. Also note that I logged the response variable as

the measurement scale is too large compared to the input variables.

1 > xy [666:670 ,]

mnth holiday weekday weathersit temp atemp hum windspeed cnt

3 [666,] 10 0 6 2 0.530000 0.515133 0.720000 0.235692 7852

[667,] 10 0 0 2 0.477500 0.467771 0.694583 0.398008 4459

5 [668,] 10 0 1 3 0.440000 0.439400 0.880000 0.358200 22

[669,] 10 0 2 2 0.318182 0.309909 0.825455 0.213009 1096

7 [670,] 10 0 3 2 0.357500 0.361100 0.666667 0.166667 5566

I first fitted a generalized linear model to it. The result showed an AIC of 570.6

which is not that satisfying. Hence we may want to take non-linear fits into con-

sideration. Recall that polynomial splines approximate uni-variate functions in a

piecewise manner. However, we have mentioned that when the number of predictors

is larger than 2, the estimation e�ciency su↵ers from the ‘Curse of Dimensionality’.

Semi-parametric models such as the Additive Partially Linear Model (APLM) is an

alternative approach to take account of that. The Bike-sharing dataset has 8 inputs

and one might think of trying an APLM.

Having noticed that the first four terms of inputs are categorical variables, we set

the first four terms as linear parts, and start o↵ with the model: Y = x1 + x2 +

x3 + x4 + g(x5) + g(x6) + g(x7) + g(x8). Then we reduce the number of nonlinear

components to 3, 2, and 1. Thus all possible models are taken into consideration.

Indeed there are 15 potential candidate models to be analyzed:

34

Sun Yiran

(1)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + g(x7) + g(x8)

(2)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + g(x7) + x8

(3)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + x7 + g(x8)

(4)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + g(x7) + g(x8)

(5)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + g(x7) + g(x8)

(6)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + x7 + x8

(7)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + g(x7) + x8

(8)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + x7 + g(x8)

(9)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + g(x7) + x8

(10)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + x7 + g(x8)

(11)Y = x1 + x2 + x3 + x4 + x5 + x6 + g(x7) + g(x8)

(12)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + x7 + x8

(13)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + x7 + x8

(14)Y = x1 + x2 + x3 + x4 + x5 + x6 + g(x7) + x8

(15)Y = x1 + x2 + x3 + x4 + x5 + x6 + x7 + g(x8)

Applying CV selection, the resulting CV values for these 15 models are respectively

78.77300, 78.37520, 80.17791, 78.77305, 85.92006, 85.74561, 78.37520, 85.10425,

79.72240, 87.97303, 109.61046, 112.96229, 87.83789, 109.61061, 84.92717. We are

thus left with models (1), (2), (4), (7), (9). Again we compare their AICs and find

that model (9) has relatively larger AIC than others, hence it should be excluded.

However note that all AICs are smaller than that of a linear model which means

a non-linear fit is necessary. To determine which model is the best among the re-

mained (1), (2), (4), (7). Let us plot the spline terms for model (1) first.

35

Sun Yiran

1 2 3 4

-2.0

-1.0

0.0

x5

s(x5,5.43)

1 2 3 4 5

-2.0

-1.0

0.0

x6

s(x6,1)

0 1 2 3 4 5 6 7

-2.0

-1.0

0.0

x7

s(x7,6.8)

1 2 3 4 5 6

-2.0

-1.0

0.0

x8

s(x8,1.31)

Figure 12: Spline terms of model (1): Y = x1+x2+x3+x4+g(x5)+g(x6)+g(x7)+

g(x8)

It is interesting to note that the plots of 6th and 8th input terms are straight lines,

which means they are indeed linearly related to the output. Therefore we choose

model (7): Y = x1 + x2 + x3 + x4 + g(x5) + x6 + g(x7) + x8 as our final model.

Fitting model (7) to the data, and the spline plots for term x5 and x7 are shown

below. Therefore x5 which represents temperature, has a non-linear but positive

trend relation with the response variable. Note there is a decrease at the right tail

meaning a too-high temperature prevents people from biking. Humidity has a first

36

Sun Yiran

increasing and then decreasing trend. We can interpret this as moderately humid

weather encourages people to go biking, but if the humidity is too heavy such as

rainy days, people would rather not go out.

1 2 3 4

-2.0

-1.0

0.0

0.5

x5

s(x5,5.59)

0 1 2 3 4 5 6 7

-2.0

-1.0

0.0

0.5

x7

s(x7,6.79)

Figure 13: Contours of the error and constraint functions for the lasso (left)

The R output of summary of model (7) is given below. In spite of showing the

parameters of spline terms, it also shows coe�cients of the six linear terms.

1 > summary(out7)

Formula:

3 y ~ x1 + x2 + x3 + x4 + s(x5) + x6 + s(x7) + x8

Parametric coefficients:

5 Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.62182 0.31833 27.084 < 2e-16 ***

7 x1 0.06831 0.01322 5.168 3.07e-07 ***

x2 -0.03319 0.01191 -2.786 0.00548 **

9 x3 0.01979 0.01197 1.653 0.09870 .

x4 -0.03796 0.01653 -2.297 0.02194 *

11 x6 -0.04864 0.10357 -0.470 0.63878

x8 -0.09960 0.01324 -7.523 1.63e-13 ***

13 ---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

15 Approximate significance of smooth terms:

edf Ref.df F p-value

17 s(x5) 5.585 6.785 37.97 <2e-16 ***

s(x7) 6.789 7.913 16.90 <2e-16 ***

19 ---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

21 R-sq.(adj) = 0.668 Deviance explained = 67.6%

GCV = 0.1037 Scale est. = 0.10095 n = 730

37

Sun Yiran

Kernel regression is a potential alternative to the spline method. If we would like

to learn more about the relation of humidity and temperature with number of bikes

rent, a contour plot based on kernel regression should be a good choice.

Temperature0.20.4 0.6 0.8 Humidity0.20.40.60.8

log count

6.5

7.0

7.5

8.0

8.5

Temperature

Humidity

0.2 0.4 0.6 0.80.00.20.40.60.8

Contour

Figure 14: Contour plot of humidity and temperature

There is a maximum point at around humidity = 0.4 and temperature = 0.7, mean-

ing if only taking humidity and temperature into consideration, the amount of bikes

rent at this weather condition is the largest.

Similar to spline smoothing, kernel estimation also has problem with high dimen-

sional data. In Chapter 2.2, we introduced two approaches to circumvent that:

Projection Pursuit Regression and Single-Index Model. In this section, we will see

how they work based on the Bike-sharing Dataset.

Recall the single-index model reduces the dimension to one by imposing a link func-

tion �(.) to the generalized linear model: Y = �(�0+�1x1+...+�

p

x

p

)+✏ = �(↵T

0X)+

✏. The LSE of ↵0 is estimated by minimizing ↵0 = argmin

P

n

i=1{Yi

� �

↵

(↵T

X

i

)}2,

which can be done by calling function ”ppr()”. Part of the output is given below:

Projection direction vectors:

2 mnth holiday weekday weathersit temp atemp hum

0.17025222 -0.10651186 0.04121438 -0.22823288 0.67358137 0.55505242 -0.31724812

4 windspeed

-0.20842051

38

Sun Yiran

So the fitted model is:

cnt =g(0.17025222 ⇤mnth� 0.10651186 ⇤ holiday + 0.04121438 ⇤ weekday

� 0.22823288 ⇤ weathersit+ 0.67358137 ⇤ temp+ 0.55505242 ⇤ atemp

� 0.31724812 ⇤ humidity � 0.20842051 ⇤ windspeed

Then we use the general regression function to explore the relationship between ↵

T

0X

with the response variable ‘cnt’, which is just the link function �(.)

-2 -1 0 1 2 3 4

6.0

6.5

7.0

7.5

8.0

8.5

9.0

xalpha

y

-2 -1 0 1 2 3 4

6.0

6.5

7.0

7.5

8.0

8.5

9.0

xalpha

y

Figure 15: The estimated link function �(.) of the Single-Index Model for Bike-

Sharing Data.

Based on figures and output, we roughly see that almost all factors a↵ect the number

of bikes rent but among all, temperature may be the most important factor. Since

the trend of the link function is roughly upward, positive (negative) coe�cients of

X are still positive (negative) after the link function is imposed. Therefore we may

conclude that humidity and windspeed are adverse variables that are negatively

related to the response variable, while a relatively higher temperature contributes

to the higher number of bikes rent. This conclusion from the data matches with

our common sense that people tend to go biking on bright and sunny day, but try

to avoid rainy and cold days. Note that since weathersit is a classification variable

39

Sun Yiran

with 1 (good weather) to 4 (extreme weather) classes, it is also consistent with the

results.

Since the single-index model uses only a uni-variate link function, we are not sure

if it is su�cient for fitting. Hence we may also try projection pursuit regression.

Again for the bike-sharing data, we consider the model of two ridge terms

Y = g1(↵T

1X) + g2(↵T

2X) + ✏

The first component Y = g1(↵T

1X) + ✏ is exactly the Single-Index Model. Suppose

the estimate of it is g1( ˆ↵1

T

X

i

). So the residuals of the first term fitted is r1,i =

Y

i

� g1( ˆ↵1

T

X

i

) and fit the second component r1,i = g2(↵2T

X

i

) + ⌘

i

. PPR plots of

the two ridge terms are shown below

-2 -1 0 1 2 3 4

6.0

7.0

8.0

9.0

xalpha1

y

-3.5 -2.5 -1.5

-2.0

-1.0

0.0

1.0

xalpha2

residuals1

-2 -1 0 1 2 3 4

-1.5

-0.5

0.5

xalpha1

s(xalpha1,4.48)

-3.5 -2.5 -1.5

-1.5

-0.5

0.5

xalpha2

s(xalpha2,5.86)

Figure 16: PPR Plot of two ridge terms

40

Sun Yiran

We see that in the right panel, residuals plot of ↵2 has no obvious trend, and is

roughly horizontal. Therefore we conclude that one term is su�cient enough and

we can simply use the Single-Index Model for this case.

The table below is a comparison of components estimates based on PPR and SIM

respectively. We see that the first term of PPR is roughly the same as that of SIM.

They both reflect the influence that temperature has on people’s will of biking. The

higher temperature, the higher possibility that people rent a bike.

Table 4: Comparison of Components Estimates between PPR and SIM

PPR SIM

mnth 0.17025 -0.05170 0.17025

holiday -0.10651 0.04768 -0.10651

weekday 0.04121 0.09028 0.04124

weathersit -0.22823 -0.21325 -0.22823

temp 0.67358 -0.82288 0.67358

atemp 0.55505 0.48210 0.55505

hum -0.31725 -0.02250 -0.31725

windspeed -0.20845 -0.17710 -0.20842

41

Sun Yiran

3.3 Model Comparison

So far we have discussed three main models, as well as the two additional models to

alleviate ‘Curse of Dimensionality’. I have also applied them to three real-life cases.

It is time to compare these models. We would like to know when to use them, and

what are their advantages and disadvantages over each other.

First are the parametric models: Ridge and Lasso. Both Ridge and Lasso work well

in a adataset with a large number of predictors. Note that in the Colon Cancer

Dataset, there are originally 2000 predictors and among them 29 predictors were

finally selected as significant features based on Lasso regression. The classification

error by Lasso is zero 9 out of 10 times of implementation. Therefore it proves

the theory in Chapter 2.1.1 that Lasso performs well in the when a relatively small

number of inputs have coe�cients unequal to zero.

Next is the semi-parametric model: Splines. For the Bike Sharing Dataset with

number of predictors greater than 2, we applied spline regression to it. The mean

squared error is smaller than that of the linear regression model. Spline regression

takes non-linear terms into consideration, thus leading to a better model fit and

more accurate predictions.

Finally is the non-parametric model: kNN and kernel regression. Although the

commonly used logistic regression model performs well for a dataset with categorical

response variable, the e�ciency dramatically decreased when the number of classes

is greater than 3. In contrast, the model-free kNN approach is not subject to that.

4 Conclusion

In this thesis, a collection of Data Mining models are discussed and compared.

Firstly, the parametric models Ridge and Lasso are used as alternatives to linear

models when the number of predictors is much greater than the number of obser-

42

Sun Yiran

vations (p >> n). Then in the case of non-linearity, semi-parametrics models such

as Cubic Splines and Additive Partially Linear Models (APLM) can be applied to

fit curves to the data. Theoretically the semi-parametric models are more flexible

than parametric models in approximation, and meanwhile more e�cient than non-

parametric models in computational implementation. Finally for the non-parametric

models, I introduced k-Nearest-Neighbors (kNN) approach and Kernel Regression.

These two are instance-based learners and free of model construction, thus are very

flexible for model fitting.

Examples of applications were carried out in empirical study. I applied all the

models introduced in model fitting to di↵erent types of datasets and illustrated the

comparison of models in great details. From the results, we can draw a couple of

conclusions:

(1) Lasso performs well in extremely high-dimensional data, both in terms of com-

putation time and accuracy. This is particularly true when the high-dimensional

data has a relatively small number of significant predictors. Lasso is also endowed

with a unique ability of variable selection and is easy to implement.

(2) In our case of Wine Quality dataset which has a categorical variable with nine

classes, kNN outperforms multilevel logistic regression in classification error. The

prediction accuracy of multilevel logistic regression is improved once I reduced the

number of classes. Therefore I concluded that kNN is a better choice when number

of class is greater than 3.

(3) APLM, PPR and SIM are employed when the linearity of the model does not

hold. Among them, APLM is the most di�cult to implement as we need to specify

all possible combinations of the linear and nonlinear components. PPR includes

more terms than SIM and thus gives more accurate estimations. However, since

each term enters the model in a complicated way, it is not good for producing an

understandable model for the data.

Given all these results, we have learned more about Data Mining, which is a signif-

43

Sun Yiran

icant element of Statistical Learning, and got a rough sense of how its techniques

are applied to real-life cases at an elementary level.

44

Sun Yiran

References

Bichler, M., & Kiss, C. (2004). A comparison of logistic regression, k-nearest

neighbor, and decision tree induction for campaign management. AMCIS 2004

Proceedings , 230.

Farrar, D. E., & Glauber, R. R. (1967). Multicollinearity in regression analysis: the

problem revisited. The Review of Economic and Statistics , 92–107.

Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of

statistical learning: data mining, inference and prediction. The Mathematical

Intelligencer , 27 (2), 83–85.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to

statistical learning (Vol. 112). Springer.

McDonald, G. C. (2009). Ridge regression. Wiley Interdisciplinary Reviews: Com-

putational Statistics , 1 (1), 93–100.

Sheather, S. J., & Jones, M. C. (1991). A reliable data-based bandwidth selection

method for kernel density estimation. Journal of the Royal Statistical Society.

Series B (Methodological), 683–690.

Stone, C. J. (1985). Additive regression and other nonparametric models. The

annals of Statistics , 689–705.

Wand, M. P. (2000). A comparison of regression spline smoothing procedures.

Computational Statistics , 15 (4), 443–462.

Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2005). Distance metric learning for

large margin nearest neighbor classification. In Advances in neural information

processing systems (pp. 1473–1480).

Wood, S. (2006). Generalized additive models: an introduction with r. CRC press.

45

Sun Yiran

Appendix

In spite of the output of R programs within the main body of the thesis, the rest

are displayed as the following:

• Lasso Regression for Colon Cancer Dataset

1 ############ Lasso Model Fitting ################

> BestLambda

3 [1] 0.0588373

> reg

5 Call: glmnet(x = x, y = y, lambda = BestLambda)

7 Df %Dev Lambda

[1,] 34 0.9299 0.05884

9 ############ Model Validation ###################

> errorLasso = mean((ytest -ypredict)^2)

11 > errorLasso

[1] 0.04403709

13

> ClassficationError = mean(ytest != yclassified)

15 > ClassficationError

[1] 0

• kNN and Multivariate Logistic Regression Model for Wine Quality Datset

############### kNN Model Validation #############

2 ### White Wine

> errorknn = mean(ytest!=predknn)

4 > errorknn

[1] 0.3857143

6 ### Red Wine

> errorknn = mean(ytest!=predknn)

8 > errorknn

[1] 0.3857143

10 ############ Multilevel Logistic Regression Model Validation ########

### White Wine

12 > classificationError

[1] 0.4704082

14 ### Red Wine

> classificationError

16 [1] 0.4693878

• Semi-parametric Models for Bike Sharing Dataset

################### Linear Model ##########################

2 > fit

Call:

4 lm(formula = ytrain ~ xtrain$mnth + xtrain$holiday + xtrain$weekday +

xtrain$weathersit + xtrain$temp + xtrain$atemp + xtrain$hum +

6 xtrain$windspeed)

46

Sun Yiran

8 Coefficients:

(Intercept) xtrain$mnth xtrain$holiday xtrain$weekday

10 7.65392 0.11104 -0.03892 0.02836

xtrain$weathersit xtrain$temp xtrain$atemp xtrain$hum

12 -0.07552 -0.54235 0.86573 -0.08455

xtrain$windspeed

14 -0.03882

16 > mse

[1] 0.4903079

18

################# Additive Partially Linear Model ##########

20 ### Model 7 is selected

> out7

22

Family: gaussian

24 Link function: identity

26 Formula:

ytrain ~ x1 + x2 + x3 + x4 + s(x5) + x6 + s(x7) + x8

28

Estimated degrees of freedom:

30 5.83 6.87 total = 19.71

32 GCV score: 0.1027573

34 ################# Project Pursuit Regression #############

#### The first term

36 Call:

ppr(x = x, y = y, nterms = 1)

38

Goodness of fit:

40 1 terms

80.83421

42

#### The second term

44 > out2

Call:

46 ppr(x = x, y = residuals1 , nterms = 1)

48 Goodness of fit:

1 terms

50 72.13232

52 ################### Single Index Model ###############

Family: gaussian

54 Link function: identity

56 Formula:

y ~ s(xalpha)

58

Estimated degrees of freedom:

60 4.67 total = 5.67

62 GCV score: 0.1141133

47

a study of data mining methods and examples of applications

Documents