a study of data mining methods and examples of applications
TRANSCRIPT
A Study of Data Mining Methods andExamples of Applications
By
Sun Yiran
Supervisor:
Zhou Wang
ST4199 Honours Project in Statistics
Department of Statistics and Applied Probability
National University of Singapore
September 2, 2016
Sun Yiran
Contents
Summary 3
Acknowledgment 4
1 Introduction 5
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Data Mining in Statistical Learning . . . . . . . . . . . . . . . . . . . 6
2 Methodology 6
2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Linear Regression Model with Penalty or Constraints . . . . . 6
2.1.2 Splines and Semi-Parametric Models . . . . . . . . . . . . . . 10
2.1.3 Kernel Regression Model . . . . . . . . . . . . . . . . . . . . . 15
2.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Theory of Curse of Dimensionality . . . . . . . . . . . . . . . 21
2.2.2 Single-Index Model . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Projection Pursuit Regression Model . . . . . . . . . . . . . . 24
3 Data Description and Empirical Application 25
3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Colon Cancer Dataset . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Wine Quality Dataset . . . . . . . . . . . . . . . . . . . . . . 26
3.1.3 Bike Sharing Dataset . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Examples of application . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Conclusion 42
1
Sun Yiran
References 45
Appendix 46
2
Sun Yiran
Summary
This thesis studies about a collection of models proposed in Data Mining techniques
as well as their applications in real-life examples. In this project, I explored theories
of the models in Chapter 2 and conducted simulation studies based on those models
in Chapter 3.
The first two models I introduced are parametric models: Ridge and Lasso, which
outperform the general linear models when there is multicollinearity within the
dataset. Then in the application section, I applied them to ‘Colon Cancer’ dataset
in which the number of predictors hugely exceeds the number of observations. Next
since most often the assumption of linearity between features and response does not
hold, I discussed about semi-parametric approaches such as Spline Regression that
can move beyond linearity. Again, I employed it in analyzing the ‘Bike Sharing’ data
and in such way both linear and non-linear components are taken into account for
model fitting. Finally I talked about non-parametric models: k-Nearest-Neighbors
and Kernel Regression, and applied them to the ‘Wine Quality’ dataset. All data
files are downloaded from UCI Machine Learning Repository.
My contributions are the design of simulation study and the implemented R pro-
grams to assess the e�ciency of the models proposed above. The classification error
for categorical cases and the mean squared error for regression examples are mea-
sured. A detailed interpretation of resulting estimations is also given. Finally in
chapter 5, I drew a conclusion and comparison on the advantages and disadvantages
that the models have over each other.
3
Sun Yiran
Acknowledgment
First and foremost, I would like to express my special appreciation to my thesis
adviser Prof. Zhou Wang. Without his valuable guidance, aspiring encouragement
and great support, this paper would have never been accomplished.
I am also thankful to Prof. Xia Yingcun for being both my academic adviser and
internship adviser in the last four years. His insight suggestions and invaluable
constructive criticism help me to learn and to make continuous progress.
Besides my advisers, I would also like to show gratitude to all the faculty members.
Their caring for students and dedication to academic work have helped to develop
our potential and prepare us for future challenges. I feel great honored to be a
student in this faculty.
Last but not least, I want to thank my parents for their support to my university life.
Their love, attention, and spiritual support has been a constant source of strength
for me.
4
Sun Yiran
1 Introduction
This is a thesis introducing basic knowledge about Data Mining, which is a key
component of Statistical Learning. The underlying background and motivation is
provided in Chapter 1. In Chapter 2.1, three most generally used models in Data
Mining are introduced and in Chapter 2.2, two more models are introduced to al-
leviate the problems that may occur in high dimensional data. Chapter 3 includes
application of the models introduced on two real-life datasets, and an interpreta-
tion of the results obtained. It also covers model comparison among the models.
Finally a conclusion of the whole thesis is given in Chapter 5. Note that the main
reference book is An Introduction to Statistical Learning (James, Witten, Hastie, &
Tibshirani, 2013).
1.1 Background and Motivation
The concept of statistical learning was firstly introduced in 1960’s. In the first three
decades, statistical learning was purely theoretical analysis of a given collection of
sample data. With a deeper understanding of Statistics as an independent scien-
tific area and a higher demand of statistical learning methods applied to real life
problems, scientists have been constructing new algorithms for estimating multi-
dimensional function. Until today the amount of data is exploding in a super fast
speed that people are facing new challenges with managing and processing datasets
of huge size. This is the so-called “Big Data” age. Modern statistical learning o↵ers
a set of tools for understanding the complex datasets. Having an insight of the data
is not only useful in traditional areas like biology and medicine, but also helps in
innovating business disciplines. For example, having realised the commercial oppor-
tunities in big data analysis, Google launched its first ‘big data’ analytics center in
2011. It provides multiple services such as analysis of purchasing patterns of millions
5
Sun Yiran
of visits to e-commercial website, assessment of the how e↵ective a billion dollars’
advertisement campaign is and so on. To learn more about that, the very first step
is to know about Statistical Learning.
1.2 Data Mining in Statistical Learning
In this thesis, we focus on one important element of Statistical Learning, called Data
Mining (Hastie, Tibshirani, Friedman, & Franklin, 2005). It is a computational
process of analyzing large data sets and extracting valuable information for further
use. There are several common tasks of data mining, including anomaly detection,
association rule learning, clustering, classification, regression and summarization. In
this thesis, we will only be discussing classification and regression. Note that one
necessary feature of data mining is data validation. This is used to check if the
model we propose is valid or not. We will talk more on that in the main part of the
thesis.
2 Methodology
2.1 Models
2.1.1 Linear Regression Model with Penalty or Constraints
The first model considered in this paper is the most generally used Linear Model.
Linear Model is one of the most primary statistical models developed in the pre-
computer age. It provides easy-to-interpret description of the relations between
inputs and output. Sometimes it allows better results than a nonlinear model for
prediction purpose. Hence even in today’s computer age, data scientists still like
to use it to perform data analysis. There are commonly two tasks considered: re-
6
Sun Yiran
gression and classification. In terms of linear regression models, there may occur
some problems that call for one’s attention. The most commonly seen problems
are: non-linearity, corrrelation or non-constant variance of error terms, outliers and
multicollinearity (Farrar & Glauber, 1967). Among these we will focus on multi-
collinearity phenomenon first and then non-linearity.
In regression problems, Linear Regression Model assumes that under the condition of
linearity between the input vector XT = (X1, X2, ..., Xp
) and the real-valued output
Y, it has the form
Y = �0 +p
X
j=1
X
j
�
j
(1)
where �
j
’s are unknown coe�cients.
One of the most well-known estimate of coe�cients is the Least Squares Estimation
(LSE). It optimizes the model fit by minimizingP
n
i=1(Yi
� X
T
i
�)2, which is the
Residual Sum of Squares (RSS). Given n observations of (Xi
, Y
i
), the LSE of �’s can
be written as
� = (XTX)�1XY (2)
where XTX =P
n
i=1 Xi
X
T
i
. The main problem with this expression is that for the
LSE to exist, a basic requirement that (XTX)�1 is non-singular must be satisfied.
However in reality, it is highly possible that this condition cannot be met. One
special but common case is called ’multicollinearity’. It refers to the situation that
two or more predictors are highly correlated in a model. When multicollinearity is
present, at least one eigenvalue of (XTX)�1 is close to zero thus it will be close to
singular. The phenomenon typically frequently occurs in high dimensional data. In
this section we will introduce two penalty approaches to remedy this problem, one
is called Ridge and the other is Lasso (McDonald, 2009).
Ridge was firstly suggested by Hoerl and Kennard in 1970. The reason that it is
a called penalty approach is that it uses the strategy of penalizing the values of
coe�cients �. That is, we add a term to the original RSS of linear models and
7
Sun Yiran
estimate � by minimizing
n
X
i=1
(Yi
�X
T
i
�)2 + �(�21 + ...+ �
2p
) (3)
or in matrix form
kY� X�k2 + �k�k2 (4)
� is a shrinkage parameter which takes values from real number space. Solving the
above minimization problem leads to
�
Ridge
= (XTX+ �I)�1XY (5)
The expression guarantees invertibility with a diagonal matrix added. For each �
chosen, there is a corresponding solution of �Ridge
and hence the �s trace out a path
of solutions. We immediately see some properties of �: (1) � controls the size of the
coe�cients; (2) � controls amount of regularization; (3) When � ! 0, we obtain the
least squares estimates; and when � ! 1, the power of the penalty grow infinitely,
thus all coe�cients approach zero.
An alternative approach to Ridge is called Lasso. Lasso is a shrinkage method similar
to Ridge. The only di↵erence is the additional term being sum of absolute values of
�’s, rather than the sum of squared �’s. Hence it solves the problem by minimizing
n
X
i=1
(Yi
�X
T
i
�)2 + �
p
X
j=1
|�j
| (6)
We can also understand the ideas of penalty above in a ‘constraint’ way. For example,
consider the case when p = 2
Ridge:n
X
i=1
(Yi
� �1xi1 � �2xi1)2 subject to �
21 + �
22 t
Lasso:n
X
i=1
(Yi
� �1xi1 � �2xi1)2 subject to |�1|+ |�2| t (7)
If we draw the contours of the least squares error function (red ellipses) and con-
straint regions (solid blue areas), the figure would be like the following:
8
Sun Yiran
Figure 1: Blue regions are constraint |�1| + |�2| s and �
21 + �
22 s, red ellipses
are contours of RSS, for Lasso (left) and Ridge (right) respectively
It is clear to see that if the red ellipses continue to expand, then the coordinates of
the intersection with the blue constraint regions are simply solutions to �1 and �2.
In this process of expansion, once the red ellipses reach the blue regions for the first
time, the coordinate �1 is zero for Lasso while it is impossible for Ridge to obtain a
zero coordinate. This is an important advantageous property that Lasso holds over
Ridge. Such property allows us to eliminate some of the coe�cients to zero and thus
do variable selection.
We may want to do a comparison between Ridge and Lasso. It is obvious that Lasso
has a major advantage over Ridge that it can do variable selection, consequently
leading to an easier-to-interpret result. However, which method performs better
in prediction accuracy? In general, Lasso performs better when a relatively small
number of inputs have substantial coe�cients unequal to zero. In contrast, one
may expect Ridge to perform better when there is large number of predictors with
roughly equal size of coe�cients. Techniques such as Akaike’s (1973) Information
Criterion (AIC) or Craven’s (1979) Cross Validation (CV) can be employed to deter-
mine which method fits the data better. We will talk more on that in later chapters
and for convenience we will use their short forms from now on. Again, Lasso and
Ridge outperform LSE in such a way that they produce a trade-o↵ between vari-
ance and bias. That means in the setting when LSE has high variance, they reduce
variance at expense of bias and hence achieve more accurate prediction.
9
Sun Yiran
2.1.2 Splines and Semi-Parametric Models
A linear regression method is relatively easy to implement and can provide easy-
to-interpret results. However, it is subject to significant limitations in predictive
power when the linearity assumption is poor. Actually in real life environment,
the relationship between output and inputs, i.e. f(X) = E[Y |X], is unlikely to
be linear. Over several decades,considerable e↵orts have been made in establishing
an algorithm of fitting curves to data. Inspired by Taylor expansion approximates
the underlying function: m(x) ⇡ m(a) +m
0(a)(x � a) + m
00(a)2 (x � a)2 + ... up to a
polynomial of power d, it is reasonable for one to think of replacing the linear model
with a polynomial function to achieve curve fitting
y
i
= �0 + �1xi
+ �2x2i
+ �3x3i
+ ...+ �
d
x
d
i
+ ✏
i
(8)
where the coe�cients can be achieved by LSE because this is just a linear regression
model with predictor variables xi
, x
2i
, ..., x
d
i
. The disadvantage of this ordinary poly-
nomial function is that since it imposes a global structure on X, when the order is
large, the approximation procedure becomes extremely complicated and ine�cient.
In other words, a large d (d > 3) leads to over flexibility or strange shapes of the
plotted curve and is particularly the case at boundaries. Therefore instead of regard-
ing the function as a whole, we approximate it in a piecewise manner. For example,
a piecewise cubic polynomial works in such a way that it fits di↵erent polynomials
in di↵erent parts of the range of X. Mathematically a standard cubic polynomial is
of the form y
i
= �0 + �1xi
+ �2x2i
+ �3x3i
+ ✏
i
. If approximating it in two parts, then
the polynomial becomes a cubic spline that takes the form
y
i
=
8
>
<
>
:
�01 + �11xi
+ �21x2i
+ �31x3i
+ ✏
i
, if xi
< c;
�02 + �12xi
+ �22x2i
+ �32x3i
+ ✏
i
, if xi
� c.
where the two polynomial functions di↵er in coe�cients. The first function is fitted
based on subset of observations xi
< c, and the other on x
i
� c.
10
Sun Yiran
Problem with piecewise polynomial model lies in its inherent discontinuity. For ex-
ample, I implemented a piecewise polynomial regression to a subset of the built-in
data ‘wage’ in R. Thought there are several variables in this dataset, I will only plot
‘wage’ versus ‘age’ for simplicity. Set ‘age = 50’ as the break point, we are given the
following figure
20 30 40 50 60 70
50100
150
200
250
300
Age
Wage
Figure 2: Cubic Polynomial Regression for Wage data with break point at age = 50
We immediately see a jump at the break point ‘age = 50’, therefore piecewise poly-
nomial is not a good choice for continuous regression. We want to put further
constraints to make the function continuous even at boundaries: both the first and
second orders of the piece-wise polynomials are continuous. These two restrictions
will help to fulfill the continuity conditions. A regression spline is a good approach
proposed to achieve that constraint (Wand, 2000).
Spline is a strategy that divides the domain of the underlying function into a se-
quence of sub-intervals and estimates the piece of function in each interval by using a
polynomial regression function. Mathematically, a k
th order spline is a function that
is a polynomial of degree k on each of the intervals (�1, t1], [t1, t2], ...[tJ ,1] and is
continuous at its knot points t1 < t2... < t
J
. The most popular spline used is of
order 3, called a cubic spline. In order to parameterize the set of cubic splines, there
11
Sun Yiran
are two sorts of bases that can be used: truncated power basis and B-spline basis.
Truncated power basis is the most naturally used and is easy for interpretation, but
the advent of B-spline basis upgrades the computational speed as well as numerical
accuracy. We will not cover that in this thesis.
A collection of truncated cubic power bases are: 1, x, x2, x
3, (x� t
j
)3+, (j = 1, ..., J).
Employing the above regression ideas, the underlying function Y = m(x)+ ✏ can be
approximated as
Y =J+4X
k=1
✓
k
B
k
(x) (9)
where we write B1(x) = 1, B2(x) = x,B3(x) = x
2, B4(x) = x
3, B4+j
(x) = (x� t
j
)3+.
It is easy to see that the function owns continuous first and second derivatives, and
thus remains continuous at knots. The beauty of this approach is that the expression
is in a linear form once the bases are determined. Therefore we can use LSE to obtain
an estimator for ✓, namely let
Y =
0
B
B
B
B
B
B
B
@
Y1
Y2
...
Y
n
1
C
C
C
C
C
C
C
A
,X =
0
B
B
B
B
B
B
B
@
B1(X1) B2(X1) ... B
J+4(X1)
B1(X2) B2(X2) ... B
J+4(X2)
... ... ... ...
B1(Xn
) B2(Xn
) ... B
J+4(Xn
)
1
C
C
C
C
C
C
C
A
, ✓ =
0
B
B
B
B
B
B
B
@
✓1
✓2
...
✓
n
1
C
C
C
C
C
C
C
A
(10)
The best approximation of ✓ is thus ✓ = (XTX)�1XY. Note that the above expres-
sion requires us to determine the number of knots J. It controls the trade-o↵ between
the smoothness of the curve and the bias of the approximation.
To see this straight-forwardly, I simulated a set of artificial data from the function
Y = sin(⇡x) + 0.3✏ where x ⇠ U(0, 1) and ✏ ⇠ N(0, 1). 100 observations are sam-
pled, and approximated functions based on cubic splines with number of knots k=10,
25, 50, 100 are plotted separately in the figure below.
One can see that at k = 100, the spline curve tries to pass more data points but
is less smooth. Theoretically the optimal number of knots can also be chosen by
finding the smallest AIC or CV value.
12
Sun Yiran
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
0.0
1.0
no. of knots = 10
x
y
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
0.0
1.0
no. of knots = 25
x
y
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
0.0
1.0
no. of knots = 50
x
y
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
0.0
1.0
no. of knots = 100
x
y
Figure 3: Cubic Splines with di↵erent number of knots at 10, 25, 50, 100
A cubic spline has achieved continuity at knots. However, it still has a shortcoming
that the estimation may have high variance at the outer range of the predictors.
In that case, a natural spline performs better by imposing an additional constraint:
requiring the function to be linear at boundaries. Again take the ‘wage’ data as an
example, calling standard functions ‘bs()’ and ‘ns’ in R gives a cubic spline and a
natural cubic spline respectively.
20 30 40 50 60 70 80
50100
150
200
250
300
Age
Wage
Cubic SplineNatural Cubic Spline
Figure 4: A cubic spline and natural cubic spline fitted to a subset of ‘wage’ data
respectively, at knots 25, 40, 60.
13
Sun Yiran
We see that the confidence interval of the cubic spline goes wild when X takes on
large values. In contrast the natural cubic spline can provide more stable estimation
there.
For the multi-variate case where the number of predictors p > 2, we are not able to
know the functional form of the model easily. In other words, the estimation e�-
ciency dramatically decreases as the number of predictors p increases. One attempt
at overcoming this drawback leads to considerations of using a semi-parametric
model, which contains both parametric and non-parametric components. In such
way the semi-parametric models are more flexible than linear models in estima-
tion, but meanwhile more e�cient than non-parametric models in computational
implementation. There is a wide variety of semi-parametric models, among which
the most well-known example is the Generalized Additive Model (GAM) (Wood,
2006). Suppose we are interested in m(x1, ..., xp
) = E(Y |x1, ..., xp
), then the GAM
approximates m(.) by describing the variables separately and summing all their
contributions together
Y = �0 + g1(x1) + ...+ g
p
(xp
) + ✏ (11)
Most of the case we are able to know how the linear predictors are dependent the
response, for example x1, ..., xq
, and our interest focus on the the unknown smooth
functions. We can thus write our function as the following Additive Partially Linear
Model (APLM), which can be regarded as a special case of GAM (Stone, 1985)
Y = �0 + �1x1 + ...+ �
q
x
q
+ g
q+1(xq+1) + ...+ g
p
(xp
) + ✏ (12)
A common way to estimate the nonlinear part of the APLM, gk
(xk
), is by assuming
that they have the form of splines. It gives
Y = �0 + �1x1 + ...+ �
q
x
q
+
J
q+1+4X
j=2
✓
q+1,jBq+1,j(xq+1) + ...+
J
p
+4X
j=2
✓
p,j
B
p,j
(xp
) + ✏
14
Sun Yiran
. Again, this equation is in linear form and we can estimate � by LSE and make
statistical inference on it as:
(�0, �1, ..., �q
, ✓
q+1,Jq+1+4, ..., ✓p,2, ..., ✓p,J
p+4)T = {XTX}�1XTY (13)
The combination of linear and non-linear components in the additive partially lin-
ear model makes the approximation procedure more flexible and the results more
accurate. Also, it allow us to examine the e↵ect of each individual predictor variable
separately.
2.1.3 Kernel Regression Model
We have introduced that semi-parametric models are more flexible than linear mod-
els in estimation. In this section we will describe a non-parametric methods that fits
a di↵erent model at each query point x0 and thus achieves even more flexibility. In
this method, no parameter is involved, and the estimation of a target point is fully
based on learning from the observations that have the nearest distance to it. Such
non-parametric method is called k-Nearest-Neighbor (kNN) and it is the most intu-
itive technique recognized for statistical discrimination so far (Weinberger, Blitzer,
& Saul, 2005).
The name of kNN indicates a concept of ‘near’. So how to define the criteria of
‘near’? Most often it is natural to use the empirical Euclidean distance function in
the feature space. After all features have been standardized to have zero mean and
variance one, the Euclidean distance takes on the form
d(Xi
, X
j
) = kXi
�X
j
k = {p
X
s=1
(xis
� x
js
)2}1/2
orp
X
s=1
|xis
� x
js
| (14)
Let us consider the uni-variate case when x is the only input variable and {(x1, y1), ..., (xn
, y
n
)}
are the n observations. Suppose the target point is x = x0. Based on Euclidean
distance criteria, we can collect k nearest neighbors of x0 to form a set Dx
= {xi
: xi
15
Sun Yiran
is one of the k nearest neighbors to x} and estimate m(x) as
y =
P
x
i
2Dx
y
i
#{xi
2 D
x
} ⌘P
n
i=1 I(xi
2 D
x
)yi
P
n
i=1 I(xi
2 D
x
)(15)
The only unknown component in this expression is k, which is also the size of Dx
.
Determination on k is important in the sense that it governs the trade-o↵ between
variation and bias. If k is too large, then the estimator will be a↵ected by some far-
away observations that may provide irrelevant information; while if it is too small,
the estimators will fluctuate with high variation. Analogous to the choice of number
of knots J in regression spline method, we can also use CV to select the most proper
k. I will implement that later in the examples of real-life applications.
Note that the above estimation assigns equal weight to all the k neighbors. However,
one may think that nearer neighbors are more relevant and thus should be assigned
more importance to in the decision. Therefore a more accurate estimation is
y =
P
k
i=1 wi
y
i
P
k
i=1 wi
(16)
Although k-Nearest-Neighbors (kNN) is an essentially model-free approach that is
e�cient when the estimations are well positioned to capture the distribution of the
real data, it is has inherent disadvantage of discreteness. It leads to the approximated
function plotted by kNN appearing discontinuous and ugly. In addition, limitations
such as, poor run-time performance when training set size is large, sensitivity to
redundant features and computational cost since we have to compute all the relative
distances, may exist in real data problems. Therefore, statisticians developed an
improved method called Kernel Smoothing (Bichler & Kiss, 2004). Rather than
choosing a number of k neighbors to compose the discrete set Dx
, Kernel Smoothing
uses an interval Dx
= [x � h, x + h] with some bandwidth h > 0 and define the
weight as
w
i
=1
h
K(x
i
� x
h
) (17)
where K(x) is a kernel function and x
i
�x
h
is the relative distance between neighbor
x
i
and the target point x within h. For convenience we write above expression
16
Sun Yiran
as K
h
(Xi
� x). The estimator becomes y =P
n
i=1 Kh
(Xi
�x)yiP
n
i=1 Kh
(Xi
�x) which is the so-called
Nadaraya-Watson (N-W) estimator.
Kernel Smoothing has many real-life applications such as pattern recognition, image
reconstruction, data visualization so on. Here is a toy example of application in
pattern recognition. I created a decayed sign from some function below in the left
panel and applied Kernel Smoothing to recover the sign shown in the right panel.
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
x[, 1]
x[, 2
]
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
x[, 1]
x[, 2
]
k=5
Figure 5: Simulated decayed sign (left) and recovered sign (right) using Kernel
Method
One might ask what is a kernel function? The nature of the Kernel function is
indeed a symmetric probability density. Construction of a kernel density estimator
is similar to that of a histogram. Recall the steps of constructing a histogram: break
the whole range of values into a sequence of intervals, and count the frequency that
observations fall into each interval. Let m denote the number of observations that
fall into the interval around x. Then for histogram the density function should be
17
Sun Yiran
f(x) = 1n⇤interval length ⇤m. Likewise for a Kernel density estimator,
f
h
(x) =1
n ⇤ interval length ⇤m
=1
2nh#{X
i
2 [x� h, x+ h]}
=1
2nh
n
X
i=1
I(|x�X
i
| h)
=1
nh
n
X
i=1
1
2I(|x�X
i
h
| 1)
=1
nh
n
X
i=1
K(x�X
i
h
)
We can easily prove that it is an appropriate density function by integrating it to 1.
Z
f(x)dx =1
n
n
X
i=1
Z
1
h
K(X
i
� x
h
)dx
=1
n
n
X
i=1
Z
1
h
K(X
i
� x
h
)dx
=1
n
n
X
i=1
Z
K(X
i
� x
h
)d(X
i
� x
h
)
=1
n
n
X
i=1
Z
K(u)du
= 1
To visually see the comparison, I applied Histogram and Kernel approaches on sam-
ples from two given functions respectively, to estimate their density functions. The
upper graphs are plots of Y = sin(3⇡x)+0.2✏ and the lowers are Y = sin(2x)+0.2✏.
18
Sun Yiran
Histogram of y1
y1
Frequency
-1.5 -0.5 0.5 1.5
040
80
-2 -1 0 1 2
0.0
0.4
0.8
y1
Pro
babi
lity
dens
ity fu
nctio
n
Histogram of y2
y2
Frequency
-0.5 0.0 0.5 1.0 1.5
0100
200
-1.0 0.0 0.5 1.0 1.5 2.00.0
1.0
y2
Pro
babi
lity
dens
ity fu
nctio
n
Figure 6: Histogram and Kernel Plots of function Y = sin(3⇡x) + 0.2✏ (upper) and
Y = sin(2x) + 0.2✏ (lower) respectively
We see that both methods give a rough sense of the density distribution, but Kernel
exhibits properties of smoothness and continuity thus is better at determining the
shape of the density. Some popular Kernel functions are given in the following table:
Table 1: Kernel Functions
Kernel Explicit Form
Gaussian K(u) = 12⇡exp(�u
2), u 2 [�1,1]
Uniform K(u) = 12 , u 2 [�1, 1]
Tri-cube K(u) = (1� |u|), u 2 [�1, 1]
Epanechnikv K(u) = 34(1� u
2), u 2 [�1, 1]
In practice, there are some points that call for one’s attention:
• Similar to k-Nearest-Neighbors, the range of the neighborhood �, needs to be
19
Sun Yiran
determined. A natural bias-variance trade-o↵ occurs as we change the size of
the bandwidth. Small � leads to smaller bias but larger variance, while big �
leads to over smoothness and bad prediction. Note that the selection criteria
of regularization parameter for smoothing splines also applies here.
• Boundary issues arise. Less observations are present on the boundaries which
leads to inadequate information. So estimation on boundaries may be inaccu-
rate.
• When there are ties in x
i
, one should average y
i
at tied values and, add extra
weights to these new observations at xi
For the bandwidth issue, we should know how to determine the bandwidth �. For
example, � is the radius of the support region the Epanechnikov or tri-cube kernel
with metric width, while for the most popular Gaussian kernel, � is the standard
deviation (Sheather & Jones, 1991). Generally it can be determined based on tech-
niques such as
The following figure is kernel smoothing plots of two di↵erent bandwidths to samples
generated from y = sin(3⇡x) + 0.2✏ where ✏’s are iid noisy terms.
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
0.0
0.5
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
0.0
0.5
1.0
x
y
Figure 7: Kernel smoothing plots at bandwidth 0.01 (left) and 0.03 (right)
The left-hand plot looks less smoother than the Right. Indeed smaller bandwidth
20
Sun Yiran
leads to more fluctuations in estimations and thus higher variance. However, as it
tries to pass more observations, the bias is also smaller.
In terms of the boundary issue, we refer to the previous local regression idea and
introduce a local linear kernel smoothing approach. The basic idea is, again for
Y = m(X) + ✏, if Xi
is close to x, we can consider a local linear approximation
based on second order Taylor expansion as m(X) ⇡ m(x) + m
0(x)(Xi
� x). Thus
for i = 1 to n, there are n such linear regression models. Then it is converted to a
least squares problem with � = (m(x),m0(x))T . However we will not discuss about
that in detail.
Finally let us consider the multi-variate case X = (x1, ..., xp
)T with n observations
(X1, Y1), ..., (Xn
, Y
n
). The formulation of N-W Estimator remains the same with
that of uni-variate case:
m(x) =
P
n
i=1 Kh
(Xi
� x)Yi
P
n
i=1 Kh
(Xi
� x)(18)
The only di↵erence is that here K(.) is a multivariate function K
h
(Xi
� x) =
1h
p
K(xi1�x1
h
, ...,
x
ip
�x
p
h
) where h can still be chosen using CV method. Both uni-
variate and multi-variate cases can be simulated in R programming. I will show
that in Chapter 3.
2.2 Curse of Dimensionality
2.2.1 Theory of Curse of Dimensionality
In this chapter, we will know what is ‘Curse of Dimensionality’, why it happens and
what solutions we can take to resolve it. First, let us consider what is ‘Curse of
Dimensionality’. Indeed it exists in almost all regression models introduced above,
particularly non-parametric models. For example, for kNN approach, the k near-
est neighbors to a particular data point x0 may be far away when the number of
predictors p is large, leading to a poor fit. The situation for the kernel smoothing
21
Sun Yiran
is relatively di�cult to explain. We can understand that based on mathematical
computations.
Recall in the uni-variate case, the N-W Estimator is
y =
P
n
i=1 Kh
(Xi
� x)Yi
P
n
i=1 Kh
(Xi
� x)
Since K(.) is a symmetric probability density function, by Taylor Expansion the
performance of the estimator can be assessed as
bias(m(x)) = E
ˆm(x)�m(x)
=
P
n
i=1 Kh
(Xi
� x)Yi
P
n
i=1 Kh
(Xi
� x)
⇡P
n
i=1 Kh
(Xi
� x){m(x) +m
0(x)(Xi
� x) + 12m
00(x)(Xi
� x)2 + ✏
i
}P
n
i=1 Kh
(Xi
� x)
⇡ c2{1
2m
00(x) + f
�1(x)m0(x)f 0(x)}h2
(19)
with c2 =R
u
2K(u)du and
V ar(m(x)) = V ar(
P
n
i=1 Kh
(Xi
� x)✏i
P
n
i=1 Kh
(Xi
� x))
=
P
n
i=1{Kh
(Xi
� x)}2
{P
n
i=1 Kh
(Xi
� x)}2�2
=
P
n
i=1{Kh
(Xi
� x)}2
(nf(x)2�
2
! nE{Kh
(Xi
� x)}2
(nf(x)2�
2
=nE{ 1
h
K(Xi
�x
h
)}2
(nf(x)2�
2
(20)
22
Sun Yiran
If we do variable substitution by defining u�x
h
as v, u = x+ hv
E{1h
K(X
i
� x
h
)}2 = 1
h
2EK
2(X1 � x
h
)
=1
h
2
Z
K
2(u� x
h
)f(u)du
=1
h
2
Z
K
2(v)f(x+ hv)d(hv)
=1
h
Z
K
2(v)f(x+ hv)dv
! 1
h
f(x)
Z
K
2(v)dv
We can finally get
V ar(m(x)) ⇡1h
f(x)R
K
2(v)dv
(nf(x))2�
2
=d0�
2
nhf(x)with d0 =
Z
K(v)2dv
(21)
Since the choice of h should minimize
E[m(x)�m(x)]2 = bias
2 + variance
Simplify the equation the optimal bandwidth is then
h
opt
=n
d0�2
4f(x)c22�
12m
00(x) + f
�1(x)m0(x)f 0(x) 1/5
o
n
�1/5 / n
�1/5 (22)
In analog, the multi-variate case has p > 1, therefore bias and variance turn out to
be bias ⇡ C
K,m
(x)h2 and var(m(x)) ⇡ D
K
nh
p
f(x) . Doing the same way, the optimal
bandwidth obtained is proportional to n
�1/(p+4). That means an increase in the
dimension p will lead to a dramatic decrease in the convergence rate, thus an de-
terioration in estimation e�ciency. This is the so-called ‘Curse of Dimensionality’.
There are a collection of approaches to resolve the problem. Here I would like to
introduce two models to take account that.
2.2.2 Single-Index Model
The first model is called Single-Index model. Recall that the linear regression model
is E(Y |X) = �1 + �1x1 + ...+ �
p
x
p
+ ✏. Brilinger (1983) considered imposing a link
23
Sun Yiran
function to the linear regression model:
E(Y |X) = �(�0 + �1x1 + ...+ �
p
x
p
) + ✏ (23)
This model was termed by Stoker (1986) as the ‘Single Index Model’ if �(.) is un-
known. After standardizing the index, we can write it as:
Y = �(↵T
X) + ✏ (24)
where ↵0 = (�1, ..., �p
)T is the direction with k↵0k2 = �
21 + ...+ �
2p
= 1. Since �(.) is
a uni-variate function, its estimation does not have the problem with dimensionality.
2.2.3 Projection Pursuit Regression Model
The second model considered is Projection Pursuit Regression (PPR) which has the
form
f(X) ⇡ f1(↵T
1 x) + ...+ f
k
(↵T
m
x) (25)
This is an additive model with inputs ↵T
x as the derived features of X. The additive
terms are called ridge functions which fully depend on the direction vector ↵ and
k↵k = 1. In other words, ↵T
x is the projection of X onto ↵, and we only need to
determine ↵ that optimizes the model fit. In practice, we start with trying one term,
and add new terms one by one if needed. Since m < p, the dimension is reduced as
well. For a simple example, take 300 observations from the model Y = 6⇤x1⇤x2⇤x3+✏
where all inputs as well as the noise are identically and independently distributed
as N(0,1). By applying PPR, we obtain estimation plots as below:
24
Sun Yiran
-2 -1 0 1 2
-20
-10
010
20
xalpha1
y
-2 -1 0 1 2 3
-25
-20
-15
-10
-50
5
xalpha2
residuals1
-3 -2 -1 0 1 2
-25
-20
-15
-10
-50
5
xalpha3
residuals2
-2 -1 0 1 2
01
23
4
term 1
-2 -1 0 1 2 3
-6-5
-4-3
-2-1
0
term 2
-2 -1 0 1 2
-10
12
34
term 3
Figure 8: PPR plots with three ridge terms
3 Data Description and Empirical Application
3.1 Data Description
The science of learning from the data is employed in many business and scientific
communities. In a typical scenario, we would like to predict a quantitative or cate-
gorical outcome measurement based on some set of features. Usually a training set
of data is given, and we are required to build an appropriate model to predict the
outcome for new observations. We are going to do so by applying the above models
to illustrate simple examples of real-life data. In this thesis, I will use three datasets
of di↵erent types. We would like to assess the e�ciency of the models, in terms of
accuracy and computational cost.
25
Sun Yiran
3.1.1 Colon Cancer Dataset
The first dataset I analyzed is called Colon Cancer Dataset. This dataset was orig-
inally used in a bio-medical research that studies gene expressions of colon tumors.
In the research 6,500 human genes expressions of 40 tumor and 22 normal colon
tissue samples were analyzed. Then a number of 2,000 genes with highest minimal
intensity across the 62 samples were selected to compose the Colon Cancer Dataset.
In other words, the dataset contains two parts: ‘colon.x’ and ‘colon.y’. ‘colon.x’ is
a 62 ⇥ 2000 matrix where each row is a tissue sample and each column is a gene
expression. ‘colon.y’ contains 62 entries of either 0 or 1, where ‘0’ represents normal
colon and ‘1’ represents tumor colon.
The study of human genes is one of the most meaningful but complicated topic in the
modern scientific areas. It is estimated that humans have around 100,000 genes with
each having DNA that encodes a unique protein specialized for a function or a set
of functions. In any one gene, there is a sequence that contain millions of individual
nucleotides arranged in a particular order (Mehmed, 2011). Therefore traditional
statistical methods are definitely inadequate to explore this massive amount of in-
formation. Statistical scientists as well as genetic scientists are trying to find out
better strategies to interpret the genes. Data mining is one of the potential solu-
tions. I will use this dataset to illustrate how Lasso and Ridge regression work in
the real life case where the number of predictors is much more larger than number
of observations, namely (p >> n).
3.1.2 Wine Quality Dataset
This dataset contains two samples for red and white variants of the Portuguese
‘Vinho Verde’ wine separately. There are 1599 instances of red wine and 4898
instances of white wine are involved in the grading. The output variable is wine
quality based on experts grading. For each instance, at least three experts tasted
26
Sun Yiran
and rated on the wine quality that ranges from 0 (very bad) to 10 (very excellent),
and the median of their scores given would be the final quality class. However
note that for white wine, the classes range from 3 to 8; while for red wine, the
classes range from 3 to 9. The input attributes are: fixed acidity, volatile acidity,
citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density,
pH, sulphates and alcohol. All the inputs are real numbers while the output is
categorical.
3.1.3 Bike Sharing Dataset
The third dataset is called Bike Sharing. Currently there are over 500 automatic
bike sharing systems in the world. These systems allow users to rent a bike at
one place and return to another place, making the rental process more convenient
than traditional ones. Since bike-sharing rental process is closely related to the
environmental settings, an analysis on the users’ data records may give us an insight
into their renting habits and thus improve the bike sharing system. The downloaded
folder contains two files containing bike sharing counts on hourly and daily basis
respectively. I will only use the daily-based data for convenience. There are 16
input fields within the dataset. I will abandon 8 fields in my research, for the reason
that some of them provide repetitive information of the others. For example, the field
‘holiday: (1:yes, 0:no)’ and ‘working day (if day is neither weekend nor holiday is 1,
otherwise is 0)’ and ‘weekday’ provide similar information. Therefore the response
variable is count of total bikes on a day and the eight independent variables after
being filtered are as the followings:
27
Sun Yiran
Table 2: List of predictors in Bike Sharing Dataset
1 month 1-12: January to December
2 holiday1: yes
0: no
3 weekday 0-6: Sunday to Saturday
4 weathersit
1: Clear, Few Clouds, Partly Cloudy
2: Mist+Cloudy, Mist+Broken Clouds, Mist+Few Clouds,
3: Light Snow, Light Rain+Thunderstorm+Scattered Clouds,
Light + Scattered Clouds
4: Heavy Rain+Ice Pallets+Thunderstorm+Mist, Snow+Fog
5 temp Normalized temperature in Celsius, divided to 41
6 atemp Normalized feeling temperature in Celsius, divided to 50
7 hum Normalized humidity, divided to 100
8 windspeed Normalized wind speed, divided to 67
3.2 Examples of application
3.2.1 Feature Scaling
I have mentioned in the introduction section that the four basic steps involved in
statistical analysis are: data processing, model fitting, model checking, and result
interpretation. We will follow these steps to find out how the above models are
applied in real life circumstances and what we can infer and benefit from the results.
Let us start o↵ with data processing. Data processing is a procedure that makes the
data meaningful and reasonable for later analysis. A general requirement of data
processing is feature scaling or standardization.
Why is scaling necessary? Let us take Ridge for instance. Recall that the commonly
known standard LSE is indeed scale equivalent: Xi
�
i
will remain the same no matter
how X
i
is scaled. This means multiplying X
i
by some constant c will always lead
to a scaling of the LSE by 1/c. However, Ridge estimations can change along with
the predictor. For example, a variable representing ‘weight’ can be measured in a
variety of units, such as ‘g’, ‘kg’, ‘pounds’ and so on. However due to the sum term
28
Sun Yiran
in Ridge regression’s formula (3), the di↵erence in scale cannot be simply adjusted
in the final estimation by some factor. Therefore it is necessary to apply scaling
on the predictors beforehand. We can use the formula x
ij
= x
ijp1n
Pn
i=1(xij
�x
j
)2in
computations or simply call function ‘scale()’ in R programming. In addition, scaling
can eliminate the intercept and make the resulted coe�cients easier-to-interpret.
Note that, outlier detection is also a necessary part of data processing. We will see
this later.
3.2.2 Results Interpretation
First let us analyze the Colon Cancer Dataset. Having noticed that the number of
predictors is much larger than the number of observations (p >> n), we think of
applying Ridge or Lasso Regression to it. Since no additional testing data is given
in this dataset, a subset of the original data needs to be reserved for the model
checking later. A general way to do that is to split the dataset randomly into two
sets and define them as training set and testing set respectively. In this case, I used
a splitting rate of 0.8 and obtained floor(62 ⇥ 0.8) = 49 observations to compose
the training set and the remaining 13 for validation purpose.
Recall that Lasso penalizes the coe�cients by imposing a constraint on them: �’s are
chosen by minimizingP
n
i=1{Yi
� �0 �P
p
j=1 xij
�
j
}2 +�
P
p
j=1 |�j
| orP
n
i=1{Yi
� �0 �P
p
j=1 xij
�
j
}2 subject toP
p
j=1 t. This means that for a strict constraint, or t ! 0,
most coe�cients will be eliminated to zero and only few coe�cients survive in the
model; while if t ! 1, the constraint imposed means no power and the coe�cients
estimated are just the LSE. In R programming, I called function ‘glmnet()’ which
fits a generalized linear model via penalized maximum likelihood.
The left panel below traces out coe�cients paths versus L1 Norm, namely t in the
above constraint expression.
29
Sun Yiran
0.0 0.5 1.0 1.5 2.0 2.5
-0.2
-0.1
0.0
0.1
0.2
L1 Norm
Coefficients
0 10 21 28 38 48
-5 -4 -3 -2 -1
0.12
0.16
0.20
0.24
log(Lambda)M
ean-
Squ
ared
Err
or
51 47 41 37 26 17 4 3 1
Figure 9: Coe�cients for the colon cancer example, as the constraint size t is varied.
Interpretation of this graph should be, for example at t = 0.5 with a vertical gray-
colored line, there are 11 coe�cients unequal to zero. Thus the corresponding 11
inputs are regarded as influential on the output in the case of t = 0.5. It roughly
shows the outstanding property of variable selection that Lasso is endowed with.
We mentioned in Chapter 2.1.1 that we can select the � based on AIC or CV value.
The basic idea of CV is to split the number of n samples into 2 sets: training set of
size (n-m) and testing set of size m. The CV value is just the average of prediction
errors of all possible partitions with size m. Hence the model with smallest CV
value makes the best prediction. Similarly the K-fold CV just an improved way that
avoids too many iterations compared with the ordinary CV. Rather than leaving one
observations out for checking, K-fold CV approach splits the data into approximately
equally K sets. The (K � 1) sets are used for model fitting, and the remaining one
set for validation. The whole procedure should be repeated K times.
By calling the function ’cv.glmnet()’ and set number of folds to be 5, we get the
above right panel in Figure 9 showing 5-fold CV paths at di↵erent �’s. The CV
value achieves its minimum at � = 0.07777501, and the final Lasso model obtained
30
Sun Yiran
is
Y =� 0.14676x4 + 0.03666x164 + 0.18835x175 + 0.26834x353 � 0.20746x377
+ 0.13086x576 � 0.02910x611 � 0.01285x654 � 0.04579x788 � 0.39661x792
� 0.02843x823 + 0.06770x1073 � 0.04461x1094 + 0.04027x1221 � 0.08913x1231
+ 0.04195x1256 + 0.11274x1346 + 0.01682x1360 + 0.12097x1400 + 0.00544x1473
+ 0.04121x1549 � 0.04568x1570 + 0.10489x1582 � 0.04668x1668 + 0.14415x1679
+ 0.30034x1772 � 0.11222x1843 � 0.00228x1873 � 0.22062x1924
We can find the corresponding names of the gene features in the ‘names.html’ file
and these features are regarded as significant in Colon Cancer tumor.
Now comes the third step: to verify if the fitted model is appropriate or not. We
examine its performance in terms of prediction accuracy. Treating the testing set
as new observations, we get a classification error by averaging all the errors that
the predictions are not correctly classified as the observations. Having repeated the
whole procedure ten times, the classification error turned out to be zero in nine
times. A more straight-forward classification plot is shown below.
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
predicted values
classes
true
classified
Figure 10: Classification error plot for Lasso applied in Colon Cancer Dataset
31
Sun Yiran
The plot shows that the model fitted by Lasso is overall satisfying. Lasso’s ability
in dealing with the situation when there are a huge number of features but few
observations is hence proved.
The second dataset is Wine Quality. We analyze the white wine and red wine data
separately but in a similar way. Since the response variable ‘quality’ is categorical, we
consider of using multilevel logistic regression model and K-Nearest Neighborhood
classification method. We have learned in elementary statistics courses that for
categorical data with K classes, we can model the probability as
P (Y = k|X) =exp(a
k
+ �
T
k
X)
exp(a1 + �
T
1 X) + ...+ exp(aK
+ �
T
K
X)(26)
Once ↵ and � are obtained, we can make prediction that the sample falls into class
k with probability p
k
.
For the white wine data, again taking splitting rate to be 0.8, the training set com-
prises 3918 observations and testing set has 980 observations. So the multilevel
logistic regression model is fitted by calling function ‘glmnet()’ with family set as
’multinomial’. One needs to be careful that the function ‘as.factor(response vari-
able)’ is necessary as it changes numerical values to symbol levels. Therefore the
resulting classification error which is the mean of the frequency that prediction does
not equal to true value in testing set, turns out to be 0.459187. Note that the re-
sponse variable in this dataset has 7 classes. If I tried omitting the first three classes
3, 4, and 5 and reduced number of classes to 4. Using the same algorithm, the
classification error decreased to 0.3190184 in such way. If I continued to omit one
more class and left with only 3 classes, then the error is reduced to 0.1462264, which
gives a satisfying prediction. Hence we doubt if logistic regression model is e�cient
for data with relatively large number of classes. Now we applied the alternative
data mining method, kNN. Simply calling function kknn(), the classification error
obtained is 0.3785714. Similarly for the red wine data, the classification errors for
multilevel logistic model and kNN model are 0.425013 and 0.359375 respectively.
32
Sun Yiran
It is summarized in the table below
Table 3: Classification Error for Wine Quality Dataset
Classification Error White Wine Red Wine
Logistic Model 0.459187 0.425013
kNN 0.3785714 0.359375
To visually see the performance, I also plotted the following figure based on a subset
of 50 resulted predictions as well as their true values.
0 10 20 30 40 50
4.0
4.5
5.0
5.5
6.0
6.5
7.0
predicted values
classes
trueclassified
white wine
0 10 20 30 40 50
4.0
4.5
5.0
5.5
6.0
6.5
7.0
predicted values
classes
trueclassified
red wine
Figure 11: A subset of 50 classification errors plot for white wine (left) and red wine
(right) based on kNN method
Based on all these outputs, it is reasonable to conclude that kNN performs better
than multilevel logistic regression model in the case when the number of classes is
big. However, in this case when number of classes is below 3 (including 3), the
logistic regression model is indeed e�cient enough.
Finally we come to analyze the Bike-sharing Dataset. This dataset has 8 input vari-
ables and 730 observations. Again, the first step is to process the data in advance.
During data processing, I detected an outlier in row 668. The population mean of
the bike count ”cnt” is 4548, while that of row 668 is only 22. Hence I removed it
33
Sun Yiran
before proceeding to model fitting. Also note that I logged the response variable as
the measurement scale is too large compared to the input variables.
1 > xy [666:670 ,]
mnth holiday weekday weathersit temp atemp hum windspeed cnt
3 [666,] 10 0 6 2 0.530000 0.515133 0.720000 0.235692 7852
[667,] 10 0 0 2 0.477500 0.467771 0.694583 0.398008 4459
5 [668,] 10 0 1 3 0.440000 0.439400 0.880000 0.358200 22
[669,] 10 0 2 2 0.318182 0.309909 0.825455 0.213009 1096
7 [670,] 10 0 3 2 0.357500 0.361100 0.666667 0.166667 5566
I first fitted a generalized linear model to it. The result showed an AIC of 570.6
which is not that satisfying. Hence we may want to take non-linear fits into con-
sideration. Recall that polynomial splines approximate uni-variate functions in a
piecewise manner. However, we have mentioned that when the number of predictors
is larger than 2, the estimation e�ciency su↵ers from the ‘Curse of Dimensionality’.
Semi-parametric models such as the Additive Partially Linear Model (APLM) is an
alternative approach to take account of that. The Bike-sharing dataset has 8 inputs
and one might think of trying an APLM.
Having noticed that the first four terms of inputs are categorical variables, we set
the first four terms as linear parts, and start o↵ with the model: Y = x1 + x2 +
x3 + x4 + g(x5) + g(x6) + g(x7) + g(x8). Then we reduce the number of nonlinear
components to 3, 2, and 1. Thus all possible models are taken into consideration.
Indeed there are 15 potential candidate models to be analyzed:
34
Sun Yiran
(1)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + g(x7) + g(x8)
(2)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + g(x7) + x8
(3)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + x7 + g(x8)
(4)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + g(x7) + g(x8)
(5)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + g(x7) + g(x8)
(6)Y = x1 + x2 + x3 + x4 + g(x5) + g(x6) + x7 + x8
(7)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + g(x7) + x8
(8)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + x7 + g(x8)
(9)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + g(x7) + x8
(10)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + x7 + g(x8)
(11)Y = x1 + x2 + x3 + x4 + x5 + x6 + g(x7) + g(x8)
(12)Y = x1 + x2 + x3 + x4 + g(x5) + x6 + x7 + x8
(13)Y = x1 + x2 + x3 + x4 + x5 + g(x6) + x7 + x8
(14)Y = x1 + x2 + x3 + x4 + x5 + x6 + g(x7) + x8
(15)Y = x1 + x2 + x3 + x4 + x5 + x6 + x7 + g(x8)
Applying CV selection, the resulting CV values for these 15 models are respectively
78.77300, 78.37520, 80.17791, 78.77305, 85.92006, 85.74561, 78.37520, 85.10425,
79.72240, 87.97303, 109.61046, 112.96229, 87.83789, 109.61061, 84.92717. We are
thus left with models (1), (2), (4), (7), (9). Again we compare their AICs and find
that model (9) has relatively larger AIC than others, hence it should be excluded.
However note that all AICs are smaller than that of a linear model which means
a non-linear fit is necessary. To determine which model is the best among the re-
mained (1), (2), (4), (7). Let us plot the spline terms for model (1) first.
35
Sun Yiran
1 2 3 4
-2.0
-1.0
0.0
x5
s(x5,5.43)
1 2 3 4 5
-2.0
-1.0
0.0
x6
s(x6,1)
0 1 2 3 4 5 6 7
-2.0
-1.0
0.0
x7
s(x7,6.8)
1 2 3 4 5 6
-2.0
-1.0
0.0
x8
s(x8,1.31)
Figure 12: Spline terms of model (1): Y = x1+x2+x3+x4+g(x5)+g(x6)+g(x7)+
g(x8)
It is interesting to note that the plots of 6th and 8th input terms are straight lines,
which means they are indeed linearly related to the output. Therefore we choose
model (7): Y = x1 + x2 + x3 + x4 + g(x5) + x6 + g(x7) + x8 as our final model.
Fitting model (7) to the data, and the spline plots for term x5 and x7 are shown
below. Therefore x5 which represents temperature, has a non-linear but positive
trend relation with the response variable. Note there is a decrease at the right tail
meaning a too-high temperature prevents people from biking. Humidity has a first
36
Sun Yiran
increasing and then decreasing trend. We can interpret this as moderately humid
weather encourages people to go biking, but if the humidity is too heavy such as
rainy days, people would rather not go out.
1 2 3 4
-2.0
-1.0
0.0
0.5
x5
s(x5,5.59)
0 1 2 3 4 5 6 7
-2.0
-1.0
0.0
0.5
x7
s(x7,6.79)
Figure 13: Contours of the error and constraint functions for the lasso (left)
The R output of summary of model (7) is given below. In spite of showing the
parameters of spline terms, it also shows coe�cients of the six linear terms.
1 > summary(out7)
Formula:
3 y ~ x1 + x2 + x3 + x4 + s(x5) + x6 + s(x7) + x8
Parametric coefficients:
5 Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.62182 0.31833 27.084 < 2e-16 ***
7 x1 0.06831 0.01322 5.168 3.07e-07 ***
x2 -0.03319 0.01191 -2.786 0.00548 **
9 x3 0.01979 0.01197 1.653 0.09870 .
x4 -0.03796 0.01653 -2.297 0.02194 *
11 x6 -0.04864 0.10357 -0.470 0.63878
x8 -0.09960 0.01324 -7.523 1.63e-13 ***
13 ---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
15 Approximate significance of smooth terms:
edf Ref.df F p-value
17 s(x5) 5.585 6.785 37.97 <2e-16 ***
s(x7) 6.789 7.913 16.90 <2e-16 ***
19 ---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
21 R-sq.(adj) = 0.668 Deviance explained = 67.6%
GCV = 0.1037 Scale est. = 0.10095 n = 730
37
Sun Yiran
Kernel regression is a potential alternative to the spline method. If we would like
to learn more about the relation of humidity and temperature with number of bikes
rent, a contour plot based on kernel regression should be a good choice.
Temperature0.20.4 0.6 0.8 Humidity0.20.40.60.8
log count
6.5
7.0
7.5
8.0
8.5
Temperature
Humidity
0.2 0.4 0.6 0.80.00.20.40.60.8
Contour
Figure 14: Contour plot of humidity and temperature
There is a maximum point at around humidity = 0.4 and temperature = 0.7, mean-
ing if only taking humidity and temperature into consideration, the amount of bikes
rent at this weather condition is the largest.
Similar to spline smoothing, kernel estimation also has problem with high dimen-
sional data. In Chapter 2.2, we introduced two approaches to circumvent that:
Projection Pursuit Regression and Single-Index Model. In this section, we will see
how they work based on the Bike-sharing Dataset.
Recall the single-index model reduces the dimension to one by imposing a link func-
tion �(.) to the generalized linear model: Y = �(�0+�1x1+...+�
p
x
p
)+✏ = �(↵T
0X)+
✏. The LSE of ↵0 is estimated by minimizing ↵0 = argmin
P
n
i=1{Yi
� �
↵
(↵T
X
i
)}2,
which can be done by calling function ”ppr()”. Part of the output is given below:
Projection direction vectors:
2 mnth holiday weekday weathersit temp atemp hum
0.17025222 -0.10651186 0.04121438 -0.22823288 0.67358137 0.55505242 -0.31724812
4 windspeed
-0.20842051
38
Sun Yiran
So the fitted model is:
cnt =g(0.17025222 ⇤mnth� 0.10651186 ⇤ holiday + 0.04121438 ⇤ weekday
� 0.22823288 ⇤ weathersit+ 0.67358137 ⇤ temp+ 0.55505242 ⇤ atemp
� 0.31724812 ⇤ humidity � 0.20842051 ⇤ windspeed
Then we use the general regression function to explore the relationship between ↵
T
0X
with the response variable ‘cnt’, which is just the link function �(.)
-2 -1 0 1 2 3 4
6.0
6.5
7.0
7.5
8.0
8.5
9.0
xalpha
y
-2 -1 0 1 2 3 4
6.0
6.5
7.0
7.5
8.0
8.5
9.0
xalpha
y
Figure 15: The estimated link function �(.) of the Single-Index Model for Bike-
Sharing Data.
Based on figures and output, we roughly see that almost all factors a↵ect the number
of bikes rent but among all, temperature may be the most important factor. Since
the trend of the link function is roughly upward, positive (negative) coe�cients of
X are still positive (negative) after the link function is imposed. Therefore we may
conclude that humidity and windspeed are adverse variables that are negatively
related to the response variable, while a relatively higher temperature contributes
to the higher number of bikes rent. This conclusion from the data matches with
our common sense that people tend to go biking on bright and sunny day, but try
to avoid rainy and cold days. Note that since weathersit is a classification variable
39
Sun Yiran
with 1 (good weather) to 4 (extreme weather) classes, it is also consistent with the
results.
Since the single-index model uses only a uni-variate link function, we are not sure
if it is su�cient for fitting. Hence we may also try projection pursuit regression.
Again for the bike-sharing data, we consider the model of two ridge terms
Y = g1(↵T
1X) + g2(↵T
2X) + ✏
The first component Y = g1(↵T
1X) + ✏ is exactly the Single-Index Model. Suppose
the estimate of it is g1( ˆ↵1
T
X
i
). So the residuals of the first term fitted is r1,i =
Y
i
� g1( ˆ↵1
T
X
i
) and fit the second component r1,i = g2(↵2T
X
i
) + ⌘
i
. PPR plots of
the two ridge terms are shown below
-2 -1 0 1 2 3 4
6.0
7.0
8.0
9.0
xalpha1
y
-3.5 -2.5 -1.5
-2.0
-1.0
0.0
1.0
xalpha2
residuals1
-2 -1 0 1 2 3 4
-1.5
-0.5
0.5
xalpha1
s(xalpha1,4.48)
-3.5 -2.5 -1.5
-1.5
-0.5
0.5
xalpha2
s(xalpha2,5.86)
Figure 16: PPR Plot of two ridge terms
40
Sun Yiran
We see that in the right panel, residuals plot of ↵2 has no obvious trend, and is
roughly horizontal. Therefore we conclude that one term is su�cient enough and
we can simply use the Single-Index Model for this case.
The table below is a comparison of components estimates based on PPR and SIM
respectively. We see that the first term of PPR is roughly the same as that of SIM.
They both reflect the influence that temperature has on people’s will of biking. The
higher temperature, the higher possibility that people rent a bike.
Table 4: Comparison of Components Estimates between PPR and SIM
PPR SIM
mnth 0.17025 -0.05170 0.17025
holiday -0.10651 0.04768 -0.10651
weekday 0.04121 0.09028 0.04124
weathersit -0.22823 -0.21325 -0.22823
temp 0.67358 -0.82288 0.67358
atemp 0.55505 0.48210 0.55505
hum -0.31725 -0.02250 -0.31725
windspeed -0.20845 -0.17710 -0.20842
41
Sun Yiran
3.3 Model Comparison
So far we have discussed three main models, as well as the two additional models to
alleviate ‘Curse of Dimensionality’. I have also applied them to three real-life cases.
It is time to compare these models. We would like to know when to use them, and
what are their advantages and disadvantages over each other.
First are the parametric models: Ridge and Lasso. Both Ridge and Lasso work well
in a adataset with a large number of predictors. Note that in the Colon Cancer
Dataset, there are originally 2000 predictors and among them 29 predictors were
finally selected as significant features based on Lasso regression. The classification
error by Lasso is zero 9 out of 10 times of implementation. Therefore it proves
the theory in Chapter 2.1.1 that Lasso performs well in the when a relatively small
number of inputs have coe�cients unequal to zero.
Next is the semi-parametric model: Splines. For the Bike Sharing Dataset with
number of predictors greater than 2, we applied spline regression to it. The mean
squared error is smaller than that of the linear regression model. Spline regression
takes non-linear terms into consideration, thus leading to a better model fit and
more accurate predictions.
Finally is the non-parametric model: kNN and kernel regression. Although the
commonly used logistic regression model performs well for a dataset with categorical
response variable, the e�ciency dramatically decreased when the number of classes
is greater than 3. In contrast, the model-free kNN approach is not subject to that.
4 Conclusion
In this thesis, a collection of Data Mining models are discussed and compared.
Firstly, the parametric models Ridge and Lasso are used as alternatives to linear
models when the number of predictors is much greater than the number of obser-
42
Sun Yiran
vations (p >> n). Then in the case of non-linearity, semi-parametrics models such
as Cubic Splines and Additive Partially Linear Models (APLM) can be applied to
fit curves to the data. Theoretically the semi-parametric models are more flexible
than parametric models in approximation, and meanwhile more e�cient than non-
parametric models in computational implementation. Finally for the non-parametric
models, I introduced k-Nearest-Neighbors (kNN) approach and Kernel Regression.
These two are instance-based learners and free of model construction, thus are very
flexible for model fitting.
Examples of applications were carried out in empirical study. I applied all the
models introduced in model fitting to di↵erent types of datasets and illustrated the
comparison of models in great details. From the results, we can draw a couple of
conclusions:
(1) Lasso performs well in extremely high-dimensional data, both in terms of com-
putation time and accuracy. This is particularly true when the high-dimensional
data has a relatively small number of significant predictors. Lasso is also endowed
with a unique ability of variable selection and is easy to implement.
(2) In our case of Wine Quality dataset which has a categorical variable with nine
classes, kNN outperforms multilevel logistic regression in classification error. The
prediction accuracy of multilevel logistic regression is improved once I reduced the
number of classes. Therefore I concluded that kNN is a better choice when number
of class is greater than 3.
(3) APLM, PPR and SIM are employed when the linearity of the model does not
hold. Among them, APLM is the most di�cult to implement as we need to specify
all possible combinations of the linear and nonlinear components. PPR includes
more terms than SIM and thus gives more accurate estimations. However, since
each term enters the model in a complicated way, it is not good for producing an
understandable model for the data.
Given all these results, we have learned more about Data Mining, which is a signif-
43
Sun Yiran
icant element of Statistical Learning, and got a rough sense of how its techniques
are applied to real-life cases at an elementary level.
44
Sun Yiran
References
Bichler, M., & Kiss, C. (2004). A comparison of logistic regression, k-nearest
neighbor, and decision tree induction for campaign management. AMCIS 2004
Proceedings , 230.
Farrar, D. E., & Glauber, R. R. (1967). Multicollinearity in regression analysis: the
problem revisited. The Review of Economic and Statistics , 92–107.
Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of
statistical learning: data mining, inference and prediction. The Mathematical
Intelligencer , 27 (2), 83–85.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to
statistical learning (Vol. 112). Springer.
McDonald, G. C. (2009). Ridge regression. Wiley Interdisciplinary Reviews: Com-
putational Statistics , 1 (1), 93–100.
Sheather, S. J., & Jones, M. C. (1991). A reliable data-based bandwidth selection
method for kernel density estimation. Journal of the Royal Statistical Society.
Series B (Methodological), 683–690.
Stone, C. J. (1985). Additive regression and other nonparametric models. The
annals of Statistics , 689–705.
Wand, M. P. (2000). A comparison of regression spline smoothing procedures.
Computational Statistics , 15 (4), 443–462.
Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2005). Distance metric learning for
large margin nearest neighbor classification. In Advances in neural information
processing systems (pp. 1473–1480).
Wood, S. (2006). Generalized additive models: an introduction with r. CRC press.
45
Sun Yiran
Appendix
In spite of the output of R programs within the main body of the thesis, the rest
are displayed as the following:
• Lasso Regression for Colon Cancer Dataset
1 ############ Lasso Model Fitting ################
> BestLambda
3 [1] 0.0588373
> reg
5 Call: glmnet(x = x, y = y, lambda = BestLambda)
7 Df %Dev Lambda
[1,] 34 0.9299 0.05884
9 ############ Model Validation ###################
> errorLasso = mean((ytest -ypredict)^2)
11 > errorLasso
[1] 0.04403709
13
> ClassficationError = mean(ytest != yclassified)
15 > ClassficationError
[1] 0
• kNN and Multivariate Logistic Regression Model for Wine Quality Datset
############### kNN Model Validation #############
2 ### White Wine
> errorknn = mean(ytest!=predknn)
4 > errorknn
[1] 0.3857143
6 ### Red Wine
> errorknn = mean(ytest!=predknn)
8 > errorknn
[1] 0.3857143
10 ############ Multilevel Logistic Regression Model Validation ########
### White Wine
12 > classificationError
[1] 0.4704082
14 ### Red Wine
> classificationError
16 [1] 0.4693878
• Semi-parametric Models for Bike Sharing Dataset
################### Linear Model ##########################
2 > fit
Call:
4 lm(formula = ytrain ~ xtrain$mnth + xtrain$holiday + xtrain$weekday +
xtrain$weathersit + xtrain$temp + xtrain$atemp + xtrain$hum +
6 xtrain$windspeed)
46
Sun Yiran
8 Coefficients:
(Intercept) xtrain$mnth xtrain$holiday xtrain$weekday
10 7.65392 0.11104 -0.03892 0.02836
xtrain$weathersit xtrain$temp xtrain$atemp xtrain$hum
12 -0.07552 -0.54235 0.86573 -0.08455
xtrain$windspeed
14 -0.03882
16 > mse
[1] 0.4903079
18
################# Additive Partially Linear Model ##########
20 ### Model 7 is selected
> out7
22
Family: gaussian
24 Link function: identity
26 Formula:
ytrain ~ x1 + x2 + x3 + x4 + s(x5) + x6 + s(x7) + x8
28
Estimated degrees of freedom:
30 5.83 6.87 total = 19.71
32 GCV score: 0.1027573
34 ################# Project Pursuit Regression #############
#### The first term
36 Call:
ppr(x = x, y = y, nterms = 1)
38
Goodness of fit:
40 1 terms
80.83421
42
#### The second term
44 > out2
Call:
46 ppr(x = x, y = residuals1 , nterms = 1)
48 Goodness of fit:
1 terms
50 72.13232
52 ################### Single Index Model ###############
Family: gaussian
54 Link function: identity
56 Formula:
y ~ s(xalpha)
58
Estimated degrees of freedom:
60 4.67 total = 5.67
62 GCV score: 0.1141133
47