deon garrett et al- comparison of linear and nonlinear methods for eeg signal classification
TRANSCRIPT
8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification
http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 1/7
1
Comparison of Linear and Nonlinear Methods forEEG Signal Classification
Deon Garrett, David A. Peterson, Charles W. Anderson, Michael H. Thaut
Abstract— The reliable operation of brain-computer inter-faces (BCI’s) based on spontaneous electro electroencephalogram(EEG) signals requires accurate classification of multichannelEEG. The design of EEG representations and classifiers for BCIare open research questions whose difficulty stems from the needto extract complex spatial and temporal patterns from noisy mul-tidimensional time series obtained from EEG measurements. It ispossible that the amount of noise in EEG limits the power of non-linear methods; linear methods may perform just as well as non-linear methods. This article reports the results of a linear (lineardiscriminant analysis) and two nonlinear classifiers (neural net-works and support vector machines) applied to the classification of
spontaneous, six-channel EEG. The nonlinear classifiers produceonly slightly better classification results. An approach to featureselection based on genetic algorithms is also presented with pre-liminary results.
Index Terms— EEG, electroencephalogram, pattern classifica-tion, neural networks, support vector machines, feature selection,genetic algorithms
I. INTRODUCTION
Recently, much research has been performed into alterna-
tive methods of communication between humans and comput-
ers. The standard keyboard/mouse model of computer use is
not only unsuitable for many people with disabilities, but alsosomewhat clumsy for many tasks regardless of the capabilities
of the user. Electroencephalogram (EEG) signals provide one
possible means of human-computer interaction which requires
very little in terms of physical abilities. By training the com-
puter to recognize and classify EEG signals, users could ma-
nipulate the machine by merely thinking about what they want
it to do within a limited set of choices.
Currently, most research into EEG classification uses such
machine learning stalwarts as Neural Networks (NNs).
In this article, we examine the application of support vec-
tor machines (SVM) to the problem of EEG classification and
compare the results to those obtained using neural networks and
linear disciminant analysis. Section II provides an overview of SVM theory and practice, and the problem of multi-class clas-
sification is considered in Section III. Section ?? discusses the
D. Garrett is a Ph.D. candidate in the Department of Computer Science, Col-orado State University, Fort Collins, CO (e-mail: [email protected]).
D. Peterson is a Ph.D. candidate in theDepartment of Computer Science, Col-orado State University, Fort Collins, CO (e-mail: [email protected]).
C. Anderson is with the Department of Computer Science, Colorado StateUniversity, Fort Collins, CO (e-mail: [email protected]).
M. Thaut is with theDepartment of Music, Theatre, andDanceand theCenterfor Biomedical Research, Colorado State University, Fort Collins, CO (e-mail:[email protected]).
D. Peterson, C. Anderson and M. Thaut are also with the Molecular, Cellular,and Integrative Neuroscience Program at Colorado State University.
acquisition of EEG signals. The results of this study are detailed
in Section V. Section ?? describes preliminary experiments us-
ing genetic algorithms to search for good subsets of features
in an EEG classification problem. Section ?? summarizes the
findings of this article and their implications.
I I . SUPPORT VECTOR MACHINES FOR BINARY
CLASSIFICATION
The support vector machine (SVM) is a classification method
rooted in statistical learning theory. The motivation behind
SVMs is to map the input into a high dimensional feature space,
in which the data might be linearly separable. In this regard,
SVMs are very similar to other Neural Network based learn-
ing machines. The principle difference between these machines
and SVMs is that the latter produce the optimal decision surface
in the feature space.
Conventional neural networks can be difficult to build due to
the need to select an appropriate number of hidden units. The
network must contain enough hidden units to be able to approx-
imate the function in question to the desired accuracy. However,
if the network contains too many hidden units, it may simply
memorize the training data, causing very poor generalization.
The ability of the machine to learn features of the training data
is often referred to as learning capacity, and is formalized in a
concept called VC dimension.
Support Vector Machines are constructed by solving a
quadratic programming problem. In solving this problem, SVM
training algorithms simultaneously maximize the performance
of the machine while minimizing a term representing the VC
dimension of the learning machine. This minimization of the
capacity of the machine ensures that the system can not overfit
the training data, for a given set of parameters.
A. Linear Support Vector Machines
In this section, the training of a support vector machine is de-
scribed for the case of a binary classification problem for which
a linear decision surface exists that can perfectly classify the
training data. In later sections, the requirement of linear sepa-
rability will be relaxed.
The assumption of linear separability means that there ex-
ists some hyperplane which perfectly separates the data. This
hyperplane is a decision surface of the form
w · x+ b = 0, (1)
where w is an adjustable weight vector, x is an input vector,
and b is a bias term. The assumption of separability means that
8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification
http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 2/7
2
there exists some set of values w and b, such that the following
constraints hold for all input vectors, given that the classes are
labeled +1 and −1:
w · xi + b ≥ +1 ∀yi = +1 (2)
w · xi + b ≤ −1 ∀yi = −1 (3)
or
yi (w · xi + b) − 1 ≥ 0 ∀i. (4)
As previously stated, the support vector machine training al-
gorithm finds the optimal hyperplane for separation of the train-
ing data. Specifically, it finds the hyperplane which maximizes
the margin of separation of the classifier.
Consider the set of training examples which satisfy (2) ex-
actly. These examples are those which lie closest to the hy-
perplane on the positive side. Similarly, the training examples
satisfying (3) exactly lie closest to the hyperplane on the nega-
tive side. These particular training examples are called support
vectors. Note that requiring the existence of points exactly sat-
isfying the constraints is equivalent to simply rescaling w and
b by an appropriate amount.
The distance between these points and the hyperplane is
given by 1/ w. We define the margin of the hyperplane to be
the distance between the positive examples nearest the hyper-
plane and the negative examples nearest the hyperplane, which
is equal to 2/ w. Therefore, we can maximize the margin of
the classifier by minimizing w, subject to the constraints of
(4). Thus the problem of training the SVM can be stated as fol-
lows: find w and b such that the resulting hyperplane correctly
classifies the training data and the Euclidean norm of the weight
vector is minimized.
To solve the problem described above, it is typically reformu-
lated as a Lagrangian optimization problem. In this reformula-tion, nonnegative Lagrange multipliersA = {α1, α2, ...αn} are
introduced, yielding the Lagrangian
L =1
2w −
ni=1
αi(yi(w · xi + b) − 1) (5)
We must minimize this Lagrangian with respect tow and b, and
simultaneously maximize with respect to the Lagrangian multi-
pliers αi. Differentiating with respect to w and b and applying
the results to the Lagrangian yields two conditions of optimal-
ity,
w =
n
i=1
αiyixi (6)
andni=1
αiyi = 0 (7)
There are two important consequences of these conditions:
the optimal weight vector wo is described in terms of the train-
ing data, and only those training examples whose correspond-
ing Lagrange multipliers are non-zero contribute to wo. From
the Karush-Kuhn-Tucker (KKT) conditions [12], [15], [10], [3],
it follows that the training patterns corresponding to the non-
zero multipliers are those that satisfy (4) exactly. To understand
why this is true, recall that we wish to maximize the Lagrangian
L with respect to A. Thus, assuming w and b are constant, the
second term of L must be minimized. If (yi(w·xi+b)−1) > 0,
then αi must be zero in order to maximize L. Therefore, only
the training points lying closest to the optimal hyperplane, the
support vectors, have any effect on its calculation.
Substituting the optimality conditions, (6) and (7), into (5)
yields the Wolfe dual [3] of the optimization problem: find mul-
tipliers αi such that
LD =n
i=1
αi −1
2
ni=1
nj=1
αiαjyiyj(xi · xj) (8)
is maximized subject to the constraints:
αi ≥ 0 ∀i (9)
andni=1
αiyi = 0, (10)
yielding a decision function of the form,
f (x) = sign
ni=1
αiyi(x · xi) + b
. (11)
Note that while w is directly determined by the set of support
vectors, the bias term b is not. Once the weight vector is known,
the bias may be computed by substitution of any support vector
into (4) and solving as an equality constraint, although numeri-
cally, it is better to take an average over all support vectors.
B. Relaxing the Separability Restriction
The previous derivation assumed that the training data was
linearly separable. The constraints of (4) are too rigid for use
with non-linearly separable data; they force all training ex-amples to lie outside the margin of the classifier. The key
idea in extending the Support Vector Machine to handle non-
separable data is to allow these constraints to be violated, but
only if accompanied by a penalty in the objective function.
We thus introduce another set of nonnegative slack variables,
Ξ = {ξ1, ξ2,...,ξn} into the constraints [7]. The new con-
straints are
w · xi + b ≥ +1 − ξi ∀yi = +1, (12)
w · xi + b ≤ −1 + ξi ∀yi = −1, (13)
and ξi ≥ 0 ∀i. (14)
An error thus occurs only when ξi > 1. Therefore, the sum
ni=1
ξi
effectively serves as an upper bound on the number of errors
committed by the SVM. We modify the original goal of the op-
timization problem, minimize w, by adding a term to penal-
ize errors. The new optimization problem thus becomes: mini-
mize
w + C ni=1
ξi,
8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification
http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 3/7
3
where C is a user-defined parameter which controls the degree
to which training errors can be tolerated.
Proceeding in a manner analogous to that above, the Wolfe
dual of the new Lagrangian is
LD =
ni=1
αi −1
2
ni=1
nj=1
αiαjyiyj(xi · xj), (15)
which is identical to (8). As in the separable case, LD must bemaximized subject to constraints on the Lagrange multipliers.
However, the addition of the ξi produces a subtle difference
in these constraints. Specifically, the constraint given in (9)
becomes the following:
0 ≤ αi ≤ C ∀i. (16)
The second constraint,
ni=1
αiyi = 0, (17)
remains the same as in the separable problem. Thus, bounding
the values of the Lagrange multipliers from above allows the
Support Vector Machine to construct decision boundaries for
training data which cannot be linearly separated.
C. Relaxing the Linearity Restriction
Thus far, it has been assumed that the SVM was to construct
a linear boundary between two classes represented by a set of
training data. Of course, most interesting problems cannot be
adequately classified by a linear machine. In order to general-
ize the SVM to non-linear decision functions, we introduce the
notion of a kernel function [1], [5].
The training data only appears in the optimization problem
(15) in the form of dot products between the input vector andthe support vectors. If the input vectors are mapped into some
high dimensional space via some nonlinear mapping Φ(x), then
the optimization problem would consist of dot products in this
higher dimensional space, Φ(xi) · Φ(xj). Given a kernel func-
tion K (xi,xj) = Φ (xi) · Φ(xj), the optimization problem
would be unchanged except for the dot product xi ·xj would be
replaced with the kernel function K (xi,xj). The actual map-
ping Φ(x) would not appear in the optimization problem and
would never need to be calculated, or even known.
Cover’s theorem on the separability of patterns [9] essentially
says that data cast nonlinearly into a high dimensional feature
space is more likely to be linearly separable there than in a
lower dimensional space. Even though the SVM still producesa linear decision function, the function is now linear in the fea-
ture space, rather than the input space. Because of the high
dimensionality of the feature space, we can expect the linear
decision function to perform well, in accordance with Cover’s
theorem. Viewed another way, because of the nonlinearity of
the mapping to feature space, the SVM is capable of produc-
ing arbitrary decision functions in input space, depending on
the kernel function. Thus the fact that the SVM constructs only
hyperplane boundaries is of little consequence.
The above discussion makes use of the kernel function
K (xi,xj), but does not specify how to choose a suitable kernel.
Mercer’s theorem [18], [8] provides the theoretical basis for the
determination of whether a given kernel function K is equal to
a dot product in some space, the requirement for admissibility
as an SVM kernel. A discussion of Mercer’s theorem is outside
the scope of this paper. Instead, we simply give two examples
of suitable kernel functions which will be used here:
• Polynomial Kernel
K (xi,xj) = (xT i xj + 1) p (18)
• Radial Basis Function Kernel
K (xi,xj) = exp
−
1
2σ2xi − xj2
(19)
III. MULTI-CLASS CLASSIFICATION
The best way to generalize SVMs to the multi-class case is an
ongoing research problem. One such method proposed by Platt
et.al. [23], is based on the notion of Decision Directed Acyclic
Graphs (DDAGs). A given DDAG is evaluated much like a
binary decision tree, where each internal node implements a
decision between two of the k classes of the classification prob-lem. At each node, one class is eliminated from consideration.
When the traversal of the graph reaches a terminal node, only
one class is left and the decision is made. The principle differ-
ence between the DDAG and the conventional decision tree is
that DDAGs are not constrained in the same manner as trees.
However, a DDAG does not take on arbitrary graph structures.
It is a specific form of graph which differs from a tree only in
how it handles duplication of decisions. In a decision tree, if
the same decision is required in multiple locations in the tree,
then each decision is represented through distinct but identical
nodes. A DDAG allows two nodes to share a child. Because an
algorithm using the DDAG has no need to backtrack through
the graph, the algorithm can treat the graph as though it is a
standard decision tree.
In the so-called DAGSVM algorithm, each decision node
uses a 1-v-1 SVM to determine which class to eliminate from
consideration. A separate classifier must be constructed to sep-
arate all pairs of classes. For the EEG classification task pre-
sented here, there are five classes, and therefore a total of ten
SVMs. Because each classifier deals only with approximately
40% of the available training data, assuming that each class
is represented nearly equally, each may be trained relatively
quickly. In addition, only four of the classifiers are used to
classify any given unknown input. Figure 1 shows a possible
DDAG for the EEG classification task.
IV. EEG SIGNAL ACQUISITION
The data used in this study were from the work of Keirn and
Aunon [13], [14] and collected using the following procedure.
Subjects were placed in a dim, sound controlled room and elec-
trodes were placed at positions C3, C4, P3, P4, O1, and O2 as
defined by the 10-20 system of electrode placement [11] and
referenced to two electrically linked mastoids at A1 and A2.
The impedance of all electrodes was kept below five Kohms.
Data were recorded at a sampling rate of 250 Hz with a Lab
Master 12 bit A/D converter mounted in an IBM-AT computer.
8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification
http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 4/7
4
Fig. 1. A Decision Directed Acyclic Graph (DDAG) for the EEG classification problem. Each node represents a 1-v-1 SVM trained to differentiate between thetwo classes compared by the node.
Before each recording session, the system was calibrated with a
known voltage. The electrodes were connected through a bank
of Grass 7P511 amplifiers with analog bandpass filters from
0.1–100 Hz. Eye blinks were detected by means of a sepa-
rate channel of data recorded from two electrodes placed above
and below the subject’s left eye. An eye blink was defined as a
change in magnitude greater than 100 µVolts within a 10 mil-
liseconds period.
With the recording instruments in place, the subjects were
asked to perform five separate mental tasks. These tasks were
chosen to invoke hemispheric brainwave asymmetry. The sub-
jects were asked to first relax as much as possible. This task
represents the baseline against which other tasks are to be com-
pared. The subjects were also asked to mentally compose a
letter to a friend, compute a non-trivial multiplication problem,
visualize a sequence of numbers being written on a blackboard,
and rotate a 3-dimensional solid. For each of these tasks, the
subjects were asked not to vocalize or gesture in any way. Data
was recorded for 10 seconds for each task, and each task was
repeated five times during each session. The data from each
channel was divided into half-second segments overlapping by
one quarter-second. After segments containing eye blinks were
discarded, the remaining data contained at most 39 segments.
V. RESULTS
In testing the classification algorithms, five trials from one
subject were selected from one day of experiments. Each trial
consisted of the subject performing all five mental tasks. The
first classifier tested is linear discriminant analysis (LDA). The
second type of classifiers are feedforward neural networks, con-sisting of 36 input units, 20 hidden units, and five binary output
units. The activation function at each unit is the tanh function.
The networks were trained using backpropagation with a learn-
ing rate of 0.1 and no momentum term. Training was halted
after 2,000 iterations or when the generalization began to fail,
as determined by a small set of validation data chosen without
replacement from the training data. The third type of classifiers
is support vector machines (SVM) that were trained using ra-
dial basis function (RBF) kernels or polynomial kernels. The
RBF based classifiers were trained using 0.5, 1.0, and 2.0 as
standard deviations of the kernel functions. Polynomial kernels
of degrees two, three, five, and ten were trained to test the poly-
nomial machines. For all kernel functions, the regularization
parameter C was tested at values 1.0, 10.0, and 100.0.
The Support Vector Machines were trained and tested using
the DAGSVM algorithm described earlier. Each of the 1-v-1
SVMs was trained using Platt’s Sequential Minimal Optimiza-
tion algorithm [21], [22]. SMO reduces the quadratic program-
ming stage of training to a series of pairwise optimizations
among the Lagrange multipliers. By solving the optimization
problem two variables at a time, the optimization can be per-
formed analytically. Platt shows significant speedups result-
ing from the SMO algorithm as compared to using a traditional
quadratic programming routine.
The training data was selected from the full set of five trials
as follows. One trial was selected as test data. Of the four re-
maining trials, one was chosen to be a validation set, which was
used to determine when to halt training of the neural networksand which values of the kernel parameters and regularization
parameter to use for the SVM tests. Finally, the remaining three
trials were compiled into one set of training data. The experi-
ments were repeated for each of the 20 ways to partition the five
trials in this manner and the results of the 20 experiments were
averaged to produce the results shown in Table I. This choice
of training paradigm is based on earlier results [2].
The SVM results reported in Table I are those corresponding
to the choice of kernel function and regularization parameter,
C , which produced the best results. Specifically, the SVM used
for the comparisons was constructed with a Radial Basis Func-
tion (RBF) kernel using a standard deviation σ = 0.5 and a
regularization parameter C equal to 1.LDA provides extremely fast evaluations of unknown inputs
performed by distance calculations between a new sample and
the mean of training data samples in each class weighted by
their covariance matrices. Neural networks are also efficient af-
ter the training phase is complete. SVMs are similar to neural
networks, but generally require more computation due to the
comparatively large numbers of support vectors. The time re-
quired to compute class membership for an SVM is directly de-
pendent on the number of support vectors. The number of sup-
port vectors resulting from experiments reported here ranged
from 140 to 308.
8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification
http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 5/7
5
Average Over
Classifier Rest Math Letter Rotate Count Total 20 Windows
LDA 47.3 45.1 51.1 38.8 44.5 44.8 66.0
NN 64.3 47.3 54.7 51.1 47.3 52.8 69.4
SVM 59.4 44.5 52.7 57.0 47.9 52.3 72.0
TABLE I
PERCENTAGE OF TEST DATA CORRECTLY CLASSIFIED BROKEN DOWN BY TASK . THE SUPPORT VECTOR MACHINE IN THESE EXPERIMENTS USED THE
SET OF PARAMETERS WHICH RESULTED IN THE HIGHEST CORRECT RATE OF CLASSIFICATION AMONG ALL SVM S TESTED.
V I . FUTURE WORK
All data used in this study was collected from a single subject
during the same day. A logical step to take from here is to test
the performance of the classifiers on data collected on later days
and to repeat these experiments on data collected from other
subjects.
In addition, there have been several other attempts at gener-
alizing kernel based learners to multi-class classification. We-
ston and Watkins [24] have extended the theory of SVMs di-
rectly into the multi-class domain. Import Vector Machines[28] seem to offer similar performance while using significantly
fewer support vectors. Each of these methods provide a slightly
different approach to the classification problem, and could offer
performance improvements.
VII. FEATURE SELECTION WITH GENETIC ALGORITHMS
A. Introduction
Both invasive and non-invasive BCI systems produce a very
large amount of electrophysiological data. However, only a
relatively small percentage of the potentially informative fea-
tures of the data are utilized. High-resolution analysis of spa-
tial, temporal, and spectral aspects of the data, and allowing
for their interactions, leads to a very high dimensional feature
space. Leveraging a higher percentage of potential features in
the measured data requires more powerful signal analysis and
classification capabilities. We have developed an EEG analy-
sis system that integrates advanced concepts and tools from the
fields of machine learning and artificial intelligence to address
this challenge. One of our initial test applications of the system
is the “self-paced key typing” dataset from Blankertz et al. [4]
which includes 413 pre-key press epochs of EEG recorded from
one subject.
B. Method
The overall system is composed of two main parts: featurecomposition and feature selection (see Figure 2). Feature com-
position entails data preprocessing, feature derivation, and as-
sembling all of the features into a single large feature matrix. In
this specific experiment, we used a fixed set of six electrodes:
F3, F4, C3, C4, CP3, CP4, partitioned each trial into 500 ms
windows shifted by 100 ms over the entire epoch, zero-meaned
the signals, zero-padded them to length 1024, and computed
their power spectra at 1 Hz frequency resolution. We used
mean power over the standard EEG frequency bands of delta
(2-4 Hz), theta (4-8), alpha (8-13), beta1 (13-20), beta2 (20-
35), and gamma (35-46).
The feature selection part included a support vector machine
(SVM) for predicting (classifying) the pressed key laterality
and a genetic algorithm (GA) for searching the space of feature
subsets [25], [27]. We used the radial basis function kernel,
with gamma = 0.2. The SVM has several advantage over al-
ternative classifiers. Unlike most neural network classifiers, the
SVM is not susceptible to local optima. SVMs involve many
fewer parameters than neural networks, have built-in regular-
ization, are theoretically well-grounded, and, particularly im-
portant for ultimate real-time use in a BCI, are extremely fast.
The GA was implemented with a population of 20, 2-point
crossover probability of 0.66, and mutation rate of 0.008. Indi-
viduals in the population were binary strings, with 1 indicating
that a feature was included, 0 indicating that it was not. We used
a GA to search the space of feature subsets for two main rea-
sons. First, exhaustive exploration of search spaces with greater
than about 20 features is computationally intractable (ie. 220
possible subsets). Second, unlike gradient-based search meth-
ods, the GA is inherently designed to avoid the pitfall of local
optima. We searched over the eleven time windows and six fre-
quency bands, while constantly including all six electrodes in
each case. Thus, the dimensionality of the searchable feature
space was 66 (11 x 6). Each time an individual (feature subset)in the GA population was evaluated, we trained and tested the
SVM using 10x10 fold cross validation and used the average
classification accuracy as the individual’s fitness measure.
C. Results
The GA evolves a population of feature subsets whose corre-
sponding fitness (classification accuracy) improves over itera-
tions of the GA (Figure 3). Note that although both the popula-
tion average and best individual fitness improve over successive
generations, the best individual fitness does so in a monotonic
fashion. The best fitness obtained was a classification accuracy
of 76%. It was stable for over 50 generations of the GA. Thestandard deviation of the classification accuracy produced by
the SVM was typically about 6%.
Figure 4 shows the feature subset exhibiting the highest clas-
sification accuracy. The feature subset included features from
every time window and every frequency band. This suggests
that alternative methods that include only a few time windows
or frequencies may be missing features that could improve clas-
sification accuracy. Furthermore, all frequency bands were in-
cluded in the third time window, suggesting that early wide-
band activity may be a significant feature of the process for de-
ciding finger laterality.
8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification
http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 6/7
6
Fig. 2. System Architecture for mining the EEG feature space. The spaceof feature subsets is searched in a “wrapper” fashion, whereby the search isdirected by the performance of the classifier, in this case a support vector ma-chine.
D. Discussion
Although the best classification accuracy (76%) was consid-erably higher than chance, it was much lower than the approx-
imately 95% classification accuracy obtained by Blankertz et
al [4]. One possible reason is that we used data from only a
small subset of the electrodes recorded (6 of 27) in order to re-
duce computation time by restraining the dimensionality of the
feature vector presented to the SVM.
Optimizing classification accuracy was not, however, our pri-
mary goal. Instead, we sought insight into the nature of the fea-
tures that would provide the best classicification accuracy. The
feature selection method showed that a diverse subset of spec-
trotemporal features in the EEG contributed to the best classifi-
cation accuracy. However, most BCIs that use EEG frequency
information in imagined or real movement look at only alpha(mu) and beta bands over only one or a few time windows [20],
[19], [26]. Furthermore, the system is amenable to on-line ap-
plications. One could use the full system, including the GA,
to learn the best dissociating features for a given subject and
task, then use the trained SVM with the best dissociating fea-
tures in real-time. Thus preliminary results from the research
suggests that BCI performance could be improved by leverag-
ing advances in machine learning and artificial intelligence for
systematic exploration of the EEG feature space.
VIII. CONCLUSIONS
Support Vector Machines provide a powerful method for dataclassification. The SVM algorithm has a very solid foundation
in statistical learning theory, and guarantees to find the optimal
decision function for a set of training data, given a set of pa-
rameters determining the operation of the SVM. The empirical
evidence presented here shows that the algorithm performs very
well on one real problem.
Finally, we are currently working with alternative represen-
tations of the EEG data. Preliminary results indicate that apply-
ing a KL-Transform to the raw data produces a data set which is
much more susceptible to accurate classification by many types
of classifiers.
Fig. 3. Classification accuracy (population fitness) evolves over iterations(generations) of the genetic algorithm. Thin line is the average fitness of thepopulation. Thick line is the fitness of the best individual in the population.
Fig. 4. Features selected for the best individual. Black indicates the featurewas included in the subset, white indicates it was not. Time windows corre-
spond to number of 100 ms shifts from epoch onset, ie. time window 1 is earlyin the epoch, time window 11 ends 120 ms before the key press.
ACKNOWLEDGMENT
Deon’s: were constructed using the SVM MATLAB toolbox
developed by [6].
dave’s: The SVM was implemented with the OSU SVM
Classifier Matlab Toolbox [17].
The GA was implemented with the commercial FlexTool GA
software [16].
REFERENCES
[1] A. Aizerman, E.M. Braverman, and L.I. Rozoner. Theoretical founda-tions of the potential function method in pattern recognition learning. In
Automation and Remote Control, 1964.[2] C. W. Anderson, S. V. Devulapalli, and E. A. Stolz. Determining mental
state from EEG signals using neural networks. Scientific Programming,4(3):171–183, 1995.
[3] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1995.[4] B. Blankertz, G. Curio, and K. R. Muller. Classifying single trial EEG:
Towards brain computer interfacing. 14, 2002. to appear.[5] B.E. Boser, I.M. Guyon, and V. Vapnik. A training algorithm for optimal
margin classifiers. In Fifth Annual Workshop on Computational LearningTheory, 1992.
[6] G.C. Cawley. MATLAB support vector machine toolboxhttp://theoval.sys.uea.ac.uk/˜gcc/svm/toolbox .University of East Anglia, School of Information Systems, Norwich,Norfolk, U.K. NR4 7TJ, 2000.
8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification
http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 7/7
7
[7] C. Cortes and V. Vapnik. Support vector networks. Machine Learning,20, 1995.
[8] R. Courant and D. Hilbert. Methods of Mathematical Physics, volume Iand II. Wiley Interscience, 1970.
[9] T.M. Cover. Geometrical and statistical properties of systems of linear in-equalities with applications in pattern recognition. In IEEE Transactionson Electronic Computers, 1965.
[10] R. Fletcher. Practical Methods of Optimization. John Wiley and Sons,Inc., 2nd edition, 1987.
[11] H. Jasper. The ten twenty electrode system of the international feder-ation. Electroencephalography and Clinical Neurophysiology, 10:371–375, 1958.
[12] W. Karush. Minima of functions of several variables with inequalities asside constraints. Master’s thesis, University of Chicago, 1939.
[13] Z. A. Keirn. Alternative modes of communication between man and ma-chine. Master’s thesis, Purdue University, Lafayette, IN, West Lafayette,IN, 1988.
[14] Z. A. Keirn and J. I. Aunon. A new mode of communication between manand his surroundings. IEEE Transactions on Biomedical Engineering,37(12):1209–1214, 1990.
[15] H.W. Kuhn and A.W. Tucker. Nonlinear programming. In Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probabilis-tics, 1951.
[16] CynapSys LLC. Flexga. www.cynapsys.com, 2002.[17] J. Ma, Y. Zhao, and S. Ahalt. Osu svm classifier matlab toolbox.
http://eewww.eng.ohio-state.edu/˜maj/osu svm/, 2002.[18] J. Mercer. Functions of positive and negative type, and their connec-
tion with the theory of integral equations. In Transactions of the London
Philosophical Society, 1909.[19] G. Pfurtscheller, C. Neuper, C. Guger, W. Harkam, H. Ramoser,
A. Schlogl, B. Obermaier, and M. Pregenzer. Current trends in Graz brain-computer interface (BCI) research. EEE Transactions on Rehabilitation
Engineering, 8(2):456–460, 2000.[20] J. A. Pineda, B. Z. Allison, and A. Vankov. The effects of self-movement,
observation, and imagination on mu rhythms and readiness potentials:Toward a brain-computer interface. IEEE Transactions on Rehabilitation
Engineering, 8(2):219–222, June 2000.[21] John C. Platt. Fast training of support vector machines using sequential
minimal optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 185–208.MIT Press, 1998.
[22] John C. Platt. Using analytic qp and sparseness to speed training of sup-port vector machines. In M. S. Kearns, S. A. Solla, and D. A. Cohn, edi-tors, Advances in Neural Information Processing Systems 11. MIT Press,1999.
[23] John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large marginDAGs for multiclass classification. In S. A. Solla, T. K. Leen, and K.-R.Muller, editors, Advances in Neural Information Processing Systems 12,pages 547–553, 2000.
[24] J. Westonand C. Watkins. Multi-class support vector machines. Technicalreport, Royal Holloway University of London, 1998.
[25] D. Whitley, R. Beveridge, C. Guerra, and C. Graves. Messy genetic al-gorithms for subset feature selection. In T. Baeck, editor, Proc. Int. Conf.on Genetic Algorithms, Boston, MA, 1997. Morgan Kaufmann.
[26] J. R. Wolpaw, D. J. McFarland, and T. M. Vaughan. Brain-computerinterface research at the wadsworth center. IEEE Transactions on Reha-bilitation Engineering, 3(2):222–226, June 2000.
[27] Jihoon Yang and Vasant Honavar. Feature subset selection using a geneticalgorithm. In Huan Liu and Hiroshi Motoda, editors, Feature extraction,construction and selection : a data mining perspective, pages 117–136.Kluwer Academic, Boston, MA, 1998.
[28] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vectormachine. In NIPS2001, 2001.