deon garrett et al- comparison of linear and nonlinear methods for eeg signal classification

7
1 Comparison of Linear and Nonlinear Methods for EEG Signal Classication Deon Garrett, David A. Peterson, Charle s W. Anderson, Michael H. Thaut  Abstract — The rel iabl e oper ation of brain -computer inte r- faces (BCI’s) based on spontaneous electro electroencephalogram (EEG) signa ls req uire s accur ate clas sic ation of mult ichan nel EEG. The design of EEG representations and classiers for BCI are open research questions whose difculty stems from the need to extract complex spatial and temporal patterns from noisy mul- tidimensional time series obtained from EEG measurements. It is possible that the amount of noise in EEG limits the power of non- linear methods; linear methods may perform just as well as non- linear methods. This article reports the re sults of a linear (linear discriminant analysis) and two nonlinear classiers (neural net- works and support vector machines) applied to the classication of spontaneous, six-channel EEG. The nonlinear classiers produce only slightly better classication results. An approach to feature selection based on genetic algorithms is also presented with pre- liminary results.  Index Terms EEG, electroen cephalogram, pattern classic a- tion, neural networks, support vector machines, feature selection, genetic algorithms I. I NTRODUCTION Recen tly , much researc h has been performed into alterna- tive methods of communication between humans and comput- ers. The standard keyboar d/mo use model of compute r use is not only unsuitable for many people with disabilities, but also somewhat clumsy for many tasks regardless of the capabilities of the user. Electroencephal ogram (EEG) signals provide one possible means of human-computer interaction which requires very little in terms of physica l abilitie s. By training the com- puter to recognize and classify EEG signals, users could ma- nipulate the machine by merely thinking about what they want it to do within a limited set of choices. Currently, most research into EEG classication uses such machine learning stalwarts as Neural Networks (NNs). In this article, we examine the application of support vec- tor machines (SVM) to the problem of EEG classication and compare the results to those obtained using neural networks and linear disciminant analysis. Section II provides an overview of SVM theory and practice, and the problem of multi-class clas- sication is considered in Section III. Section ?? discusses the D. Garrett is a Ph.D. candidate in the Department of Computer Science, Col- orado State Universi ty, Fort Collins, CO (e-mail: [email protected] ostate.edu). D. Pet ers on is a Ph. D. candi dat e in theDepar tme nt of Comput er Sci enc e, Col- orado State Universi ty, Fort Collins, CO (e-mail: [email protected] ostate.edu). C. Anderson is with the Department of Computer Science, Colorado State Univers ity, Fort Collins, CO (e-mail: [email protected] state.edu). M. Tha ut is wit h theDepar tme nt of Mus ic, Thea tre, andDanceand theCent er for Biomedical Research, Colorado State University, Fort Collins, CO (e-mail: [email protected]). D. Peterson, C. Anderson and M. Thaut are also with the Molecular, Cellular , and Integrative Neuroscience Program at Colorado State University . acq uisit ion of EEG sig nal s. The res ult s of this study are det ail ed in Section V. Section ?? describes preliminary experiments us- ing genetic algorithms to search for good subsets of features in an EEG classicati on problem. Secti on ?? summarizes the ndings of this article and their implications. II. SUPPORT VECTOR MACHINES FOR BINARY CLASSIFICATION The support vector machine (SVM) is a classication metho d root ed in stat isti cal learni ng theory . The motiv ation behin d SVMs is to map the input into a high dimensional feature space, in which the data might be line arly separ able. In this regard , SVMs are very similar to other Neural Network based learn- ing machines. The principle difference between these machines and SVMs is that the latter produce the optimal decision surface in the feature space. Conventional neural networks can be difcult to build due to the need to sele ct an appr opri ate number of hidden units . The network must contain enough hidden units to be able to approx- ima te the functi on in question to the des ired acc ura cy. Howeve r, if the network contains too many hidden units, it may simply memorize the training data, causing very poor generalization. The ability of the machine to learn features of the training data is often referred to as learning capacity, and is formalized in a concept called VC dimension. Suppor t V ect or Mac hin es are con str uct ed by sol vin g a quadr atic prog ramming prob lem. In solving this prob lem, SVM training algorithms simultaneously maximize the performance of the machine while minimizing a term representing the VC dime nsio n of the lear ning machin e. This minimizat ion of the capacity of the machine ensures that the system can not overt the training data, for a given set of parameters.  A. Linear Support V ector Machi nes In this section, the training of a support vector machine is de- scribed for the case of a binary classication problem for which a linear decision surface exists that can perfectly classify the training data. In later sections, the requirement of linear sepa- rability will be relaxed. The assumption of linear separability means that there ex- ists some hyperplane which per fectly separates the data. This hyperplane is a decision surface of the form w · x + b = 0 , (1) where w is an adjustable weight vector, x is an input vector, and b is a bias term. The assumption of separability means that

Upload: asvcxv

Post on 06-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification

http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 1/7

1

Comparison of Linear and Nonlinear Methods forEEG Signal Classification

Deon Garrett, David A. Peterson, Charles W. Anderson, Michael H. Thaut

 Abstract— The reliable operation of brain-computer inter-faces (BCI’s) based on spontaneous electro electroencephalogram(EEG) signals requires accurate classification of multichannelEEG. The design of EEG representations and classifiers for BCIare open research questions whose difficulty stems from the needto extract complex spatial and temporal patterns from noisy mul-tidimensional time series obtained from EEG measurements. It ispossible that the amount of noise in EEG limits the power of non-linear methods; linear methods may perform just as well as non-linear methods. This article reports the results of a linear (lineardiscriminant analysis) and two nonlinear classifiers (neural net-works and support vector machines) applied to the classification of 

spontaneous, six-channel EEG. The nonlinear classifiers produceonly slightly better classification results. An approach to featureselection based on genetic algorithms is also presented with pre-liminary results.

 Index Terms— EEG, electroencephalogram, pattern classifica-tion, neural networks, support vector machines, feature selection,genetic algorithms

I. INTRODUCTION

Recently, much research has been performed into alterna-

tive methods of communication between humans and comput-

ers. The standard keyboard/mouse model of computer use is

not only unsuitable for many people with disabilities, but alsosomewhat clumsy for many tasks regardless of the capabilities

of the user. Electroencephalogram (EEG) signals provide one

possible means of human-computer interaction which requires

very little in terms of physical abilities. By training the com-

puter to recognize and classify EEG signals, users could ma-

nipulate the machine by merely thinking about what they want

it to do within a limited set of choices.

Currently, most research into EEG classification uses such

machine learning stalwarts as Neural Networks (NNs).

In this article, we examine the application of support vec-

tor machines (SVM) to the problem of EEG classification and

compare the results to those obtained using neural networks and

linear disciminant analysis. Section II provides an overview of SVM theory and practice, and the problem of multi-class clas-

sification is considered in Section III. Section ?? discusses the

D. Garrett is a Ph.D. candidate in the Department of Computer Science, Col-orado State University, Fort Collins, CO (e-mail: [email protected]).

D. Peterson is a Ph.D. candidate in theDepartment of Computer Science, Col-orado State University, Fort Collins, CO (e-mail: [email protected]).

C. Anderson is with the Department of Computer Science, Colorado StateUniversity, Fort Collins, CO (e-mail: [email protected]).

M. Thaut is with theDepartment of Music, Theatre, andDanceand theCenterfor Biomedical Research, Colorado State University, Fort Collins, CO (e-mail:[email protected]).

D. Peterson, C. Anderson and M. Thaut are also with the Molecular, Cellular,and Integrative Neuroscience Program at Colorado State University.

acquisition of EEG signals. The results of this study are detailed

in Section V. Section ?? describes preliminary experiments us-

ing genetic algorithms to search for good subsets of features

in an EEG classification problem. Section ?? summarizes the

findings of this article and their implications.

I I . SUPPORT VECTOR MACHINES FOR BINARY

CLASSIFICATION

The support vector machine (SVM) is a classification method

rooted in statistical learning theory. The motivation behind

SVMs is to map the input into a high dimensional feature space,

in which the data might be linearly separable. In this regard,

SVMs are very similar to other Neural Network based learn-

ing machines. The principle difference between these machines

and SVMs is that the latter produce the optimal decision surface

in the feature space.

Conventional neural networks can be difficult to build due to

the need to select an appropriate number of hidden units. The

network must contain enough hidden units to be able to approx-

imate the function in question to the desired accuracy. However,

if the network contains too many hidden units, it may simply

memorize the training data, causing very poor generalization.

The ability of the machine to learn features of the training data

is often referred to as learning capacity, and is formalized in a

concept called VC dimension.

Support Vector Machines are constructed by solving a

quadratic programming problem. In solving this problem, SVM

training algorithms simultaneously maximize the performance

of the machine while minimizing a term representing the VC

dimension of the learning machine. This minimization of the

capacity of the machine ensures that the system can not overfit

the training data, for a given set of parameters.

 A. Linear Support Vector Machines

In this section, the training of a support vector machine is de-

scribed for the case of a binary classification problem for which

a linear decision surface exists that can perfectly classify the

training data. In later sections, the requirement of linear sepa-

rability will be relaxed.

The assumption of linear separability means that there ex-

ists some hyperplane which perfectly separates the data. This

hyperplane is a decision surface of the form

w · x+ b = 0, (1)

where w is an adjustable weight vector, x is an input vector,

and b is a bias term. The assumption of separability means that

8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification

http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 2/7

2

there exists some set of values w and b, such that the following

constraints hold for all input vectors, given that the classes are

labeled +1 and −1:

w · xi + b ≥ +1 ∀yi = +1 (2)

w · xi + b ≤ −1 ∀yi = −1 (3)

or

yi (w · xi + b) − 1 ≥ 0 ∀i. (4)

As previously stated, the support vector machine training al-

gorithm finds the optimal hyperplane for separation of the train-

ing data. Specifically, it finds the hyperplane which maximizes

the margin of separation of the classifier.

Consider the set of training examples which satisfy (2) ex-

actly. These examples are those which lie closest to the hy-

perplane on the positive side. Similarly, the training examples

satisfying (3) exactly lie closest to the hyperplane on the nega-

tive side. These particular training examples are called support 

vectors. Note that requiring the existence of points exactly sat-

isfying the constraints is equivalent to simply rescaling w and

b by an appropriate amount.

The distance between these points and the hyperplane is

given by 1/ w. We define the margin of the hyperplane to be

the distance between the positive examples nearest the hyper-

plane and the negative examples nearest the hyperplane, which

is equal to 2/ w. Therefore, we can maximize the margin of 

the classifier by minimizing w, subject to the constraints of 

(4). Thus the problem of training the SVM can be stated as fol-

lows: find w and b such that the resulting hyperplane correctly

classifies the training data and the Euclidean norm of the weight

vector is minimized.

To solve the problem described above, it is typically reformu-

lated as a Lagrangian optimization problem. In this reformula-tion, nonnegative Lagrange multipliersA = {α1, α2, ...αn} are

introduced, yielding the Lagrangian

L =1

2w −

ni=1

αi(yi(w · xi + b) − 1) (5)

We must minimize this Lagrangian with respect tow and b, and

simultaneously maximize with respect to the Lagrangian multi-

pliers αi. Differentiating with respect to w and b and applying

the results to the Lagrangian yields two conditions of optimal-

ity,

w =

n

i=1

αiyixi (6)

andni=1

αiyi = 0 (7)

There are two important consequences of these conditions:

the optimal weight vector wo is described in terms of the train-

ing data, and only those training examples whose correspond-

ing Lagrange multipliers are non-zero contribute to wo. From

the Karush-Kuhn-Tucker (KKT) conditions [12], [15], [10], [3],

it follows that the training patterns corresponding to the non-

zero multipliers are those that satisfy (4) exactly. To understand

why this is true, recall that we wish to maximize the Lagrangian

L with respect to A. Thus, assuming w and b are constant, the

second term of L must be minimized. If (yi(w·xi+b)−1) > 0,

then αi must be zero in order to maximize L. Therefore, only

the training points lying closest to the optimal hyperplane, the

support vectors, have any effect on its calculation.

Substituting the optimality conditions, (6) and (7), into (5)

yields the Wolfe dual [3] of the optimization problem: find mul-

tipliers αi such that

LD =n

i=1

αi −1

2

ni=1

nj=1

αiαjyiyj(xi · xj) (8)

is maximized subject to the constraints:

αi ≥ 0 ∀i (9)

andni=1

αiyi = 0, (10)

yielding a decision function of the form,

f (x) = sign

ni=1

αiyi(x · xi) + b

. (11)

Note that while w is directly determined by the set of support

vectors, the bias term b is not. Once the weight vector is known,

the bias may be computed by substitution of any support vector

into (4) and solving as an equality constraint, although numeri-

cally, it is better to take an average over all support vectors.

 B. Relaxing the Separability Restriction

The previous derivation assumed that the training data was

linearly separable. The constraints of (4) are too rigid for use

with non-linearly separable data; they force all training ex-amples to lie outside the margin of the classifier. The key

idea in extending the Support Vector Machine to handle non-

separable data is to allow these constraints to be violated, but

only if accompanied by a penalty in the objective function.

We thus introduce another set of nonnegative slack variables,

Ξ = {ξ1, ξ2,...,ξn} into the constraints [7]. The new con-

straints are

w · xi + b ≥ +1 − ξi ∀yi = +1, (12)

w · xi + b ≤ −1 + ξi ∀yi = −1, (13)

and ξi ≥ 0 ∀i. (14)

An error thus occurs only when ξi > 1. Therefore, the sum

ni=1

ξi

effectively serves as an upper bound on the number of errors

committed by the SVM. We modify the original goal of the op-

timization problem, minimize w, by adding a term to penal-

ize errors. The new optimization problem thus becomes: mini-

mize

w + C ni=1

ξi,

8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification

http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 3/7

3

where C  is a user-defined parameter which controls the degree

to which training errors can be tolerated.

Proceeding in a manner analogous to that above, the Wolfe

dual of the new Lagrangian is

LD =

ni=1

αi −1

2

ni=1

nj=1

αiαjyiyj(xi · xj), (15)

which is identical to (8). As in the separable case, LD must bemaximized subject to constraints on the Lagrange multipliers.

However, the addition of the ξi produces a subtle difference

in these constraints. Specifically, the constraint given in (9)

becomes the following:

0 ≤ αi ≤ C  ∀i. (16)

The second constraint,

ni=1

αiyi = 0, (17)

remains the same as in the separable problem. Thus, bounding

the values of the Lagrange multipliers from above allows the

Support Vector Machine to construct decision boundaries for

training data which cannot be linearly separated.

C. Relaxing the Linearity Restriction

Thus far, it has been assumed that the SVM was to construct

a linear boundary between two classes represented by a set of 

training data. Of course, most interesting problems cannot be

adequately classified by a linear machine. In order to general-

ize the SVM to non-linear decision functions, we introduce the

notion of a kernel function [1], [5].

The training data only appears in the optimization problem

(15) in the form of dot products between the input vector andthe support vectors. If the input vectors are mapped into some

high dimensional space via some nonlinear mapping Φ(x), then

the optimization problem would consist of dot products in this

higher dimensional space, Φ(xi) · Φ(xj). Given a kernel func-

tion K (xi,xj) = Φ (xi) · Φ(xj), the optimization problem

would be unchanged except for the dot product xi ·xj would be

replaced with the kernel function K (xi,xj). The actual map-

ping Φ(x) would not appear in the optimization problem and

would never need to be calculated, or even known.

Cover’s theorem on the separability of patterns [9] essentially

says that data cast nonlinearly into a high dimensional feature

space is more likely to be linearly separable there than in a

lower dimensional space. Even though the SVM still producesa linear decision function, the function is now linear in the fea-

ture space, rather than the input space. Because of the high

dimensionality of the feature space, we can expect the linear

decision function to perform well, in accordance with Cover’s

theorem. Viewed another way, because of the nonlinearity of 

the mapping to feature space, the SVM is capable of produc-

ing arbitrary decision functions in input space, depending on

the kernel function. Thus the fact that the SVM constructs only

hyperplane boundaries is of little consequence.

The above discussion makes use of the kernel function

K (xi,xj), but does not specify how to choose a suitable kernel.

Mercer’s theorem [18], [8] provides the theoretical basis for the

determination of whether a given kernel function K  is equal to

a dot product in some space, the requirement for admissibility

as an SVM kernel. A discussion of Mercer’s theorem is outside

the scope of this paper. Instead, we simply give two examples

of suitable kernel functions which will be used here:

• Polynomial Kernel

K (xi,xj) = (xT i xj + 1) p (18)

• Radial Basis Function Kernel

K (xi,xj) = exp

1

2σ2xi − xj2

(19)

III. MULTI-CLASS CLASSIFICATION

The best way to generalize SVMs to the multi-class case is an

ongoing research problem. One such method proposed by Platt

et.al. [23], is based on the notion of Decision Directed Acyclic

Graphs (DDAGs). A given DDAG is evaluated much like a

binary decision tree, where each internal node implements a

decision between two of the k classes of the classification prob-lem. At each node, one class is eliminated from consideration.

When the traversal of the graph reaches a terminal node, only

one class is left and the decision is made. The principle differ-

ence between the DDAG and the conventional decision tree is

that DDAGs are not constrained in the same manner as trees.

However, a DDAG does not take on arbitrary graph structures.

It is a specific form of graph which differs from a tree only in

how it handles duplication of decisions. In a decision tree, if 

the same decision is required in multiple locations in the tree,

then each decision is represented through distinct but identical

nodes. A DDAG allows two nodes to share a child. Because an

algorithm using the DDAG has no need to backtrack through

the graph, the algorithm can treat the graph as though it is a

standard decision tree.

In the so-called DAGSVM algorithm, each decision node

uses a 1-v-1 SVM to determine which class to eliminate from

consideration. A separate classifier must be constructed to sep-

arate all pairs of classes. For the EEG classification task pre-

sented here, there are five classes, and therefore a total of ten

SVMs. Because each classifier deals only with approximately

40% of the available training data, assuming that each class

is represented nearly equally, each may be trained relatively

quickly. In addition, only four of the classifiers are used to

classify any given unknown input. Figure 1 shows a possible

DDAG for the EEG classification task.

IV. EEG SIGNAL ACQUISITION

The data used in this study were from the work of Keirn and

Aunon [13], [14] and collected using the following procedure.

Subjects were placed in a dim, sound controlled room and elec-

trodes were placed at positions C3, C4, P3, P4, O1, and O2 as

defined by the 10-20 system of electrode placement [11] and

referenced to two electrically linked mastoids at A1 and A2.

The impedance of all electrodes was kept below five Kohms.

Data were recorded at a sampling rate of 250 Hz with a Lab

Master 12 bit A/D converter mounted in an IBM-AT computer.

8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification

http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 4/7

4

Fig. 1. A Decision Directed Acyclic Graph (DDAG) for the EEG classification problem. Each node represents a 1-v-1 SVM trained to differentiate between thetwo classes compared by the node.

Before each recording session, the system was calibrated with a

known voltage. The electrodes were connected through a bank 

of Grass 7P511 amplifiers with analog bandpass filters from

0.1–100 Hz. Eye blinks were detected by means of a sepa-

rate channel of data recorded from two electrodes placed above

and below the subject’s left eye. An eye blink was defined as a

change in magnitude greater than 100 µVolts within a 10 mil-

liseconds period.

With the recording instruments in place, the subjects were

asked to perform five separate mental tasks. These tasks were

chosen to invoke hemispheric brainwave asymmetry. The sub-

  jects were asked to first relax as much as possible. This task 

represents the baseline against which other tasks are to be com-

pared. The subjects were also asked to mentally compose a

letter to a friend, compute a non-trivial multiplication problem,

visualize a sequence of numbers being written on a blackboard,

and rotate a 3-dimensional solid. For each of these tasks, the

subjects were asked not to vocalize or gesture in any way. Data

was recorded for 10 seconds for each task, and each task was

repeated five times during each session. The data from each

channel was divided into half-second segments overlapping by

one quarter-second. After segments containing eye blinks were

discarded, the remaining data contained at most 39 segments.

V. RESULTS

In testing the classification algorithms, five trials from one

subject were selected from one day of experiments. Each trial

consisted of the subject performing all five mental tasks. The

first classifier tested is linear discriminant analysis (LDA). The

second type of classifiers are feedforward neural networks, con-sisting of 36 input units, 20 hidden units, and five binary output

units. The activation function at each unit is the tanh function.

The networks were trained using backpropagation with a learn-

ing rate of 0.1 and no momentum term. Training was halted

after 2,000 iterations or when the generalization began to fail,

as determined by a small set of validation data chosen without

replacement from the training data. The third type of classifiers

is support vector machines (SVM) that were trained using ra-

dial basis function (RBF) kernels or polynomial kernels. The

RBF based classifiers were trained using 0.5, 1.0, and 2.0 as

standard deviations of the kernel functions. Polynomial kernels

of degrees two, three, five, and ten were trained to test the poly-

nomial machines. For all kernel functions, the regularization

parameter C was tested at values 1.0, 10.0, and 100.0.

The Support Vector Machines were trained and tested using

the DAGSVM algorithm described earlier. Each of the 1-v-1

SVMs was trained using Platt’s Sequential Minimal Optimiza-

tion algorithm [21], [22]. SMO reduces the quadratic program-

ming stage of training to a series of pairwise optimizations

among the Lagrange multipliers. By solving the optimization

problem two variables at a time, the optimization can be per-

formed analytically. Platt shows significant speedups result-

ing from the SMO algorithm as compared to using a traditional

quadratic programming routine.

The training data was selected from the full set of five trials

as follows. One trial was selected as test data. Of the four re-

maining trials, one was chosen to be a validation set, which was

used to determine when to halt training of the neural networksand which values of the kernel parameters and regularization

parameter to use for the SVM tests. Finally, the remaining three

trials were compiled into one set of training data. The experi-

ments were repeated for each of the 20 ways to partition the five

trials in this manner and the results of the 20 experiments were

averaged to produce the results shown in Table I. This choice

of training paradigm is based on earlier results [2].

The SVM results reported in Table I are those corresponding

to the choice of kernel function and regularization parameter,

C , which produced the best results. Specifically, the SVM used

for the comparisons was constructed with a Radial Basis Func-

tion (RBF) kernel using a standard deviation σ = 0.5 and a

regularization parameter C equal to 1.LDA provides extremely fast evaluations of unknown inputs

performed by distance calculations between a new sample and

the mean of training data samples in each class weighted by

their covariance matrices. Neural networks are also efficient af-

ter the training phase is complete. SVMs are similar to neural

networks, but generally require more computation due to the

comparatively large numbers of support vectors. The time re-

quired to compute class membership for an SVM is directly de-

pendent on the number of support vectors. The number of sup-

port vectors resulting from experiments reported here ranged

from 140 to 308.

8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification

http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 5/7

5

Average Over

Classifier Rest Math Letter Rotate Count Total 20 Windows

LDA 47.3 45.1 51.1 38.8 44.5 44.8 66.0

NN 64.3 47.3 54.7 51.1 47.3 52.8 69.4

SVM 59.4 44.5 52.7 57.0 47.9 52.3 72.0

TABLE I

PERCENTAGE OF TEST DATA CORRECTLY CLASSIFIED BROKEN DOWN BY TASK . THE SUPPORT VECTOR MACHINE IN THESE EXPERIMENTS USED THE

SET OF PARAMETERS WHICH RESULTED IN THE HIGHEST CORRECT RATE OF CLASSIFICATION AMONG ALL SVM S TESTED.

V I . FUTURE WORK

All data used in this study was collected from a single subject

during the same day. A logical step to take from here is to test

the performance of the classifiers on data collected on later days

and to repeat these experiments on data collected from other

subjects.

In addition, there have been several other attempts at gener-

alizing kernel based learners to multi-class classification. We-

ston and Watkins [24] have extended the theory of SVMs di-

rectly into the multi-class domain. Import Vector Machines[28] seem to offer similar performance while using significantly

fewer support vectors. Each of these methods provide a slightly

different approach to the classification problem, and could offer

performance improvements.

VII. FEATURE SELECTION WITH GENETIC ALGORITHMS

  A. Introduction

Both invasive and non-invasive BCI systems produce a very

large amount of electrophysiological data. However, only a

relatively small percentage of the potentially informative fea-

tures of the data are utilized. High-resolution analysis of spa-

tial, temporal, and spectral aspects of the data, and allowing

for their interactions, leads to a very high dimensional feature

space. Leveraging a higher percentage of potential features in

the measured data requires more powerful signal analysis and

classification capabilities. We have developed an EEG analy-

sis system that integrates advanced concepts and tools from the

fields of machine learning and artificial intelligence to address

this challenge. One of our initial test applications of the system

is the “self-paced key typing” dataset from Blankertz et al. [4]

which includes 413 pre-key press epochs of EEG recorded from

one subject.

  B. Method 

The overall system is composed of two main parts: featurecomposition and feature selection (see Figure 2). Feature com-

position entails data preprocessing, feature derivation, and as-

sembling all of the features into a single large feature matrix. In

this specific experiment, we used a fixed set of six electrodes:

F3, F4, C3, C4, CP3, CP4, partitioned each trial into 500 ms

windows shifted by 100 ms over the entire epoch, zero-meaned

the signals, zero-padded them to length 1024, and computed

their power spectra at 1 Hz frequency resolution. We used

mean power over the standard EEG frequency bands of delta

(2-4 Hz), theta (4-8), alpha (8-13), beta1 (13-20), beta2 (20-

35), and gamma (35-46).

The feature selection part included a support vector machine

(SVM) for predicting (classifying) the pressed key laterality

and a genetic algorithm (GA) for searching the space of feature

subsets [25], [27]. We used the radial basis function kernel,

with gamma = 0.2. The SVM has several advantage over al-

ternative classifiers. Unlike most neural network classifiers, the

SVM is not susceptible to local optima. SVMs involve many

fewer parameters than neural networks, have built-in regular-

ization, are theoretically well-grounded, and, particularly im-

portant for ultimate real-time use in a BCI, are extremely fast.

The GA was implemented with a population of 20, 2-point

crossover probability of 0.66, and mutation rate of 0.008. Indi-

viduals in the population were binary strings, with 1 indicating

that a feature was included, 0 indicating that it was not. We used

a GA to search the space of feature subsets for two main rea-

sons. First, exhaustive exploration of search spaces with greater

than about 20 features is computationally intractable (ie. 220

possible subsets). Second, unlike gradient-based search meth-

ods, the GA is inherently designed to avoid the pitfall of local

optima. We searched over the eleven time windows and six fre-

quency bands, while constantly including all six electrodes in

each case. Thus, the dimensionality of the searchable feature

space was 66 (11 x 6). Each time an individual (feature subset)in the GA population was evaluated, we trained and tested the

SVM using 10x10 fold cross validation and used the average

classification accuracy as the individual’s fitness measure.

C. Results

The GA evolves a population of feature subsets whose corre-

sponding fitness (classification accuracy) improves over itera-

tions of the GA (Figure 3). Note that although both the popula-

tion average and best individual fitness improve over successive

generations, the best individual fitness does so in a monotonic

fashion. The best fitness obtained was a classification accuracy

of 76%. It was stable for over 50 generations of the GA. Thestandard deviation of the classification accuracy produced by

the SVM was typically about 6%.

Figure 4 shows the feature subset exhibiting the highest clas-

sification accuracy. The feature subset included features from

every time window and every frequency band. This suggests

that alternative methods that include only a few time windows

or frequencies may be missing features that could improve clas-

sification accuracy. Furthermore, all frequency bands were in-

cluded in the third time window, suggesting that early wide-

band activity may be a significant feature of the process for de-

ciding finger laterality.

8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification

http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 6/7

6

Fig. 2. System Architecture for mining the EEG feature space. The spaceof feature subsets is searched in a “wrapper” fashion, whereby the search isdirected by the performance of the classifier, in this case a support vector ma-chine.

  D. Discussion

Although the best classification accuracy (76%) was consid-erably higher than chance, it was much lower than the approx-

imately 95% classification accuracy obtained by Blankertz et

al [4]. One possible reason is that we used data from only a

small subset of the electrodes recorded (6 of 27) in order to re-

duce computation time by restraining the dimensionality of the

feature vector presented to the SVM.

Optimizing classification accuracy was not, however, our pri-

mary goal. Instead, we sought insight into the nature of the fea-

tures that would provide the best classicification accuracy. The

feature selection method showed that a diverse subset of spec-

trotemporal features in the EEG contributed to the best classifi-

cation accuracy. However, most BCIs that use EEG frequency

information in imagined or real movement look at only alpha(mu) and beta bands over only one or a few time windows [20],

[19], [26]. Furthermore, the system is amenable to on-line ap-

plications. One could use the full system, including the GA,

to learn the best dissociating features for a given subject and

task, then use the trained SVM with the best dissociating fea-

tures in real-time. Thus preliminary results from the research

suggests that BCI performance could be improved by leverag-

ing advances in machine learning and artificial intelligence for

systematic exploration of the EEG feature space.

VIII. CONCLUSIONS

Support Vector Machines provide a powerful method for dataclassification. The SVM algorithm has a very solid foundation

in statistical learning theory, and guarantees to find the optimal

decision function for a set of training data, given a set of pa-

rameters determining the operation of the SVM. The empirical

evidence presented here shows that the algorithm performs very

well on one real problem.

Finally, we are currently working with alternative represen-

tations of the EEG data. Preliminary results indicate that apply-

ing a KL-Transform to the raw data produces a data set which is

much more susceptible to accurate classification by many types

of classifiers.

Fig. 3. Classification accuracy (population fitness) evolves over iterations(generations) of the genetic algorithm. Thin line is the average fitness of thepopulation. Thick line is the fitness of the best individual in the population.

Fig. 4. Features selected for the best individual. Black indicates the featurewas included in the subset, white indicates it was not. Time windows corre-

spond to number of 100 ms shifts from epoch onset, ie. time window 1 is earlyin the epoch, time window 11 ends 120 ms before the key press.

ACKNOWLEDGMENT

Deon’s: were constructed using the SVM MATLAB toolbox

developed by [6].

dave’s: The SVM was implemented with the OSU SVM

Classifier Matlab Toolbox [17].

The GA was implemented with the commercial FlexTool GA

software [16].

REFERENCES

[1] A. Aizerman, E.M. Braverman, and L.I. Rozoner. Theoretical founda-tions of the potential function method in pattern recognition learning. In

 Automation and Remote Control, 1964.[2] C. W. Anderson, S. V. Devulapalli, and E. A. Stolz. Determining mental

state from EEG signals using neural networks. Scientific Programming,4(3):171–183, 1995.

[3] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1995.[4] B. Blankertz, G. Curio, and K. R. Muller. Classifying single trial EEG:

Towards brain computer interfacing. 14, 2002. to appear.[5] B.E. Boser, I.M. Guyon, and V. Vapnik. A training algorithm for optimal

margin classifiers. In Fifth Annual Workshop on Computational LearningTheory, 1992.

[6] G.C. Cawley. MATLAB support vector machine toolboxhttp://theoval.sys.uea.ac.uk/˜gcc/svm/toolbox .University of East Anglia, School of Information Systems, Norwich,Norfolk, U.K. NR4 7TJ, 2000.

8/3/2019 Deon Garrett et al- Comparison of Linear and Nonlinear Methods for EEG Signal Classification

http://slidepdf.com/reader/full/deon-garrett-et-al-comparison-of-linear-and-nonlinear-methods-for-eeg-signal 7/7

7

[7] C. Cortes and V. Vapnik. Support vector networks. Machine Learning,20, 1995.

[8] R. Courant and D. Hilbert. Methods of Mathematical Physics, volume Iand II. Wiley Interscience, 1970.

[9] T.M. Cover. Geometrical and statistical properties of systems of linear in-equalities with applications in pattern recognition. In IEEE Transactionson Electronic Computers, 1965.

[10] R. Fletcher. Practical Methods of Optimization. John Wiley and Sons,Inc., 2nd edition, 1987.

[11] H. Jasper. The ten twenty electrode system of the international feder-ation. Electroencephalography and Clinical Neurophysiology, 10:371–375, 1958.

[12] W. Karush. Minima of functions of several variables with inequalities asside constraints. Master’s thesis, University of Chicago, 1939.

[13] Z. A. Keirn. Alternative modes of communication between man and ma-chine. Master’s thesis, Purdue University, Lafayette, IN, West Lafayette,IN, 1988.

[14] Z. A. Keirn and J. I. Aunon. A new mode of communication between manand his surroundings. IEEE Transactions on Biomedical Engineering,37(12):1209–1214, 1990.

[15] H.W. Kuhn and A.W. Tucker. Nonlinear programming. In Proceedings of the 2nd  Berkeley Symposium on Mathematical Statistics and Probabilis-tics, 1951.

[16] CynapSys LLC. Flexga. www.cynapsys.com, 2002.[17] J. Ma, Y. Zhao, and S. Ahalt. Osu svm classifier matlab toolbox.

http://eewww.eng.ohio-state.edu/˜maj/osu svm/, 2002.[18] J. Mercer. Functions of positive and negative type, and their connec-

tion with the theory of integral equations. In Transactions of the London

Philosophical Society, 1909.[19] G. Pfurtscheller, C. Neuper, C. Guger, W. Harkam, H. Ramoser,

A. Schlogl, B. Obermaier, and M. Pregenzer. Current trends in Graz brain-computer interface (BCI) research. EEE Transactions on Rehabilitation

 Engineering, 8(2):456–460, 2000.[20] J. A. Pineda, B. Z. Allison, and A. Vankov. The effects of self-movement,

observation, and imagination on mu rhythms and readiness potentials:Toward a brain-computer interface. IEEE Transactions on Rehabilitation

 Engineering, 8(2):219–222, June 2000.[21] John C. Platt. Fast training of support vector machines using sequential

minimal optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 185–208.MIT Press, 1998.

[22] John C. Platt. Using analytic qp and sparseness to speed training of sup-port vector machines. In M. S. Kearns, S. A. Solla, and D. A. Cohn, edi-tors, Advances in Neural Information Processing Systems 11. MIT Press,1999.

[23] John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large marginDAGs for multiclass classification. In S. A. Solla, T. K. Leen, and K.-R.Muller, editors, Advances in Neural Information Processing Systems 12,pages 547–553, 2000.

[24] J. Westonand C. Watkins. Multi-class support vector machines. Technicalreport, Royal Holloway University of London, 1998.

[25] D. Whitley, R. Beveridge, C. Guerra, and C. Graves. Messy genetic al-gorithms for subset feature selection. In T. Baeck, editor, Proc. Int. Conf.on Genetic Algorithms, Boston, MA, 1997. Morgan Kaufmann.

[26] J. R. Wolpaw, D. J. McFarland, and T. M. Vaughan. Brain-computerinterface research at the wadsworth center. IEEE Transactions on Reha-bilitation Engineering, 3(2):222–226, June 2000.

[27] Jihoon Yang and Vasant Honavar. Feature subset selection using a geneticalgorithm. In Huan Liu and Hiroshi Motoda, editors, Feature extraction,construction and selection : a data mining perspective, pages 117–136.Kluwer Academic, Boston, MA, 1998.

[28] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vectormachine. In NIPS2001, 2001.