biometric authentication via keystroke sound

Biometric Authentication via Keystroke Sound

Joseph Roth Xiaoming Liu Arun RossDepartment of Computer Science and EngineeringMichigan State University, East Lansing MI 48824

{rothjos1,liuxm,rossarun}@msu.edu

Dimitris MetaxasDepartment of Computer Science

Rutgers University, Piscataway, NJ [email protected]

Abstract

Unlike conventional “one shot” biometric authentica-tion schemes, continuous authentication has a number ofadvantages, such as longer time for sensing, ability to rec-tify authentication decisions, and persistent verification ofa user’s identity, which are critical in applications de-manding enhanced security. However, traditional modali-ties such as face, fingerprint and keystroke dynamics, havevarious drawbacks in continuous authentication scenar-ios. In light of this, this paper proposes a novel non-intrusive and privacy-aware biometric modality that utilizeskeystroke sound. Given the keystroke sound recorded bya low-cost microphone, our system extracts discriminativefeatures and performs matching between a gallery and aprobe sound stream. Motivated by the concept of digraphsused in modeling keystroke dynamics, we learn a virtualalphabet from keystroke sound segments, from which thedigraph latency within pairs of virtual letters as well asother statistical features are used to generate match scores.The resultant multiple scores are indicative of the similar-ities between two sound streams, and are fused to makea final authentication decision. We collect a first-of-its-kind keystroke sound database of 45 subjects typing on akeyboard. Experiments on static text-based authentication,demonstrate the potential as well as limitations of this bio-metric modality.

1. IntroductionBiometric authentication is the process of verifying the

identity of individuals by their physical traits, such asface and fingerprint, or behavioral traits, such as gait andkeystroke dynamics. Most biometric authentication re-search has focused on one shot authentication where a sub-ject’s identity is verified only once prior to granting accessto a resource. However, one shot authentication has a num-ber of drawbacks in many application scenarios. Theseinclude short sensing time, inability to rectify decisions,and enabled access for potentially unlimited periods of

Keystroke Sound Biometric

Subject_A is authenticated

to be a legitimate user!

Figure 1. The concept of keystroke sound as a biometric: Thesound of a user typing on the keyboard is captured by a simplemicrophone attached to the PC and is input to our proposed sys-tem, which authenticates the user by verifying if the characteristicof the acoustic signals is similar to that of the claimed identity.

time [17]. In contrast, continuous authentication aims tocontinuously verify the identity of a subject over an ex-tended period of time thereby ensuring that the integrity ofan application is not compromised. It can not only addressthe aforementioned problems in one shot authentication, butcan also substantially enhance the security of systems, suchas computer user authentication in security-sensitive facili-ties.

Research on continuous authentication is relativelysparse, and researchers have explored a limited number ofbiometric modalities, such as face, fingerprint, keystrokedynamics, and mouse movement, for this purpose. How-ever, each of these modalities has its own shortcomings. Forexample, face authentication demands uninterrupted sens-ing and monitoring of faces - an intrusive process from auser’s perspective. Similarly, the fingerprint sensor embed-ded on the mouse requires user collaboration in terms ofprecision in holding the mouse [15]. Although it is nonin-trusive, keystroke dynamics utilizes key-logging to recordtyped texts and thus poses significant privacy risk in the caseof surreptitious usage [13, 2].

In this paper, we consider another potential biometricmodality for continuous authentication based on keystrokeacoustics. Nowadays, microphone has become a standardperipheral device that is embedded in PCs, monitors, andwebcams. As shown in Figure 1, a microphone can be used

Acoustic signal(gallery)

Temporal segmentation & feature extraction

Extracted keystroke features

Convert to virtual letters

Timing and letter of keystrokes

Digraph statistic

Hist. of digraphs

Intra-letter dist.

Biometrictemplate

K-means clustering

Virtual alphabet

All acoustic signals Compute

score distributions

Acoustic signal(probe)

Feature set Score fusion

Score distributions

Yes/No

EnrollmentEnrollmentEnrollmentEnrollmentEnrollmentEnrollmentEnrollmentEnrollmentEnrollment

AuthenticationAuthenticationAuthenticationAuthenticationAuthenticationAuthenticationAuthenticationAuthenticationAuthentication

TrainingTrainingTrainingTrainingTrainingTrainingTrainingTrainingTraining

Top N digraphs

Digraph frequencyLetter frequency

Figure 2. Architecture of the biometric authentication via keystroke sound, where the boldface indicates the output of each stage.

to record the sound of a user typing on the keyboard, whichis then input to the proposed system for feature extractionand matching. Compared to the other biometrics modalities,it has a number of advantages. It utilizes a readily availablesensor without requiring additional hardware. Unlike faceor fingerprint, it is less intrusive and does not interfere withthe normal computing operation of a user. Compared tokeystroke dynamics, it protects the user’s privacy by avoid-ing direct logging of any keyboard input. Keystroke acous-tics does have the disadvantage of dealing with environmen-tal noise, but proper audio filtering or a directed microphoneshould help in a noisy environment.

Our technical approach to employ this novel biometric islargely motivated by prior work in keystroke dynamics. Oneof the most popular features used to model keystroke dy-namics is digraph latency [9, 8, 3], which calculates the timedifference between pressing the keys of two adjacent letters.It has been shown that the word-specific digraph is muchmore discriminative than the generic digraph, which is com-puted without regard to what letters were typed [14]. As-suming that the acoustic signal from keystroke does not ex-plicitly carry the information of what letter is typed, we pro-pose a novel approach to employ the digraph feature by con-structing a virtual alphabet. Given the acoustic signals fromall training subjects, we first detect segments of keystrokes,whose mel-frequency cepstral coefficients (MFCC) [4] arefed into a K-means clustering routine. The resultant set ofcluster centroids is considered as a virtual alphabet, whichenables us to compute the most frequent digraphs (a pair ofcluster centroids) and their latencies for each subject. Basedupon the virtual alphabet, we also consider a number ofother feature representation and scoring schemes. Eventu-ally a score level fusion is employed to determine whether aprobe stream matches with the gallery stream. In summary,this paper has three main contributions:

⇧ We propose a novel keystroke sound-based biometric

modality that offers non-intrusive, privacy-friendly, and po-tentially continuous authentication for computer users.

⇧ We collect a first-of-its-kind sound database of userstyping on a keyboard. The database and the experimentalresults are made publicly available so as to facilitate futureresearch and comparison on this research topic.

⇧ We propose a novel virtual alphabet-based approach tolearn various score functions from acoustic signals, and ascore-fusion approach to match two sound streams.

2. Prior WorkTo the best of our knowledge, there is no prior work in

exploring keystroke sound for the purpose of biometric au-thentication. As a result, we focus our literature survey onvarious biometric modalities for continuous authentication,and other applications of keystroke sound.

Face is one of the most popular modalities suggested forcontinuous user authentication, with the benefit of using ex-isting cameras embedded in the monitor [15, 11]. However,continuously capturing face images creates an intrusive andunfavorable computing environment for the user. Similarly,fingerprint has been suggested for continuous authentica-tion by embedding a fingerprint sensor on a specific areaof the mouse [15]. This can be intrusive since it substan-tially constrains the way a user operates a mouse. Mousedynamics has been used as a biometric modality due to thedistinctive characteristics exhibited in its movement whenoperated by a user [12]. However, as indicated in a recentpaper [18], more than 3 minutes of mouse movement on av-erage is required to make a reliable authentication decision,which can be a bottleneck when the user does not continu-ously use the mouse for an extended period of time.

Keystroke dynamics utilizes the habitual patterns andrhythms a user exhibits while typing on a keyboard. Al-though it has a long history dating back to the use of tele-graph in the 19th century and Morse Code in World War II,

0 0.01 0.02 0.03 0.04 0.05

−0.1

−0.05

0

0.05

0.1

0.15

Sam

ple

Valu

e

Time (sec)

Keystroke Raw Signal

Key Press

Key Release

Figure 3. A raw acoustic signal containing one keystroke.

most of the prior work still focuses on static text [7, 19, 2],i.e., all users type the same text. Only a few efforts haveaddressed the issue of free text (i.e., a user can type ar-bitrary text) which is necessary for continuous authenti-cation [10, 16]. Nevertheless, one major drawback ofkeystroke dynamics is the fact that it poses significant pri-vacy risk to the user because all typed texts can be poten-tially recorded via key logging.

As far as recorded keystroke sound is concerned, therehas been a series of work on acoustic cryptanalysis in thecomputer security community. The focus has been on de-signing advanced signal processing and machine learningalgorithms to recover the typed letters from the keystrokesound [1, 20, 6]. One of their main assumptions is thatwhen pressed, each key/letter will emit a slightly differentacoustic signal, which motivates us to learn a virtual alpha-bet for our biometrics authentication application. Assumingthe success of this line of work in the future, we may lever-age it to combine the best of both worlds: keystroke dynam-ics based authentication using recovered keys and enhanceddiscrimination due to additional 1D acoustic signals oversimple key logging signals.

3. Our AlgorithmAs a pattern matching scheme, our algorithm seeks to

calculate a similarity score between a probe sound streamrecorded during the authentication stage and a gallery soundstream, whose features have been pre-computed and savedas a biometric template during the enrollment stage. Bothsound streams are recorded when a user is typing on a key-board with minimal background noise, and they are de-scribed in detail in Section 4.1. In this section, we presentour algorithm for calculating the similarity score betweentwo sound streams.

Figure 2 provides an overview of the architecture of ourproposed algorithm. There are three different stages inour algorithm: training, enrollment, and authentication. Ineach stage, given a raw acoustic signal, we first isolate thekeystroke portion from the silent periods in the temporaldomain, and MFCC features are extracted from each of the

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pow

er

Time (sec)

FFT Keystroke Detection

Figure 4. Temporal segmentation by thresholding the FFT power.

resultant keystroke segments. During the training stage, weperform a K-means clustering of the MFCC features of allkeystroke segments, where the centroids of clusters are con-sidered as the virtual letters in a virtual alphabet. We alsoextract the digraph and letter frequencies, top N digraphs,and compute score distributions for fusion. In the enroll-ment stage, we use the virtual alphabet to convert the MFCCfeatures into virtual letters, from which three different fea-ture sets, viz., digraph statistic, histogram of digraphs andintra-letter distance, are designed to collectively form a bio-metric template for the user. Finally, in the authenticationstage, the score functions are computed for a probe signal,and score-level fusion is used to make the final authentica-tion decision. In the following we will present each compo-nent of this architecture in detail.

3.1. Temporal Segmentation & Feature ExtractionFor an acoustic typing signal, g(t), composed of

keystrokes and silent periods, the first task before featureextraction is to identify portions in the stream where akeystroke occurs since it is assumed that the keystroke car-ries the discriminative information about an individual typ-ist. As shown in Figure 3, a keystroke is composed of keypressing and key release, where the former comprises of thesound emitted when a finger touches the surface of a key aswell as the sound due to the key being pressed down. Wedenote a keystroke as ki, with a start time of ti and a du-ration of L. In our algorithm, L is set to be 40 ms, since itcovers the full length of most keystrokes.

Motivated by [20], we conduct the temporal segmenta-tion based on the observation that the energy of keystroke isconcentrated in the frequencies between 400 Hz and 12KHz, while the background noise occupies other frequencyranges. We compute the 5-ms windowed Fast FourierTransform (FFT) of an acoustic signal using a sliding win-dow approach, where the magnitudes of outputs in the rangeof 400 Hz and 12K Hz are summarized to produce an ag-gregate curve of the FFT power. By setting a threshold onthe curve, we identify the start of keystrokes, as shown inFigure 4. Ideally this temporal segmentation should detect

all keystrokes, instead of the background noise. Our cur-rent simple threshold-based method may not achieve thisreliably as yet, and in the future we will investigate moreadvanced methods such as adaptive thresholding or super-vised learning approaches.

Once the start of a keystroke ti is determined, we convertthe acoustic signal within a keystroke segment, g(ti, ..., ti+L), to a feature representation fi for future processing. Thestandard MFCC features are utilized due to its wide applica-tions in speech recognition and demonstrated effectivenessin key recovery [20]. Our MFCC implementation uses thesame parameter setup as [20]. That is, we have 32 channelsof the Mel-Scale Filter Bank and use 16 coefficients and32 filters with 10-ms windows that are shifted by 2.5 ms.The resultant feature of a keystroke ki is a 256-dimensionalvector fi.

3.2. Virtual Alphabet via ClusteringMost of the prior work in keystroke dynamics uses di-

graphs statistics, which is the mean and standard devia-tion of delays between two individual keys or two groupsof keys, or trigraphs, which is the delay across three keys,to model keystroke dynamics. It has also been shownthat word-specific digraph, which is computed on the sametwo keys, is much more discriminative than the generic di-graph [14]. In keystroke dynamics, such word informationis readily available since key logging records the letter asso-ciated with each keystroke. However, this is not the case inour keystroke acoustic signal. Hence, in order to enjoy thebenefit of digraph in our application, we strive to achievetwo goals from the acoustic signal:

• Estimate the starting time of each keystroke;

• Infer or label the letter pressed at each keystroke.

While the first goal is addressed by the temporal seg-mentation process, the second goal requires knowledgeof the pressed key in order to construct word-specific di-graphs. However, as shown in acoustic cryptanalysis [6],precisely recognizing the typed key itself is an ongoing re-search topic. Hence, we take a different approach by aim-ing to associate each keystroke to a unique label, with thehope that different physical keys will correspond to differ-ent labels, and different typists will generate different labelswhen pressing the same key. We call each label a virtualletter, the collection of which is called a virtual alphabet.Learning the virtual alphabet is accomplished by applyingK-means clustering to the MFCC features of all keystrokesin the training set.

Assuming the training set includes streams from J sub-jects, each with Ij keystrokes, the input of the K-means is{f ji }, where j 2 [1, J ] and i 2 [1, Ij ]. The K-means cluster-ing partitions the input samples into K clusters, with cen-troids mf (k). We set K to be 30 considering that the total

−4 −2 0 2 4 6 8 100

1

2

3

4

5

6

7Virtual Letters

1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

−10 −5 0 5 10 15 20−5

0

5

10

15

Virtual Letters

1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

(a) First four virtual letters. (b) All 30 virtual letters.

Figure 5. The mean MFCC features f̄k of each of 20 subjectswithin virtual letters, plotted along the top-2 principle componentsreduced from the original 256-dimensional space. The symbolrepresents a letter and the color in (a) indicates a subject (bestviewed in color).

number of common keys on a typical keyboard is around30.

During all three stages, given an original acoustic sig-nal represented as a collection of keystrokes K = {ki}, wecompute the Euclidean distance between the MFCC featureof a keystroke fi and each of the cluster centroids. The cen-troid with the minimal distance will assign its index to thekeystroke as its corresponding virtual letter, i.e., li = k.Thus, in our algorithm a virtual letter is simply a numberbetween 1 and K. Finally, we can represent a keystroke asa triplet ki = {ti, fi, li}.

3.3. Score FunctionsGiven the aforementioned feature representation

scheme, we next investigate a set of score functionsto compute the similarity measure between two sets offeatures, listed as follows.Digraph statistic As a representation of individual typingcharacteristics, digraph statistic refers to the statistics of thelatency, in terms of mean and stand deviation, within fre-quent pairs of letters. A pair of letters is named digraph,with examples such as (t,h), (h,e). In our algorithm, avirtual alphabet with K letters can generate K2 digraphs,such as (2, 5), (7, 1). During the training stage, we countthe frequency of each digraph by passing through adjacentkeystrokes, lji�1 and lji , in the entire training set. Thenwe generate a list of top N most frequent digraphs, eachdenoted as dn = {kn1, kn2}. Given the {ki} representa-tion of an acoustic signal, for each one of N digraphs, wethen compute the mean, {mn}, and the standard deviation,{�n}, of the time difference variable �t = ti � ti�1 whereli = kn2 and li�1 = kn1. Finally, the similarity score be-tween two arbitrary length signals, K and K0, is computedusing the following equation:

S1(K,K0) =NX

n=1

wn1

"X

�t

qN (�t;mn,�2

n)N (�t;m0n,�0

n2)

#,

(1)

Input: A stream g(t), top digraphs dn, clustercentroids mf (k).

Output: A feature set F.Locate keystrokes t = [t1, ..., ti, ...] at times of highenergy using FFT,foreach keystroke time t i do

fi = MFCC(g(ti, ..., ti + L)),li = argminkkmf (k)� fik2,

foreach digraph dn doT = {ti � ti�1 : li = kn2 & li�1 = kn1, 8i 2[2, |t|]},mn = mean(T),�n = std(T),Compute histogram of digraphs hn via Eqn. (2),

foreach letter k doCompute f̄k via Eqn. (4),

return F = {mn,�n, hn, f̄k}.Algorithm 1: Feature extraction algorithm.

which basically summarizes the overlapping region be-tween two Gaussian distributions of the same digraph,weighted by wn

1 . The overlapping region is computed viathe Bhattacharyya coefficient, and wn

1 is the overall fre-quency of each digraph learned from the training set. Weset N to be 50 in our algorithm.Histogram of digraphs In our virtual alphabet represen-tation, different subjects may produce different virtual let-ters when pressing the same key. This implies that differ-ent subjects are likely to generate different digraphs whentyping the same word. Hence, it is expected that we canuse the frequencies of popular digraphs as a cue for au-thentication. Given this, we compute the histogram h =[h1, h2, ..., hN ]T of top N digraphs, which are the same asthe ones in digraph statistic, for each acoustic signal. Thatis,

hn =

Pi �(li = kn2)�(li�1 = kn1)

|K|� 1, (2)

where � is the indicator function and the numerator is thenumber of digraph dn = {kn1, kn2} within a sequence oflength |K|. The similarity score between two signals is sim-ply the inner product between two histograms of digraphs,

S2(K,K0) = hTh0. (3)

Intra-letter distance We use the virtual letter to representsimilar keystrokes emerging from different keys pressed bydifferent subjects. Hence, within a virtual letter, it is verylikely that different subjects will have different distribu-tions. Figure 5 provides evidences for this observation byshowing the mean MFCC features of 20 training subjectswithin virtual letters. It can be seen that 1) there is dis-tinct inter-letter separation among virtual letters; 2) within

Input: A probe stream g0(t), biometric template F,top digraphs dn, cluster centroids mf (k),score distribution msv,�sv , threshold ⌧ .

Output: An authentication decision d.Compute feature set F0 for probe g0(t) via Alg. 1,Compute digraph statistic score S1 via Eqn. (1),Compute histogram of digraphs score S2 via Eqn. (3),Compute intra-letter distance score S3 via Eqn. (5),Compute normalized score S via Eqn. (6),if S > ⌧ then

return d = genuine.else

return d = impostor.

Algorithm 2: Authentication algorithm.

each virtual letter, there is intra-letter separation due to in-dividuality. Hence, we would like to utilize this intra-letterseparation in our score function. For an acoustic signal, wecompute the mean of fi associated with each virtual letter,which results in K = 30 mean MFCC features, as follows:

f̄k =1

|li = k|X

li=k

fi. (4)

Given two acoustic signals K and K0, we use Equa-tion (5) to compute the Euclidean distance between the cor-responding means and sum up by using a weight wn

3 , whichis the overall frequency of each virtual letter among allkeystroke segments and is computed from the training set.The sign �1 is to make sure that, on average, the genuineprobe has a larger score than the impostor probe.

S3(K,K0) = �KX

k=1

wn3 ||̄fk � f̄ 0k||2. (5)

So far we have constructed a new feature set for oneacoustic signal, denoted as {mn,�n, hn, f̄k} where n 2[1, N ] and k 2 [1,K]. We summarize the feature extractionalgorithm in Algorithm 1. If the acoustic signal is from thegallery stream, we call the resultant feature set as a biomet-ric template of the gallery subject, which is computed dur-ing enrollment and stored for future matching with a probestream.Score fusion Once the three scores are computed, we fusethem to generate one value to determine whether the au-thentication claim should be accepted or rejected. In oursystem we use a simple score-level fusion where the threenormalized scores are added to create the overall similarityscore between two streams,

S =3X

v=1

Sv �msv

�sv. (6)

0 5 10 15 20 25 300

3

6

9

QWERTY experience (years)

Num

ber o

f sub

ject

s

Figure 6. Distribution of subjects’ experience with keyboard.

The score normalization is conducted by using the meanmsv and standard deviation �sv of the score distributionlearned from the training data, such that the normalizedscores fall in a standard normal distribution. We summa-rize the algorithm during the authentication stage in Algo-rithm 2. In the future, more advanced fusion mechanismscan be utilized, especially by leveraging previous work inthe biometrics fusion domain [5].

4. ExperimentsIn this section, we present an overview of the database

collected for this experiment which can be used for othertyping based biometrics. We also present our specific exper-imental setup and the results of our static-text experiments.

4.1. Keystroke Sound DatabaseSince keystroke sound is a novel biometric modality

without any prior database, we develop a capture protocolthat ensures the collected database is not only useful forthe current research problem, but also beneficial to otherresearchers on the general topic of typing-based biometrics.

When developing the protocol, our first goal was to studywhether the keystroke sound should be based on static textor free text, i.e., does a subject have to type the same textto be authenticated? The same question was addressed inkeystroke dynamics, where there is substantially more re-search effort on static text than free text [2]. To answer thisquestion, we request each subject to perform typing in twosessions. In the first session, a subject types one paragraphfour times (the first paragraph of “A Tale of Two Cities” byCharles Dickens displayed on the monitor), with a 2–3 sec-onds break between trials. Multiple typing instances of thesame paragraph enable the study of static text based authen-tication. In the second session, a subject types a half-pageletter with arbitrary content to his/her family. It can be seenfrom this session that most users make spontaneous pausesduring typing, which mimics well the real-world typing sce-nario. Normally a subject spends around 5–8 minutes oneach session, depending on the typing speed.

A second consideration is the recording environment.The background environment plays an important role in theusability of the data. Low frequency pitches from comput-ers, heaters, and lights can create distractions to the sound

Age 10–19 20–29 30–39Number of subjects 11 30 4

Table 1. Age distribution of subjects.

of the actual typing. The placement of the sensor relativeto the keyboard as well as in the room can also introducedifferences due to echoes. All subjects in the study per-formed the typing with the same computer, keyboard, andmicrophone, in the same room, under reasonably similarconditions. Care was taken to use non-verbal communi-cation during the experiments to eliminate human speechfrom corrupting the audio stream. Nevertheless, standardworkplace noises still exist in the background, e.g. doorsopening, chairs rolling, and people walking.

The third consideration is the hardware setup. We usea US standard QWERTY keyboard for data collection. Al-though there are many available options for microphones,we decide to utilize an inexpensive webcam with embed-ded microphone, which is centered on the top of the moni-tor and pointed toward the keyboard. This setup allows usto capture both the video of hand movement and the audiorecording of keyboard typing. Thus, a multi-modal (visualand acoustic) typing-based continuous authentication sys-tem can be developed in the future based on this database.The sound is captured at 48, 000 Hz with a single channel.

Thus far our database contains 45 subjects with differ-ent years of experience in using the keyboard. The sub-jects are students and faculty at Michigan State University,whose typing experience and age are displayed in Figure 6and Table 1. The database of keystroke sound is available athttp://www.cse.msu.edu/

˜

liuxm/typing, andis intended to facilitate future research, discussion, and per-formance comparison on this topic.

4.2. ResultsWe use the static text portion of the database for our ex-

periments and leave the use of the free text for future work.Our algorithm requires a separate training dataset for thepurpose of learning a virtual alphabet, top digraphs, and thestatistics of score distributions. Hence, we randomly par-tition the database into 20 subjects for training and the re-maining 25 subjects for testing our algorithm.

For each subject, we use the first typing trial as thegallery stream and portions of the fourth typing trial as theprobe stream. We choose the first and last paragraph tomaximize the intra-subject differences for the static text.For the probe streams, we need to balance two concerns.Firstly, we want to use as many probes as possible to en-hance the statistical significance of our experiments, whichrequires that we use a shorter partial paragraph. Secondly,we want to use longer probe streams to allow accurate cal-culations of features for a given subject. We decide to form7 continuous probe streams for each user by using 70% ofthe fourth paragraph starting at the 0%, 5%, 10%, 15%,

http://www.cse.msu.edu/~liuxm/typing

200 400 600 800 1000 1200 1400 16000

0.02

0.04

0.06

0.08

0.1

0.12Digraph Statistic Distribution

Similarity

% o

f pro

bes

ImpostorGenuine

0 0.2 0.4 0.6 0.8 1x 10−3

0

0.02

0.04

0.06

0.08

0.1Histogram of Digraph Distribution

Similarity

% o

f pro

bes

ImpostorGenuine

−12 −10 −8 −6 −4x 104

0

0.02

0.04

0.06

0.08

0.1Intra−letter Distance Distribution

Negative Distance

% o

f pro

bes

ImpostorGenuine

−15 −10 −5 0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14Fusion Distribution

Similarity

% o

f pro

bes

ImpostorGenuine

Figure 7. Distribution of the individual score functions, S1, S2 and S3, and the fused score S, for genuine and impostor probes.

20%, 25%, and 30% mark of the paragraph. This overlapof text streams allows us to balance both concerns, whilealso simulates a continuous environment where the algo-rithm works with an incoming data stream and the results ofthe current probe build on the results of the previous probes.The average probe length is 62 seconds. The total numberof training probe streams is 140 with 3500 training cases.The total number of testing probe streams is 175 with 4375testing cases.Evaluation metrics We use the standard biometrics au-thentication metric, the ROC curve, to measure perfor-mance. The ROC curve has two axes, False Positive Rate(FPR), the fraction of impostor pairs incorrectly deemed tobe genuine, and True Positive Rate (TPR), the fraction ofgenuine pairs correctly deemed to be genuine. We use theEqual Error Rate (EER) to succinctly summarize the ROCcurve. We also use the probability distributions of the gen-uine and impostor scores to study the performance.Feature performance comparison Figure 7 illustratesthe separability of the three score functions and the over-all fused score on the testing data. Figure 8 displays thesame information in the standard ROC format. We canmake a number of observations. Firstly, the individualscore function distributions all display overlap between thegenuine and impostor probes. The task of identifying asingle feature representation scheme to authenticate usersvia keystroke sounds proves challenging. Digraph statis-tics, histogram of digraphs, and intra-letter distance provide31%, 32%, and 33% EER, respectively. Secondly, despitethe overlap, there is still some separation between the gen-uine and impostor probes. We see Gaussian-like distribu-tion for digraph statistics and intra-letter distance, whichimplies that in real-world applications with unseen data, theproposed score functions will be useful. The histogram ofdigraphs displays a bimodal Gaussian distribution for thegenuine class, which indicates that it will behave on the ex-tremes of either providing very useful information or lim-ited discriminability. Furthermore, by using fusion, we cre-ate a new fused score, which produces better results andindicates that the individual score functions capture differ-ent aspects of the subject typing behavior. The result withthe fused score has an EER of 25%.

0 0.25 0.5 0.75 10

0.25

0.5

0.75

1

FPR

TPR

ROC Comparison

Digraph StatisticHistogram of DigraphIntra−letter DistanceFusion

Figure 8. ROC curves of individual score functions as well as thefinal fused score.

Computation efficiency Since a probe stream is an one-dimensional signal, our algorithm can operate in real time,with very minimal CPU load, which is a very favorableproperty for continuous authentication. For a 60-secondprobe stream, it takes approximately 20 seconds to createthe feature representation with the majority of the time spentidentifying the locations of keystrokes. Once the featureshave been extracted, it takes less than 0.1 seconds to com-pute the score functions against a biometric template. Afuture work is to design an incremental score function. Thisis important since we would like to perform authenticationin the online mode, as the sound stream is continuously re-ceived.

5. ConclusionsIn this paper, we presented keystroke sound as a potential

biometric for use in continuous authentication systems. Theproposed biometric is non-intrusive and privacy-friendly. Itcan be easily implemented with standard computer hard-ware and requires minimal computational overhead. To fa-cilitate this research, we collected both acoustic and visualtyping data from 45 individuals. We designed multiple ap-proaches to compute match scores between a gallery andprobe keystroke acoustic stream. In particular, we proposeda fusion of digraph statistics, histogram of digraphs, and

intra-letter distances to authenticate a user. We obtainedan initial EER of 25% on a database of 45 subjects. Thissuggests that more sophisticated features have to be investi-gated in the future. While these are preliminary results, withthe accuracy far below well-researched biometric modali-ties such as face or fingerprint, we wish to emphasize thatthe intent of this feasibility study is to open up a new lineof exciting research on typing-based biometrics. We antic-ipate other interested researchers to commence working onkeystroke acoustics as a biometric modality - either by it-self or in conjunction with other modalities - in continuousauthentication applications.

Our future work on this topic will cover the following as-pects. First, we will increase the subject size of our databaseto provide more training samples. We will continue to col-lect data from a larger number of users as well as acquiremore samples from the current users to gain a better under-standing of the inter- and intra-class variability of this bio-metric. Second, we will explore methods for better iden-tifying keystrokes through supervised learning or context-sensitive thresholding. This will allow for more robust au-thentication in the presence of background noise typical ofa normal work environment. Third, we will explore thechanges that occur when comparing free text, by utilizingthe second session of our database. Finally, we will in-vestigate new feature representation schemes to address theintra-class variability of this biometric.

It is interesting to note a potential attack on this biomet-ric system that involves recording a genuine user’s typingsounds and then replaying the recording while the impostoris using the computer in a silent manner through the mouse.A means of foiling this attack is to log only the times ofkeystrokes and ensure alignment of the detected keystroketimes from the audio with the actual keystroke times. Al-ternatively, positioning a webcam toward the keyboard toanalyze the video of typing can foil this attack as well. It isalso desirable to completely fuse this method with standardkeystroke dynamics methods where the audio and virtual di-graphs add extra discriminating information to the physicaldigraphs.

References[1] D. Asonov and R. Agrawal. Keyboard acoustic emanations.

In Proc. of IEEE Symposium on Security and Privacy, pages3 – 11, May 2004. 3

[2] S. P. Banerjee and D. Woodard. Biometric authenticationand identification using keystroke dynamics: A survey. J. ofPattern Recognition Research, 7(1):116–139, 2012. 1, 3, 6

[3] F. Bergadano, D. Gunetti, and C. Picardi. User authenticationthrough keystroke dynamics. ACM Trans. Inf. Syst. Secur.,5(4):367–397, Nov. 2002. 2

[4] S. Davis and P. Mermelstein. Comparison of parametric rep-resentations for monosyllabic word recognition in continu-

ously spoken sentences. IEEE Trans. Acoustics, Speech andSignal Processing, 28(4):357 – 366, Aug 1980. 2

[5] A. Jain, A. Ross, and S. Prabhakar. An introduction to bio-metric recognition. IEEE Trans. CSVT, 14(1):4 – 20, Jan.2004. 6

[6] A. Kelly. Cracking Passwords using Keyboard Acoustics andLanguage Modeling. Master thesis, University of Edinburgh.2010. 3, 4

[7] K. S. Killourhy and R. A. Maxion. Comparing anomalydetectors for keystroke dynamics. In Proc. of 39th Inter-national Conference on Dependable Systems and Networks(DSN-2009), pages 125–134, 2009. 3

[8] F. Monrose, M. Reiter, and S. Wetzel. Password hardeningbased on keystroke dynamics. International Journal of In-formation Security, 1(2):69–83, 2002. 2

[9] F. Monrose and A. Rubin. Authentication via keystroke dy-namics. In Proceedings of the 4th ACM conference on Com-puter and communications security, pages 48–56, 1997. 2

[10] T. Mustafic, S. Camtepe, and S. Albayrak. Continuous andnon-intrusive identity verification in real-time environmentsbased on free-text keystroke dynamics. In ICJB, 2011. 3

[11] K. Niinuma, U. Park, and A. K. Jain. Soft biometric traitsfor continuous user authentication. IEEE Trans. InformationForensics and Security, 5(4):771–780, Dec. 2010. 2

[12] K. Revett, H. Jahankhani, S. T. Magalhes, and H. M. D.Santos. Global E-Security, volume 12 of Communicationsin Computer and Information Science, chapter A Survey ofUser Authentication Based on Mouse Dynamics, pages 210–219. Springer Berlin Heidelberg, 2008. 2

[13] D. Shanmugapriya and G. Padmavathi. A survey of bio-metric keystroke dynamics: Approaches, security and chal-lenges. International Journal of Computer Science and In-formation Security, 5(1):115–119, 2009. 1

[14] T. Sim and R. Janakiraman. Are digraphs good for free-textkeystroke dynamics? In CVPR, Workshop on Biometrics,2007. 2, 4

[15] T. Sim, S. Zhang, R. Janakiraman, and S. Kumar. Continu-ous verification using multimodal biometrics. IEEE T-PAMI,29(4):687–700, Apr. 2007. 1, 2

[16] C. C. Tappert, S.-H. Cha, M. Villani, and R. S. Zack. Akeystroke biometric systemfor long-text input. Int. J. of In-formation Security and Privacy (IJISP), 4(1):32–60, 2010.3

[17] M. Turk. Grand challenges in biometrics. Presented askeynote at IEEE Int. Conf. on Biometrics: Theory, Appli-cations and Systems (BTAS 10), Washington, DC, 2010. 1

[18] N. Zheng, A. Paloski, and H. Wang. An efficient user veri-fication system via mouse movements. In Proc. of the 18thACM conference on Computer and communications security,pages 139–150, New York, NY, USA, 2011. ACM. 2

[19] Y. Zhong, Y. Deng, and A. K. Jain. Keystroke dynamicsfor user authentication. In CVPR, Workshop on Biometrics,2012. 3

[20] L. Zhuang, F. Zhou, and J. D. Tygar. Keyboard acousticemanations revisited. In Proc. of the 12th ACM conferenceon Computer and communications security, pages 373–382,2005. 3, 4

biometric authentication via keystroke sound

Documents