audio captcha[report]

56
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony Chapter 1 INTRODUCTION With the rapid worldwide growth of VoIP services, the spam issue in VoIP systems becomes increasingly important , which is the reason why important companies, like NEC and Microsoft, have already developed mechanisms to tackle SPam over Internet Telephony (SPIT). A serious obstacle when trying to prevent SPIT is identifying VoIP communications, which originate from software robots (‘‘bots’’). Alan Turing’s ‘‘Turing Test’’ paper discusses the special case of a human tester who wishes to distinguish humans from computer programs. Nowadays, there has been a considerable interest in applying an alternate form of the Turing Test, the so called Reverse Turing Test. The term ‘‘Reverse Turing Test’’ is used to describe that the tester is not a human but a machine. In the spam protection world this kind of computer administrated Reverse Turing Test is also called CAPTCHA (Completely Automated Public Turing Test to Tell Computer and Humans Apart). The research interest in this subject has spurred a number of relevant proposals. Commercial examples include major stakeholders in the field, such as Google and MSN, which require CAPTCHA (visual or audio), in order to provide services to users. However, there exist computer programs, which can break the CAPTCHA that have been proposed so far. In this paper, an audio CAPTCHA was developed that is suitable for use in VoIP systems. In specific, first we present the background and related work and explain the main aspects of SPIT and Dept of ISE, BTLIT Page 1

Upload: rohit-chaturvedi

Post on 26-Mar-2015

1.373 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 1

INTRODUCTION

With the rapid worldwide growth of VoIP services, the spam issue in VoIP systems becomes

increasingly important , which is the reason why important companies, like NEC and Microsoft, have

already developed mechanisms to tackle SPam over Internet Telephony (SPIT). A serious obstacle

when trying to prevent SPIT is identifying VoIP communications, which originate from software

robots (‘‘bots’’). Alan Turing’s ‘‘Turing Test’’ paper discusses the special case of a human tester

who wishes to distinguish humans from computer programs. Nowadays, there has been a

considerable interest in applying an alternate form of the Turing Test, the so called Reverse Turing

Test. The term ‘‘Reverse Turing Test’’ is used to describe that the tester is not a human but a

machine. In the spam protection world this kind of computer administrated Reverse Turing Test is

also called CAPTCHA (Completely Automated Public Turing Test to Tell Computer and Humans

Apart). The research interest in this subject has spurred a number of relevant proposals. Commercial

examples include major stakeholders in the field, such as Google and MSN, which require

CAPTCHA (visual or audio), in order to provide services to users. However, there exist computer

programs, which can break the CAPTCHA that have been proposed so far.

In this paper, an audio CAPTCHA was developed that is suitable for use in VoIP systems. In

specific, first we present the background and related work and explain the main aspects of SPIT and

CAPTCHA. Then, we provide the basic requirements of a CAPTCHA, briefly explain why an audio

CAPTCHA is suitable for VoIP systems, and present an algorithm for selecting a suitable

CAPTCHA.

Dept of ISE, BTLIT Page 1

Page 2: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 2

BACKGROUND

SPIT constitutes an emerging type of threat in VoIP systems. It illustrates several similarities

to email spam. Both spammers and ‘‘spitters’’ use the Internet, so as to target a group of users and

initiate bulk and unsolicited messages and calls. Compared to traditional telephony, IP telephony

provides a more effective channel, since messages are sent in bulk and at a low cost. Individuals can

use spam-bots to harvest VoIP addresses. Furthermore, since call-route tracing over IP is harder, the

potential for fraud is considerably greater.

A CAPTCHA is a method that is widely used to uphold automated SPAM attacks. The same

technique can be used to mitigate SPIT. According to this, each time a callee receives a call from an

unknown caller, an automated Reverse Turing Test would be triggered. The ‘‘spit-bot’’ needs to

solve this test in order to complete its attack. Integrating such a technique into a VoIP system raises

two main issues. First, the CAPTCHA module should be combined with other anti-SPIT controls, i.e.,

not every call should pass through the CAPTCHA challenge, since each CAPTCHA requires

considerable computational resources. A simultaneous triggering of several CAPTCHA challenges

can soon lead to denial of service. Challenges would also cause annoyance to users, if they had to

solve one CAPTCHA for every call they make. Second, a CAPTCHA needs to be friendly and easy

to solve (‘‘pass’’) for a human user.

2.1. CAPTCHA

A CAPTCHA is a test that most humans should be able to pass, but computer programs

should not. Such a test is often based on hard open AI problems, e.g., automatic recognition of

distorted text, or of human speech against a noisy background. Differing from the original Turing

Test, CAPTCHA challenges are automatically generated and graded by a computer. Since only

humans are able to return a sensible response, an auto-mated Turing Test embedded in a protocol can

verify whether there is a human or a bot behind the challenged computer. Although the original

Turing Test was designed as a measure of progress for AI, CAPTCHA is rather a human-nature-

authentication mechanism.

This paper is focused on audio CAPTCHA. These were initially created to enable people that

are visually impaired to register or make use of a service that requires solving a CAPTCHA. Today,

an audio CAPTCHA would be useful to defend against automated audio VoIP messages, as visual

Dept of ISE, BTLIT Page 2

Page 3: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

CAPTCHA are hard to apply in VoIP systems, mainly due to the limitations of end-user devices. For

example, nowadays not many people have a home telephony device with a screen capable of

displaying a proper (high resolution) image CAPTCHA. If an adequate CAPTCHA is used, it should

be hard for a spit-bot to respond correctly and thus manage to initiate a call. Also, audio CAPTCHA

seems attractive, as text-based CAPTCHA has been demonstrated breakable.

2.2. Related work

As the audio CAPTCHA technology is practically in its infancy, the relevant research work is

currently limited.

Bigham and Cavender demonstrated that existing audio CAPTCHA are clearly more difficult

and time-consuming to complete as compared to visual CAPTCHA ( Bigham and Cav- ender, 2009).

They created a comparison between the existing CAPTCHA implementations, but they do not reach

to any conclusion on how their characteristics affect the user success rate. They developed and

evaluated an optimized interface for non-visual use, which can be added in-place to an existing audio

CAPTCHA. In their published CAPTCHA evaluation they mentioned that Facebook, Veoh, and

Craigs-list use different CAPTCHA; today, all three of them use Recaptcha ( Recaptcha Audio

CAPTCHA).

Tam et al. (2008a,b) described a number of security tests of audio CAPTCHA. The authors

used machine learning techniques, which are similar to the ones used for breaking visual CAPTCHA.

They analyzed three audio CAPTCHA taken from popular websites (Google ( Google Audio

CAPTCHA), Recaptcha ( Recaptcha Audio CAPTCHA), Digg ( DIGG)). In some cases they reached

correct solutions with an accuracy of up to 71%. The main issue with this work is that they only

tested the audio CAPTCHA implementations and did not analyze what is the impact of audio

CAPTCHA characteristics on its performance.

Yan and El Ahmad (2008) worked on the usability issues that should be taken into

consideration when developing a CAPTCHA. Their work does not specifically focus on audio

CAPTCHA, with the exception of a few characteristics (i.e., character set). Their work was concluded

with a framework referring to CAPTCHA usability.

Bursztein and Bethard (2009) developed a prototype audio CAPTCHA decoder, called

decaptcha, which is able to success-fully break 75% of the eBay audio CAPTCHA. They described

an automated process for downloading audio CAPTCHA, training the decaptcha bot and finally

solving the eBay CAPTCHA.

Finally, Markkola and Lindqvist (2008) proposed a number of ‘‘voice’’ CAPTCHA for

Internet telephony. However, they did not explain in detail how this could be integrated into an

Dept of ISE, BTLIT Page 3

Page 4: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Internet telephony infrastructure. Also, their work lacks experimentation results.

2.3. A new approach

In the paper, apart from classifying the audio CAPTCHA attributes and evaluating the current

audio CAPTCHA implementations, a new audio CAPTCHA for VoIP environments will be

developed. The proposed CAPTCHA must be easy for human users to solve, easy for a tester

machine to generate and grade, and hard for a software bot to solve. The validation of its performance

will be made by two means; namely, by user tests and by a bot configured to solve ‘‘difficult’’ audio

CAPTCHAs. The latter requirement implies that a specific kind of test should be developed; i.e., a

test that is easy to generate but intractable to pass without knowledge that is available to humans but

not to machines. Audio recognition fits in this category. For example, humans can easily identify

words in an environment, whereas this is usually hard for machines ( Dusan and Rabiner, 2005; von

Ahn et al., 2008). Specification-wise, a CAPTCHA should ideally be 100% effective at identifying

software bots, but it was proved ( Chellapilla et al., 2005) that a CAPTCHA could be designed to

fight bots with a low failure rate (i.e., <0.1%). Generically, a CAPTCHA is effective as long as the

cost of using a software robot remains higher than the cost of using a human, even when the

spammers use cheap labor to solve CAPTCHA ( Trend Micro’s TrendLabs).

In order to develop a new audio CAPTCHA, we followed an iterative algorithm: (a) we

selected a set of attributes that are appropriate for audio CAPTCHA, (b) we developed a CAPTCHA

that is based on these attributes, and (c) we evaluated the CAPTCHA by calculating the success rates

of a bot and of a number of users, until the results were adequately ( Fig. 1).

Fig. 1. A generic CAPTCHA development process.

Dept of ISE, BTLIT Page 4

Page 5: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 3

CAPTCHA ATTRIBUTES

A high user success rate is a key factor in deciding whether a new CAPTCHA is effective or

not. This is particularly important in the case of an audio CAPTCHA, as it does not only refer to

VoIP callers, but also to visually impaired users of a VoIP service. Equally important is the bot

success rate, which should be kept to a minimum. Both factors depend on a number of attributes. The

main characteristic of these attributes is that they should all be adjusted in the production procedure

of the CAPTCHA. We classified these attributes into four categories: (a) vocabulary, (b) background

noise, (c) time, and (d) audio production.

3.1. Vocabulary attributes

Audio CAPTCHA designs vary, mainly due to the vocabulary used. Variations depend upon:

(a) the set of characters the audio CAPTCHA consists of, (b) the number of characters of a single

CAPTCHA, and (c) the local settings, e.g., the language that CAPTCHA characters belong to.

3.1.1. Adequate data field

A data field (called ‘‘alphabet’’) is used as a pool for selecting the characters to be included in

an audio CAPTCHA. In order to integrate an audio CAPTCHA into a VoIP system, we chose an

alphabet of ten one-digit numbers, i.e., {0, ., 9}. Such a choice allows the use of the DTMF method

for answering the audio CAPTCHA. Other examples of audio CAPTCHA that use only digits are the

MSN and the Google ones. Moreover, some CAPTCHA includes beep sounds in their vocabulary, so

as to inform the user that the audio CAPTCHA begins. From the other side, a limited alphabet and

beep sounds may make an audio method quite vulnerable to attacks.

3.1.2. Spoken characters variation

In order to make the CAPTCHA solution even harder for a bot to solve, we introduce a

number of different human speakers for each digit of the alphabet. For example, if there are X

different speakers for each character, then there will be X different ways to pronounce each character.

This essentially means that each speaker makes a difference for a bot, but hardly for a human.

Another drawback for a CAPTCHA implementation is the use of a fixed number of

characters. A non-variable number of characters, in combination with a limited alphabet, can make a

CAPTCHA vulnerable to attack. For example, if only 3-digit CAPTCHA are used and a bot can

Dept of ISE, BTLIT Page 5

Page 6: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

successfully recognize only 2 of the digits, then it can reach a success rate of ≥10% just by guessing

the remaining digit. On the other hand, if the number of digits of a CAPTCHA is not fixed and a bot

can successfully recognize only 2 of them, then the number of remaining digits is not known to the

bot.

3.1.3. Language requirements

Another important factor is the mother tongue of the users, as it plays a major role in

achieving a human user high success rate. This is particularly important in the case of audio methods,

where identifying spoken characters is hard to do, in

case the mother tongue of the speaker and the user differs. Therefore, the language should meet the

scope of the specific CAPTCA implementation. As a good practice, the spoken characters should be

not more than a few. The CAPTCHA we developed can be adjusted for non-English users, as it is

created dynamically and different characters can be added easily.

3.2. Noise attributes

The noise is still another important attribute of an audio CAPTCHA, as it can help to increase the

difficulty for an automated procedure to solve it.

3.2.1. Background noise

The background noise, which can be added during the production of a voice message, can

make CAPTCHA particularly resistant to attacks by automated bots. Application of background noise

requires a great variety of such noises to be available. These noises should be rotated in an erratic

manner. In our proposal, instead of developing a repository with noises we chose to proceed with a

dynamic production of them, while ensuring that they are distorted in a random manner. The way

various noises are produced should prevent their easy elimination by automated programs that use

learning techniques ( Tam et al., 2008a). In any case, the final version of the audio message, resulting

from the combined use of different distortion techniques and added noise, should be such that the

majority of users can easily recognize it. In the proposed CAPTCHA there was a real-time distortion,

applied in between the characters, as there appears to be no effective method for evaluating how

people understand digits with distortion.

3.2.2. Intermediate noise

Intermediate noise may prevent an automated program from isolating correctly spoken

characters from a voice message. The developer needs to select the scale in which the inter-mediate

noise will be applied, because intermediate noise can decrease not only the automated bot success rate

Dept of ISE, BTLIT Page 6

Page 7: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

but also that of the user ( Festa, 2003). Also, as this noise should have the same characteristics as the

background noise, it should be created dynamically.

3.3. Time attributes

A set of variables should be defined during the production of an audio snapshot. The variables

refer to the length of the audio message, which depends on: (a) the number of characters spoken, (b)

the characters chosen, and (c) the time required for each character to be announced, which in turn

depends on the speaker of each character. Both, the beginning and the end of each spoken character,

should also be defined. This depends on the duration of each char-acter, as well as on the duration of

the pause between spoken characters. If the above time parameters follow specific patterns, then the

resistance of the audio CAPTCHA to a bot will decrease significantly. In the proposed CAPTCHA

we aim at eliminating such time-related patterns.

3.4. Audio production attributes

In principle, an audio CAPTCHA production procedure should be automated. In practice, an

acceptable human interference could be allowed only for the adjustment of the various thresholds.

3.4.1. Automated production process

The automation of the CAPTCHA production process is a desirable, though hard to achieve,

property. The various elements that compose an audio CAPTCHA, such as the number of characters

of a message, the speaker of each character, the background sound, the timing and the distortion of

the message, make the process time-costly and demanding in terms of hardware resources. Our

choice is to produce audio CAPTCHA periodically, in order: (a) not to produce them in real-time, and

(b) not to produce identical snapshots for extended time periods.

3.4.2. Audio CAPTCHA reappearance

An audio CAPTCHA should reappear as rare as possible. However, with short alphabets

every CAPTCHA is actually expected to reappear after a while. Due to the attributes of the voice

messages (e.g., technical distortion, added noise, language, speakers, etc.), as well as to the context of

the user (e.g., noisy environment, etc.), a voice message sometimes cannot be identified by the user

on the first attempt. There-fore, a second chance should be given. In this case, a different CAPTCHA

should be used.

3.4.3. Audio CAPTCHA reproduction

An audio CAPTCHA should be reproduced in a streaming way. The main reason for this is

that most of the bots need a training session before they are able to solve a CAPTCHA. Therefore, if

the audio reproduction process is not streaming, then the bot could easily download all audio

Dept of ISE, BTLIT Page 7

Page 8: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

CAPTCHA that are needed for the training session.

Fig. 2 refers to all the attributes of an audio CAPTCHA.

Fig. 2 Audio CAPTCHA attributes.

Dept of ISE, BTLIT Page 8

Page 9: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 4

AUDIO CAPTCHA EVALUATION

In this section we evaluate some popular audio CAPTCHA utilizing the above mentioned

characteristics. First, we collected twelve (12) different audio CAPTCHA, not only from popular

websites (i.e., Google, Hotmail, Recaptcha), but also from other sources (Secure Image CAPTCHA).

For each of them we down-loaded100 examples (in .wav or .mp3format), resulting ina total of 1200

audio files that were used for the evaluation.

Then, for each audio CAPTCHA we provided a short description of its functionality. We

summarized with drafting a table that includes all these CAPTCHA, together with their attributes.

Two interesting points, regarding our analysis, are:

1.User’s success rate was calculated by inviting 10 users to solve 5 CAPTCHA of each

implementation. All CAPTCHA were in English, which was the mother tongue of one (1) of the

participants (as a requirement, all users should speak English). All users had a university degree.

Also, they all use a PC for more than 20 h/week.

2.The ‘‘automated creation’’ attribute was not put in-place for the commercial CAPTCHA (Google,

MSN), as their rele-vant algorithms are not publicly available.

4.1. Google

The Google Audio CAPTCHA uses a limited data field of ten digits (0, ., 9), which seems not

adequate for every situation; however, it is suitable for a VoIP system. The number of digits for each

audio CAPTCHA is not fixed, but it ranges from 5 to 10 digits. Moreover, this CAPTCHA is

available in multiple languages. This CAPTCHA uses background and intermediate noise. The noise

at the beginning is louder and then a different speaker is used for the announcement of each character.

In addition, the duration of a CAPTCHA ranges from 20 to 50 s (based on our Google Audio

sample). Google uses three beeps every time an audio CAPTCHA begins. These beeps make the

audio CAPTCHA vulnerable to attacks because it is much easier for a bot to know when a

CAPTCHA begins. Furthermore, Google Audio CAPTCHA is announced twice in every audio file,

therefore an attacker can process it twice and has multiple attempts to find the right answer. Finally,

the most important drawback is the user success rate, which is not adequately high.

Dept of ISE, BTLIT Page 9

Page 10: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

4.2. MSN

The MSN Audio CAPTCHA uses a limited data field of ten (10) digits, with a fixed number

of spoken characters (10) in each one. The frequency of the spoken characters varies, since a number

of different speakers are used. That makes MSN Audio CAPTCHA vulnerable to attacks. Also, it is

available in multiple languages. MSN uses weak and constant background noise. The distance

between the words is, to a far extend, constant. Moreover, the duration of the CAPTCHA is not

always the same (e.g., one CAPTCHA lasts 0:07 s, another 0:16 s). There are no beeps at the

beginning of this audio CAPTCHA. The main advantage of MSN Audio CAPTCHA is it is easy for a

user to understand. As a result, the user success rate is high.

4.3. Recaptcha

The Recaptcha Audio CAPTCHA uses a large data field that includes various phrases.

Therefore, the number of spoken words varies and it is available only in English. Recaptcha uses no

background noise. On the other hand, it uses distortion techniques and multiple speakers, with

different pronunciation and different pace. The user can hear twice the audio CAPTCHA in one audio

file (like Google). Recaptcha does not use beeps. The duration of this CAPTCHA is almost fixed.

Moreover, the user success rate is significantly low. Recaptcha Audio CAPTCHA meets most of the

requirements for an effective tool. Its main drawbacks are the vocabulary (includes more than digits),

as well as the user success rate, which is low. The latter happens because it seems not easy for a user

to understand the words and their combination.

4.4. eBay

The eBay Audio CAPTCHA has a limited data field of ten (10) digits (0–9). The number of

spoken characters is always six (6). The CAPTCHA uses different speakers and it is available in

several languages, depending on the specific eBay sites (i.e., the digits in www.ebay.fr are

pronounced in French). More-over, there is a different background noise for each digit, but there is no

intermediate noise. Finally, the duration of the CAPTCHA, as well as the speaker pace, are both

fixed. The main advantages of this implementation are the high user success rate, the lack of beeps at

the beginning or end of the CAPTCHA, and its streaming reproduction.

4.5. Secure Image CAPTCHA

Secure Image CAPTCHA uses an adequate data field of digits (0–9) and letters (A–Z). The

number of spoken characters is fixed and it is available only in English. On the other hand, this

CAPTCHA uses the same speaker all the time. Moreover, it uses simple background noise and there

is no intermediate one. Also, the CAPTCHA duration and the speaker pace are fixed. Secure Image

CAPTCHA is an open-source free PHP CAPTCHA script; therefore most of the attributes can be

Dept of ISE, BTLIT Page 10

Page 11: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

fine-tuned. However, there is no functionality allowing the auto-mated production of new CAPTCHA

instances. The main advantage of this implementation is the high user success rate.

4.6. Mp3Captcha

This CAPTCHA ( Mp3Captcha) uses an adequate data field of digits (0–9) and letters (A–Z).

Also, it is available in multiple languages, which is very helpful for non-English users. Moreover, it

does not use beeps at the beginning or specific extra tokens that help the bot understand when the

characters of the CAPTCHA are announced. On the other hand, the speaker is only one, which makes

it easy for a computer-based audio recognition tool to correctly identify it. Additionally, there are no

background noise or distortion techniques. The duration of the CAPTCHA is fixed and the time for

solving the CAPTCHA is short. Furthermore, it uses a specific number of spoken characters and the

pace is fixed. Finally, the main advantage is that the user success rate is high.

4.7. Captchas.net

The Captchas.net audio CAPTCHA ( Captchas.net) uses letters and digits. Also, this

implementation is friendly to non-English users, as it is available in the most popular languages.

When a character in the CAPTCHA is a letter, then a word is announced and the requested answer is

the first letter of this word. For example, if the announced word is ‘‘horse’’, then the requested

character is ‘‘h’’. The number of spoken characters is fixed; therefore the CAPTCHA is vulnerable to

attacks. The implementation uses distortion techniques and NATO pronunciation, but no background

noise. The speaker is always the same person. The pace and the duration of the CAPTCHA are fixed.

There are no beeps at the beginning and no extra tokens. The user success rate is high and the

duration for solving the CAPTCHA is short.

4.8. Bokehman

Bokehman’s ( Bokehman Audio CAPTCHA) data field includes numbers (0–9), letters (A–Z),

and some extra tokens. These tokens are the words ‘‘capital’’ and ‘‘lower’’, which the user hears

before the announcement of each character, so as to understand whether the following letter is

lowercase or uppercase. The use of extra tokens makes the CAPTCHA vulnerable, because a bot can

identify them easily and understand when to expect each character. Moreover, it is available only in

English. The implementation does not use background noise or distortion techniques. The spoken

char-acters are always four (4). Finally, it always uses the same speaker, the same pace, and the same

duration. The user success rate is high, but the implementation suffers draw-backs, due to the use of

mainly static characteristics.

4.9. Slashdot

Slashdot audio CAPTCHA ( Slashdot) uses a strong data field that contains letters (A–Z) and

words. First the speaker says the whole word and then he/she spells it. This makes the CAPTCHA

Dept of ISE, BTLIT Page 11

Page 12: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

solution easier for the users. Moreover, each word contains a different number of characters, which

makes the CAPTCHA even harder. Also, this implementation does not use extra tokens or beeps at

the beginning. On the other hand, it is available only in English, it does not use background noise, the

speaker is always the same and the duration of each CAPTCHA is almost fixed. Additionally, these

CAPTCHA reappear often. There is no available information about their production process. Finally,

we should mention that the user success rate is one of the highest (95%).

4.10. Authorize

Authorize audio CAPTCHA ( Authorize) data field uses digits (0–9) and letters (A–Z). The

number of spoken characters is fixed. There is no use of beeps or extra tokens. On the other hand, it is

available only in English. Moreover, there is no background noise and no use of distortion

techniques, which make the CAPTCHA vulnerable to attacks. Also, the speaker is always the same

and the duration is fixed. Finally, it is easy for a user to understand.

4.11. AOL

AOL audio CAPTCHA ( AOL) data field uses letters (A–Z) and digits (0–9). The number of

spoken characters is fixed. There are two speakers. One says some characters and the other the rest.

The sequence is not specific but changes as one pass from one CAPTCHA to another. It is available

only in English. It uses voices for background noise, but no distortion tech-niques. The duration is

fixed. It does not use extra tokens. It uses three (3) beeps not only at the beginning, but also at the end

of the CAPTCHA. This makes the CAPTCHA vulnerable to attacks, as a bot can be programmed to

identify when the CAPTCHA starts and ends. Finally, this CAPTCHA implementation is easy for a

user to understand.

4.12. Digg

The last audio CAPTCHA is Digg ( DIGG). It uses an adequate data field of digits (0–9) and

letters (A–Z). The number of spoken characters is fixed (i.e., 5). Moreover, it is available only in

English. Digg uses a constant background noise, which is louder at the end. It also uses a pause

before the announcement of each character. The speaker is the same and the duration of the

CAPTCHA is fixed. Digg’s developers suggested a way to defeat a bot; i.e., they randomly put a

sound in an audio CAPTCHA (the background noise for every character), without including any

character. However, this is not hard for a bot to identify (this sound is always the same) and just

ignore it. This implementation is easy for a user to understand.

Dept of ISE, BTLIT Page 12

Page 13: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 5

CAPTCHA BOTS

Given the user success rate of a CAPTCHA, one has to test it against automated

audio recognition tools. In this paper a state-of-the-art open-source speech recognition

tool (SPHINX) was used ( Walker et al., 2004; SPHINX). In addition, a frequency and

energy pick detection bot, called devoicecaptcha ( Defeating Audio (Voice)

CAPTCHA), was also utilized. The criteria for selecting those two bots were (a) they

have proven record for audio CAPTCHA solving, especially the devoicecaptcha bot

( Bursztein and Bethard, 2009), (b) they are widely used, and (c) both can easily

adapted in a VoIP environment.

5.1. Automatic speech recognition bots

Automatic speech recognition (ASR), also known as computer speech

recognition, is the technology that makes it possible for a computer to identify the

components of human speech. The process begins with a spoken utterance being

captured by a microphone or an audio file and ends with the recognized words being

output by the system. In particular, the basic function is to convert the spoken word to

properly encoded data that can be recognized by a computer. The ultimate aim of this

technology is to identify, in real time and with a degree of success close to 100%,

words spoken by humans, regardless of the size of vocabulary, noise levels, the

characteristics of speaker like pronunciation, and conditions of the channel through

which the human voice is transmitted.

On a practical level, ASR tools can achieve a high performance when used in

controlled conditions. These limitations are usually associated with the discharge of

Dept of ISE, BTLIT Page 13

Page 14: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

adding additional information not related to the acoustic signal recognition. Thus, the

environments in which high degree of success is achieved are characterized usually by

the absence of any form of noise or distortion technique. Depending on the extent of

the various restrictions, there are different levels of performance. The closer the

conditions are to the optimal ones the higher the performance is. In order for an ASR to

work, it has to build a Speech Recognition Engine (SRE). An SRE requires two types

of files to recognize speech.

An acoustic model, which is created by taking audio recordings of speech and

‘compiling’ them into a statistical representations of the sounds that make up

each word (this process is called ‘training’).

A language model, or grammar file, which is a file that contains the available

vocabulary and the probabilities of the words’ sequences.

When the vocabulary is limited, it requires no training to recognize a small number

of words (e.g., the ten digits) as spoken by most speakers. Such systems are popular for

routing incoming phone calls to their destinations in large organizations. This is why

we used these tools without involving special training.

5.1.1. SPHINX

There are various ASR methods/tools to recognize the words spoken in an audio

file. Among the most known ones are the Hidden Markov Model Toolkit ( HTK) and

SPHINX (the latest version of Carnegie Mellon University’s repository of Sphinx

speech recognition systems was developed by CMU, SUN Microsystems and

Mitsubishi Electric Research Laboratories). We decided to use SPHINX because it is

open-source – thus easily configurable – and has a large community of developers who

use and maintain it. The HTK is developed in Cambridge University Engineering

Department, but its source has not been made publicly available. The major advantage

of the SPHINX is that it has pluggable language/grammar and acoustic models.

In our test environment, we used a language model called HUB4. It uses a large

vocabulary and a customized acoustic model, which expects not only digits. Other

Dept of ISE, BTLIT Page 14

Page 15: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

language models like TIDIGITS were not used, because they cannot handle random

distortion even though its vocabulary is only digits.

5.2. Frequency and energy peak detection bots

The second category for an automated audio recognition bot employs frequency

and energy peak detection methods. It can be used for solving audio CAPTCHA, for

the following reasons:

Such bots have been proven effective: demonstrative (though perhaps not thorough

enough) tests of such bots against popular audio CAPTCHA implementations

have been successful (Defeating Audio (Voice) CAPTCHA; Breaking Gmail’s

audio CAPTCHA) (e.g., SPIT prevention infrastructures, registrations for

visually impaired people, etc.).

Such bots are easy to implement: frequency and energy peak detection bots are

comparatively easy to implement using open-source software.

Such bots require limited time to solve a CAPTCHA: fast

CAPTCHA solving is required because most services leave a small time frame

for their users to solve the tests (5–15 s), especially when VoIP services are

considered. The CAPTCHA solving bot must analyze and reform the solution to

the desired form (SIP message, DTMF, etc.) within a limited time frame.

Such bots require a small amount of system recourses: an auto-mated SPAM attack is

chosen when its cost is lower than employing humans. Also, a ‘‘spitter’’ performs

multiple attacks simultaneously (e.g., the goal is to initiate SIP calls or messages in

parallel). Thus, a bot must be inexpensive in terms of system recourses, which will

allow the spammer/ spitter to run several instances of the bot at the same time.

Regarding time constraints, frequency and energy peak detection processes are less

demanding than approaches using different methods, such as Hidden Markov Models

(HMM) ( HTK).

There are certain drawbacks when using these bots. This is mainly due to the fact

that they require a training session. In this session a user identifies a number of

selected CAPTCHA. Then, he/she recognizes the announced characters and records

Dept of ISE, BTLIT Page 15

Page 16: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

them in a file, from which the bot receives the data to solve the CAPTCHA. The set of

training audio CAPTCHA might be extensive, if the CAPTCHA data field (alphabet)

is long. However, in a VoIP system the available alphabet is relatively small as it

contains only digits (0–9), which increase the applicability of the mechanism.

5.2.1. The bot used

For the purpose of this paper we used the devoicecaptcha bot developed by

Vorm ( Defeating Audio (Voice) CAPTCHA). This bot uses frequency analysis and

energy peak detection, in order to segment and solve an audio CAPTCHA in real-time.

The bot works as follows: first it reads the audio file and skips as many starting bytes

as the user has predefined (to avoid the starting bells that some implementations have,

e.g., Google). Then, the samples are processed with a hamming window defined by the

user. Each block is transformed into the frequency domain using Discrete Fourier

Transformation. The frequencies are put in a predefined number of bins (the bins are

not equally wide, the higher the frequency the larger the band). After that, the bot

looks at the highest frequency bin. Every block that has more energy in a window than

the pre-defined threshold energy is considered a peak (see Fig. 3). These peaks are

used to segment the audio file in the different spoken digits. Then the bot looks for a

number of windows around the peaks and prints all the frequency bins. This is the

profile of the digit. The profiles of the digits are then compared to the ones in the

training file. The closest match is chosen as a possible guess for each digit.

Dept of ISE, BTLIT Page 16

Page 17: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Fig. 3 – Audio analysis of the bot.

During the training session of the bot the user gives as input to the bot an audio

CAPTCHA. Then, for each profile of the digit that the bot chooses the user enters

which digit it actually was (this procedure can be automated if the user gives a name to

the audio files accordingly, i.e., if an audio CAPTCHA file include digits 6, 9 and 2,

the file name can be ‘‘692.wav’’). The larger the number of audio CAPTCHA in the

training set is, the higher the bot’s success ratio would be.

Dept of ISE, BTLIT Page 17

Page 18: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 6

CAPTCHA APPLICABILITY FOR VOIP ENVIRONMENT

In this section we discuss which of the CAPTCHA in Section 4 could be candidates

for anti-SPIT purposes. The only requirement this CAPTCHA should have is that the

vocabulary should be limited to digits {0, ., 9}, as the audio CAPTCHA will be used

for an SIP-based VoIP system, where DTMF signals need to be sent. Sending letters to

answer a CAPTCHA could be difficult for an average user. Not many users can write

3–4 letters with a phone keyboard (e.g., pressing multiple times a key to get the letter)

in a short time period. An implementation of this kind should not ignore or

underestimate the digital divide.

Based on the algorithm introduced in Section 2, the user success rate should be

high (>80%). The Google and Recaptcha CAPTCHA cannot meet this requirement.

Nearly the same user success rates were also presented by Bigham and Cav- ender

(2009). Moreover, the Recaptcha uses phrases (not digits).

DIGG, AOL, Slashdot and Authorize do include characters other than digits.

They are, also, not open-source, therefore their data field is not able to be altered. As

far as it concerns eBay audio CAPTCHA, it has already been ‘‘cracked’’.

The problem with MSN CAPTCHA is the number of digits each one includes.

As a result of the user tests that we per-formed with normal phones, user success rate

decreases significantly, from 80% to 25%, because it was not easy for a user to type

the digits and hear the CAPTCHA at the same time, or to remember all 10 digits and

type them after the CAPTCHA ends. MSN CAPTCHA can be of practical use only if

Dept of ISE, BTLIT Page 18

Page 19: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

the telephone device has a microphone and a headphone separated from the telephone

keyboard.

The remaining CAPTCHA implementations ( Secure Image CAPTCHA;

Captchas.net; Bokehman Audio CAPTCHA; Mp3Captcha) could be, in principle, used

for anti-SPIT purposes. Even though their vocabulary contains letters, this can be

changed to only digits because they are open-source. However, in practice only the

Secure Image CAPTCHA and the Captchas.net can be taken into account, because

Bokehman and Mp3Captcha are very similar to the Captchas.net (i.e., no background

noise) and they are both more vulnerable to attacks (they use only one speaker ( Tam et

al., 2008a; Chan, 2003)).

6.1. Evaluation of selected audio CAPTCHA

At this stage, we have decided upon the two selected CAPTCHAs. The next step

was to evaluate them against the two bots presented in Section 5.

For the devoicecaptcha bot we had to create a training session, because it works

with a comparison to a training set. We took 50 audio files of each CAPTCHA as a

training set and tested it with the remaining 50 audio files. The result was a clear defeat

of the two CAPTCHA, as the bot had a 77% success rate for the Secure Image

CAPTCHA and an 81% success rate for Captchas.net. Both success rates are large

enough, thus the CAPTCHA is considered not effective.

For the SPHINX test environment a small custom application was created, in

order to decode multiple wav files in batch form and send to output the corresponding

results. Even though the SPHINX success rate was not high, it was large enough (>8%)

for the two implementations to be considered not effective.

Dept of ISE, BTLIT Page 19

Page 20: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Both experiments were conducted in a Windows XP SP2 PC with 2.1 GHz

Core2Duo processor and 2 GB RAM memory. The experiments are depicted in Fig. 4,

which includes the users’ success rate as it was depicted in Table 1.

To sum up, based on the aforementioned tests and the VoIP system requirements

(e.g., only digits in vocabulary), we concluded that there is practically no existing

audio CAPTCHA implementation that could be considered as efficient enough for a

VoIP system.

Fig. 4 Evaluation of audio CAPTCHA

Dept of ISE, BTLIT Page 20

Page 21: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 7

AUDIO CAPTCHA EXPERIMENTAL ENVIRONMENT/INTEGRATION

We now proceed to the development of a new audio CAPTCHA

implementation. A key question for the development of such a new CAPTCHA is

whether it is applicable to a VoIP system, in particular in an SIP-based environment.

This section describes our laboratory VoIP system, the development of the new audio

CAPTCHA, the applicability of the bot in the SIP-based VoIP system, and the results

of the user evaluation.

7.1. Experimental lab infrastructure and CAPTCHA integration

The test computing environment, which was used, is depicted in Fig. 5. It

consists of 2 SIP proxy servers. The SIP server application is scalable and reliable

open-source software called SIP Express Router (SER 2.0) ( SER server version 2.0).

It can act as an SIP registrar, proxy, or redirect server. Each of the SIP servers creates a

different VoIP domain. Both, the bots’ host computer and the users, belong to the first

domain. The callee, who is protected by the proposed audio CAPTCHA, belongs to the

second domain. The functionality of the second domain has been extended, in order to

be able to send/stream an audio CAPTCHA. Each time a call reaches the second

domain, the call is redirected to a media server, which reproduces the audio

CAPTCHA and validates the caller’s or bots’ answer.

Dept of ISE, BTLIT Page 21

Page 22: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Fig. 5 Laboratory infrastructure.

The media server is the SIP Express Media server (SEMS) ( SIP express media

server version 2.0), which is a reliable media and application server for SIP-based

VoIP services. In order for the caller (user or bot) to hear the audio CAPTCHA, a

media session should be established by exchanging SIP messages. The SIP message

number of the audio CAPTCHA is 182 and the subject (header field) is ‘‘CAPTCHA’’.

7.2. Bots’ applicability to SIP-based VoIP

In order to integrate the bots in an SIP-based VoIP system and examine their

applicability, the implementation procedure was decided to include three stages (the

procedure and the SIP messages exchanged between the participating entities, are

presented in Fig. 6).

Dept of ISE, BTLIT Page 22

Page 23: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Fig. 6 SIP message exchange for CAPTCHA.

Stage 0: it is dominated by the administrator of the callee’s domain (Domain2). When

the callee’s domain receives an SIP INVITE message, there are three possible distinct

outcomes:

(a) forward the message to the caller, (b) reject the message, and (c) send a CAPTCHA

to the caller (UA1). In the test environment we forward every INVITE message to the

media server, which sends a CAPTCHA to the caller.

Stage 1: an audio CAPTCHA is sent (in the form of a 182 message) to the caller

(UA1). In the proposed implementation, the caller is replaced by a bot. It must record

the audio CAPTCHA, reform it to an appropriate audio format (wav, 8000 Hz, 16 bit)

and identify the announced digits. The procedure depends mainly on the time needed

to reform the message. Moreover, the particular bot needs approximately 0.10 s to

identify a 3-digit CAPTCHA and 0.15 s to identify a 4-digit one.

Stage 2: when the bot has generated an answer, it forms an SIP message by using SIPp

( SIPP traffic generator for the SIP protocol), which includes the DTMF answer. This

answer is sent as a reply of the CAPTCHA. If the caller does not receive a 200 OK

message, a new CAPTCHA is sent and the bot starts to record again (Stage 1).

The above procedure should be completed within a specific time frame. The time

slot opens when the audio file is received by the caller and closes when the timeout of

the user’s input expires (defined by the service CAPTCHA provider ( Fig. 7)). The

Dept of ISE, BTLIT Page 23

Page 24: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

duration of the CAPTCHA playback does not affect the time frame, because the

waiting time for an answer starts when the playback is complete. If an answer arrives

before the timeout, then it is validated by CAPTCHA service (and if it is correct the

call is established), otherwise the bot has another try. In our implementation, the bot is

given 6 s to respond to the CAPTCHA, whereas the maximum number of attempts is

set to three (3).

Fig. 7 – A CAPTCHA time frame.

Table 2 illustrates the time required by the various stages in the proposed

implementation. The selected bot can answer properly the CAPTCHA puzzle in much

less time than the time frame. Since the CAPTCHA is desired to be easy for users, we

suggest that the time frame, in which the caller should answer the CAPTCHA puzzle,

should be not less than 3 s. This is because many groups of users, such as minors or

elderly, may not be able to respond promptly. Finally, we note that our bots’ host

computer can accomplish the two stages for 82 CAPTCHA simultaneously.

Table 2 Stage duration.

7.3. User applicability

The users, who were invited to solve the CAPTCHA samples, were 32, most of

them aged between 20 and 30 years old. Most of them were university students (21 out

of 32). We had 6 persons older than 40 years old. All CAPTCHA were in English,

Dept of ISE, BTLIT Page 24

Page 25: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

which was the mother tongue of 1 of the participants (there was a requirement for

every user to speak English). In order for the user to take the tests, all users’ PCs (in

Fig. 5 depicted as the caller) were equipped with soft-IP-phones (X-lite and Twinkle).

These phones were used to initiate a call, to listen to the CAPTCHA, and to send the

answer in a DTMF tone format.

Dept of ISE, BTLIT Page 25

Page 26: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 8

AUDIO CAPTCHA IMPLEMENTATION PROCESS

In this section, the details of the development of a new audio CAPTCHA will be

explained.

8.1. Selected attributes

In order to develop an effective new audio CAPTCHA, we decided upon the following

attributes:

Different announcers (speakers): the announcer (speaker) of each and every digit is

selected randomly among a given set of (more than one) speakers.

Random positioning of each digit in the CAPTCHA: the digits used by the

CAPTCHA are physically distributed randomly in the available space.

Background noise of each digit: background noise, randomly selected, is added to

each and every digit of the audio CAPTCHA. The audio noise files are segments (from

1 to 3 s) of randomly selected music files. They are not auto-generated by other

methods (e.g., creation of white noise). We tried to ensure that the noise will be least

annoying for the user to listen to. The background and intermediate noises were

automatically generated in-line with the requirements set forth by a statistical analysis.

The volume level of the noise is lower than the level of the digits, so that they remain

audible to the users.

Loud noise between digits: loud noise is introduced between the digits (the noise is

not very loud, in order to minimize the discomfort of the user).

Different duration and file size: each audio CAPTCHA file has different duration

and different size.

Vocabulary: the vocabulary was limited to digits {0, ., 9}, because the audio

CAPTCHA was designed for an SIP-based VoIP system where DTMF signals need to

be sent.

Dept of ISE, BTLIT Page 26

Page 27: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

8.2. CAPTCHA development

The audio CAPTCHA development was carried out in five stages, in terms of

the number of attributes adopted. Each development stage was tested and evaluated

upon its efficiency according to the success rate of the bot and the success rate of

human users.

During Stage 1, the produced audio CAPTCHA was pronounced by one sole

announcer. It did not include additional features, such as background noise, or noise

between the digits. The first digit of every word started at the exact same point as the

other ones. The time difference between two consecutive digits was fixed. The

waveforms of the resulting 3- and 4-digit CAPTCHA appear in Fig. 8a and b. In such

a simple audio CAPTCHA, a bot can use a detection method (e.g., energy peak

detection) and easily segment and recognize the digits. An important factor in this

process is the number of audio CAPTCHA that was used during the training of the

devoicecaptcha bot. If a small number was used, then there is a high chance that not all

digits are given as an input to the training process; thus, the bot may have a low

success rate. That is the case with the 4-digit CAPTCHA ( Fig. 8b). The random

training sequence did not involve many instances of some digits (such us 8 and 9);

therefore, even though the bot recognized successfully a large number of CAPTCHA,

it failed to recognize others and resulted in a relatively low (69%) bot success rate.

Fig. 8 – a) Stage 1 (3 digits). b) Stage 1 (4 digits).

Fig. 9 – a) Stage 2 (3 digits). b) Stage 2 (4 digits).

Dept of ISE, BTLIT Page 27

Page 28: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

The SPHINX software did not achieve such a high success rate; it reached a

(success rate of) 27%. The main reason for this is that there was no background and

intermediate noise within the CAPTCHA.

During Stage 2, the audio CAPTCHA was produced by using 7 different

announcers. Each digit was pronounced by a randomly selected announcer. Even

though this affected the success of the devoicecaptcha bot in the case of 3-digit

CAPTCHA, it did not do so in the case of the 4-digit ones. This mainly hinges upon

the training set. Moreover, for the same number of training CAPTCHA instances, 4-

digit ones offer more digits to the training procedure. For example, if 100 3-digits

CAPTCHA are used for training, 300 digits are recorded, whereas with the same

number of 4-digit CAPTCHA 400 digits are recorded.

The SPHINX software success decreased dramatically (i.e., 0.9% for the 3 digits

CAPTCHA and 0.7% for the 4 digits CAPTCHA). This is because there was

considerable back-ground noise, due to the microphone recording. Fig. 9a and b shows

the waveforms of the produced digits.

In Stage 3 background noise was added against each digit. This way the success

rate of the devoicecaptcha bot was suppressed to 30% for the 3-digit CAPTCHA and

55% for the 4-digit ones, but it still remained relatively high. Fig. 10a and b shows the

waveforms of the produced digits with the background noise. The high success rate is

due to the ability of the Frequency bot to cut-off the low energy sounds (i.e., the noise),

by checking above certain threshold energy levels. In that way, it can – in most cases –

isolate the noise behind each digit. The difference between the successes of 3- and 4-

digit CAPTCHA is due to the difference in the training sets. In this case, a training of

50 audio CAPTCHAs was allowed for the 3- digit ones and 150 for the 4-digit ones.

As a result, the available digits taking part in the training process were 150 and 600,

respectively.

Dept of ISE, BTLIT Page 28

Page 29: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Fig. 10 – a) Stage 3 (3 digits). b) Stage 3 (4 digits).

Fig. 11 – a) Stage 5 (3 digits). b) Stage 5 (4 digits).

The SPHINX software repeated the same low success rate, because the

background noise added further difficulty for solving the CAPTCHA.

In Stage 4 the volume of the background noise of each digit was raised.

Although the devoicecaptcha bot’s success rate fell noticeably (10–15% success), and

the SPHINX software was unable to solve any CAPTCHA correctly, the produced

audio CAPTCHA was too difficult to solve for the users, as the loud background noise

made it hard for the users to distinguish the digits spoken. For that reason, loud

background noise was not included in our final strategy.

In Stage 5 loud noise was introduced between every couple of digits

(intermediate noise) ( Fig. 11a and b). This resulted in the devoicecaptcha bot being

unable to segment the audio file correctly. This happened because there were more

energy peaks than the digits spoken. The loud intermediate noises were recognized as

additional digits, because they produce high energy peaks as well, when transformed

with the Discrete Fourier Transformation. As a consequence, this bot could not be

trained, as it failed to successfully recognize any digits.

Dept of ISE, BTLIT Page 29

Page 30: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Fig. 12 – Demonstration of the devoicecaptcha bot failing to solve the CAPTCHA

The SPHINX software repeated the same low success rate. The main issue

remains that such speech recognition tools are effective only in ‘‘controlled’’

conditions, such as with only one speaker and without any noise (Section 3).

Stage 5 is described, in more detail in Fig. 12, where the CAPTCHA includes

intermediate noise between the digits. When the bot transforms such an audio into the

frequency domain, the energy peaks that can be found are both digits and noise. As a

result, the bot recognizes more digits than those which are actually included in the file.

One possible solution for the devoicecaptcha bot would be to raise or lower the

threshold of the energy. In that case ( Fig. 12), the bot would still fail. If the threshold

energy is very high, then the bot would not recognize some of the digits in the

CAPTCHA, while at the same time it would recognize some intermediate noise as

digits. On the other hand, if the threshold energy is lowered, then the bot would

recognize all digits, but at the same time all intermediate noises would also be

considered digits, as well. Thus, the bot would assume that there were 12–15 digits in

the CAPTCHA.

8.3. CAPTCHA testing

Users’ and bots’ success rates are the main factors, which prove whether a

CAPTCHA is efficient or not. The corresponding success rates, as per the CAPTCHA

described in Section 5.2, appear on Fig. 13a–c. Each attribute added efficiency to the

CAPTCHA and directly affected the user and bot success rates. The CAPTCHA

developed in Stage 5 had an average user success rate of 87%, with an average bots’

Dept of ISE, BTLIT Page 30

Page 31: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

success rate of less than 1%.

Fig. 13 – a) SPHINX success rates. b) Devoicecaptcha bot success rates. c) Users

success rates

8.4. CAPTCHA implementation

During the implementation of the proposed audio CAPTCHA, the audio files had the

following attributes:

a) They were produced automatically; therefore, they can be updated at random

time periods without human intervention. The overall process for creating a full

set of 3-digit CAPTCHA took 8 s, whereas creating a full set of 4-digit

CAPTCHA took 107 s. Thus, the reproduction of the whole set of CAPTCHA

does not cause significant overload to our VoIP system (the VoIP server was a

2.1 GHz Core2Duo, with 2 GB RAM).

b) All constituting parts of the audio CAPTCHA, such as the digits and the noise,

lay in different folders. Moreover, each time a set of CAPTCHA is produced, the

program selects randomly each digit from a different announcer, as well as a

Dept of ISE, BTLIT Page 31

Page 32: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

random background noise.

c) The noise between the digits is selected randomly and has different volume and

energy.

d) The noise and the pronounced digits have random dura-tion, which results in a

random duration of each audio CAPTCHA.

Dept of ISE, BTLIT Page 32

Page 33: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 9

DISCUSSION AND LIMITATIONS

The evaluation process of the current CAPTCHA implementations included the

positive and negative characteristics of each one. Moreover, the user success rate for

every CAPTCHA was presented but the bot success rate was introduced only for those

which are easily applicable to a VoIP infrastructure. Therefore, the remaining

CAPTCHA could be evaluated for their resistance against bots.

Additionally, the testing environment for the proposed VoIP CAPTCHA is a lab

environment; therefore there might be issues in order the proposed CAPTCHA to be

integrated to the overall security infrastructure of the VoIP provider. However, a

further experimentation clearly requires the co-operation of a major SIP-based VoIP

service provider, especially for business purposes, since the applicability of this

mechanism has been introduced and justified in this paper.

A limitation of the proposed CAPTCHA is that there could be no evaluation of

its effectiveness and its attributes by some additional audio/speech recognition tools, as

those introduced by Tam et al. (2008a).

Another possible limitation was due to the sample of the users used for

experimentation. The experiment procedure could consider different populations of

users and take into consideration the specific requirements of each set.

Dept of ISE, BTLIT Page 33

Page 34: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

Chapter 10

CONCLUSIONS

CAPTCHA are expected to play a key role for preventing email spam and voice

spam (SPIT) in the near future. In order for them to be effective, they must be easy to

solve for the users, while at the same time very hard for bots to pass.

In this paper, we provided the reader with an overview of existing audio

CAPTCHA implementations, in order to identify their main characteristics. Based on

these characteristics, we identified that two of them may be able, in principle, to be

appropriate audio CAPTCHA for a VoIP system. After an evaluation process, which

included a test procedure by two speech recognition tools, we demonstrated that the

existing audio CAPTCHA implementations are clearly inadequate candidates for a

VoIP system.

As a result of the aforementioned facts, we proposed a new audio CAPTCHA

implementation. This CAPTCHA incorporates several attributes, such us different digit

announcers, back-ground noise against each digit, noise between digits and all of them

in a random and automated way.

Then, we produced a number of audio CAPTCHAs, which are regularly

refreshed, with a limited chance of creating the same instance of an audio CAPTCHA

more than once, and reproducing in streaming mode. The production of the CAPTCHA

was done in five stages. Each time the CAPTCHA was tested not only by a number of

users, but also by two automated speech recognition tools (SPHINX and

devoicecaptcha bot). The bots managed to achieve a high success rate during the first

four stages (up to 98%), but that rate dropped dramatically at the last one (less than

2%). That was mainly due to the addition of intermediate noises, which made the bot

unable to segment properly the audio file, to be trained properly, and thus to solve the

CAPTCHA.

Dept of ISE, BTLIT Page 34

Page 35: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

We also determined an appropriate level of background noise of each digit, in order

for the CAPTCHA to be solvable by users and difficult to break by bots. However,

such a low bot success rate could not have been achieved without the combination of

all the above mentioned attributes. Each attribute alone is not enough for making

CAPTCHA robust; it is the combination of the features that make the CAPTCHA

resistant.

Dept of ISE, BTLIT Page 35

Page 36: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

REFERENCES

[1]. von Ahn L, Blum M, Hopper N, Langford J. CAPTCHA: using hard AI

problems for security. In: Biham E, editor. Proceedings of the international conference

on the theory and applications of cryptographic techniques (EUROCRYPT ’03).

Poland: Springer; 2003. p. 294–311 (LNCS 2656).

[2]. von Ahn L, Blum M, Langford J. Telling humans and computer apart

automatically. Communications of the ACM 2004;47(2): 57–60.

[3]. von Ahn L, Maurer B, McMillen C, Abraham D, Blum M. reCAPTCHA:

human-based character recognition via web security measures. Science

2008;321(5895):1465–8.

[4]. Blum M, von Ahn L, Langford J, Hopper N. The CAPTCHA project, USA,

November 2000.

[5]. Bigham J, Cavender A. Evaluating existing audio CAPTCHA optimized for

non-visual use. In: Proceedings of the ACM conference on human factors in

systems (CHI 2009), USA; 2009, p. 1829–38.

Breaking Gmail’s Audio CAPTCHA, http://blog.wintercore.com/ ?p¼ 11 [retrieved

10.10.08].

[6]. Bursztein E, Bethard S. Decaptcha: breaking 75% of eBay audio CAPTCHA.

In: Procedings of the 3rd USENIX workshop on offensive technologies (WOOT

’09), Canada; 2009.

Bokehman Audio CAPTCHA, http://bokehman.com/captcha_ verification.php

[retrieved 5.05.09].

[7]. Chellapilla K, Larson K, Simard P, Czerwinski M. Building segmentation based

human friendly human interaction proofs. In: Proceedings of the SIGCHI conference

Dept of ISE, BTLIT Page 36

Page 37: Audio CAPTCHA[report]

Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony

on human factors in computing systems. ACM Press; 2005. p. 711–20.

[8]. Chew M, Baird H. Baffletext: a human interactive proof. In: Proceedings of

the 10th SPIE/IS&T document recognition and retrieval conference, USA; 2003,

p. 305–16.

[9]. Chan T-Y. Using a text-to-speech synthesizer to generate a Reverse Turing

Test. In: Proceedings of the 15th IEEE international conference on tools with

artificial intelligence (ICTAI’03); 2003, p. 226.

[10]. Dusan S, Rabiner L. On integrating insights from human speech perception

into automatic speech recognition. In: INTERSPEECH, Portugal; 2005, p. 1233–6.

Festa P. Spam-bot tests flunk the blind. CNET, News.com. Available at:

www.news.com/2100-1032-1022814.html; July 2, 2003.

[11]. Gibbs S, Breiteneder C, Tsichritzis D. Data modeling of time-based

media. In: Proceedings of the ACM SIGMOD international conference on

management of data, USA; 1994, pp. 91–102.

Google Audio CAPTCHA, www.google.com/accounts/ NewAccount [retrieved

26.03.09].

[12]. Graham-Rowe D. A sentinel to screen phone calls technology. MIT Review

2006 [accessed 08.11.2009].

[13]. Seminars For You, http://www.seminars4you.info

Dept of ISE, BTLIT Page 37