trade-off between security level and compression in voice communications

40
Analysis of the tradeoff between compression ratio and security level in real-time voice communication Par Abdallah Attie Encadré par: Dr Ahmad Fadlallah Dr Mohamad Raad Soutenance le 09.07.2014 devant le jury composé de: Dr. Wafaa Abou Diab Dr. Bassem Bakhash Dr. Ahmad Fadlallah

Upload: abdallah-attie

Post on 16-Dec-2015

10 views

Category:

Documents


0 download

DESCRIPTION

Masters thesis report about the trade-off between the level of security and the efficiency of compression in voice over IP communication.

TRANSCRIPT

  • Analysis of the tradeoff between

    compression ratio and security level in real-time voice communication

    Par

    Abdallah Attie

    Encadr par:

    Dr Ahmad Fadlallah

    Dr Mohamad Raad

    Soutenance le 09.07.2014 devant le jury compos de:

    Dr. Wafaa Abou Diab

    Dr. Bassem Bakhash

    Dr. Ahmad Fadlallah

  • i

    Abstract

    The project aims at analysis of the tradeoff between security level and compression ratio in real-time

    voice communication. The problem stated in the project is that the combination between variable

    bitrate compression same length encryption will induce vulnerability to traffic analysis. The variation

    of packet sizes can leak information about the conversation starting with language identification,

    identifying certain phrases, and reconstructing phonemes. The solution to this problem is to rely on

    constant bitrate compression or to pad the sent frames to a multiple of 16, 32, or 64 bytes. Each

    padding schemes has a security gain in the form of increased immunity to the described traffic

    analysis systems. The security level is escaladed with the increase in the size of the encryption block.

    The research project we conduct aims at analysis of the impact of those padding schemes on the

    bitrate of the VoIP stream. We created for this purpose a test bed that simulates the compression,

    encryption and sending/receiving of the speech over RTP socket. The resulting bitrates are calculated

    with and without the overhead of packetization. In conclusion, the resulting data allow proper clear

    perspective of the tradeoff between three parameters: security level, bitrate, and quality.

  • ii

    Contents

    Abstract ....................................................................................................................................... i

    List of Figures ............................................................................................................................ iv

    List of Tables .............................................................................................................................. v

    List of References ...................................................................................................................... vi

    Chapter I Introduction ............................................................................................................... 1

    Compression ............................................................................................................................... 1

    Types of Speech Coders .............................................................................................................. 2

    Variable Bit-Rate Coding ............................................................................................................ 3

    Speech Coding State of the Art .................................................................................................... 3

    Adaptive Multi Rate (AMR) .................................................................................................... 3

    Opus ........................................................................................................................................ 4

    Speex ...................................................................................................................................... 5

    Security ....................................................................................................................................... 6

    Symmetric and Asymmetric Encryption .................................................................................. 7

    Block and Stream Encryption .................................................................................................. 8

    Common Encryption Algorithms ............................................................................................. 8

    Report Structure ........................................................................................................................ 10

    Chapter II Literature Review and Problem Formulation ............................................................ 11

    Traffic Analysis of Encrypted Voice Stream ............................................................................. 11

    Information leakage via variable bit-rate................................................................................ 12

    Example of traffic analysis ........................................................................................................ 14

    Mitigation Techniques ............................................................................................................... 15

    Chapter III Test-Bed .................................................................................................................. 17

    Test-bed requirements ............................................................................................................... 17

    Test-bed elements ..................................................................................................................... 18

  • iii

    Speex Encoder ...................................................................................................................... 19

    AES Encryption .................................................................................................................... 22

    RTP Sending/Receiving ........................................................................................................ 24

    Dataset .................................................................................................................................. 25

    Test-bed overview ..................................................................................................................... 26

    Chapter IV Results and Conclusion ............................................................................................ 27

    Narrow Band Results ................................................................................................................ 27

    Wide Band Results .................................................................................................................... 29

    Statistical Analysis .................................................................................................................... 30

    Conclusion and future recommendations ................................................................................... 32

  • iv

    List of Figures

    Figure I-1: Block Diagram of the Opus Encoder .............................................................................. 5

    Figure II-1: Distribution of bit rates used to encode four phonemes with Speex ............................. 13

    Figure II-2: Overview of training and detection process ................................................................ 14

    Figure II-3: Robustness to padding ................................................................................................ 15

    Figure IV-1: NB padding overhead (without packetization) ........................................................... 27

    Figure IV-2: NB rate vs quality (without packtization) .................................................................. 27

    Figure IV-3: NB rate vs quality (with packtization) ....................................................................... 28

    Figure IV-4: NB padding overhead (with packetization) ............................................................... 28

    Figure IV-5: WB overheaad (without packetization) ..................................................................... 29

    Figure IV-6: WB Rate vers Quality (without packetization) .......................................................... 29

    Figure IV-7: Wide Band overhead (with packetization) ................................................................. 30

    Figure IV-8: Wide Band rate versus quality (with packetization) ................................................... 30

    Figure IV-9: Stream Cipher 95% Confidence Interval ................................................................... 31

    Figure IV-10: Stream and 128 bit padding Confidence Interval ..................................................... 31

    Figure IV-11: Stream and 512 bit padding confidence interval ...................................................... 32

    Figure IV-12: Stream and 256 bit Confidence Interval .................................................................. 32

  • v

    List of Tables

    Table I-1: Characteristics of Standardized Speech Coding Algorithms in Each of Four Broad

    Categories Error! Bookmark not defined.

    Table I-2 Comparison Between the 3 Speech Encoders ................... Error! Bookmark not defined.

    Table III-1 Quality vurses bitrate for Speex narrowband ............................................................... 21

    Table III-2 Quality vurses bitrate for Speex wideband ................................................................... 21

  • vi

    List of References

    1 M. Arjona Ramrez and M. Minami, "Low bit rate speech coding," in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1299-1308.

    2 P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494.

    3 M. Hasegawa-Johnson and A. Alwan, "Speech Coding: Fundamentals and Applications" in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1256-1265.

    4 Wiki.hydrogenaud.io, (2014). Variable Bitrate - Hydrogenaudio Knowledgebase. [online] Available at: http://wiki.hydrogenaud.io/index.php?title=VBR [Accessed 22 Jun. 2014].

    5 E. Ekudden et al, "THE ADAPTIVE MULTI-RATE SPEECH CODER", Ericson Research.

    6 Tools.ietf.org, (2014). RFC 6716 - Definition of the Opus Audio Codec. [online] Available at: http://tools.ietf.org/html/rfc6716#section-2.1.8 [Accessed 22 Jun. 2014].

    7 Speex.org, (2014). Introduction to CELP Coding. [online] Available at: http://www.speex.org/docs/manual/speex-manual/node9.html [Accessed 24 Jun. 2014].

    8 C. V. Wright, L. Ballard, F. Monrose, and G. M. Masson. Language identication of encrypted VoIP trafc: Alejandra y Roberto or Alice and Bob? In Proceedings of the USENIX Security Symposium, 2007.

    9 C. V. Wright, L. Ballard, S. E. Coull, F. Monrose, and G. M. Masson. Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations. In Proceedings of the IEE Symposium on Security and Privacy, 2008.

    10 Tools.ietf.org, (2014). RFC 6562 -Guidelines for the Use of Variable Bit Rate Audio with Secure RTP. [online] Available at: http://tools.ietf.org/html/rfc6562#section-2.1.8 [Accessed 30 Jun. 2014].

  • 1

    Chapter I Introduction

    Security and performance are two important issues that any network operator should be concerned

    about. Such concern is escalated when dealing with real-time voice communication. One of the

    reasons behind this is that performance directly affects the user experience in such real-time

    application. Furthermore, the context of voice conversations is always personal, and consequently

    has more severe privacy requirements with respect to other applications (web browsing for example).

    Enhancing performance requires compression of the network stream, while preserving privacy as a

    security aspect requires encryption of the exchanged data. Our project aims at finding the optimal

    solution of combining the two operations (compression and encryption) since they don't get along

    together by nature. That is because compression removes redundancy from data while encryption

    adds it. Our research looks into the possibilities in the domains of both encryption and compression

    in order to find the optimal combination using existing tools.

    Compression

    To achieve performance requirements, one of the most important techniques used is compressing the

    exchanged data throughout the network. With the data being a voice signal, this gives it a certain

    structure rendering it compressible at high ratios with no/minimal distortion. Therefore, speech

    coding has always been a hot research area in which many approaches are adopted with different

    perspectives and one outcome: minimizing the needed bandwidth while preserving voice quality at

    an important level.

    Speech coding is an application of data compression on digital audio signals containing speech.

    Speech coding uses speech-specific parameter estimation using audio signal processing techniques to

    model the speech signal, combined with generic data compression algorithms to represent the

    resulting modeled parameters in a compact bit-stream. [1]

    The techniques employed in speech coding are similar to those used in audio data compression and

    audio coding where knowledge in psychoacoustics is used to transmit only data that is relevant to the

    human auditory system. For example, in voice-band speech coding, only information in the frequency

    band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.

  • 2

    A sampling rate of 8 kHz is needed for narrowband coding. Also, wideband coding codes information

    in the frequency band reaching 7 8 kHz, which requires sampling or rate 16 kHz.

    Speech coding differs from other forms of audio coding in that speech is a much simpler signal than

    most other audio signals, and a lot more statistical information is available about the properties of

    speech. As a result, some auditory information which is relevant in audio coding can be unnecessary

    in the speech coding context. In speech coding, the most important criterion is preservation of

    intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data. [2]

    Types of Speech Coders

    There are different types of speech encoders:

    Waveform coders attempt to code the exact shape of the speech signal waveform, without

    considering the nature of human speech production and speech perception. These coders are

    high-bit-rate coders (typically above 16 kbps).

    Linear prediction coders (LPCs), on the other hand, assume that the speech signal is the output

    of a linear time-invariant (LTI) model of speech production. The transfer function of that

    model is assumed to be all-pole (autoregressive model). The excitation function is a quasi-

    periodic signal constructed from discrete pulses (18 per pitch period), pseudorandom noise,

    or some combination of the two. If the excitation is generated only at the receiver, based on a

    transmitted pitch period and voicing information, then the system is designated as an LPC

    voice coder (vocoder). LPC vocoders that provide extra information about the spectral shape

    of the excitation have been adopted as coder standards between 2.0 and 4.8 kbps.

    LPC-based analysis-by-synthesis coders (LPC-AS), on the other hand, choose an excitation

    function by explicitly testing a large set of candidate excitations and choosing the best. LPC-

    AS coders are used in most standards between 4.8 and 16 kbps.

    Sub-band coders are frequency-domain coders that attempt to parameterize the speech signal

    in terms of spectral properties in different frequency bands. These coders are less widely used

    than LPC-based coders but have the advantage of being scalable and do not model the

    incoming signal as speech. Sub-band coders are widely used for high-quality audio coding.

    Table 1.1 shows the four discussed types of speech coders. [3]

    Speech Coder Class Rates (kbps) Complexity Standardized Applications

    Waveform coders 1664 Low Landline telephone

    Sub-band coders 12256 Medium Teleconferencing, audio

    LPC-AS 4.816 High Digital cellular

    LPC vocoder 2.04.8 High Satellite telephony, military

    Table I-1: Characteristics of Standardized Speech Coding Algorithms in Each of Four Broad Categories

  • 3

    Variable Bit-Rate Coding

    One of the important techniques in speech coding is using variable bitrate while coding the speech

    signal. The main idea behind this technique is the fact that not all speech signals need the same bitrate

    in coding. In Variable Bitrate (VBR) coding, the user chooses the desired quality level and/or a range

    of allowable bitrates. Then the encoder tries to maintain the selected quality during the whole stream

    by choosing the optimal amount of data to represent each frame of audio. The main advantage is that

    the user is able to specify the quality level and conserve as much space as possible, but the

    inconvenience is that the final file size is quite unpredictable.

    Most modern encoders are able to perform VBR encoding, including (but not limited to) nearly all

    popular MP3, AAC, (Ogg) Vorbis, Musepack, and WMA encoders. [4]

    Speech Coding State of the Art

    The two most important applications of speech coding are mobile telephony and Voice over IP.

    Consequently, the standards for speech compression are organized and published by the International

    Telecommunication Union (ITU) responsible for development in the mobile technology and by the

    Internet Engineering Task Force (IETF).

    This section presents the most widely used encoders in the domain: Adaptive Multi-Rate (AMR) is

    an encoder developed and adopted by the ITU. It is used in WCDMA networks. On the other hand,

    Opus and Speex are two sibling encoders developed by Xiph.org and adopted by IETF.

    Adaptive Multi Rate (AMR)

    The Adaptive Multi-Rate speech coder is based on the Algebraic CELP (ACELP) technology and is

    referred to as a Multi-Rate ACELP (MR-ACELP) coder. The coder is capable of operating at 8

    different bit-rates denoted coder modes. The frame size is 20 milliseconds with 4 sub-frames of 5

    milliseconds. A look-ahead of 5 ms is used. The 12.2 Kbit/s mode is equivalent to the GSM EFR

    coder while the 7.40 Kbit/s mode is equivalent to the EFR coder for the IS-136 system.

    The AMR speech coder was developed to fulfill a challenging set of performance requirements for

    clean speech, speech in background noise, tendering and degraded channel conditions. The highest

    mode is the GSM EFR coder, which provides speech quality comparable to fixed-line quality. The

    lowest mode provides communication quality. The range of bit-rates and the high quality provides

    flexibility to trade quality and capacity as well as to optimize quality under changing channel

  • 4

    conditions. The quality was shown to be significantly higher than for existing speech services in GSM.

    [5]

    Opus

    Opus codec is developed by Xiph.org and standardized for multimedia streaming and VoIP

    applications by the IETF. Opus can handle a wide range of audio applications, including Voice over

    IP, videoconferencing, in-game chat, and even remote live music performances. It can scale from low

    bit-rate narrowband speech to very high quality stereo music. Supported features are [7]:

    Bit-rates from 6 kb/s to 510 kb/s

    Sampling rates from 8 kHz (narrowband) to 48 kHz (fullband)

    Frame sizes from 2.5 ms to 60 ms

    Support for both constant bit-rate (CBR) and variable bit-rate (VBR)

    Audio bandwidth from narrowband to full-band

    Support for speech and music

    Support for mono and stereo

    Support for up to 255 channels (multistream frames)

    Dynamically adjustable bitrate, audio bandwidth, and frame size

    Good loss robustness and packet loss concealment (PLC)

    Floating point and fixed-point implementation

    The Opus codec is a real-time interactive audio codec. It is composed of a layer based on Linear

    Prediction [LPC] and a layer based on the Modified Discrete Cosine Transform [MDCT]. The main

    idea behind using two layers is as follows: in speech, linear prediction techniques (such as Code-

    Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform (e.g.,

    MDCT) domain techniques, while the situation is reversed for music and higher speech frequencies.

    Thus, a codec with both layers available can operate over a wider range than either one alone and can

    achieve better quality by combining them than by using either one individually. [6]

    The Opus encoder consists of two main blocks: the SILK encoder and the CELT encoder. However,

    unlike the decoder, a valid (though potentially suboptimal) Opus encoder is not required to support

    all modes and may thus only include a SILK encoder module or a CELT encoder module. The output

    bit-stream of the Opus encoding contains bits from the SILK and CELT encoders, though these are

    not separable due to the use of a range coder. A block diagram of the encoder is illustrated below.

    [6]

  • 5

    Opus encoder is standardized for VoIP applications by the IETF. The reference (RFC 6716) defines

    the encoder/decoder. Furthermore, IETF has published specifications for packet payload format of

    Opus frames.

    Speex

    Speex encoder is the sibling of Opus, It is developed also by Xiph.org, and it has a very similar

    approach to Opus. The options featured by the two encoders are similar to a great extent. However,

    in our research we are interested more in experimenting with Speex rather than Opus. The reason

    behind this will be explained later throughout the course of the report.

    Speex is based on CELP, which stands for Code Excited Linear Prediction. The CELP technique is

    based on three ideas:

    The use of a linear prediction (LP) model to model the vocal tract

    The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model

    The search performed in closed-loop in a perceptually weighted domain'

    Speex is designed to compress voice at bitrates ranging from 2 to 44 kbps. Some of Speex's features

    include:

    Narrowband (8 kHz), wideband (16 kHz), and ultra-wideband (32 kHz) compression in the

    same bit-stream

    Intensity stereo encoding

    Packet loss concealment

    Variable bitrate operation (VBR)

    Voice Activity Detection (VAD)

    Figure I-1: Block Diagram of the Opus Encoder

  • 6

    Discontinuous Transmission (DTX)

    Fixed-point port

    Acoustic echo canceller

    Noise suppression

    The following table shows a comparison between the 3 discussed coders.

    Codec Rate (kHz) bitrate (kbps)

    delay

    frame+lookahead

    (ms)

    multirate VBR license

    Speex 8, 16, 32

    2.15-24.6 (NB) 20+10 (NB)

    yes yes

    open-

    source/

    4-44.2 (WB) 20+14 (WB) free

    software

    Opus 8, 16, 22 6 - 510 2.5-60 yes yes open-

    source/

    AMR-

    NB 8 4.75-12.2 20+5? yes proprietary

    AMR-

    WB 16 6.6-23.85 20+5? yes proprietary

    (G.722.2)

    Table I-2 Comparison Between the 3 Speech Encoders

    To sum up, the bibliographic work has led us to emphasize the concept of variable bitrate (VBR).

    This is due to reasons that are explained in Chapter II. Furthermore, the literature that we are dealing

    with in our research is based on working with Speex encoder. Consequently, Speex will be our

    designated encoder in the test-bed.

    Security

    The other concern in our research is security. As stated in the beginning of the chapter, privacy has

    great importance for real-time voice communication applications, whether in mobile telephony or

    voice over IP. In this section, we review the concept of encrypting voice data along with the state of

    the art in the field.

    Encryption is the process of converting plain text "unhidden" to a cryptic text "hidden" to secure it

    against data thieves. This process has another part where cryptic text needs to be decrypted on the

    other end to be understood. As dened in RFC 2828 [Reference], cryptographic system is "a set of

    cryptographic algorithms together with the key management processes that support use of the

    algorithms in some application context." This denition denes the whole mechanism that provides

    the necessary level of security comprised of network protocols and data encryption algorithms.

  • 7

    The goals of any cryptography system fall into 5 categories:

    Authentication: This means that before sending and receiving data using the system, the

    receiver and sender identity should be veried.

    Secrecy or Condentiality: Usually this function (feature) is how most people identify a

    secure system. It means that only the authenticated people are able to interpret the message

    (date) content and no one else.

    Integrity: Integrity means that the content of the communicated data is assured to be free

    from any type of modication between the end points (sender and receiver). The basic form

    of integrity is packet check sum in IPv4 packets.

    Non-Repudiation: This function implies that neither the sender nor the receiver can falsely

    deny that they have sent/received a certain message.

    Service Reliability and Availability: Since secure systems usually get attacked by intruders,

    which may affect their availability and type of service to their users. Such systems should

    provide a way to grant their users the quality of service they expect.

    The category of our interest is confidentiality. Consequently, the reference to security throughout the

    report is meant to address the confidentiality goal of the implemented security system. Furthermore,

    the attack on the system is based on traffic analysis and not the conventional cryptanalysis. This idea

    will be discussed in details in the next chapter.

    Symmetric and Asymmetric Encryption

    Data encryption procedures are mainly categorized into two categories depending on the type of

    security keys used to encrypt/decrypt the secured data. These two categories are: Asymmetric and

    Symmetric encryption techniques. In symmetric encryption, the sender and the receiver agree on a

    secret (shared) key. Then they use this secret key to encrypt and decrypt their exchanged messages.

    The main concern behind symmetric encryption is how to share the secret key securely between the

    two peers. If the key gets known for any reason, the whole system collapses. On the other hand,

    Asymmetric encryption is where two keys are used. To explain more, what Key1 can encrypt only

    Key2 can decrypt, and vice versa. It is also known as Public Key Cryptography (PKC), because users

    tend to use two keys: public key, which is known to the public, and private key, which is only known

    to the user.

    In the project, we will be interested in experimenting with symmetric encryption. This is because the

    state of the art in the domain of speech encryption is based on symmetric ciphers. The reason behind

    is that symmetric algorithms in general are less complex than asymmetric ones. The reduction in

    complexity is of great importance to such real-time application, running usually on platforms with

    limited capabilities (Mobile Phones).

  • 8

    Block and Stream Encryption

    One of the main categorization methods for encryption techniques commonly used is based on the

    form of the input data they operate on. The two types are Block Cipher and Stream Cipher.

    Stream cipher operates on a stream of data by operating on it bit by bit. Stream cipher consists of two

    major components: a key stream generator, and a mixing function. Mixing function is usually just an

    XOR function, while key stream generator is the main unit in stream cipher encryption technique.

    In a block cipher method, data is encrypted and decrypted in blocks. In its simplest mode, you divide

    the plain text into blocks, which are then fed into the cipher system to produce blocks of cipher text.

    ECB (Electronic Codebook Mode) is the basic form of block cipher where data blocks are encrypted

    directly to generate its correspondent ciphered blocks.

    There are many variances of block cipher, where dierent techniques are used to strengthen the

    security of the system. The most common methods are: ECB (Electronic Codebook Mode), CBC

    (Chain Block Chaining Mode), and OFB (Output Feedback Mode). ECB mode and the CBC mode

    use the cipher block from the previous step of encryption in the current one, which forms a chain-like

    encryption process. OFB operates on plain text in away similar to stream cipher that will be described

    below, where the encryption key used in every step depends on the encryption key from the previous

    step. There are other modes like CTR (counter) and CFB (Cipher Feedback). CTR mode is used to

    transform a block cipher into a stream cipher. The idea is simple; a block mode is used to generate a

    key stream, which is mixed (mainly XORed) with the plain text.

    The recommended mode of operation for real-time voice communication is obviously the stream

    cipher. This is due to the nature of transferred data, which is in the form of stream. However, in we

    explore the option of using block cipher. The feasibility of using block ciphers for encryption of voice

    data comes from the perspective of trading off performance for security. We will elaborate more on

    that later in the course of the report.

    Common Encryption Algorithms

    Here we discuss 5 of the most famous ciphers present in the state of the art. Among these algorithms,

    AES and KASUMI are implemented in real-time voice communication security. AES is standardized

    for voice over IP in the Secure Real-time Transport Protocol (SRTP), which is a profile for RTP. On

  • 9

    the other hand, KASUMI was standardized by the ITU for GSM and consequent communication

    systems.

    DES: (Data Encryption Standard), was the rst encryption standard to be recommended by NIST

    (National Institute of Standards and Technology). It is based on the IBM proposed algorithm called

    Lucifer. DES became a standard in 1974. Since that time, many attacks and methods recorded that

    exploit the weaknesses of DES, which made it an insecure block cipher.

    3DES: As an enhancement of DES, the3DES (Triple DES) encryption standard was proposed. In this

    standard the encryption method is similar to the one in original DES but applied 3 times to increase

    the encryption level. But it is a known fact that 3DES is slower than other block cipher methods.

    AES: (Advanced Encryption Standard), is the new encryption standard recommended by NIST to

    replace DES. Rijndael (pronounced Rain Doll) algorithm was selected in 1997 after a competition to

    select the best encryption standard. Brute force attack is the only eective attack known against it, in

    which the attacker tries to test all the characters combinations to unlock the encryption. Both AES

    and DES are block ciphers.

    Blowsh: It is one of the most common public domain encryption algorithms provided by Bruce

    Schneier - one of the world's leading cryptologists, and the president of Counterpane Systems, a

    consulting rm specializing in cryptography and computer security. Blowsh is a variable length key,

    64-bit block cipher. The Blowsh algorithm was rst introduced in 1993.This algorithm can be

    optimized in hardware applications though it's mostly used in software applications. Though it suers

    from weak keys problem, no attack is known to be successful against it.

    KASUMI: It is a block cipher used in UMTS, GSM, and GPRS mobile communications systems. In

    UMTS, KASUMI is used in the confidentiality (f8) and integrity algorithms (f9) with names UEA1

    and UIA1, respectively. In GSM, KASUMI is used in the A5/3 key stream generator and in GPRS in

    the GEA3 key stream generator.

    KASUMI was designed for 3GPP to be used in UMTS security system by the Security Algorithms

    Group of Experts (SAGE), a part of the European standards body ETSI. SAGE agreed with 3GPP

    technical specification group (TSG) for system aspects of 3G security (SA3) to base the development

    on an existing algorithm that had already undergone some evaluation. They chose the cipher

    algorithm MISTY1 developed and patented by Mitsubishi Electric Corporation. The original

    algorithm was slightly modified for easier hardware implementation and to meet other requirements

    set for 3G mobile communications security.

  • 10

    In January 2010, Orr Dunkelman, Nathan Keller and Adi Shamir released a paper showing that they

    could break Kasumi with a related key attack and very modest computational resources. Interestingly,

    the attack is ineffective against MISTY.

    Report Structure

    In the first chapter of this report, we were acquainted with the state of the art of both compression

    and encryption. We reviewed the encoding concepts along with the widely used encoders. We also

    reviewed security in brief manner. Cipher types and modes were presented with emphasis on the

    application of VoIP and Mobile telephony.

    In the second chapter, we have a brief literature review stating the main problem the project tries to

    tackle: the bad combination between VBR and stream encryption. The papers stating security

    vulnerabilities are reviewed briefly. The solution for the problem is discussed and the perspective that

    the project works in is determined.

    Chapter III exhibits the test-bed that we created in order to test for bitrates. The test-bed is consisted

    of 3 main elements (or stages): encoding, encryption, sending/receiving.

    The fourth and final chapter includes all the obtained results. These results are the obtained bitrates

    throughout different setups spanning the whole space of options found in our field of interest. This

    chapter also includes the concluding the statement along with future recommendations.

  • 11

    Chapter II Literature Review and Problem Formulation

    The main problem to be tackled in this project can be presented and explained in a very simple and

    brief manner. The combination between variable bit-rate compression and length preserving

    encryption (stream cipher) induces security weaknesses in the form vulnerability to traffic analysis.

    The solution is reducing information leaking by reducing the variation of bitrate in the transmitted

    stream. This is acquired by relying on constant bitrate (CBR) or by using padding. In brief, our project

    emphasizes on the analysis of the cost of padding in the context of bandwidth. We aim at performing

    tests of using padding and reaching a conclusion about the cost of padding and consequently its

    feasibility. They proposition by the research project should answer the question about the possibility

    of gaining trusted security level using existing tools.

    In this chapter, we exhibit the weakness invoked by using variable bit-rate compression and then we

    discuss the perspective adopted in tackling this problem.

    Traffic Analysis of Encrypted Voice Stream

    In 2007, a paper was published under the title of Language Identification of Encrypted VoIP Traffic.

    After that by 2 years another paper, Spot me if you can: Uncovering spoken phrases in encrypted

    VoIP conversations. The most important paper in the context was published in 2011 and titled by:

    Phonotactic Reconstruction of Encrypted VoIP Conversations: Hookt on fon-iks. The inferred

    common idea from the titles is extraction of certain information (language, some phrases, phoneme

    reconstruction) from encrypted VoIP stream. A key point is not revealed in the titles: such extraction

    relies on variable bit-rate compression.

    The Secure RTP (SRTP) framework [RFC3711] is a widely used framework for securing RTP

    sessions [RFC3550]. SRTP provides the ability to encrypt the payload of an RTP packet, and

    optionally add an authentication tag, while leaving the RTP header and any header extension in the

    clear. A range of encryption transforms can be used with SRTP, but none of the predefined encryption

    transforms use any padding; the RTP and SRTP payload sizes match exactly.

    When using SRTP with voice streams compressed using variable bit rate (VBR) codecs, the length

    of the compressed packets will depend on the characteristics of the speech signal. This variation in

    packet size will leak a small amount of information about the contents of the speech signal. This is

    potentially a security risk for some applications. For example, [spot-me] shows that known phrases

  • 12

    in an encrypted call using the Speex codec in VBR mode can be recognized with high accuracy in

    certain circumstances, and [fon-iks] shows that approximate transcripts of encrypted VBR calls can

    be derived for some codecs without breaking the encryption. How significant these results are, and

    how they generalize to other codecs, is still an open question. This memo discusses ways in which

    such traffic analysis risks may be mitigated.

    Information leakage via variable bit-rate

    Generally speaking, the codec takes as input the audio stream from the user, which is typically

    sampled at either 8000 or 16000 samples per second (Hz). At some fixed interval, the codec takes the

    n most recent samples from the input, and compresses them into a packet for efficient transmission

    across the network. To achieve the low latency required for real-time performance, the length of the

    interval between packets is typically fixed between 10 and 50ms, with 20ms being the common case.

    Thus for a 16 kHz audio source, we have n = 320 samples per packet, or 160 samples per packet for

    the 8 kHz case.

    Many common voice codecs are based on a technique called code-excited linear prediction (CELP).

    For each packet, a CELP encoder simply performs a brute-force search over the entries in a codebook

    of audio vectors to output the one that most closely reproduces the original audio. The quality of the

    compressed sound is therefore determined by the number of entries in the codebook. The index of the

    best-fitting codebook entry, together with the linear predictive coefficients and the gain, make up the

    payload of a CELP packet. The larger code books used for higher-quality encodings require more bits

    to index, resulting in higher bit rates and therefore larger packets.

    In some CELP variants, such as QCELP, Speexs variable bit rate mode, or the approach advocated

    by Zhang et al., the encoder adaptively chooses the bit rate for each packet in order to achieve a good

    balance of audio quality and network bandwidth. This approach is appealing because the decrease in

    data volume may be substantial, with little or no loss in quality. In a two-way call, each participant is

    idle roughly 63% of the time, so the savings may be substantial. Unfortunately, this approach can also

    cause substantial leakage of information in encrypted VoIP calls because, in the standard specification

    for Secure RTP (SRTP), the cryptographic layer does not pad or otherwise alter the size of the original

    RTP payload.

  • 13

    Intuitively, the sizes of CELP packets leak information because the choice of bit rate is largely based

    on the audio encoded in the packets payload. For example, the variable bit-rate Speex codec encodes

    vowel sounds at higher bit rates than fricative sounds like f or s. In phonetic models of speech,

    sounds are broken down into several different categories, including the aforementioned vowels and

    fricatives, as well as stops like b or d, and affricatives like ch. Each of these canonical sounds

    is called a phoneme, and the

    pronunciation for each word in the

    language can then be given as a sequence

    of phonemes. While there is no consensus

    on the exact number of phonemes in

    spoken English, most in the speech

    community put the number between 40

    and 60.

    In [9], to demonstrate the relationship

    between bit rate and phonemes, several

    recordings from the TIMIT corpus of phonetically-rich English speech were encoded using Speex in

    wideband variable bit rate mode, and observed the bit rate used to encode each phoneme. The

    probabilities for 8 of the 21 possible bit rates are shown for a handful of phonemes in the following

    figure. As expected, we see that the two vowel

    sounds, aa and aw, are typically encoded at

    signicantly higher bit rates than the fricative f or the consonant k. Moreover, large differences

    in the frequencies of certain bit rates (namely, 16.6, 27.8, and 34.2 kbps), can be used to distinguish

    aa from aw and f from k.

    Figure II-1: Distribution of bit rates used to encode four

    phonemes with Speex

    Figure II-2: Packets for articial Figure II-3: Packets for intelligence

  • 14

    In fact, it is these differences in bit rate for the phonemes that make recognizing words and phrases

    in encrypted traffic possible. To illustrate the patterns that occur in the stream of packet sizes when a

    certain word is spoken, we examined the sequences of packets generated by encoding several

    utterances of the words artificial and intelligence from the TIMIT corpus. They represent the

    packets for each word visually in Figures 2 and 3 as a data imagea grid with bit rate on the y-axis

    and position in the sequence on the x-axis. Starting with a plain white background, we darken the cell

    at position (x,y) each time we observe a packet encoded at bit rate y and position x for the given word.

    In both graphs, we see several dark gray or black grid cells where the same packet size is consistently

    produced across different utterances of the word, and in fact, these dark spots are closely related to

    the phonemes in the two words. In Figure 2, the bit rate in the 2nd - 5th packets (the a in artificial)

    is usually quite high (35.8kbps), as we would expect for a vowel sound. Then, in packets 12 - 14 and

    20 - 22, we see much lower bit rates for the fricative f and affricative sh. Similar trends are visible

    in Figure 3; for example, the t sound maps consistently to 24.6 kbps in both words.

    Example of traffic analysis

    In the paper Uncovering spoken phrases in encrypted VoIP conversations, [9], the adopted method

    in analyzing the encrypted VoIP stream can be summarized by the following:

    To identify a phrase without using any examples of the phrase or any of its constituent words, this

    concatenative synthesis technique is applied to generate a few hundred synthetic training sequences

    for the phrase. These sequences are used to train a profile HMM for the phrase and then search for

    the phrase in streams of packets. An overview of the entire training and detection process is given in

    Figure II-4.

    Figure II-2: Overview of training and detection process

  • 15

    Mitigation Techniques

    One way to prevent word spotting would be to pad packets to a common length, or at least to coarser

    granularity. Another way is to reframe from using VBR into using the CBR mode. However, its not

    optimal though. Padding regains the lost security (to a certain extent as we will see) while preserving

    some benefit from variable bit-rate encoding.

    In the paper [9] the traffic analysis system (search algorithm) was tested against padding. To explore

    the tradeoff between padding and search accuracy, they encrypted both their training and testing data

    sets to multiples of 128, 256 or 512 bits and applied their approach. The results are presented in

    Figure II-4. The use of padding is quite encouraging as a mitigation technique, as it greatly reduced

    the overall accuracy of the search algorithm. When padding to multiples of 128 bits, the system

    achieves only 0.15 recall at 0.16 precision. Increasing padding so that packets are multiples of 256

    bits gives a recall of .04 at .04 precision.

    The debate around the announcement of security flaws in variable bit-rate encoding has led to

    publishing of an RFC by the ITU. The standard, Guidelines for the Use of Variable Bit Rate Audio

    with Secure RTP, RFC 6562, specifies standards for dealing with variable bit-rate in SRTP Protocol.

    For scenarios where VBR is considered unsafe, a constant bit rate (CBR) codec SHOULD be

    negotiated and used instead, or the VBR codec SHOULD be operated in a CBR mode. However, if

    the codec does not support CBR, RTP padding SHOULD be used to reduce the information leak to

    an insignificant level. Packets may be padded to a constant size or to a small range of sizes ([spot-

    me] achieves good results by padding to the next multiple of 16 octets, but the amount of padding

    Figure II-3: Robustness to padding

  • 16

    needed to hide the variation in packet size will depend on the codec and the sophistication of the

    attacker) or may be padded to a size that varies with time. The most secure and RECOMMENDED

    option is to pad all packets throughout the call to the same size.

    In the case where the size of the padded packets varies in time, the same concerns as for VAD apply.

    That is, the padding SHOULD NOT be reduced without waiting for a certain (random) time. The

    RECOMMENDED "hold time" is the same as the one for VAD.

    Note that SRTP encrypts the count of the number of octets of padding added to a packet, but not the

    bit in the RTP header that indicates that the packet has been padded. For this reason, it is

    RECOMMENDED to add at least one octet of padding to all packets in a media stream, so an attacker

    cannot tell which packets needed padding.[10]

  • 17

    Chapter III Test-Bed

    In the previous chapter, we exhibited the security weakness provoked by the combination between

    variable bit-rate encoding and same length encryption. This weakness is in the form of vulnerability

    to traffic analysis. The performance of the traffic analysis system presented in the previous chapter

    has shown degradation along with padding with increasing key lengths.

    Furthermore, as a result to the fact that padding preserves security to a great extent. It was

    recommended by the ITU in RFC 6562 to either use constant bit-rate encoding or rely on padding to

    16 bytes block length.

    All the discussion around the subject didnt take into consideration the tradeoff between security and

    performance. A question was to be asked about the feasibility of padding. A key point to have in mind

    is that variable bitrate encoding aims at lowering the needed bandwidth as much as possible. As a

    consequence to that notion, the cost of padding in terms of bit-rate and needed bandwidth is to be

    calculated in order to have a good perspective about the price we have to pay in order to achieve

    security while using variable bitrate.

    The answer for the question about the feasibility of padding is our main goal in the research project.

    This answer might be that padding will maybe cost more than constant bitrate and, consequently,

    padding is not the optimal solution for preserving security. However, we aim at having a solid

    perspective of the cost paid for different security levels. The results of our test-bed will hopefully

    give a good understanding about the relation between security, quality, and performance.

    Quality is a parameter we take in our research as part of tradeoff formula. The quality of the encoder

    is usually mapped to the bitrate used by it. Consequently, the quality can be inserted into the tradeoff

    formulation as a price to pay for preserving both security and bitrate.

    In order to have a proper testing and calculate the obtained bitrates. We need to create a system in

    which we implement compression, encryption, sending and receiving of a voice stream. The system

    should allow the manipulation of parameters that we are interested in.

    Test-bed requirements

    The created system must be able to implement compression and encryption of a speech stream.

    Furthermore, the system should allow the manipulation of parameters for both compression and

  • 18

    encryption. One more important requirement is ability to send and receive the compressed and

    encrypted stream. Sending/receiving conveys the packetization of the stream in realistic manner that

    can be related a real application. The system should be also able to log the obtained bitrates at every

    setup.

    For compression, we should be able to choose the mode (narrow-band, wideband). In addition to that,

    we should be able to choose the quality of compression. The quality variable is an important variable

    that is supported by many algorithms that form the state of the art. We emphasize the ability to choose

    quality since we are interested in inserting quality as a parameter in the tradeoff setup as we can see

    later in the results section.

    In encryption, the main requirement is the ability to pad data to a multiple of 128, 256, 512 bits. Of

    course, in addition to that, we need to adopt a cipher which is trusted in the state of the art. The cipher

    should have a low cost in terms of processing time since the platforms are usually mobile phones with

    limited memory and processing power. One additional requirement is being a symmetric cipher since

    all protocols implement symmetric encryption/decryption mechanisms.

    The requirements can be summarized and formulated in a compact format as the following:

    Compression:

    o Widely implemented encoder

    o Variable bit-rate compression

    o Variable quality setting

    Encryption

    o Trusted low cost cipher

    o Padding to different sizes

    o Symmetric cipher

    Test-bed elements

    Based on the discussed requirements, the search for an encoder and a cipher is aimed at finding

    modules widely present in the state of the art. The test-bed is built in a Linux environment (UBUNTU

    distribution of GNU-Linux). The used libraries are all written in C programming language,

    consequently, the built test-bed was to be written in C.

  • 19

    For the encoder, the choice was set to Speex encoder. This encoder was chosen since it meets all the

    stated requirements. Furthermore, this encoder was used in the three articles that state the security

    vulnerability as the designated encoder.

    Regarding encryption, the choice was obvious: Advanced Encryption Standard. AES is standardized

    and adopted in SRTP, the main standard for security in voice over IP. However, SRTP specifications

    and implementation use AES in CTR mode (Counter mode) this mode generates a key stream and

    mixes it with data (using XOR operation) in order to get the encrypted text. The length of the initial

    plain text is reserved. Consequently, this mode modes renders a block cipher into a stream length

    preserving cipher regardless of the block size of the cipher. Other modes specified by SRTP are f8

    and null cipher.

    It is worthy of mentioning that the RFC published about the guidelines for using variable bit-rate with

    SRTP recommends relying on higher levels in the hierarchy of the networking model to achieve

    padding. The padding was part of compression or application layer in general as per the published

    standard. However, in our approach we tried to use a block cipher in the test bed. The choice of a

    block cipher does not affect the desired results in any way. Furthermore, the choice making padding

    part of the encryption process is justified in terms of security requirements. The implementation of

    padding in compression or other entity may induce security vulnerabilities avoidable by using block

    cipher. For example, padding can be done within the RTP payload, the number of padding bytes will

    be part of the encrypted header of the RTP packet, but the flag specifying padding will not be

    encrypted.

    Speex Encoder

    In our test bed, we used Speex encoder the designated compression tool. We used the Speex library

    and relied on detailed step by step construction of the encoder using Speex API (Application

    Programming interface). This choice is because manipulating parameters and managing the encoders

    output requires such construction rather than using a prebuilt ready-to-use module.

    The libspeex library contains all the functions for encoding and decoding speech with the Speex codec.

    When linking on a UNIX system, we must add -lspeex -lm to the compiler command line.

    In order to encode speech using Speex, we rst need to:

    #include

    Then in the code, a Speex bit-packing struct must be declared, along with a Speex encoder state:

  • 20

    SpeexBits bits;

    void *enc_state;

    The two are initialized by:

    speex_bits_init(&bits);

    enc_state = speex_encoder_init(&speex_nb_mode);

    For wideband coding, speex_nb_mode will be replaced by speex_wb_mode. In most cases, you will

    need to know the frame size used at the sampling rate you are using.

    The encoder is by default set to cbr mode. We set it into variable bit-rate mode by using:

    speex_encoder_ctl(enc_state,SPEEX_SET_VBR,&vbr);

    The variable vbr an integer value ( 0 or 1). It is used to set vbr on (1) or off (0).

    There are many parameters that can be set for the Speex encoder, but the most useful one is the quality

    parameter that controls the quality vs. bit-rate tradeoff.

    This is set by:

    speex_encoder_ctl(enc_state,SPEEX_SET_VBR_QUALITY,&quality);

    Quality is a float value ranging from 0.0 to 10.0 (inclusively). The mapping between quality and bit-

    rate is described in the following 2 tables for both narrowband and wideband.

    Mode Quality Bit-

    rate (bps)

    mflops Quality/description

    0 - 250 0 No transmission (DTX)

    1 0 2,150 6 Vocoder (mostly for comfort noise)

    2 2 5,950 9 Very noticeable artifacts/noise, good intelligibility

    3 3-4 8,000 10 Artifacts/noise sometimes noticeable

    4 5-6 11,000 14 Artifacts usually noticeable only with headphones

    5 7-8 15,000 11 Need good headphones to tell the difference

    6 9 18,200 17.5 Hard to tell the difference even with good headphones

    7 10 24,600 14.5 Completely transparent for voice, good quality music

    8 1 3,950 10.5 Very noticeable artifacts/noise, good intelligibility

    9 - - - reserved

  • 21

    10 - - - reserved

    11 - - - reserved

    12 - - - reserved

    13 - - - Application-defined, interpreted by callback or skipped

    14 - - - Speex in-band signaling

    15 - - - Terminator code

    Table III-1 Quality vurses bitrate for Speex narrowband

    Mode/

    Quality

    Bit-rate (bps) Quality/description

    0 3,950 Barely intelligible (mostly for comfort noise)

    1 5,750 Very noticeable artifacts/noise, poor intelligibility

    2 7,750 Very noticeable artifacts/noise, good intelligibility

    3 9,800 Artifacts/noise sometimes annoying

    4 12,800 Artifacts/noise usually noticeable

    5 16,800 Artifacts/noise sometimes noticeable

    6 20,600 Need good headphones to tell the difference

    7 23,800 Need good headphones to tell the difference

    8 27,800 Hard to tell the difference even with good headphones

    9 34,400 Hard to tell the difference even with good headphones

    10 42,400 Completely transparent for voice, good quality music

    Table III-2 Quality vurses bitrate for Speex wideband

    Once the initialization is done, for every input frame:

    speex_bits_reset(&bits);

    speex_encode_int(enc_state, input_frame, &bits);

    nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);

    Where input_frame is a (short *) pointing to the beginning of a speech frame, byte_ptr is a (char *)

    where the encoded frame will be written, MAX_NB_BYTES is the maximum number of bytes that

    can be written to byte_ptr without causing an overow and nbBytes is the number of bytes actually

    written to byte_ptr (the encoded size in bytes). Before calling speex_bits_write, it is possible to nd

    the number of bytes that need to be written by calling speex_bits_nbytes(&bits), which returns a

    number of bytes.

  • 22

    After youre done with the encoding, free all resources with:

    speex_bits_destroy(&bits);

    speex_encoder_destroy(enc_state);

    AES Encryption

    The choice of the AES cipher is justified in the previous section of the chapter. However, the

    algorithm has a high number of implementations. Among these, a trusted and well known library in

    the state of the art is OpenSSL.

    OpenSSL provides two primary libraries: libssl and libcrypto. The libcrypto library provides the

    fundamental cryptographic routines used by libssl. You can however use libcrypto without using

    libssl.

    For most uses, users should use the high level interface that is provided for performing cryptographic

    operations. This is known as the EVP interface (short for Envelope). This interface provides a suite

    of functions for performing encryption/decryption (both symmetric and asymmetric),

    signing/verifying, as well as generating hashes and MAC codes, across the full range of OpenSSL

    supported algorithms and modes. Working with the high level interface means that a lot of the

    complexity of performing cryptographic operations is hidden from view. A single consistent API is

    provided. In addition low level issues such as padding and encryption modes are all handled.

    The EVP functions provide a high level interface to OpenSSL cryptographic functions. They provide

    the following features:

    A single consistent interface regardless of the underlying algorithm or mode

    Support for an extensive range of algorithms

    Encryption/Decryption using both symmetric and asymmetric algorithms

    Sign/Verify

    Key derivation

    Secure Hash functions

    Message Authentication Codes

    Support for external crypto engines,

  • 23

    AES is available in libcrypto with different modes, and in block sizes 128, 192, and 256 bits.

    Unfortunately, the library doesnt support a block size of 512. In fact, generally implementations of

    AES use a block size of 128 and 256 at most. To deal with this issue, we used the algorithm in CBC

    mode for block sizes of 128 and 256 bits. And to get the size of 512 bits, we relied on manual padding.

    Although the use of EVP as a high level interface simplifies using the library to a great extent, using

    EVP in a complex test bed with multi stage procedures may induce complexity.

    To encrypt using EVP, first we have to:

    #include

    The encryption process starts with initializing the cipher. We have to create a context, "opaque"

    encryption, decryption structures that libcrypto uses to record status of encrypt/decrypt operations:

    EVP_CIPHER_CTX e_ctx;

    Then we have to create a key and IV (initiation vector) for the cipher. A SHA1 digest is used to hash

    the supplied key material (password) multiple times (rounds). More rounds are more secure but

    slower. Then after setting the key and IV, we call:

    EVP_CIPHER_CTX_init(e_ctx);

    EVP_EncryptInit_ex(e_ctx, EVP_aes_256_cbc(), NULL, key, iv);

    This initiates AES encryption in CBC mode with a block size of 256 as shown in the second parameter.

    To initialize 128 block size instead, we call:

    EVP_EncryptInit_ex(e_ctx, EVP_aes_128_cbc(), NULL, key, iv);

    Encryption of the Speex frame then takes place in the following manner:

    EVP_EncryptInit_ex(e_ctx, NULL, NULL, NULL, NULL);

    EVP_EncryptUpdate(e_ctx, ciphertext, &c_len, plaintext, *len);

    EVP_EncryptFinal_ex(ectx, ciphertext+c_len, &f_len);

    Note: both decompressing and decryption of the stream are not implemented in the test bed. Although

    implementation of decoding and decryption will add value and integrity to the results. The results can

    be calculated without the need for neither decryption nor decoding.

  • 24

    RTP Sending/Receiving

    The previous 2 stages of the operations held in the test bed allow calculating bitrate in the absence of

    packetization. To achieve realistic results, we implement sending and receiving of the stream in two

    separate threads. Then we calculate bitrates of the received and dumped packets.

    The library used for RTP sending/receiving is oRTP, an implementation of the RTP library. A number

    of calls must be made to initialize the library. The first of the first of these is RTPCreate(), which

    establishes a context. A context is an identifier used by the library to determine which RTP session a

    function call is to be associated with. An application can run many sessions at the same time, each

    created with a separate call to RTPCreate, resulting in a different context for each. Most library

    functions accept a context as the first argument. Once RTPCreate has been called to initialize the

    session, the addresses for the session must be set.

    rtperror RTPCreate(context *the_context);

    rtperror RTPOpenConnection(context cid);

    Sending packets is fairly straightforward. The RTPSend() function is used to tell the library to send

    an RTP packet. It requires the user to pass a pointer to a buffer, a length, a value for the marker field

    in the RTP header, an increment for the timestamp, and the context. The library will take the buffer,

    add the RTP header, perform any required operations, and send the packet. The library will

    automatically send RTCP packets. The initial timestamp and sequence number are chosen randomly.

    rtperror RTPSend(context cid, int32 tsinc, int8 marker, int16 pti, int8 *payload, int len);

    Receiving packets is a little more complex. In order to know if a packet is available for reading, a

    process can block, it can poll, or use any other kind of mechanism. Since the library does not dictate

    this policy, it is up to us to determine when data is available for reading. We choose polling every 20

    milliseconds in order to check for a received packet. To do this, the library allows access to the

    receive sockets. There are two: one for RTP, one for RTCP. The functions RTPSessionGetRTPSocket

    and RTPSessionGetRTCPSocket are used to do this. They take as input the context and a pointer to

    a socket. When they return 1, the socket has been filled in. We then check for the presence on an RTP

    packet on these sockets using select().

    RTPSessionGetRTPSocket(context cid, socktype *value);

    rtperror RTPReceive(context cid, int socket,

    char *rtp_pkt_stream, int *len);

  • 25

    When a packet is present on either socket, the application should call the function RTPReceive().

    This function takes the context, the socket on which data is present, a pointer to a buffer, and a pointer

    to a length value. The length value should be initialized to the amount of room in the buffer. The

    library will read and process the RTP or RTCP packet. For RTCP, it will perform all statistics

    collection and parsing. The buffer will be filled in with the entire RTP/RTCP packet, including the

    header.

    We then save the whole received packet into a file for further calculation of the obtained bitrate. The

    bitrate is calculated based on the previously known duration of the sent speech.

    Dataset

    The choice of the data set was guided by the dataset used in the articles published about the subject.

    They used the TIMIT corpus, a database used for speech recognition. Since the TIMIT database is

    not open for public use. We chose to work with another speech recognition training database: the

    census database. Here we state information about the designated dataset:

    The directory contains the alphanumeric database (aka "census" aka "an4") recorded at Carnegie

    Mellon University circa 1991. Subjects were asked to spell out personal information, such as name,

    address, telephone number, birthdates, etc. They were instructed to not use their actual numbers. In

    addition to these, subjects also spoke randomly generated sequences of words containing control

    words. The database used internally at CMU has 1018 training and 140 test utterances, whereas the

    database provided here has 948 training and 130 test utterances. All data are sampled at 16 kHz, 16-

    bit linear sampling. All recordings were made with a close talking microphone.

    In the dataset, we have two directories:

    an4_clstk

    The directory with training data has 74 sub-directories, one for each speaker. 21 of

    them are female, 53 are male. The total number of utterances is 948, and the average

    duration is about 3 seconds, totaling a little less than 50 minutes of speech.

    an4test_clstk

    The directory with test data has 10 sub-directories, one for each speaker. 3 of them

    are female, 7 are male. The total number of utterances is 130, totaling around 6

    minutes of speech.

  • 26

    Test-bed overview

    The presented test-bed can be summarized in the block diagram in figure III-1.

    The process of testing starts with choosing a file from the dataset. The time of the file is calculated

    by counting the number of samples in the file. After that, the file is encoded, encrypted and sent over

    a RTP socket. Compression must span all the range for quality (0 to 10). Encryption must also take

    place in the 4 presented modes (stream and 3 block sizes). Next, the file is sent over a RTP socket to

    a local receiving socket initiated by another thread. Packets are dumped and saved in an output file.

    The recorded size of frames is used to calculate bit-rate without the packetization overhead. The size

    of sent/received stream is used to calculate the bit-rate along with the network overhead.

    Choose file from dataset

    Calculate time

    Compress using Speex

    Set quality parameter

    Encrypt using AES in CBC mode

    Set the block size

    record frame sizes

    Send/Receive over RTP socket

    dump frames

    calculate bit-rate

  • 27

    Chapter IV Results and Conclusion

    Tests were held using the test-bed presented in the previous chapter. The resulting bitrates obtained

    are divided into two categories: narrowband and wideband. Along with presenting the bitrate obtained

    for 4 encryption schemes (stream, 128, 256, and 512 padding). The overhead for the latter three

    schemes over the original stream bitrate is calculated.

    Narrow Band Results

    Figure IV-2: NB rate vs quality (without packtization)

    Figure IV-1: NB padding overhead (without packetization)

  • 28

    As we can infer from these results, the overhead induced by padding for narrow band mode is of great

    magnitude. In figures IV-1 and IV-2, we see the rate versus quality for the three levels of security as

    well as the overhead induced by padding. The 128 bit padding has a moderate overhead to be added.

    512 bit padding has a constant bitrate throughout the whole range of quality. Consequently, using

    CBR with highest quality maybe a better solution than relying on padding. However, for other

    padding schemes (256 bit for example) the overhead added is manageable.

    An example of a tradeoff using these results can stated by the following. Take for example the rates

    of stream encryption and 256 padding. We can see the average rates for streaming quality 10 and 256

    padding of quality 7 are the same. A tradeoff can be made here: padding to 256 bits and setting quality

    to 7 can create a huge security gain while keeping the same rate. The price we have to pay is quality.

    Figure IV-3: NB rate vs quality (with packtization)

    Figure IV-4: NB padding overhead (with packetization)

  • 29

    Wide Band Results

    The same testing was held while setting Speex to wide-band mode. The following Figures show the

    obtained results (bit-rate versus quality, and overhead) for the 4 encryption streams adopted. The

    results are a lot better than the results obtained for narrow band. As we see in figure IV-5, the curves

    corresponding to the 4 encryption schemes show less difference and consequently less added

    overhead. For example, if we have to do the same tradeoff exhibited in the previous section, the

    quality will downgrade only to 9 instead of 7. To have a padding of 512 bits and keep the same bit-

    rate, the quality will downgrade down to 8.

    Figure IV-6: WB Rate vers Quality (without packetization)

    Figure IV-5: WB overheaad (without packetization)

  • 30

    An important notion is that the overhead calculated with packetization a smaller impact to a great

    extent. The overhead for 256 bits padding and a quality of 7 for example is around 30% if calculated

    without packetization. This overhead is less than 20 percent when calculated with packetization.

    Statistical Analysis

    The results shown in the previous sections represent only average bitrate. To have a clearer

    perspective, we calculated a 95% confidence interval for each obtained bitrate. The confidence

    Figure IV-8: Wide Band rate versus quality (with packetization)

    Figure IV-7: Wide Band overhead (with packetization)

  • 31

    interval gives us more information about the resulting bitrate. The size of the confidence interval tells

    us about the fact of benefiting from variable bitrate compression. However, a very large confidence

    interval cannot be linked to a manageable overhead.

    Another important piece of information is that the confidence interval overlapping between 2

    encryption schemes will make us suggest that the 2 schemes can be working in the same rate. The

    statistical calculated results overall give a better perspective to understand the tradeoff to be made.

    The following figures present the confidence interval for the 4 encryption schemes. The 3 block

    encryption schemes are compared to the stream cipher. Only wide band results are shown.

    Figure IV-9: Stream Cipher 95% Confidence Interval

    Figure IV-10: Stream and 128 bit padding Confidence Interval

  • 32

    Conclusion and future recommendations

    The answer to the main question asked in our report problem is: yes, using padding for VoIP

    encrypted stream is feasible. The results show in a clear manner that a 3 dimensional tradeoff can be

    made to get the desired solution. The parameters of the tradeoff, the level of security (presented by

    the padding block size), the bit-rate, and the quality. The two latter parameters work inversely, while

    the security parameter changes the scale of bitrate range.

    Figure IV-12: Stream and 256 bit Confidence Interval

    Figure IV-11: Stream and 512 bit padding confidence interval

  • 33

    A remark is to be done about the importance of 256 bit padding. 256 padding shows a great

    enhancement in immunity to traffic analysis (chapter 2), but on the other hand, the overhead induced

    by this encryption scheme is manageable to a very great extent.

    The conclusive statement can be made about the possibility of solving the security issues presented

    in the literature without relying on new technology. Tools from the state of the art, implemented and

    standardized, can be used with minor modification to gain a great security upgrade.

    We recommend, as a future work, taking this approach and testing it with video compression.

    Although nothing is published yet about such analysis for video. But the concept of information

    leakage through varying packet size is worthy of studying for all types network streams, especially

    for real-time applications.

    Another recommendation to be made is to push towards standardizing such approach as a part of

    security standards. Although an RFC is publish about the guidelines for using variable bit-rate

    encoding with SRTP, the problem is that this standard suggest making padding part of the application

    layer and a responsibility of the developer. Such approach may induce security weaknesses avoidable

    if padding was part of the standard.