trade-off between security level and compression in voice communications
DESCRIPTION
Masters thesis report about the trade-off between the level of security and the efficiency of compression in voice over IP communication.TRANSCRIPT
-
Analysis of the tradeoff between
compression ratio and security level in real-time voice communication
Par
Abdallah Attie
Encadr par:
Dr Ahmad Fadlallah
Dr Mohamad Raad
Soutenance le 09.07.2014 devant le jury compos de:
Dr. Wafaa Abou Diab
Dr. Bassem Bakhash
Dr. Ahmad Fadlallah
-
i
Abstract
The project aims at analysis of the tradeoff between security level and compression ratio in real-time
voice communication. The problem stated in the project is that the combination between variable
bitrate compression same length encryption will induce vulnerability to traffic analysis. The variation
of packet sizes can leak information about the conversation starting with language identification,
identifying certain phrases, and reconstructing phonemes. The solution to this problem is to rely on
constant bitrate compression or to pad the sent frames to a multiple of 16, 32, or 64 bytes. Each
padding schemes has a security gain in the form of increased immunity to the described traffic
analysis systems. The security level is escaladed with the increase in the size of the encryption block.
The research project we conduct aims at analysis of the impact of those padding schemes on the
bitrate of the VoIP stream. We created for this purpose a test bed that simulates the compression,
encryption and sending/receiving of the speech over RTP socket. The resulting bitrates are calculated
with and without the overhead of packetization. In conclusion, the resulting data allow proper clear
perspective of the tradeoff between three parameters: security level, bitrate, and quality.
-
ii
Contents
Abstract ....................................................................................................................................... i
List of Figures ............................................................................................................................ iv
List of Tables .............................................................................................................................. v
List of References ...................................................................................................................... vi
Chapter I Introduction ............................................................................................................... 1
Compression ............................................................................................................................... 1
Types of Speech Coders .............................................................................................................. 2
Variable Bit-Rate Coding ............................................................................................................ 3
Speech Coding State of the Art .................................................................................................... 3
Adaptive Multi Rate (AMR) .................................................................................................... 3
Opus ........................................................................................................................................ 4
Speex ...................................................................................................................................... 5
Security ....................................................................................................................................... 6
Symmetric and Asymmetric Encryption .................................................................................. 7
Block and Stream Encryption .................................................................................................. 8
Common Encryption Algorithms ............................................................................................. 8
Report Structure ........................................................................................................................ 10
Chapter II Literature Review and Problem Formulation ............................................................ 11
Traffic Analysis of Encrypted Voice Stream ............................................................................. 11
Information leakage via variable bit-rate................................................................................ 12
Example of traffic analysis ........................................................................................................ 14
Mitigation Techniques ............................................................................................................... 15
Chapter III Test-Bed .................................................................................................................. 17
Test-bed requirements ............................................................................................................... 17
Test-bed elements ..................................................................................................................... 18
-
iii
Speex Encoder ...................................................................................................................... 19
AES Encryption .................................................................................................................... 22
RTP Sending/Receiving ........................................................................................................ 24
Dataset .................................................................................................................................. 25
Test-bed overview ..................................................................................................................... 26
Chapter IV Results and Conclusion ............................................................................................ 27
Narrow Band Results ................................................................................................................ 27
Wide Band Results .................................................................................................................... 29
Statistical Analysis .................................................................................................................... 30
Conclusion and future recommendations ................................................................................... 32
-
iv
List of Figures
Figure I-1: Block Diagram of the Opus Encoder .............................................................................. 5
Figure II-1: Distribution of bit rates used to encode four phonemes with Speex ............................. 13
Figure II-2: Overview of training and detection process ................................................................ 14
Figure II-3: Robustness to padding ................................................................................................ 15
Figure IV-1: NB padding overhead (without packetization) ........................................................... 27
Figure IV-2: NB rate vs quality (without packtization) .................................................................. 27
Figure IV-3: NB rate vs quality (with packtization) ....................................................................... 28
Figure IV-4: NB padding overhead (with packetization) ............................................................... 28
Figure IV-5: WB overheaad (without packetization) ..................................................................... 29
Figure IV-6: WB Rate vers Quality (without packetization) .......................................................... 29
Figure IV-7: Wide Band overhead (with packetization) ................................................................. 30
Figure IV-8: Wide Band rate versus quality (with packetization) ................................................... 30
Figure IV-9: Stream Cipher 95% Confidence Interval ................................................................... 31
Figure IV-10: Stream and 128 bit padding Confidence Interval ..................................................... 31
Figure IV-11: Stream and 512 bit padding confidence interval ...................................................... 32
Figure IV-12: Stream and 256 bit Confidence Interval .................................................................. 32
-
v
List of Tables
Table I-1: Characteristics of Standardized Speech Coding Algorithms in Each of Four Broad
Categories Error! Bookmark not defined.
Table I-2 Comparison Between the 3 Speech Encoders ................... Error! Bookmark not defined.
Table III-1 Quality vurses bitrate for Speex narrowband ............................................................... 21
Table III-2 Quality vurses bitrate for Speex wideband ................................................................... 21
-
vi
List of References
1 M. Arjona Ramrez and M. Minami, "Low bit rate speech coding," in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1299-1308.
2 P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494.
3 M. Hasegawa-Johnson and A. Alwan, "Speech Coding: Fundamentals and Applications" in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1256-1265.
4 Wiki.hydrogenaud.io, (2014). Variable Bitrate - Hydrogenaudio Knowledgebase. [online] Available at: http://wiki.hydrogenaud.io/index.php?title=VBR [Accessed 22 Jun. 2014].
5 E. Ekudden et al, "THE ADAPTIVE MULTI-RATE SPEECH CODER", Ericson Research.
6 Tools.ietf.org, (2014). RFC 6716 - Definition of the Opus Audio Codec. [online] Available at: http://tools.ietf.org/html/rfc6716#section-2.1.8 [Accessed 22 Jun. 2014].
7 Speex.org, (2014). Introduction to CELP Coding. [online] Available at: http://www.speex.org/docs/manual/speex-manual/node9.html [Accessed 24 Jun. 2014].
8 C. V. Wright, L. Ballard, F. Monrose, and G. M. Masson. Language identication of encrypted VoIP trafc: Alejandra y Roberto or Alice and Bob? In Proceedings of the USENIX Security Symposium, 2007.
9 C. V. Wright, L. Ballard, S. E. Coull, F. Monrose, and G. M. Masson. Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations. In Proceedings of the IEE Symposium on Security and Privacy, 2008.
10 Tools.ietf.org, (2014). RFC 6562 -Guidelines for the Use of Variable Bit Rate Audio with Secure RTP. [online] Available at: http://tools.ietf.org/html/rfc6562#section-2.1.8 [Accessed 30 Jun. 2014].
-
1
Chapter I Introduction
Security and performance are two important issues that any network operator should be concerned
about. Such concern is escalated when dealing with real-time voice communication. One of the
reasons behind this is that performance directly affects the user experience in such real-time
application. Furthermore, the context of voice conversations is always personal, and consequently
has more severe privacy requirements with respect to other applications (web browsing for example).
Enhancing performance requires compression of the network stream, while preserving privacy as a
security aspect requires encryption of the exchanged data. Our project aims at finding the optimal
solution of combining the two operations (compression and encryption) since they don't get along
together by nature. That is because compression removes redundancy from data while encryption
adds it. Our research looks into the possibilities in the domains of both encryption and compression
in order to find the optimal combination using existing tools.
Compression
To achieve performance requirements, one of the most important techniques used is compressing the
exchanged data throughout the network. With the data being a voice signal, this gives it a certain
structure rendering it compressible at high ratios with no/minimal distortion. Therefore, speech
coding has always been a hot research area in which many approaches are adopted with different
perspectives and one outcome: minimizing the needed bandwidth while preserving voice quality at
an important level.
Speech coding is an application of data compression on digital audio signals containing speech.
Speech coding uses speech-specific parameter estimation using audio signal processing techniques to
model the speech signal, combined with generic data compression algorithms to represent the
resulting modeled parameters in a compact bit-stream. [1]
The techniques employed in speech coding are similar to those used in audio data compression and
audio coding where knowledge in psychoacoustics is used to transmit only data that is relevant to the
human auditory system. For example, in voice-band speech coding, only information in the frequency
band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.
-
2
A sampling rate of 8 kHz is needed for narrowband coding. Also, wideband coding codes information
in the frequency band reaching 7 8 kHz, which requires sampling or rate 16 kHz.
Speech coding differs from other forms of audio coding in that speech is a much simpler signal than
most other audio signals, and a lot more statistical information is available about the properties of
speech. As a result, some auditory information which is relevant in audio coding can be unnecessary
in the speech coding context. In speech coding, the most important criterion is preservation of
intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data. [2]
Types of Speech Coders
There are different types of speech encoders:
Waveform coders attempt to code the exact shape of the speech signal waveform, without
considering the nature of human speech production and speech perception. These coders are
high-bit-rate coders (typically above 16 kbps).
Linear prediction coders (LPCs), on the other hand, assume that the speech signal is the output
of a linear time-invariant (LTI) model of speech production. The transfer function of that
model is assumed to be all-pole (autoregressive model). The excitation function is a quasi-
periodic signal constructed from discrete pulses (18 per pitch period), pseudorandom noise,
or some combination of the two. If the excitation is generated only at the receiver, based on a
transmitted pitch period and voicing information, then the system is designated as an LPC
voice coder (vocoder). LPC vocoders that provide extra information about the spectral shape
of the excitation have been adopted as coder standards between 2.0 and 4.8 kbps.
LPC-based analysis-by-synthesis coders (LPC-AS), on the other hand, choose an excitation
function by explicitly testing a large set of candidate excitations and choosing the best. LPC-
AS coders are used in most standards between 4.8 and 16 kbps.
Sub-band coders are frequency-domain coders that attempt to parameterize the speech signal
in terms of spectral properties in different frequency bands. These coders are less widely used
than LPC-based coders but have the advantage of being scalable and do not model the
incoming signal as speech. Sub-band coders are widely used for high-quality audio coding.
Table 1.1 shows the four discussed types of speech coders. [3]
Speech Coder Class Rates (kbps) Complexity Standardized Applications
Waveform coders 1664 Low Landline telephone
Sub-band coders 12256 Medium Teleconferencing, audio
LPC-AS 4.816 High Digital cellular
LPC vocoder 2.04.8 High Satellite telephony, military
Table I-1: Characteristics of Standardized Speech Coding Algorithms in Each of Four Broad Categories
-
3
Variable Bit-Rate Coding
One of the important techniques in speech coding is using variable bitrate while coding the speech
signal. The main idea behind this technique is the fact that not all speech signals need the same bitrate
in coding. In Variable Bitrate (VBR) coding, the user chooses the desired quality level and/or a range
of allowable bitrates. Then the encoder tries to maintain the selected quality during the whole stream
by choosing the optimal amount of data to represent each frame of audio. The main advantage is that
the user is able to specify the quality level and conserve as much space as possible, but the
inconvenience is that the final file size is quite unpredictable.
Most modern encoders are able to perform VBR encoding, including (but not limited to) nearly all
popular MP3, AAC, (Ogg) Vorbis, Musepack, and WMA encoders. [4]
Speech Coding State of the Art
The two most important applications of speech coding are mobile telephony and Voice over IP.
Consequently, the standards for speech compression are organized and published by the International
Telecommunication Union (ITU) responsible for development in the mobile technology and by the
Internet Engineering Task Force (IETF).
This section presents the most widely used encoders in the domain: Adaptive Multi-Rate (AMR) is
an encoder developed and adopted by the ITU. It is used in WCDMA networks. On the other hand,
Opus and Speex are two sibling encoders developed by Xiph.org and adopted by IETF.
Adaptive Multi Rate (AMR)
The Adaptive Multi-Rate speech coder is based on the Algebraic CELP (ACELP) technology and is
referred to as a Multi-Rate ACELP (MR-ACELP) coder. The coder is capable of operating at 8
different bit-rates denoted coder modes. The frame size is 20 milliseconds with 4 sub-frames of 5
milliseconds. A look-ahead of 5 ms is used. The 12.2 Kbit/s mode is equivalent to the GSM EFR
coder while the 7.40 Kbit/s mode is equivalent to the EFR coder for the IS-136 system.
The AMR speech coder was developed to fulfill a challenging set of performance requirements for
clean speech, speech in background noise, tendering and degraded channel conditions. The highest
mode is the GSM EFR coder, which provides speech quality comparable to fixed-line quality. The
lowest mode provides communication quality. The range of bit-rates and the high quality provides
flexibility to trade quality and capacity as well as to optimize quality under changing channel
-
4
conditions. The quality was shown to be significantly higher than for existing speech services in GSM.
[5]
Opus
Opus codec is developed by Xiph.org and standardized for multimedia streaming and VoIP
applications by the IETF. Opus can handle a wide range of audio applications, including Voice over
IP, videoconferencing, in-game chat, and even remote live music performances. It can scale from low
bit-rate narrowband speech to very high quality stereo music. Supported features are [7]:
Bit-rates from 6 kb/s to 510 kb/s
Sampling rates from 8 kHz (narrowband) to 48 kHz (fullband)
Frame sizes from 2.5 ms to 60 ms
Support for both constant bit-rate (CBR) and variable bit-rate (VBR)
Audio bandwidth from narrowband to full-band
Support for speech and music
Support for mono and stereo
Support for up to 255 channels (multistream frames)
Dynamically adjustable bitrate, audio bandwidth, and frame size
Good loss robustness and packet loss concealment (PLC)
Floating point and fixed-point implementation
The Opus codec is a real-time interactive audio codec. It is composed of a layer based on Linear
Prediction [LPC] and a layer based on the Modified Discrete Cosine Transform [MDCT]. The main
idea behind using two layers is as follows: in speech, linear prediction techniques (such as Code-
Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform (e.g.,
MDCT) domain techniques, while the situation is reversed for music and higher speech frequencies.
Thus, a codec with both layers available can operate over a wider range than either one alone and can
achieve better quality by combining them than by using either one individually. [6]
The Opus encoder consists of two main blocks: the SILK encoder and the CELT encoder. However,
unlike the decoder, a valid (though potentially suboptimal) Opus encoder is not required to support
all modes and may thus only include a SILK encoder module or a CELT encoder module. The output
bit-stream of the Opus encoding contains bits from the SILK and CELT encoders, though these are
not separable due to the use of a range coder. A block diagram of the encoder is illustrated below.
[6]
-
5
Opus encoder is standardized for VoIP applications by the IETF. The reference (RFC 6716) defines
the encoder/decoder. Furthermore, IETF has published specifications for packet payload format of
Opus frames.
Speex
Speex encoder is the sibling of Opus, It is developed also by Xiph.org, and it has a very similar
approach to Opus. The options featured by the two encoders are similar to a great extent. However,
in our research we are interested more in experimenting with Speex rather than Opus. The reason
behind this will be explained later throughout the course of the report.
Speex is based on CELP, which stands for Code Excited Linear Prediction. The CELP technique is
based on three ideas:
The use of a linear prediction (LP) model to model the vocal tract
The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model
The search performed in closed-loop in a perceptually weighted domain'
Speex is designed to compress voice at bitrates ranging from 2 to 44 kbps. Some of Speex's features
include:
Narrowband (8 kHz), wideband (16 kHz), and ultra-wideband (32 kHz) compression in the
same bit-stream
Intensity stereo encoding
Packet loss concealment
Variable bitrate operation (VBR)
Voice Activity Detection (VAD)
Figure I-1: Block Diagram of the Opus Encoder
-
6
Discontinuous Transmission (DTX)
Fixed-point port
Acoustic echo canceller
Noise suppression
The following table shows a comparison between the 3 discussed coders.
Codec Rate (kHz) bitrate (kbps)
delay
frame+lookahead
(ms)
multirate VBR license
Speex 8, 16, 32
2.15-24.6 (NB) 20+10 (NB)
yes yes
open-
source/
4-44.2 (WB) 20+14 (WB) free
software
Opus 8, 16, 22 6 - 510 2.5-60 yes yes open-
source/
AMR-
NB 8 4.75-12.2 20+5? yes proprietary
AMR-
WB 16 6.6-23.85 20+5? yes proprietary
(G.722.2)
Table I-2 Comparison Between the 3 Speech Encoders
To sum up, the bibliographic work has led us to emphasize the concept of variable bitrate (VBR).
This is due to reasons that are explained in Chapter II. Furthermore, the literature that we are dealing
with in our research is based on working with Speex encoder. Consequently, Speex will be our
designated encoder in the test-bed.
Security
The other concern in our research is security. As stated in the beginning of the chapter, privacy has
great importance for real-time voice communication applications, whether in mobile telephony or
voice over IP. In this section, we review the concept of encrypting voice data along with the state of
the art in the field.
Encryption is the process of converting plain text "unhidden" to a cryptic text "hidden" to secure it
against data thieves. This process has another part where cryptic text needs to be decrypted on the
other end to be understood. As dened in RFC 2828 [Reference], cryptographic system is "a set of
cryptographic algorithms together with the key management processes that support use of the
algorithms in some application context." This denition denes the whole mechanism that provides
the necessary level of security comprised of network protocols and data encryption algorithms.
-
7
The goals of any cryptography system fall into 5 categories:
Authentication: This means that before sending and receiving data using the system, the
receiver and sender identity should be veried.
Secrecy or Condentiality: Usually this function (feature) is how most people identify a
secure system. It means that only the authenticated people are able to interpret the message
(date) content and no one else.
Integrity: Integrity means that the content of the communicated data is assured to be free
from any type of modication between the end points (sender and receiver). The basic form
of integrity is packet check sum in IPv4 packets.
Non-Repudiation: This function implies that neither the sender nor the receiver can falsely
deny that they have sent/received a certain message.
Service Reliability and Availability: Since secure systems usually get attacked by intruders,
which may affect their availability and type of service to their users. Such systems should
provide a way to grant their users the quality of service they expect.
The category of our interest is confidentiality. Consequently, the reference to security throughout the
report is meant to address the confidentiality goal of the implemented security system. Furthermore,
the attack on the system is based on traffic analysis and not the conventional cryptanalysis. This idea
will be discussed in details in the next chapter.
Symmetric and Asymmetric Encryption
Data encryption procedures are mainly categorized into two categories depending on the type of
security keys used to encrypt/decrypt the secured data. These two categories are: Asymmetric and
Symmetric encryption techniques. In symmetric encryption, the sender and the receiver agree on a
secret (shared) key. Then they use this secret key to encrypt and decrypt their exchanged messages.
The main concern behind symmetric encryption is how to share the secret key securely between the
two peers. If the key gets known for any reason, the whole system collapses. On the other hand,
Asymmetric encryption is where two keys are used. To explain more, what Key1 can encrypt only
Key2 can decrypt, and vice versa. It is also known as Public Key Cryptography (PKC), because users
tend to use two keys: public key, which is known to the public, and private key, which is only known
to the user.
In the project, we will be interested in experimenting with symmetric encryption. This is because the
state of the art in the domain of speech encryption is based on symmetric ciphers. The reason behind
is that symmetric algorithms in general are less complex than asymmetric ones. The reduction in
complexity is of great importance to such real-time application, running usually on platforms with
limited capabilities (Mobile Phones).
-
8
Block and Stream Encryption
One of the main categorization methods for encryption techniques commonly used is based on the
form of the input data they operate on. The two types are Block Cipher and Stream Cipher.
Stream cipher operates on a stream of data by operating on it bit by bit. Stream cipher consists of two
major components: a key stream generator, and a mixing function. Mixing function is usually just an
XOR function, while key stream generator is the main unit in stream cipher encryption technique.
In a block cipher method, data is encrypted and decrypted in blocks. In its simplest mode, you divide
the plain text into blocks, which are then fed into the cipher system to produce blocks of cipher text.
ECB (Electronic Codebook Mode) is the basic form of block cipher where data blocks are encrypted
directly to generate its correspondent ciphered blocks.
There are many variances of block cipher, where dierent techniques are used to strengthen the
security of the system. The most common methods are: ECB (Electronic Codebook Mode), CBC
(Chain Block Chaining Mode), and OFB (Output Feedback Mode). ECB mode and the CBC mode
use the cipher block from the previous step of encryption in the current one, which forms a chain-like
encryption process. OFB operates on plain text in away similar to stream cipher that will be described
below, where the encryption key used in every step depends on the encryption key from the previous
step. There are other modes like CTR (counter) and CFB (Cipher Feedback). CTR mode is used to
transform a block cipher into a stream cipher. The idea is simple; a block mode is used to generate a
key stream, which is mixed (mainly XORed) with the plain text.
The recommended mode of operation for real-time voice communication is obviously the stream
cipher. This is due to the nature of transferred data, which is in the form of stream. However, in we
explore the option of using block cipher. The feasibility of using block ciphers for encryption of voice
data comes from the perspective of trading off performance for security. We will elaborate more on
that later in the course of the report.
Common Encryption Algorithms
Here we discuss 5 of the most famous ciphers present in the state of the art. Among these algorithms,
AES and KASUMI are implemented in real-time voice communication security. AES is standardized
for voice over IP in the Secure Real-time Transport Protocol (SRTP), which is a profile for RTP. On
-
9
the other hand, KASUMI was standardized by the ITU for GSM and consequent communication
systems.
DES: (Data Encryption Standard), was the rst encryption standard to be recommended by NIST
(National Institute of Standards and Technology). It is based on the IBM proposed algorithm called
Lucifer. DES became a standard in 1974. Since that time, many attacks and methods recorded that
exploit the weaknesses of DES, which made it an insecure block cipher.
3DES: As an enhancement of DES, the3DES (Triple DES) encryption standard was proposed. In this
standard the encryption method is similar to the one in original DES but applied 3 times to increase
the encryption level. But it is a known fact that 3DES is slower than other block cipher methods.
AES: (Advanced Encryption Standard), is the new encryption standard recommended by NIST to
replace DES. Rijndael (pronounced Rain Doll) algorithm was selected in 1997 after a competition to
select the best encryption standard. Brute force attack is the only eective attack known against it, in
which the attacker tries to test all the characters combinations to unlock the encryption. Both AES
and DES are block ciphers.
Blowsh: It is one of the most common public domain encryption algorithms provided by Bruce
Schneier - one of the world's leading cryptologists, and the president of Counterpane Systems, a
consulting rm specializing in cryptography and computer security. Blowsh is a variable length key,
64-bit block cipher. The Blowsh algorithm was rst introduced in 1993.This algorithm can be
optimized in hardware applications though it's mostly used in software applications. Though it suers
from weak keys problem, no attack is known to be successful against it.
KASUMI: It is a block cipher used in UMTS, GSM, and GPRS mobile communications systems. In
UMTS, KASUMI is used in the confidentiality (f8) and integrity algorithms (f9) with names UEA1
and UIA1, respectively. In GSM, KASUMI is used in the A5/3 key stream generator and in GPRS in
the GEA3 key stream generator.
KASUMI was designed for 3GPP to be used in UMTS security system by the Security Algorithms
Group of Experts (SAGE), a part of the European standards body ETSI. SAGE agreed with 3GPP
technical specification group (TSG) for system aspects of 3G security (SA3) to base the development
on an existing algorithm that had already undergone some evaluation. They chose the cipher
algorithm MISTY1 developed and patented by Mitsubishi Electric Corporation. The original
algorithm was slightly modified for easier hardware implementation and to meet other requirements
set for 3G mobile communications security.
-
10
In January 2010, Orr Dunkelman, Nathan Keller and Adi Shamir released a paper showing that they
could break Kasumi with a related key attack and very modest computational resources. Interestingly,
the attack is ineffective against MISTY.
Report Structure
In the first chapter of this report, we were acquainted with the state of the art of both compression
and encryption. We reviewed the encoding concepts along with the widely used encoders. We also
reviewed security in brief manner. Cipher types and modes were presented with emphasis on the
application of VoIP and Mobile telephony.
In the second chapter, we have a brief literature review stating the main problem the project tries to
tackle: the bad combination between VBR and stream encryption. The papers stating security
vulnerabilities are reviewed briefly. The solution for the problem is discussed and the perspective that
the project works in is determined.
Chapter III exhibits the test-bed that we created in order to test for bitrates. The test-bed is consisted
of 3 main elements (or stages): encoding, encryption, sending/receiving.
The fourth and final chapter includes all the obtained results. These results are the obtained bitrates
throughout different setups spanning the whole space of options found in our field of interest. This
chapter also includes the concluding the statement along with future recommendations.
-
11
Chapter II Literature Review and Problem Formulation
The main problem to be tackled in this project can be presented and explained in a very simple and
brief manner. The combination between variable bit-rate compression and length preserving
encryption (stream cipher) induces security weaknesses in the form vulnerability to traffic analysis.
The solution is reducing information leaking by reducing the variation of bitrate in the transmitted
stream. This is acquired by relying on constant bitrate (CBR) or by using padding. In brief, our project
emphasizes on the analysis of the cost of padding in the context of bandwidth. We aim at performing
tests of using padding and reaching a conclusion about the cost of padding and consequently its
feasibility. They proposition by the research project should answer the question about the possibility
of gaining trusted security level using existing tools.
In this chapter, we exhibit the weakness invoked by using variable bit-rate compression and then we
discuss the perspective adopted in tackling this problem.
Traffic Analysis of Encrypted Voice Stream
In 2007, a paper was published under the title of Language Identification of Encrypted VoIP Traffic.
After that by 2 years another paper, Spot me if you can: Uncovering spoken phrases in encrypted
VoIP conversations. The most important paper in the context was published in 2011 and titled by:
Phonotactic Reconstruction of Encrypted VoIP Conversations: Hookt on fon-iks. The inferred
common idea from the titles is extraction of certain information (language, some phrases, phoneme
reconstruction) from encrypted VoIP stream. A key point is not revealed in the titles: such extraction
relies on variable bit-rate compression.
The Secure RTP (SRTP) framework [RFC3711] is a widely used framework for securing RTP
sessions [RFC3550]. SRTP provides the ability to encrypt the payload of an RTP packet, and
optionally add an authentication tag, while leaving the RTP header and any header extension in the
clear. A range of encryption transforms can be used with SRTP, but none of the predefined encryption
transforms use any padding; the RTP and SRTP payload sizes match exactly.
When using SRTP with voice streams compressed using variable bit rate (VBR) codecs, the length
of the compressed packets will depend on the characteristics of the speech signal. This variation in
packet size will leak a small amount of information about the contents of the speech signal. This is
potentially a security risk for some applications. For example, [spot-me] shows that known phrases
-
12
in an encrypted call using the Speex codec in VBR mode can be recognized with high accuracy in
certain circumstances, and [fon-iks] shows that approximate transcripts of encrypted VBR calls can
be derived for some codecs without breaking the encryption. How significant these results are, and
how they generalize to other codecs, is still an open question. This memo discusses ways in which
such traffic analysis risks may be mitigated.
Information leakage via variable bit-rate
Generally speaking, the codec takes as input the audio stream from the user, which is typically
sampled at either 8000 or 16000 samples per second (Hz). At some fixed interval, the codec takes the
n most recent samples from the input, and compresses them into a packet for efficient transmission
across the network. To achieve the low latency required for real-time performance, the length of the
interval between packets is typically fixed between 10 and 50ms, with 20ms being the common case.
Thus for a 16 kHz audio source, we have n = 320 samples per packet, or 160 samples per packet for
the 8 kHz case.
Many common voice codecs are based on a technique called code-excited linear prediction (CELP).
For each packet, a CELP encoder simply performs a brute-force search over the entries in a codebook
of audio vectors to output the one that most closely reproduces the original audio. The quality of the
compressed sound is therefore determined by the number of entries in the codebook. The index of the
best-fitting codebook entry, together with the linear predictive coefficients and the gain, make up the
payload of a CELP packet. The larger code books used for higher-quality encodings require more bits
to index, resulting in higher bit rates and therefore larger packets.
In some CELP variants, such as QCELP, Speexs variable bit rate mode, or the approach advocated
by Zhang et al., the encoder adaptively chooses the bit rate for each packet in order to achieve a good
balance of audio quality and network bandwidth. This approach is appealing because the decrease in
data volume may be substantial, with little or no loss in quality. In a two-way call, each participant is
idle roughly 63% of the time, so the savings may be substantial. Unfortunately, this approach can also
cause substantial leakage of information in encrypted VoIP calls because, in the standard specification
for Secure RTP (SRTP), the cryptographic layer does not pad or otherwise alter the size of the original
RTP payload.
-
13
Intuitively, the sizes of CELP packets leak information because the choice of bit rate is largely based
on the audio encoded in the packets payload. For example, the variable bit-rate Speex codec encodes
vowel sounds at higher bit rates than fricative sounds like f or s. In phonetic models of speech,
sounds are broken down into several different categories, including the aforementioned vowels and
fricatives, as well as stops like b or d, and affricatives like ch. Each of these canonical sounds
is called a phoneme, and the
pronunciation for each word in the
language can then be given as a sequence
of phonemes. While there is no consensus
on the exact number of phonemes in
spoken English, most in the speech
community put the number between 40
and 60.
In [9], to demonstrate the relationship
between bit rate and phonemes, several
recordings from the TIMIT corpus of phonetically-rich English speech were encoded using Speex in
wideband variable bit rate mode, and observed the bit rate used to encode each phoneme. The
probabilities for 8 of the 21 possible bit rates are shown for a handful of phonemes in the following
figure. As expected, we see that the two vowel
sounds, aa and aw, are typically encoded at
signicantly higher bit rates than the fricative f or the consonant k. Moreover, large differences
in the frequencies of certain bit rates (namely, 16.6, 27.8, and 34.2 kbps), can be used to distinguish
aa from aw and f from k.
Figure II-1: Distribution of bit rates used to encode four
phonemes with Speex
Figure II-2: Packets for articial Figure II-3: Packets for intelligence
-
14
In fact, it is these differences in bit rate for the phonemes that make recognizing words and phrases
in encrypted traffic possible. To illustrate the patterns that occur in the stream of packet sizes when a
certain word is spoken, we examined the sequences of packets generated by encoding several
utterances of the words artificial and intelligence from the TIMIT corpus. They represent the
packets for each word visually in Figures 2 and 3 as a data imagea grid with bit rate on the y-axis
and position in the sequence on the x-axis. Starting with a plain white background, we darken the cell
at position (x,y) each time we observe a packet encoded at bit rate y and position x for the given word.
In both graphs, we see several dark gray or black grid cells where the same packet size is consistently
produced across different utterances of the word, and in fact, these dark spots are closely related to
the phonemes in the two words. In Figure 2, the bit rate in the 2nd - 5th packets (the a in artificial)
is usually quite high (35.8kbps), as we would expect for a vowel sound. Then, in packets 12 - 14 and
20 - 22, we see much lower bit rates for the fricative f and affricative sh. Similar trends are visible
in Figure 3; for example, the t sound maps consistently to 24.6 kbps in both words.
Example of traffic analysis
In the paper Uncovering spoken phrases in encrypted VoIP conversations, [9], the adopted method
in analyzing the encrypted VoIP stream can be summarized by the following:
To identify a phrase without using any examples of the phrase or any of its constituent words, this
concatenative synthesis technique is applied to generate a few hundred synthetic training sequences
for the phrase. These sequences are used to train a profile HMM for the phrase and then search for
the phrase in streams of packets. An overview of the entire training and detection process is given in
Figure II-4.
Figure II-2: Overview of training and detection process
-
15
Mitigation Techniques
One way to prevent word spotting would be to pad packets to a common length, or at least to coarser
granularity. Another way is to reframe from using VBR into using the CBR mode. However, its not
optimal though. Padding regains the lost security (to a certain extent as we will see) while preserving
some benefit from variable bit-rate encoding.
In the paper [9] the traffic analysis system (search algorithm) was tested against padding. To explore
the tradeoff between padding and search accuracy, they encrypted both their training and testing data
sets to multiples of 128, 256 or 512 bits and applied their approach. The results are presented in
Figure II-4. The use of padding is quite encouraging as a mitigation technique, as it greatly reduced
the overall accuracy of the search algorithm. When padding to multiples of 128 bits, the system
achieves only 0.15 recall at 0.16 precision. Increasing padding so that packets are multiples of 256
bits gives a recall of .04 at .04 precision.
The debate around the announcement of security flaws in variable bit-rate encoding has led to
publishing of an RFC by the ITU. The standard, Guidelines for the Use of Variable Bit Rate Audio
with Secure RTP, RFC 6562, specifies standards for dealing with variable bit-rate in SRTP Protocol.
For scenarios where VBR is considered unsafe, a constant bit rate (CBR) codec SHOULD be
negotiated and used instead, or the VBR codec SHOULD be operated in a CBR mode. However, if
the codec does not support CBR, RTP padding SHOULD be used to reduce the information leak to
an insignificant level. Packets may be padded to a constant size or to a small range of sizes ([spot-
me] achieves good results by padding to the next multiple of 16 octets, but the amount of padding
Figure II-3: Robustness to padding
-
16
needed to hide the variation in packet size will depend on the codec and the sophistication of the
attacker) or may be padded to a size that varies with time. The most secure and RECOMMENDED
option is to pad all packets throughout the call to the same size.
In the case where the size of the padded packets varies in time, the same concerns as for VAD apply.
That is, the padding SHOULD NOT be reduced without waiting for a certain (random) time. The
RECOMMENDED "hold time" is the same as the one for VAD.
Note that SRTP encrypts the count of the number of octets of padding added to a packet, but not the
bit in the RTP header that indicates that the packet has been padded. For this reason, it is
RECOMMENDED to add at least one octet of padding to all packets in a media stream, so an attacker
cannot tell which packets needed padding.[10]
-
17
Chapter III Test-Bed
In the previous chapter, we exhibited the security weakness provoked by the combination between
variable bit-rate encoding and same length encryption. This weakness is in the form of vulnerability
to traffic analysis. The performance of the traffic analysis system presented in the previous chapter
has shown degradation along with padding with increasing key lengths.
Furthermore, as a result to the fact that padding preserves security to a great extent. It was
recommended by the ITU in RFC 6562 to either use constant bit-rate encoding or rely on padding to
16 bytes block length.
All the discussion around the subject didnt take into consideration the tradeoff between security and
performance. A question was to be asked about the feasibility of padding. A key point to have in mind
is that variable bitrate encoding aims at lowering the needed bandwidth as much as possible. As a
consequence to that notion, the cost of padding in terms of bit-rate and needed bandwidth is to be
calculated in order to have a good perspective about the price we have to pay in order to achieve
security while using variable bitrate.
The answer for the question about the feasibility of padding is our main goal in the research project.
This answer might be that padding will maybe cost more than constant bitrate and, consequently,
padding is not the optimal solution for preserving security. However, we aim at having a solid
perspective of the cost paid for different security levels. The results of our test-bed will hopefully
give a good understanding about the relation between security, quality, and performance.
Quality is a parameter we take in our research as part of tradeoff formula. The quality of the encoder
is usually mapped to the bitrate used by it. Consequently, the quality can be inserted into the tradeoff
formulation as a price to pay for preserving both security and bitrate.
In order to have a proper testing and calculate the obtained bitrates. We need to create a system in
which we implement compression, encryption, sending and receiving of a voice stream. The system
should allow the manipulation of parameters that we are interested in.
Test-bed requirements
The created system must be able to implement compression and encryption of a speech stream.
Furthermore, the system should allow the manipulation of parameters for both compression and
-
18
encryption. One more important requirement is ability to send and receive the compressed and
encrypted stream. Sending/receiving conveys the packetization of the stream in realistic manner that
can be related a real application. The system should be also able to log the obtained bitrates at every
setup.
For compression, we should be able to choose the mode (narrow-band, wideband). In addition to that,
we should be able to choose the quality of compression. The quality variable is an important variable
that is supported by many algorithms that form the state of the art. We emphasize the ability to choose
quality since we are interested in inserting quality as a parameter in the tradeoff setup as we can see
later in the results section.
In encryption, the main requirement is the ability to pad data to a multiple of 128, 256, 512 bits. Of
course, in addition to that, we need to adopt a cipher which is trusted in the state of the art. The cipher
should have a low cost in terms of processing time since the platforms are usually mobile phones with
limited memory and processing power. One additional requirement is being a symmetric cipher since
all protocols implement symmetric encryption/decryption mechanisms.
The requirements can be summarized and formulated in a compact format as the following:
Compression:
o Widely implemented encoder
o Variable bit-rate compression
o Variable quality setting
Encryption
o Trusted low cost cipher
o Padding to different sizes
o Symmetric cipher
Test-bed elements
Based on the discussed requirements, the search for an encoder and a cipher is aimed at finding
modules widely present in the state of the art. The test-bed is built in a Linux environment (UBUNTU
distribution of GNU-Linux). The used libraries are all written in C programming language,
consequently, the built test-bed was to be written in C.
-
19
For the encoder, the choice was set to Speex encoder. This encoder was chosen since it meets all the
stated requirements. Furthermore, this encoder was used in the three articles that state the security
vulnerability as the designated encoder.
Regarding encryption, the choice was obvious: Advanced Encryption Standard. AES is standardized
and adopted in SRTP, the main standard for security in voice over IP. However, SRTP specifications
and implementation use AES in CTR mode (Counter mode) this mode generates a key stream and
mixes it with data (using XOR operation) in order to get the encrypted text. The length of the initial
plain text is reserved. Consequently, this mode modes renders a block cipher into a stream length
preserving cipher regardless of the block size of the cipher. Other modes specified by SRTP are f8
and null cipher.
It is worthy of mentioning that the RFC published about the guidelines for using variable bit-rate with
SRTP recommends relying on higher levels in the hierarchy of the networking model to achieve
padding. The padding was part of compression or application layer in general as per the published
standard. However, in our approach we tried to use a block cipher in the test bed. The choice of a
block cipher does not affect the desired results in any way. Furthermore, the choice making padding
part of the encryption process is justified in terms of security requirements. The implementation of
padding in compression or other entity may induce security vulnerabilities avoidable by using block
cipher. For example, padding can be done within the RTP payload, the number of padding bytes will
be part of the encrypted header of the RTP packet, but the flag specifying padding will not be
encrypted.
Speex Encoder
In our test bed, we used Speex encoder the designated compression tool. We used the Speex library
and relied on detailed step by step construction of the encoder using Speex API (Application
Programming interface). This choice is because manipulating parameters and managing the encoders
output requires such construction rather than using a prebuilt ready-to-use module.
The libspeex library contains all the functions for encoding and decoding speech with the Speex codec.
When linking on a UNIX system, we must add -lspeex -lm to the compiler command line.
In order to encode speech using Speex, we rst need to:
#include
Then in the code, a Speex bit-packing struct must be declared, along with a Speex encoder state:
-
20
SpeexBits bits;
void *enc_state;
The two are initialized by:
speex_bits_init(&bits);
enc_state = speex_encoder_init(&speex_nb_mode);
For wideband coding, speex_nb_mode will be replaced by speex_wb_mode. In most cases, you will
need to know the frame size used at the sampling rate you are using.
The encoder is by default set to cbr mode. We set it into variable bit-rate mode by using:
speex_encoder_ctl(enc_state,SPEEX_SET_VBR,&vbr);
The variable vbr an integer value ( 0 or 1). It is used to set vbr on (1) or off (0).
There are many parameters that can be set for the Speex encoder, but the most useful one is the quality
parameter that controls the quality vs. bit-rate tradeoff.
This is set by:
speex_encoder_ctl(enc_state,SPEEX_SET_VBR_QUALITY,&quality);
Quality is a float value ranging from 0.0 to 10.0 (inclusively). The mapping between quality and bit-
rate is described in the following 2 tables for both narrowband and wideband.
Mode Quality Bit-
rate (bps)
mflops Quality/description
0 - 250 0 No transmission (DTX)
1 0 2,150 6 Vocoder (mostly for comfort noise)
2 2 5,950 9 Very noticeable artifacts/noise, good intelligibility
3 3-4 8,000 10 Artifacts/noise sometimes noticeable
4 5-6 11,000 14 Artifacts usually noticeable only with headphones
5 7-8 15,000 11 Need good headphones to tell the difference
6 9 18,200 17.5 Hard to tell the difference even with good headphones
7 10 24,600 14.5 Completely transparent for voice, good quality music
8 1 3,950 10.5 Very noticeable artifacts/noise, good intelligibility
9 - - - reserved
-
21
10 - - - reserved
11 - - - reserved
12 - - - reserved
13 - - - Application-defined, interpreted by callback or skipped
14 - - - Speex in-band signaling
15 - - - Terminator code
Table III-1 Quality vurses bitrate for Speex narrowband
Mode/
Quality
Bit-rate (bps) Quality/description
0 3,950 Barely intelligible (mostly for comfort noise)
1 5,750 Very noticeable artifacts/noise, poor intelligibility
2 7,750 Very noticeable artifacts/noise, good intelligibility
3 9,800 Artifacts/noise sometimes annoying
4 12,800 Artifacts/noise usually noticeable
5 16,800 Artifacts/noise sometimes noticeable
6 20,600 Need good headphones to tell the difference
7 23,800 Need good headphones to tell the difference
8 27,800 Hard to tell the difference even with good headphones
9 34,400 Hard to tell the difference even with good headphones
10 42,400 Completely transparent for voice, good quality music
Table III-2 Quality vurses bitrate for Speex wideband
Once the initialization is done, for every input frame:
speex_bits_reset(&bits);
speex_encode_int(enc_state, input_frame, &bits);
nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);
Where input_frame is a (short *) pointing to the beginning of a speech frame, byte_ptr is a (char *)
where the encoded frame will be written, MAX_NB_BYTES is the maximum number of bytes that
can be written to byte_ptr without causing an overow and nbBytes is the number of bytes actually
written to byte_ptr (the encoded size in bytes). Before calling speex_bits_write, it is possible to nd
the number of bytes that need to be written by calling speex_bits_nbytes(&bits), which returns a
number of bytes.
-
22
After youre done with the encoding, free all resources with:
speex_bits_destroy(&bits);
speex_encoder_destroy(enc_state);
AES Encryption
The choice of the AES cipher is justified in the previous section of the chapter. However, the
algorithm has a high number of implementations. Among these, a trusted and well known library in
the state of the art is OpenSSL.
OpenSSL provides two primary libraries: libssl and libcrypto. The libcrypto library provides the
fundamental cryptographic routines used by libssl. You can however use libcrypto without using
libssl.
For most uses, users should use the high level interface that is provided for performing cryptographic
operations. This is known as the EVP interface (short for Envelope). This interface provides a suite
of functions for performing encryption/decryption (both symmetric and asymmetric),
signing/verifying, as well as generating hashes and MAC codes, across the full range of OpenSSL
supported algorithms and modes. Working with the high level interface means that a lot of the
complexity of performing cryptographic operations is hidden from view. A single consistent API is
provided. In addition low level issues such as padding and encryption modes are all handled.
The EVP functions provide a high level interface to OpenSSL cryptographic functions. They provide
the following features:
A single consistent interface regardless of the underlying algorithm or mode
Support for an extensive range of algorithms
Encryption/Decryption using both symmetric and asymmetric algorithms
Sign/Verify
Key derivation
Secure Hash functions
Message Authentication Codes
Support for external crypto engines,
-
23
AES is available in libcrypto with different modes, and in block sizes 128, 192, and 256 bits.
Unfortunately, the library doesnt support a block size of 512. In fact, generally implementations of
AES use a block size of 128 and 256 at most. To deal with this issue, we used the algorithm in CBC
mode for block sizes of 128 and 256 bits. And to get the size of 512 bits, we relied on manual padding.
Although the use of EVP as a high level interface simplifies using the library to a great extent, using
EVP in a complex test bed with multi stage procedures may induce complexity.
To encrypt using EVP, first we have to:
#include
The encryption process starts with initializing the cipher. We have to create a context, "opaque"
encryption, decryption structures that libcrypto uses to record status of encrypt/decrypt operations:
EVP_CIPHER_CTX e_ctx;
Then we have to create a key and IV (initiation vector) for the cipher. A SHA1 digest is used to hash
the supplied key material (password) multiple times (rounds). More rounds are more secure but
slower. Then after setting the key and IV, we call:
EVP_CIPHER_CTX_init(e_ctx);
EVP_EncryptInit_ex(e_ctx, EVP_aes_256_cbc(), NULL, key, iv);
This initiates AES encryption in CBC mode with a block size of 256 as shown in the second parameter.
To initialize 128 block size instead, we call:
EVP_EncryptInit_ex(e_ctx, EVP_aes_128_cbc(), NULL, key, iv);
Encryption of the Speex frame then takes place in the following manner:
EVP_EncryptInit_ex(e_ctx, NULL, NULL, NULL, NULL);
EVP_EncryptUpdate(e_ctx, ciphertext, &c_len, plaintext, *len);
EVP_EncryptFinal_ex(ectx, ciphertext+c_len, &f_len);
Note: both decompressing and decryption of the stream are not implemented in the test bed. Although
implementation of decoding and decryption will add value and integrity to the results. The results can
be calculated without the need for neither decryption nor decoding.
-
24
RTP Sending/Receiving
The previous 2 stages of the operations held in the test bed allow calculating bitrate in the absence of
packetization. To achieve realistic results, we implement sending and receiving of the stream in two
separate threads. Then we calculate bitrates of the received and dumped packets.
The library used for RTP sending/receiving is oRTP, an implementation of the RTP library. A number
of calls must be made to initialize the library. The first of the first of these is RTPCreate(), which
establishes a context. A context is an identifier used by the library to determine which RTP session a
function call is to be associated with. An application can run many sessions at the same time, each
created with a separate call to RTPCreate, resulting in a different context for each. Most library
functions accept a context as the first argument. Once RTPCreate has been called to initialize the
session, the addresses for the session must be set.
rtperror RTPCreate(context *the_context);
rtperror RTPOpenConnection(context cid);
Sending packets is fairly straightforward. The RTPSend() function is used to tell the library to send
an RTP packet. It requires the user to pass a pointer to a buffer, a length, a value for the marker field
in the RTP header, an increment for the timestamp, and the context. The library will take the buffer,
add the RTP header, perform any required operations, and send the packet. The library will
automatically send RTCP packets. The initial timestamp and sequence number are chosen randomly.
rtperror RTPSend(context cid, int32 tsinc, int8 marker, int16 pti, int8 *payload, int len);
Receiving packets is a little more complex. In order to know if a packet is available for reading, a
process can block, it can poll, or use any other kind of mechanism. Since the library does not dictate
this policy, it is up to us to determine when data is available for reading. We choose polling every 20
milliseconds in order to check for a received packet. To do this, the library allows access to the
receive sockets. There are two: one for RTP, one for RTCP. The functions RTPSessionGetRTPSocket
and RTPSessionGetRTCPSocket are used to do this. They take as input the context and a pointer to
a socket. When they return 1, the socket has been filled in. We then check for the presence on an RTP
packet on these sockets using select().
RTPSessionGetRTPSocket(context cid, socktype *value);
rtperror RTPReceive(context cid, int socket,
char *rtp_pkt_stream, int *len);
-
25
When a packet is present on either socket, the application should call the function RTPReceive().
This function takes the context, the socket on which data is present, a pointer to a buffer, and a pointer
to a length value. The length value should be initialized to the amount of room in the buffer. The
library will read and process the RTP or RTCP packet. For RTCP, it will perform all statistics
collection and parsing. The buffer will be filled in with the entire RTP/RTCP packet, including the
header.
We then save the whole received packet into a file for further calculation of the obtained bitrate. The
bitrate is calculated based on the previously known duration of the sent speech.
Dataset
The choice of the data set was guided by the dataset used in the articles published about the subject.
They used the TIMIT corpus, a database used for speech recognition. Since the TIMIT database is
not open for public use. We chose to work with another speech recognition training database: the
census database. Here we state information about the designated dataset:
The directory contains the alphanumeric database (aka "census" aka "an4") recorded at Carnegie
Mellon University circa 1991. Subjects were asked to spell out personal information, such as name,
address, telephone number, birthdates, etc. They were instructed to not use their actual numbers. In
addition to these, subjects also spoke randomly generated sequences of words containing control
words. The database used internally at CMU has 1018 training and 140 test utterances, whereas the
database provided here has 948 training and 130 test utterances. All data are sampled at 16 kHz, 16-
bit linear sampling. All recordings were made with a close talking microphone.
In the dataset, we have two directories:
an4_clstk
The directory with training data has 74 sub-directories, one for each speaker. 21 of
them are female, 53 are male. The total number of utterances is 948, and the average
duration is about 3 seconds, totaling a little less than 50 minutes of speech.
an4test_clstk
The directory with test data has 10 sub-directories, one for each speaker. 3 of them
are female, 7 are male. The total number of utterances is 130, totaling around 6
minutes of speech.
-
26
Test-bed overview
The presented test-bed can be summarized in the block diagram in figure III-1.
The process of testing starts with choosing a file from the dataset. The time of the file is calculated
by counting the number of samples in the file. After that, the file is encoded, encrypted and sent over
a RTP socket. Compression must span all the range for quality (0 to 10). Encryption must also take
place in the 4 presented modes (stream and 3 block sizes). Next, the file is sent over a RTP socket to
a local receiving socket initiated by another thread. Packets are dumped and saved in an output file.
The recorded size of frames is used to calculate bit-rate without the packetization overhead. The size
of sent/received stream is used to calculate the bit-rate along with the network overhead.
Choose file from dataset
Calculate time
Compress using Speex
Set quality parameter
Encrypt using AES in CBC mode
Set the block size
record frame sizes
Send/Receive over RTP socket
dump frames
calculate bit-rate
-
27
Chapter IV Results and Conclusion
Tests were held using the test-bed presented in the previous chapter. The resulting bitrates obtained
are divided into two categories: narrowband and wideband. Along with presenting the bitrate obtained
for 4 encryption schemes (stream, 128, 256, and 512 padding). The overhead for the latter three
schemes over the original stream bitrate is calculated.
Narrow Band Results
Figure IV-2: NB rate vs quality (without packtization)
Figure IV-1: NB padding overhead (without packetization)
-
28
As we can infer from these results, the overhead induced by padding for narrow band mode is of great
magnitude. In figures IV-1 and IV-2, we see the rate versus quality for the three levels of security as
well as the overhead induced by padding. The 128 bit padding has a moderate overhead to be added.
512 bit padding has a constant bitrate throughout the whole range of quality. Consequently, using
CBR with highest quality maybe a better solution than relying on padding. However, for other
padding schemes (256 bit for example) the overhead added is manageable.
An example of a tradeoff using these results can stated by the following. Take for example the rates
of stream encryption and 256 padding. We can see the average rates for streaming quality 10 and 256
padding of quality 7 are the same. A tradeoff can be made here: padding to 256 bits and setting quality
to 7 can create a huge security gain while keeping the same rate. The price we have to pay is quality.
Figure IV-3: NB rate vs quality (with packtization)
Figure IV-4: NB padding overhead (with packetization)
-
29
Wide Band Results
The same testing was held while setting Speex to wide-band mode. The following Figures show the
obtained results (bit-rate versus quality, and overhead) for the 4 encryption streams adopted. The
results are a lot better than the results obtained for narrow band. As we see in figure IV-5, the curves
corresponding to the 4 encryption schemes show less difference and consequently less added
overhead. For example, if we have to do the same tradeoff exhibited in the previous section, the
quality will downgrade only to 9 instead of 7. To have a padding of 512 bits and keep the same bit-
rate, the quality will downgrade down to 8.
Figure IV-6: WB Rate vers Quality (without packetization)
Figure IV-5: WB overheaad (without packetization)
-
30
An important notion is that the overhead calculated with packetization a smaller impact to a great
extent. The overhead for 256 bits padding and a quality of 7 for example is around 30% if calculated
without packetization. This overhead is less than 20 percent when calculated with packetization.
Statistical Analysis
The results shown in the previous sections represent only average bitrate. To have a clearer
perspective, we calculated a 95% confidence interval for each obtained bitrate. The confidence
Figure IV-8: Wide Band rate versus quality (with packetization)
Figure IV-7: Wide Band overhead (with packetization)
-
31
interval gives us more information about the resulting bitrate. The size of the confidence interval tells
us about the fact of benefiting from variable bitrate compression. However, a very large confidence
interval cannot be linked to a manageable overhead.
Another important piece of information is that the confidence interval overlapping between 2
encryption schemes will make us suggest that the 2 schemes can be working in the same rate. The
statistical calculated results overall give a better perspective to understand the tradeoff to be made.
The following figures present the confidence interval for the 4 encryption schemes. The 3 block
encryption schemes are compared to the stream cipher. Only wide band results are shown.
Figure IV-9: Stream Cipher 95% Confidence Interval
Figure IV-10: Stream and 128 bit padding Confidence Interval
-
32
Conclusion and future recommendations
The answer to the main question asked in our report problem is: yes, using padding for VoIP
encrypted stream is feasible. The results show in a clear manner that a 3 dimensional tradeoff can be
made to get the desired solution. The parameters of the tradeoff, the level of security (presented by
the padding block size), the bit-rate, and the quality. The two latter parameters work inversely, while
the security parameter changes the scale of bitrate range.
Figure IV-12: Stream and 256 bit Confidence Interval
Figure IV-11: Stream and 512 bit padding confidence interval
-
33
A remark is to be done about the importance of 256 bit padding. 256 padding shows a great
enhancement in immunity to traffic analysis (chapter 2), but on the other hand, the overhead induced
by this encryption scheme is manageable to a very great extent.
The conclusive statement can be made about the possibility of solving the security issues presented
in the literature without relying on new technology. Tools from the state of the art, implemented and
standardized, can be used with minor modification to gain a great security upgrade.
We recommend, as a future work, taking this approach and testing it with video compression.
Although nothing is published yet about such analysis for video. But the concept of information
leakage through varying packet size is worthy of studying for all types network streams, especially
for real-time applications.
Another recommendation to be made is to push towards standardizing such approach as a part of
security standards. Although an RFC is publish about the guidelines for using variable bit-rate
encoding with SRTP, the problem is that this standard suggest making padding part of the application
layer and a responsibility of the developer. Such approach may induce security weaknesses avoidable
if padding was part of the standard.