submitted to ieee transactions on circuits and …kostas/publications2008/pub/... · the video...

SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1

Secure Visual Object Based Coding for Privacy

Protected SurveillanceKarl Martin* and Konstantinos N. Plataniotis

Abstract

This paper presents a scheme for secure coding of arbitrarily-shaped visual objects. The scheme can

be employed in a privacy protected surveillance system, whereby visual objects are encrypted so that the

content is only available to certain entities, such as persons of authority, possessing the correct decryption

key. This system may be deployed in sensitive areas requiring surveillance, but where personnel require

privacy for authorized activities within the surveillancearea. The encryption can be tied with the identity

of human objects under surveillance so that unauthorized personnel are immediately apparent to human

or computer based monitoring systems. The secure visual object coder employs Shape and Texture Set

Partitioning in Hierarchical Trees (ST-SPIHT) along with partial encryption for efficient, secure storage

and transmission of visual object shape and textures. The encryption is performed in the compressed

domain and does not affect the rate-distortion performanceof the coder. A separate parameter for each

encrypted object controls the strength of the encryption versus required processing overhead.

Index Terms

Shape adaptive coding, security, encryption, surveillance, privacy, privacy protection, visual object

coding, shape and texture coding, wavelet based coding, setpartitioning in hierarchical trees (SPIHT).

Corresponding Author: Karl Martin, Multimedia Laboratory, Room BA 4157, The Edward S. Rogers Sr. Department of ECE,

University of Toronto, 10 King’s College Road, Toronto, Ontario, M5S 3G4, Canada, phone: 1 (416) 978 6845, FAX: 1 (416)

978 4425, e-mail: [email protected]

K. Martin, and K.N. Plataniotis are with The Edward S. RogersSr. Department of ECE, University of Toronto, Multimedia

Laboratory, Room BA 4157, 10 Kings College Road, Toronto, Ontario, M5S 3G4, Canada

*Partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) under the

Network for Effective Collaboration Technologies throughAdvanced Research (NECTAR) project.

October 12, 2007 DRAFT


I. INTRODUCTION

Video surveillance of both public and private spaces is expanding at an ever-increasing rate. Conse-

quently, individuals are increasingly concerned about theinvasiveness of such ubiquitous surveillance

and fear that their privacy is at risk. The demands of law enforcement agencies to prevent and prosecute

criminal activity, and the need for private organizations to protect against unauthorized activities on their

premises is seen to often be in conflict with the privacy requirements of individuals.

In this paper, we propose a secure visual object based coder,Secure Shape and Texture Set Partitioning

in Hierarchical Trees (SecST-SPIHT), in order to address this conflict. The SecST-SPIHT scheme codes

the shape and texture of arbitrarily-shaped visual objectsin the same fashion as ST-SPIHT [1], and

partially encrypts the output bit-stream based on the classification and importance of the bits [2]. The

scheme efficiently and effectively secures the entire shapeand texture of the object and ensures that the

data cannot be accessed without provision of the correct decryption key. At typical output bit-rates and

choice of security parameter, the encryption operation is performed on less than 5% of the output code

bits.

The SecST-SPIHT secure coder can be employed in surveillance systems where the capture of certain

visual objects may be considered privacy invasive (e.g., face and body images). The decryption key

required to decrypt and decode the visual object shape and texture may be managed such that only

the appropriate authorities are able to access the object data. Furthermore, the key may be tied to the

subject’s identity (e.g., through RFID based tokens), thusgiving control of the private content to the

subject. The proposed, computationally simple selective encryption procedure makes the scheme suitable

for real-time applications where significant processing resources are requisitely consumed for coding of

the video stream.

Previous works on the privacy protection of individuals in video surveillance have largely focussed on

face and body tracking, but have generally resorted to scrambling, obscuring, or masking the visual data

to protect the identity of the subjects. In [3], the subject’s image is masked, revealing only a silhouette.

However, such a silhouette may not completely obscure the identity of the subject. Furthermore, the

system discards the texture data, making future investigation by authorities impossible. Similarly, in [4],

the focus is on removing appearance information while retaining structural information about the body

in order to assess behavior. However, the removed appearance information is discarded and cannot be

retrieved, making the solution inappropriate in law enforcement and forensic applications.

The approach in [5] is to ”de-identify” face images so that facial recognition software cannot be used



to reliably identify the subject, but enough facial features remain so that the image could still be used

for detecting behavior. In this so-calledk-Same approach, face images are clustered based on a distance

metric, and the images replaced by a representative image generated by averaging of components based on

pixels or eigenvectors. This approach, however, does not obscure identifying information that is conveyed

by other parts of the body (e.g., via gait [6]), and again, theoriginal data is discarded and cannot be

retrieved by authorized personnel. In [7], a region of interest (ROI) is defined for face data within a

frame, and the corresponding coefficients downshifted in order to be coded and protected in a separate

quality layer using Motion JPEG 2000. However, the wavelet domain separation of ROI content only

allows for rough separation of content in the spatial domain, thus disallowing true object vs. background

separation possible in object-based coding schemes.

The computer vision approach of [8] provides three policy-dependent options to hiding privacy data:

summarization; transformation (obscuration); and encryption. However, in the case of encrypted output,

traditional encryption is applied to the entire private data stream, which is computationally infeasible in

many digital video surveillance systems. The proposed scheme in [9] embeds the private information of

subjects as an encrypted watermark within the surveillanceframes. However, the private data is limited

to rectangular regions of the image frame and, again, traditional encryption is applied to the data. In

[10], a reversible wavelet-domain scrambling is performedon ROI-defined private data, thus allowing

subsequent retrieval of the private data by authorized users. This approach, as in [7], does not allow

explicit spatial domain separation of the object of interest and the background, and the region-of-interest

shape is not secured. Furthermore, the scrambling is performed before compression, resulting in a modest

reduction in coding performance.

The schemes in [11] use efficient encryption or shuffling of variable-length codeword concatenations

to secure MPEG-4 video streams while maintaining format compliance. However, entire frames are

secured and hence cannot be used to secure only private data in surveillance applications. Furthermore,

the intended target is entertainment applications, where some image details can be reconstructed through

error concealment techniques. In [12], MPEG-4 video objects are secured through selective encryption

of Object Descriptors (OD). This approach, however, offersvery limited security since none of the actual

object content is encrypted.

The proposed SecST-SPIHT secure coder accepts arbitrary shape and texture input, and therefore may

be assisted by the subject detection and tracking systems proposed in other works. However, the efficient

encryption and coding of both the shape and texture information makes it uniquely appropriate for privacy

protection in real-time surveillance applications. The remainder of the paper is organized as follows. In



Section II, the SecST-SPIHT scheme is described in detail and security analysis is provided. In Section

III, experimental results are provided and analyzed for various object inputs and parameters. Finally, the

paper is concluded in Section IV.

II. SECURE SHAPE AND TEXTURE SPIHT CODING SCHEME

The Secure ST-SPIHT (SecST-SPIHT) coding and decoding system is shown in Fig. 1. It is based on the

Shape and Texture Set Partitioning in Hierarchical Trees (ST-SPIHT) scheme for coding arbitrarily-shaped

visual objects [1], with individual bits from the output bit-stream selectively encrypted using a stream

cipher. The selective encryption offers an efficient alternative to complete content encryption which can

be computationally burdensome in full color image and videoapplications. The data-dependent decoding

algorithm makes the unencrypted portion of the bit-stream effectively impossible to locate or interpret.

Furthermore, the bits chosen for encryption represent the most significant components of the coded object,

ensuring complete confidentiality of the visual data from those without the correct decryption key. Since

encryption only occurs during the output stage, the shape and texture coding operate in exactly the

same fashion as ST-SPIHT, with identical rate-distortion performance and embedded/progressive output

properties [1]. The system describes secure coding of stillvisual objects but can easily be extended to

the frames of a video object sequence.

The input consists of two components: i) anM×N full color (texture) imagex : Z2 → Z3 representing

a two-dimensional matrix of three-component RGB color samples x(i, j) = [x(i, j)1, x(i, j)2, x(i, j)3],

with i = 0, 1, . . . ,M − 1 andj = 0, 1, . . . , N − 1 denoting the spatial position of the pixel, andx(i, j)k

denoting the component in the red (k = 1), green (k = 2), or blue (k = 3) color channel; and ii) an

M × N binary (shape mask) images : Z2 → {0, 1} representing a two-dimensional matrix of binary

values wheres(i, j) = 1 denotes spatial positions ‘inside’ the object, ands(i, j) = 0 denotes spatial

positions ‘outside’ the object. The object is preprocessedby first converting the texture to theYCbCr

color space. Subsequently, texture positions outside the object are set to zero, such thatx(i, j) = [0, 0, 0],

∀ (i, j) wheres(i, j) = 0.

Each color channel of the texture is subsequently transformed using an in-place lifting shape-adaptive

discrete wavelet transform (SA-DWT) with global subsampling [1], [13], creating theM × N vectorial

field xT : Z2 → Z3 of transform coefficientsxT(i, j) = [xT(i, j)1, xT(i, j)2, xT(i, j)3]. This is a

modification of the SA-DWT described in [14], allowing the spatial domain shape masks to remain

unmanipulated and coded directly.



A. SecST-SPIHT Coder and Decoder

The SecST-SPIHT coder, shown in Fig. 3, is identical to ST-SPIHT except that the output bit-stream

is selectively encrypted using a stream cipherfE(b, kE), applied to individual bitsb using the private

key kE . The ST-SPIHT algorithm is employed to code the input shape and texture as well as to instruct

the stream cipher which bits require encryption. The details of the ST-SPIHT coding algorithm will only

be summarized here; full details and analysis can be found in[1].

1) ST-SPIHT: The texture coding in ST-SPIHT follows a natural extension of SPIHT with the spatial

orientation trees (SOT) defined as in [15], with the modification for color images proposed in [16]. The

SOTs are first formed using all coordinates inside the bounding box of sizeM × N ; the binary shape

masks is used to describe which nodes are inside the object and which are outside.

We defineG = {(i, j) | s(i, j) = 1} as the set of all coordinates inside the object, andG =

{(i, j) | s(i, j) = 0} as the complementary set containing all coordinates outside the object — i.e.,

G⋃

G = {(i, j) | i = 0, 1, . . . ,M − 1, j = 0, 1, . . . , N − 1} and |G| + |G| = MN . All the definitions

from the standard SPIHT algorithm described in [15] remain in use with the addition of the color

component indexk. Briefly, the list of insignificant pixels (LIP), list of significant pixels (LSP), and list

insignificant sets (LIS), store different coefficient and tree root coordinates. A “type-A” entry in the LIS

refers toD(i, j)k, all the descendants of(i, j)k ; a “type-B” entry refers toL(i, j)k = D(i, j)k −O(i, j)k,

whereO(i, j)k are the direct offspring of location(i, j)k . H denotes the set of all luminance LL subband

coefficient coordinates andSn(·) refers to the significance test at bit-planen, as defined in [15].

Unique to the ST-SPIHT algorithm are a series of three “α-test” functions. The “α pixel test” function,

αp(·, ·), identifies whether a coordinate is inside or outside the shape and is defined follows:

αp(i, j) =

1, (i, j) ∈ G

0, otherwise(1)

The “α set-discard test” function,αSD(·), identifies sets of coefficients that are entirely outside the object:

αSD(T ) =

0, T ⊆ G

1, otherwise, (2)

where T represents a given set of coefficients. And finally, the “α set-retain test” function,αSR(·),

identifies sets of coefficients that are entirely inside the object:

αSR(T ) =

1, T ⊆ G

0, otherwise(3)



The ST-SPIHT coding routine requires theshape code level parameter,λ, to be input. This defines

the quantization level at which the routine forces the coding of not-yet-coded shape mask pixelss(i, j).

This is done by applying the subroutine “Shape Code Set” (SCS) to the appropriate trees. The complete

algorithm codes the shape and texture information in parallel, producing an embedded bit-stream that

can be decoded to produce progressive shape and texture reconstruction. By loweringλ, the shape

code becomes further dispersed in the output bit-stream, delaying the point at which the shape can be

completely, losslessly decoded. At very low output bit-rates, loweringλ allows greater emphasis to be

placed on the texture, providing the trade-off of lossy shape reconstruction [1]. The decoder follows the

same data-dependent execution path as the coder based on interpretation of the output bit-stream.

2) Selective Encryption for ST-SPIHT: The SecST-SPIHT selective encryption algorithm is based on

the scheme proposed in [2] for regular SPIHT. We denote the ST-SPIHT bit-stream as the ordered set

of bits B. The bit-stream can be divided into the ordered subsetsB = {Bnmax, Bnmax−1, Bnmax−2, . . .},

whereBn is the set of bits obtained during coding iteration for bit-planen (i.e., representing the value

2n), andnmax is the highest bit-plane at which coding is initiated. EachBn can be further subdivided

into Bn = {Bn,LIP, Bn,LIS, Bn,LSP}, whereBn,LIP denotes the ordered set of bits obtained during the

first phase of the sorting pass where coefficients in the LIP are tested for significance;Bn,LIS denotes

the ordered set of bits obtained during the second phase of the sorting pass where entire trees are tested

for significance; andBn,LSP denotes the ordered set of bits obtained during the refinement pass.

Each set of bitsBn,LIP is composed ofα-test bits (Bn,LIP−α), significance bits (Bn,LIP−sig) and sign

bits (Bn,LIP−sgn). Similarly, each set of bitsBn,LIS is composed of significance bits (Bn,LIS−sig) and

sign bits (Bn,LIS−sgn) for individual coefficients, significance bits for trees (Bn,LIS−Tsig), andα-test bits

for both individual coefficients and trees (Bn,LIS−α). This decomposition of the bit-stream is shown in

Fig. 2.

The SecST-SPIHT encryption scheme uses an encryption function fE(b, kE) to encrypt only the bits

b ∈ {Bn,LIP−α, Bn,LIP−sig, Bn,LIS−α, Bn,LIS−sig}, for n = nmax, nmax − 1, . . . , nmax − K + 1. The key

kE enforces the confidentiality of the data by preventing entities without the correct matching decryption

key, kD, from correctly decrypting the data. The parameterK is controlled by the user at the time

of encryption/encoding to determine the number of coding iterations to be encrypted. IncreasingK

results in more bits being encrypted and greater security, with the trade-off of greater computational

overhead. The specific bits are selectively chosen since they represent the object shape information and the

significance information of individual coefficients. The coefficient sign bits (Bn,LIP−sgn andBn,LIS−sgn)

remain unencrypted since their values do not affect the coder/decoder execution path. Similarly, the



significance bits relating to entire trees (Bn,LIS−Tsig) remain unencrypted since they do not affect specific

coefficient reconstruction values.

The encryption functionfE(b, kE) must be implemented using a stream cipher since the decoder (Fig.

4) must decode individual bits and instruct the decryption functionfD(b, kD) whether each subsequent bit

requires decryption or not; the use of a block cipher would prevent the decoder from correctly determining

which bits in the output bit-stream are part of the cipher block. However, the system is flexible in that

any bit-level stream cipher may be used, employing either private keys or public-private key pairs.

The complete description of the SecST-SPIHT routine and theSecure SCS subroutine (SecSCS) follows.

For ease of notation, we introduce the controlled encryption functionfE(b, kE , n,K) defined as follows:

fE(b, kE , n,K) =

fE(b, kE), n > nmax − K

b, otherwise.(4)

Hence, the encryption function is only activated for the first K iterations of the coding algorithm, after

which the input bits are passed through, unencrypted.

SecST-SPIHT Coder:

Input: xT, s, λ, K, kE

1. Initialization: Find initial quantization leveln = nmax =

⌊

log2

(

max(i,j,k)

{|xT(i, j)k|}

)⌋

; set LSP= ∅; set

LIP = H; set LIS ={(i, j)k “type-A” | (i, j)k ∈ H, D(i, j)k 6= ∅}.

2. Sorting pass:

2.1. For each(i, j)k ∈ LIP:

2.1.1. If αp(i, j) not coded yet then outputfE(αp(i, j), kE , n, K);

2.1.2. If αp(i, j) = 1 then:

• OutputfE(Sn(i, j)k, kE , n, K);

• If Sn(i, j)k = 1 then move(i, j)k to the LSP and output the sign ofxT(i, j)k;

2.1.3. If αp(i, j) = 0 then remove(i, j)k from the LIP;

2.2. For each entry(i, j)k ∈ LIS:

[If “type-A” entry, T = D(i, j)k; If “type-B” entry, T = L(i, j)k]

2.2.1. If n ≥ λ and shape not completely coded, then:

• If αSD(T ) not coded yet then outputfE(αSD(T ), kE , n, K);

• If αSD(T ) = 0 then remove(i, j)k from the LIS and move on to next entry in the LIS (go to

Step 2.2);

• If αSD(T ) = 1 then:

– If αSR(T ) not coded yet then outputfE(αSR(T ), kE , n, K);

– If αSR(T ) = 0 andn = λ then runSCS(T );

2.2.2. If shape completely coded andαSD(T ) = 0 then remove(i, j)k from the LIS and move on to next

entry in the LIS (go to Step 2.2);



2.2.3. If “type-A” entry andαSD(T ) = 1:

• OutputSn(D(i, j)k);

• If Sn(D(i, j)k) = 1 then:

– For each(p, q)r ∈ O(i, j)k:

∗ OutputfE(Sn(p, q)r, kE , n, K);

∗ If Sn(p, q)r = 1 then add(p, q)r to the LSP and output sign ofxT(p, q)r;

∗ If Sn(p, q)r = 0 andαp(p, q) not coded yet, then outputfE(αp(p, q), kE , n, K);

∗ If αp(p, q) = 1 then add(p, q)r to the LIP;

– If L(i, j)k 6= ∅ then move(i, j)k to the end of the LIS as “type-B” entry; else, remove(i, j)k

from the LIS;

2.2.4. If “type-B” entry andαSD(T ) = 1:

• OutputSn(L(i, j)k);

• If Sn(L(i, j)k) = 1 then:

– Add each(p, q)r ∈ O(i, j)k to the end of the LIS as “type-A” entry;

– Remove(i, j)k from the LIS.

3. Refinement pass:For each(i, j)k ∈ LSP, except those found significant in the current sorting pass, output

the nth most significant bit of|xT(i, j)k|;

4. Quantization-step update:Decrementn by 1 and go to Step 2.

Secure Shape Code Set (SecSCS) Subroutine:

Input: setT with root (i, j)k, n, kE , K

1. If (i, j)k is “type-A” entry:

1.1. For each(p, q)r ∈ O(i, j)k:

1.1.1. If αp(p, q) not coded yet then outputfE(αp(p, q), kE , n, K);

1.1.2. If D(p, q)r 6= ∅ then:

• If αSD(D(p, q)r) not coded yet then outputfE(αSD(D(p, q)r), kE , n, K);

• If αSD(D(p, q)r) = 0 terminate processing ofD(p, q)r ;

• If αSD(D(p, q)r) = 1 then:

– If αSR(D(p, q)r) not coded yet then outputfE(αSR(D(p, q)r), kE , n, K);

– If αSD(D(p, q)r) = 0 then go to Step 1 treatingD(p, q)r as new “type-A” input;

2. If (i, j)k is “type-B” entry:

2.1. For each(p, q)r ∈ O(i, j)k, go to Step 1 treatingD(p, q)r as new “type-A” input;

The coding operation is typically terminated when a specified rate or distortion criterion is met. While

SecST-SPIHT allows for coding to be terminated before the shape has been losslessly coded, typical rate

criteria and values ofλ will result in complete lossless coding of the shape. Also, the coder may be



instructed not to code the shape in situations where, for example, the shape is implicitly available via

the shape of another object which surrounds the object to be coded (e.g., a background object).

The SecST-SPIHT decoder follows exactly the same executionpath as the coder and only requires

basic initialization information (i.e.,M , N , |G|, nmax, λ, the number of wavelet transform levels, ands

if the shape was not coded) to interpret the output bit-stream. Provided with the correct decryption key,

kD, the decoder decodes the bit-stream and instructs the decryption functionfD(b, kD) as to whether

each subsequent bit should be decrypted or passed through, unencrypted. Since the first bit is always in

Bnmax,LIP−α (generated from the first iteration of step 2.1.1), it must always be decrypted.

It should be noted that SecST-SPIHT is backward compatible such that when the input shapes fills

the entireM × N rectangular bounding box, the coding operation is identical to traditional SPIHT [15]

and the selective encryption algorithm operates the same asin [2].

B. Security Analysis of SecST-SPIHT

The SecST-SPIHT selective encryption ensures the confidentiality of the coded visual object data in

two ways: i) securing the most significant portion of the bit-stream using a secret cryptographic keykE

and a stream cipher; and ii) making the unencrypted portion of the bit-stream impossible to decode since

its location and the state of the decoder cannot be determined without correct decryption and decoding

of the encrypted portion.

As noted in the previous section, encryption is performed onthe bits b ∈ {Bn,LIP−α, Bn,LIP−sig,

Bn,LIS−α, Bn,LIS−sig}, for n = nmax, nmax − 1, . . . , nmax − K + 1. This represents a partial bit-plane

and shape encryption performed on the visual object in the SA-DWT domain, with the choice ofK

determining how many bit-planes are encrypted. Specifically, with K = 1, only the most significant bit-

plane is encrypted for the coefficients|xT(i, j)k | ≥ 2nmax ; for K = 2, only the top two most significant

bit-planes are encrypted for the coefficients|xT(i, j)k | ≥ 2nmax−1, and so on. In other words, the topK

bit-planes are encrypted for all coefficients that are foundsignificant in the firstK iterations of the coding

algorithm. Additionally, the output of eachα-test is encrypted, effectively encrypting the entire shape

code during the firstK iterations. IfK > nmax −λ, then the complete, lossless shape code is encrypted.

The choice ofK should be made to ensure that the number of bits finally encrypted is sufficient to

make it computationally infeasible to perform a brute-force, exhaustive search attack over all possible

sequences.

As with SPIHT and ST-SPIHT, the SecST-SPIHT coder and decoder follow a data-dependent execution

path. This means that the correct interpretation of a given bit in the output bit-stream requires complete



knowledge of all previous significance test andα-test bits. The result is that an attacker cannot in fact

locate the bits in the output bit-stream which are not encrypted. To demonstrate the difficulty encountered

by a cryptanalyst attempting to determine which bits are unencrypted, we usebjn,LIP to denote thejth

bit in the setBn,LIP, for j = 0, 1, 2, . . . , Nn,LIP − 1, whereNn,LIP is the total number of bits inBn,LIP.

According to the SecST-SPIHT coder definition, consideringthe initial coding iterations in whichn ≥ λ

(i.e., the shape is still being coded), it is knowna priori that the first bit is anα-test bit:

b0n,LIP ∈ Bn,LIP−α (5)

However, classification of the second bit depends on the firstbit:

b1n,LIP ∈

Bn,LIP−sig, if b0n,LIP = 1

Bn,LIP−α, otherwise(6)

And, consequently, classification of the third bit depends on the first and second bits:

b2n,LIP ∈

Bn,LIP−sig, if(

b0n,LIP = 0 and b1

n,LIP = 1)

Bn,LIP−sgn, if(

b0n,LIP = 1 and b1

n,LIP = 1)

Bn,LIP−α, otherwise

(7)

This can be generalized as follows:

bjn,LIP ∈

Bn,LIP−sig, if(

bj−1n,LIP ∈ Bn,LIP−α and b

j−1n,LIP = 1

)

Bn,LIP−sgn, if(

bj−1n,LIP ∈ Bn,LIP−sig and b

j−1n,LIP = 1

)

Bn,LIP−α, otherwise

, 1 ≤ j < Nn,LIP. (8)

From (8), it is evident that the bitsBn,LIP can in fact be treated as the ordered set of coded transition

instructions in a Markov chain. The classification ofbj−1n,LIP, indicating the(j−1)th state in the chain, must

be known along with the valuebjn,LIP (the transition instruction) in order to determine the classification

of bjn,LIP (the jth state in the chain). Since the value ofb

jn,LIP indicates only the transition and not the

state itself, it is clear that all previous bitsbln,LIP, 0 ≤ l < j must be known in order classifybj

n,LIP

and determine whether it is unencrypted. Similar argumentscan be made forBn,LIS. Hence, without the

correct decryption key, not only do the the encrypted bits remain confidential, but the locations of the

unencrypted bits cannot be determined and are thus also confidential.

In attempting to attack the encrypted portion of the bit-stream, the attacker may recreate the Markov

chain and perform statistical analyses so that the originalbits could be predicted with probabilityp > 0.5

from previous bits, thus aiding an exhaustive search attack. While recreating such an attack is beyond

the scope of this paper, the efficiency of the coding algorithm [1], [15] implies that the entropy of each



bit H(b) ≈ 1 and thusp ≈ 0.5, regardless of the additional contextual information. However, if a more

conservative estimate ofH(b) < 1 is made, thenK can simply be increased to increase the number of

encrypted bits in order to ensure that an exhaustive search remains computationally infeasible.

Alternatively, an attacker may attempt to locate the unencrypted portion of the bit-streamBu =

{Bn|n ≤ nmax − K} since it is known that all bitsbju ∈ Bu are unencrypted, and may reveal important

image features if correctly decoded. If we denote the set of encrypted bits asBe = {Bn,LIP−α, Bn,LIP−sig,

Bn,LIS−α, Bn,LIS−sig|nmax−K < n ≤ nmax}, and the total number of bits in the firstK coding iterations

(both encrypted and unencrypted) asNK , an attack onBu may be attractive ifH(Be) > H(NK). In

other words, if determining the location ofBu (which starts at bitNK +1 within the overall bit-streamB)

is computationally simpler than an exhaustive search over the encrypted bitsBe, the attacker may view

this approach as offering greater probability of success inrevealing image details. However, even with

knowledge ofBu, the state of the LSP, LIP, and LIS lists and the shape decoding remain unknown without

correct decryption and decoding ofBe. This means that while the initial bits inBu may be correctly

classified by the attacker, it cannot be determined which coordinates within the SA-DWT description of

the object the coded bits correspond to. Ultimately, the attacker will not be able to determine any image

details fromBu without correct decryption and decoding ofBe.

In summary, the SecST-SPIHT secure coder achieves confidentiality by encrypting the most significant

portion of the bit-stream as well as obfuscating the unencrypted portion. Choice of the parameterK

provides control of the number of coding iterations which are encrypted. This allows flexibility to meet

the security requirements of the application at hand.

III. E XPERIMENTAL RESULTS

The SecST-SPIHT secure coder was tested with a variety of input visual objects, as shown in Figs. 5 to

9. The ‘surveillance1’, ‘surveillance2’, and ‘surveillance3’ objects were extracted from actual surveillance

video frames using motion-based segmentation, whereas ‘akiyo’ and ‘foreman’ are the standard MPEG

test objects. The coder accepts an arbitrary binary segmentation map so that any segmentation algorithm

can be employed, depending on the requirements of the application. All frames are in 8-bit per channel

RGB CIF format (352× 288) with Table I showing the percentage of the frame that the object occupies.

The rate-distortion performance is identical to ST-SPIHT,which is examined in detail in [1], and will

not be covered here. In all test cases here, the SecST-SPIHT coder utilized the CDF 9/7 biorthogonal

wavelet filters [17] with a 4-level transform, and an output code bit-rate of 2.4 bits-per-object-pixel

(including the shape code, where applicable). Since the progressive/embedded output property of ST-



SPIHT is maintained, the output code may be arbitrarily truncated to achieve a lower bit-rate with the

sacrifice of greater texture distortion.1 If lossless coding of the texture is required, integer-to-integer

wavelet filters [18] and colour transforms can be utilized and the coder instructed to code all of the

transform domain bit-planes [1]. For simulation purposes,a Vernam cipher was employed as the stream

cipher [19], using 128-bit randomly generated key. However, any bit-level stream cipher that is sufficiently

secure for the application at hand can be utilized.

Figs. 10 to 14 show sample output using the test objects. In all cases, encryption is performed during

the first two coding iterations (K = 2). In the cases where the shape is coded and encrypted with the

object texture, the shape is code is completed in the third iteration (λ = nmax − 2). Figs. 10 to 12

show the decrypted/decoded output ’surveillance’ objects/frames when: (a)/(d) the correct decryption key

is provided; (b)/(e) the incorrect decryption key is provided; and (c)/(f) the incorrect decryption key

is provided, but the shape is available externally and only the texture is coded. In all cases where the

incorrect key is provided, the textural content is completely obscured; no object details can be seen. For

the case (b)/(e) where the shape is coded and encrypted with the texture, the shape is also completely

obscured. In order to reconstruct the frame without revealing the object shape mask, the background is

transmitted as a full frame, with the missing texture information behind the object filled-in using prior

frames.

Similarly, the decrypted/deoded test objects/frames ’akiyo’ and ’foreman’ are shown in Figs. 13 and 14,

respectively with: (a)/(d) the correct decryption key provided; (b) the incorrect decryption key provided;

and (c)/(e) the incorrect decryption key is provided, but the shape is available externally and only the

texture is coded. In the cases when the shape is coded and encrypted with the object and the incorrect

decryption key is provided (Figs. 13(b) and 14(b)), the fullframe background is not transmitted since

the prior frames in the sequence do not offer enough information to in-fill the original object area.

Fig. 15 shows the fraction of the output code bits which are encrypted vs. the number of coding

iterations during which encryption is performed (K). The total number of output code bits corresponds

to a bit-rate of 2.4 bits-per-object-pixel (including the shape code, where applicable). Fig. 15(a) shows

the case where the shape is not coded; Figs. 15(b) to 15(d) show the cases where the shape code is

completed during the first, second, and third coding iteration (λ = nmax, nmax − 1, and nmax − 2),

respectively. In Fig. 15(a), the effect of varyingK can clearly be seen, with the fraction of the output

code being encrypted rising withK. The fraction remains small for all consideredK = 1, · · · , 4, ranging

1At most bit-rates and choices ofλ, the shape will be coded losslessly.



from approximately 0.2% to 1.6%. In Figs. 15(b) to 15(d), a large jump in the portion of the bit-stream

that is encrypted is observed onceK is set high enough to ensure that the shape is completely encrypted

(K = nmax − λ + 1). WhenK is raised above this point, the effect is more subtle since atlow output

bit-rates the shape code represents a significant portion ofthe bit-stream. WithK > nmax −λ, the actual

percentage of the output code that is encrypted is largely controlled by the portion which is the shape

code. If the user wishes to keep the level of encryption to a minimum for the purpose of computational

efficiency,λ should be set low enough to disperse the shape code further into the bit-stream, and setting

K ≤ nmax − λ so that only the initial portion of the shape code is encrypted. In this case,λ should

be chosen so thatK can still be set high enough to encrypt a minimum number of bits to achieve a

minimum desired level of security. For example, as in Figs. 10 to 14, settingK = 2 andλ = nmax − 2

(i.e., shape code completed in the third coding iteration).The drawback of this approach is that the shape

cannot be completely, losslessly decoded until later in theoutput bit-stream, possibly resulting in lossy

shape reconstruction in low bit-rate scenarios.

Table II shows the number of bits encrypted forλ = nmax − 2 and differentK. As in Fig 15(d), there

is a jump at the iteration at which the remaining shape code isgenerated and encrypted (K = 3). With

this choice ofλ, K = 2 can be chosen since the number of bits encrypted is large enough to prevent

a brute-force, exhaustive search attack over the encryptedbits, but still represent minimal processing

overhead with less than 5% of the output bit-stream encrypted for a bit-rate of 2.4 bits-per-object-pixel.

It should be noted that the property of SecST-SPIHT to disperse the shape code within the texture code

is inherited from ST-SPIHT. With the execution path of the texture decoding dependent on the shape

code, the two portions of the code cannot be separated without correct decryption of all encrypted bits.

IV. CONCLUSIONS

The SecST-SPIHT secure visual object coder was presented, offering an efficient solution for privacy

protection of subjects in digital video surveillance systems. Provided with segmented, arbitrarily-shaped

visual objects, SecST-SPIHT securely codes both the shape and texture, ensuring confidentiality through

the use of a private decryption key. In contrast to privacy protection systems that simply scramble or blur

the subject’s visual data, SecST-SPIHT allows complete recovery of the data if the correct decryption

key is provided. This is necessary in applications where, for example, the subject can be deemed to be

unauthorized, and the appropriate authorities must have access to the visual data. Additionally, the SecST-

SPIHT secure coder offers all the features of the ST-SPIHT visual object coder [1], namely efficient and

progressive/embedded parallel coding of the object shape and texture.



The parameterK offers the user control over a variable level of application-dependent security. In effect,

increasingK increases the portion of the output bit-stream that is encrypted by performing encryption

for a greater number of coding iterations. In practice,K can be chosen to ensure that the number of

encrypted bits is high enough to protect against a brute-force, exhaustive search attack over the encrypted

portion of the bit-stream. The remaining unencrypted portion of the bit-stream cannot be decoded since

the data-dependent execution of the decoder requires complete knowledge of the prior (encrypted) portion

of the bit-stream.

The provided secure coding algorithm operates on individual visual object input frames, but may be

extended for video sequences. Motion compensation may be employed to reduce the size of the shape

and texture coded for subsequent frames. Consequently, fora givenK, the number of encrypted bits

for subsequent encrypted object frames would also be very low. However, confidentiality of those object

frames would not be compromised since correct decoding would require decryption of the previous

frames, thus extending the data dependent, partial encryption paradigm into the temporal dimension.

SecST-SPIHT is well suited as a privacy enhancing technology for surveillance-intensive environments.

However, the coder can be employed in any number of applications where the confidentiality and efficient

coding of arbitrarily-shaped visual objects is required.

REFERENCES

[1] K. Martin, R. Lukac, and K. N. Plataniotis, “SPIHT-basedcoding of the shape and texture of arbitrarily shaped visual

objects,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 10, pp. 1196–1208, Oct. 2006.

[2] ——, “Efficient encryption of wavelet-based coded color images,”Pattern Recognition, vol. 38, no. 7, pp. 1111–1115,

2005.

[3] S. Tansuriyavong and S. Hanaki, “Privacy protection by concealing person in circumstantial video image,” inProc. Workshop

on Perceptive User Interfaces, vol. 4, 2001, pp. 1–4.

[4] D. Chen, Y. Chang, R. Yan, and J. Yang, “Tools for protecting the privacy of specific individuals in video,”EURASIP

Jrnl. on Advances in Sig. Proc., vol. 2007, pp. 1–9, 2007.

[5] E. M. Newton, L. Sweeney, and B. Malin, “Preserving privacy by de-identifying face images,”IEEE Trans. Knowl. Data

Eng., vol. 17, no. 2, pp. 232–243, Feb. 2005.

[6] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “A full-body layered deformable model for automatic model-based

gait recognition,”EURASIP Jrnl. on Advances in Sig. Proc., Spec. Issue on Adv. Sig. Proc. and Patt. Recog. Methods for

Biometrics, preprint 2008.

[7] I. Martinez-Ponte, X. Desurmont, J. Meesen, and J.-F. Delaigle, “Robust human face hiding ensuring privacy,” inProc.

Int. Workshop on Image Analysis for Multimedia Interactive Services., 2005.

[8] A. Senior, S. Pankanti, A. Hampapur, L. Brown, Y.-L. Tian, A. Ekin, J. Connell, C. F. Shu, and M. Lu, “Enabling video

privacy through computer vision,”IEEE Security Privacy, vol. 3, no. 3, pp. 50–57, May–June 2005.



[9] W. Zhang, S. S. Cheung, and M. Chen, “Hiding privacy information in video surveillance system,” inProc. IEEE Int.

Conf. on Image Proc., vol. 3, 2005, pp. 868–871.

[10] F. Dufaux, M. Ouaret, Y. Abdeljaoued, A. Navarro, F. Vergnenegre, and T. Ebrahimi, “Privacy enabling technology for

video surveillance,” inImage Processing for Military and Security Applications, S. S. Agaian and S. A. Jassim, Eds. Proc.

SPIE 6250, 2006, pp. 1–12.

[11] J. Wen, M. Severa, W. Zeng, M. H. Luttrell, and W. Jin, “A format-compliant configurable encryption framework for access

control of video,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 6, 2002.

[12] P.-C. Wang and T.-W. Hou, “An AV object oriented encryption algorithm for MPEG-4 streams,” inProc. Int. Conf. on

Multimedia and Expo, Jun. 2004, pp. 971–974.

[13] K. Martin, R. Lukac, and K. N. Plataniotis, “Binary shape mask representation for zerotree-based visual object coding,”

in Proceedings IEEE Canadian Conference on Electrical and Computer Engineering, May 2004, pp. 2197–2200.

[14] S. Li and W. Li, “Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual object coding,”IEEE Trans.

Circuits Syst. Video Technol., vol. 10, pp. 725–743, Aug. 2000.

[15] A. Said and W. A. Pearlman, “A new fast and efficient imagecodec based on set partitioning in hierarchical trees,”IEEE

Trans. Circuits Syst. Video Technol., vol. 6, pp. 243–250, Jun. 1996.

[16] A. A. Kassim and W. S. Lee, “Embedded color image coding using SPIHT with partially linked spatial orientation trees,”

IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 2, pp. 203–206, Feb. 2003.

[17] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies,“Image coding using wavelet transform,”IEEE Trans. Image

Process., vol. 1, pp. 205–220, Apr. 1992.

[18] R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo, “Wavelet transforms that map integers to integers,”Appl.

Comput. Harmon. Anal., vol. 5, no. 3, pp. 322–369, 1998.

[19] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone,Handbook of Applied Cryptography. CRC Press, 1996.



Pre-Processing SA-DWTSecure

ST-SPIHTCoder

Shape Mask

Object

Image(texture)

0 1 0 0 1 0

01

10

Channel/Storage

11

00

101 1 0 0

SecureST-SPIHTDecoder

Inverse SA-DWTPost-Processing

Shape Mask

ReconstructedObject

Image(texture)

xT

s

x

xT^

s

x

Parameters:λ, K

kEkD

SecretKeys

Fig. 1. System level diagram of the SecST-SPIHT coding and decoding scheme.

BnBn,LIP Bn,LIS Bn,LSP

1 1 0 0 1 0 1 1 1 1 1 0 1 1 …

Bn,LIP-sgn

Bn,LIP-sig

0 1 1 1 0 1 0 0 1 1 1 0 1 1 …

Bn,LIS-sgn

Bn,LIS-sig

Bn,LIS-Tsig

Bn,LIP-α

Bn,LIS-α

Fig. 2. Composition of subsetBn of ST-SPIHT bit-stream forn > λ.



ST-SPIHTCoder

TexturexT

0 1 1 1 0 0...

fE(b,kE)Stream Cipher

Encryption Function

Combined Codingand Encryption

Compressed/Encrypted Bitstream

Location of bits Bn,LIP-α, Bn,LIP-sig, Bn,LIS-α, Bn,LIS-sig

Secure ST-SPIHT Coder

Shapes

λ, K

kE

Fig. 3. SecST-SPIHT Coder.

ST-SPIHTDecoder...0 1 1 1 0 0...

fD(b,kD)

Combined Decryptionand Decoding

Compressed/Encrypted Bitstream

Stream Cipher Decryption Function

Secure ST-SPIHT Decoder

Location of bits Bn,LIP-α, Bn,LIP-sig, Bn,LIS-α, Bn,LIS-sig

TexturexT

Shapes

^

^

kD

Fig. 4. SecST-SPIHT Decoder.

TABLE I

PERCENTAGE OF FRAME OCCUPIED BY TEST OBJECTS.

Object Frame Percentage

’surveillance1’ 10.9%



’akiyo’ 37.2%

’foreman’ 29.4%



(a) original frame (b) segmentation map (c) segmented object

Fig. 5. ‘Surveillance1’ test object.








Fig. 8. ‘Akiyo’ test object.


Fig. 9. ‘Foreman’ test object.

TABLE II

THE NUMBER OF BITS ENCRYPTED FOR THE TEST OBJECTS USING DIFFERENT VALUES OFK AND λ = nmax − 2.

K

Test Object 1 2 3 4

Surveillance1 777 805 4333 4507

Surveillance2 783 819 3239 3428

Surveillance3 734 790 3494 4030

Akiyo 768 901 4086 4934

Foreman 762 874 5381 5763



(a) (b) (c)

(d) (e) (f)

Fig. 10. ‘Surveillance1’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b)/(e) with incorrect

key; (c)/(f) with incorrect key and shape provided externally.



(a) (b) (c)

(d) (e) (f)





(a) (b) (c)

(d) (e)

Fig. 13. ‘Akiyo’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b) with incorrect key;

(c)/(e) with incorrect key and shape provided externally.



(a) (b) (c)

(d) (e)

Fig. 14. ‘Foreman’ test object/frame decoded and decryptedoutput (K = 2): (a)/(d) with correct key; (b) with incorrect key;

(c)/(e) with incorrect key and shape provided externally.



1 2 3 40

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

K (# encrypted coding iterations)

# en

cryp

ted

bits

/# to

tal c

ode

bits

surveillance1surveillance2surveillance3akiyoforeman

(a) Shape not coded

1 2 3 40

0.05

0.1

0.15

0.2

0.25


# en

cryp

ted

bits

/# to

tal c

ode

bits


(b) Shape code completed in first iteration (λ = nmax)

1 2 3 40

0.05

0.1

0.15

0.2

0.25


# en

cryp

ted

bits

/# to

tal c

ode

bits


(c) Shape code completed in second iteration (λ = nmax − 1)

1 2 3 40

0.05

0.1

0.15

0.2

0.25


# en

cryp

ted

bits

/# to

tal c

ode

bits


(d) Shape code completed in third iteration (λ = nmax − 2)

Fig. 15. The fraction of bits encrypted vs. the security level parameterK (number of encrypted coding iterations) for different

λ (shape code levels). The total bits in the code corresponds to a bit-rate of 2.4 bits-per-object-pixel.


submitted to ieee transactions on circuits and …kostas/publications2008/pub/... · the video...

Documents