submitted to ieee transactions on circuits and …kostas/publications2008/pub/... · the video...
TRANSCRIPT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1
Secure Visual Object Based Coding for Privacy
Protected SurveillanceKarl Martin* and Konstantinos N. Plataniotis
Abstract
This paper presents a scheme for secure coding of arbitrarily-shaped visual objects. The scheme can
be employed in a privacy protected surveillance system, whereby visual objects are encrypted so that the
content is only available to certain entities, such as persons of authority, possessing the correct decryption
key. This system may be deployed in sensitive areas requiring surveillance, but where personnel require
privacy for authorized activities within the surveillancearea. The encryption can be tied with the identity
of human objects under surveillance so that unauthorized personnel are immediately apparent to human
or computer based monitoring systems. The secure visual object coder employs Shape and Texture Set
Partitioning in Hierarchical Trees (ST-SPIHT) along with partial encryption for efficient, secure storage
and transmission of visual object shape and textures. The encryption is performed in the compressed
domain and does not affect the rate-distortion performanceof the coder. A separate parameter for each
encrypted object controls the strength of the encryption versus required processing overhead.
Index Terms
Shape adaptive coding, security, encryption, surveillance, privacy, privacy protection, visual object
coding, shape and texture coding, wavelet based coding, setpartitioning in hierarchical trees (SPIHT).
Corresponding Author: Karl Martin, Multimedia Laboratory, Room BA 4157, The Edward S. Rogers Sr. Department of ECE,
University of Toronto, 10 King’s College Road, Toronto, Ontario, M5S 3G4, Canada, phone: 1 (416) 978 6845, FAX: 1 (416)
978 4425, e-mail: [email protected]
K. Martin, and K.N. Plataniotis are with The Edward S. RogersSr. Department of ECE, University of Toronto, Multimedia
Laboratory, Room BA 4157, 10 Kings College Road, Toronto, Ontario, M5S 3G4, Canada
*Partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) under the
Network for Effective Collaboration Technologies throughAdvanced Research (NECTAR) project.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2
I. INTRODUCTION
Video surveillance of both public and private spaces is expanding at an ever-increasing rate. Conse-
quently, individuals are increasingly concerned about theinvasiveness of such ubiquitous surveillance
and fear that their privacy is at risk. The demands of law enforcement agencies to prevent and prosecute
criminal activity, and the need for private organizations to protect against unauthorized activities on their
premises is seen to often be in conflict with the privacy requirements of individuals.
In this paper, we propose a secure visual object based coder,Secure Shape and Texture Set Partitioning
in Hierarchical Trees (SecST-SPIHT), in order to address this conflict. The SecST-SPIHT scheme codes
the shape and texture of arbitrarily-shaped visual objectsin the same fashion as ST-SPIHT [1], and
partially encrypts the output bit-stream based on the classification and importance of the bits [2]. The
scheme efficiently and effectively secures the entire shapeand texture of the object and ensures that the
data cannot be accessed without provision of the correct decryption key. At typical output bit-rates and
choice of security parameter, the encryption operation is performed on less than 5% of the output code
bits.
The SecST-SPIHT secure coder can be employed in surveillance systems where the capture of certain
visual objects may be considered privacy invasive (e.g., face and body images). The decryption key
required to decrypt and decode the visual object shape and texture may be managed such that only
the appropriate authorities are able to access the object data. Furthermore, the key may be tied to the
subject’s identity (e.g., through RFID based tokens), thusgiving control of the private content to the
subject. The proposed, computationally simple selective encryption procedure makes the scheme suitable
for real-time applications where significant processing resources are requisitely consumed for coding of
the video stream.
Previous works on the privacy protection of individuals in video surveillance have largely focussed on
face and body tracking, but have generally resorted to scrambling, obscuring, or masking the visual data
to protect the identity of the subjects. In [3], the subject’s image is masked, revealing only a silhouette.
However, such a silhouette may not completely obscure the identity of the subject. Furthermore, the
system discards the texture data, making future investigation by authorities impossible. Similarly, in [4],
the focus is on removing appearance information while retaining structural information about the body
in order to assess behavior. However, the removed appearance information is discarded and cannot be
retrieved, making the solution inappropriate in law enforcement and forensic applications.
The approach in [5] is to ”de-identify” face images so that facial recognition software cannot be used
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 3
to reliably identify the subject, but enough facial features remain so that the image could still be used
for detecting behavior. In this so-calledk-Same approach, face images are clustered based on a distance
metric, and the images replaced by a representative image generated by averaging of components based on
pixels or eigenvectors. This approach, however, does not obscure identifying information that is conveyed
by other parts of the body (e.g., via gait [6]), and again, theoriginal data is discarded and cannot be
retrieved by authorized personnel. In [7], a region of interest (ROI) is defined for face data within a
frame, and the corresponding coefficients downshifted in order to be coded and protected in a separate
quality layer using Motion JPEG 2000. However, the wavelet domain separation of ROI content only
allows for rough separation of content in the spatial domain, thus disallowing true object vs. background
separation possible in object-based coding schemes.
The computer vision approach of [8] provides three policy-dependent options to hiding privacy data:
summarization; transformation (obscuration); and encryption. However, in the case of encrypted output,
traditional encryption is applied to the entire private data stream, which is computationally infeasible in
many digital video surveillance systems. The proposed scheme in [9] embeds the private information of
subjects as an encrypted watermark within the surveillanceframes. However, the private data is limited
to rectangular regions of the image frame and, again, traditional encryption is applied to the data. In
[10], a reversible wavelet-domain scrambling is performedon ROI-defined private data, thus allowing
subsequent retrieval of the private data by authorized users. This approach, as in [7], does not allow
explicit spatial domain separation of the object of interest and the background, and the region-of-interest
shape is not secured. Furthermore, the scrambling is performed before compression, resulting in a modest
reduction in coding performance.
The schemes in [11] use efficient encryption or shuffling of variable-length codeword concatenations
to secure MPEG-4 video streams while maintaining format compliance. However, entire frames are
secured and hence cannot be used to secure only private data in surveillance applications. Furthermore,
the intended target is entertainment applications, where some image details can be reconstructed through
error concealment techniques. In [12], MPEG-4 video objects are secured through selective encryption
of Object Descriptors (OD). This approach, however, offersvery limited security since none of the actual
object content is encrypted.
The proposed SecST-SPIHT secure coder accepts arbitrary shape and texture input, and therefore may
be assisted by the subject detection and tracking systems proposed in other works. However, the efficient
encryption and coding of both the shape and texture information makes it uniquely appropriate for privacy
protection in real-time surveillance applications. The remainder of the paper is organized as follows. In
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 4
Section II, the SecST-SPIHT scheme is described in detail and security analysis is provided. In Section
III, experimental results are provided and analyzed for various object inputs and parameters. Finally, the
paper is concluded in Section IV.
II. SECURE SHAPE AND TEXTURE SPIHT CODING SCHEME
The Secure ST-SPIHT (SecST-SPIHT) coding and decoding system is shown in Fig. 1. It is based on the
Shape and Texture Set Partitioning in Hierarchical Trees (ST-SPIHT) scheme for coding arbitrarily-shaped
visual objects [1], with individual bits from the output bit-stream selectively encrypted using a stream
cipher. The selective encryption offers an efficient alternative to complete content encryption which can
be computationally burdensome in full color image and videoapplications. The data-dependent decoding
algorithm makes the unencrypted portion of the bit-stream effectively impossible to locate or interpret.
Furthermore, the bits chosen for encryption represent the most significant components of the coded object,
ensuring complete confidentiality of the visual data from those without the correct decryption key. Since
encryption only occurs during the output stage, the shape and texture coding operate in exactly the
same fashion as ST-SPIHT, with identical rate-distortion performance and embedded/progressive output
properties [1]. The system describes secure coding of stillvisual objects but can easily be extended to
the frames of a video object sequence.
The input consists of two components: i) anM×N full color (texture) imagex : Z2 → Z3 representing
a two-dimensional matrix of three-component RGB color samples x(i, j) = [x(i, j)1, x(i, j)2, x(i, j)3],
with i = 0, 1, . . . ,M − 1 andj = 0, 1, . . . , N − 1 denoting the spatial position of the pixel, andx(i, j)k
denoting the component in the red (k = 1), green (k = 2), or blue (k = 3) color channel; and ii) an
M × N binary (shape mask) images : Z2 → {0, 1} representing a two-dimensional matrix of binary
values wheres(i, j) = 1 denotes spatial positions ‘inside’ the object, ands(i, j) = 0 denotes spatial
positions ‘outside’ the object. The object is preprocessedby first converting the texture to theYCbCr
color space. Subsequently, texture positions outside the object are set to zero, such thatx(i, j) = [0, 0, 0],
∀ (i, j) wheres(i, j) = 0.
Each color channel of the texture is subsequently transformed using an in-place lifting shape-adaptive
discrete wavelet transform (SA-DWT) with global subsampling [1], [13], creating theM × N vectorial
field xT : Z2 → Z3 of transform coefficientsxT(i, j) = [xT(i, j)1, xT(i, j)2, xT(i, j)3]. This is a
modification of the SA-DWT described in [14], allowing the spatial domain shape masks to remain
unmanipulated and coded directly.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 5
A. SecST-SPIHT Coder and Decoder
The SecST-SPIHT coder, shown in Fig. 3, is identical to ST-SPIHT except that the output bit-stream
is selectively encrypted using a stream cipherfE(b, kE), applied to individual bitsb using the private
key kE . The ST-SPIHT algorithm is employed to code the input shape and texture as well as to instruct
the stream cipher which bits require encryption. The details of the ST-SPIHT coding algorithm will only
be summarized here; full details and analysis can be found in[1].
1) ST-SPIHT: The texture coding in ST-SPIHT follows a natural extension of SPIHT with the spatial
orientation trees (SOT) defined as in [15], with the modification for color images proposed in [16]. The
SOTs are first formed using all coordinates inside the bounding box of sizeM × N ; the binary shape
masks is used to describe which nodes are inside the object and which are outside.
We defineG = {(i, j) | s(i, j) = 1} as the set of all coordinates inside the object, andG =
{(i, j) | s(i, j) = 0} as the complementary set containing all coordinates outside the object — i.e.,
G⋃
G = {(i, j) | i = 0, 1, . . . ,M − 1, j = 0, 1, . . . , N − 1} and |G| + |G| = MN . All the definitions
from the standard SPIHT algorithm described in [15] remain in use with the addition of the color
component indexk. Briefly, the list of insignificant pixels (LIP), list of significant pixels (LSP), and list
insignificant sets (LIS), store different coefficient and tree root coordinates. A “type-A” entry in the LIS
refers toD(i, j)k, all the descendants of(i, j)k ; a “type-B” entry refers toL(i, j)k = D(i, j)k −O(i, j)k,
whereO(i, j)k are the direct offspring of location(i, j)k . H denotes the set of all luminance LL subband
coefficient coordinates andSn(·) refers to the significance test at bit-planen, as defined in [15].
Unique to the ST-SPIHT algorithm are a series of three “α-test” functions. The “α pixel test” function,
αp(·, ·), identifies whether a coordinate is inside or outside the shape and is defined follows:
αp(i, j) =
1, (i, j) ∈ G
0, otherwise(1)
The “α set-discard test” function,αSD(·), identifies sets of coefficients that are entirely outside the object:
αSD(T ) =
0, T ⊆ G
1, otherwise, (2)
where T represents a given set of coefficients. And finally, the “α set-retain test” function,αSR(·),
identifies sets of coefficients that are entirely inside the object:
αSR(T ) =
1, T ⊆ G
0, otherwise(3)
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 6
The ST-SPIHT coding routine requires theshape code level parameter,λ, to be input. This defines
the quantization level at which the routine forces the coding of not-yet-coded shape mask pixelss(i, j).
This is done by applying the subroutine “Shape Code Set” (SCS) to the appropriate trees. The complete
algorithm codes the shape and texture information in parallel, producing an embedded bit-stream that
can be decoded to produce progressive shape and texture reconstruction. By loweringλ, the shape
code becomes further dispersed in the output bit-stream, delaying the point at which the shape can be
completely, losslessly decoded. At very low output bit-rates, loweringλ allows greater emphasis to be
placed on the texture, providing the trade-off of lossy shape reconstruction [1]. The decoder follows the
same data-dependent execution path as the coder based on interpretation of the output bit-stream.
2) Selective Encryption for ST-SPIHT: The SecST-SPIHT selective encryption algorithm is based on
the scheme proposed in [2] for regular SPIHT. We denote the ST-SPIHT bit-stream as the ordered set
of bits B. The bit-stream can be divided into the ordered subsetsB = {Bnmax, Bnmax−1, Bnmax−2, . . .},
whereBn is the set of bits obtained during coding iteration for bit-planen (i.e., representing the value
2n), andnmax is the highest bit-plane at which coding is initiated. EachBn can be further subdivided
into Bn = {Bn,LIP, Bn,LIS, Bn,LSP}, whereBn,LIP denotes the ordered set of bits obtained during the
first phase of the sorting pass where coefficients in the LIP are tested for significance;Bn,LIS denotes
the ordered set of bits obtained during the second phase of the sorting pass where entire trees are tested
for significance; andBn,LSP denotes the ordered set of bits obtained during the refinement pass.
Each set of bitsBn,LIP is composed ofα-test bits (Bn,LIP−α), significance bits (Bn,LIP−sig) and sign
bits (Bn,LIP−sgn). Similarly, each set of bitsBn,LIS is composed of significance bits (Bn,LIS−sig) and
sign bits (Bn,LIS−sgn) for individual coefficients, significance bits for trees (Bn,LIS−Tsig), andα-test bits
for both individual coefficients and trees (Bn,LIS−α). This decomposition of the bit-stream is shown in
Fig. 2.
The SecST-SPIHT encryption scheme uses an encryption function fE(b, kE) to encrypt only the bits
b ∈ {Bn,LIP−α, Bn,LIP−sig, Bn,LIS−α, Bn,LIS−sig}, for n = nmax, nmax − 1, . . . , nmax − K + 1. The key
kE enforces the confidentiality of the data by preventing entities without the correct matching decryption
key, kD, from correctly decrypting the data. The parameterK is controlled by the user at the time
of encryption/encoding to determine the number of coding iterations to be encrypted. IncreasingK
results in more bits being encrypted and greater security, with the trade-off of greater computational
overhead. The specific bits are selectively chosen since they represent the object shape information and the
significance information of individual coefficients. The coefficient sign bits (Bn,LIP−sgn andBn,LIS−sgn)
remain unencrypted since their values do not affect the coder/decoder execution path. Similarly, the
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 7
significance bits relating to entire trees (Bn,LIS−Tsig) remain unencrypted since they do not affect specific
coefficient reconstruction values.
The encryption functionfE(b, kE) must be implemented using a stream cipher since the decoder (Fig.
4) must decode individual bits and instruct the decryption functionfD(b, kD) whether each subsequent bit
requires decryption or not; the use of a block cipher would prevent the decoder from correctly determining
which bits in the output bit-stream are part of the cipher block. However, the system is flexible in that
any bit-level stream cipher may be used, employing either private keys or public-private key pairs.
The complete description of the SecST-SPIHT routine and theSecure SCS subroutine (SecSCS) follows.
For ease of notation, we introduce the controlled encryption functionfE(b, kE , n,K) defined as follows:
fE(b, kE , n,K) =
fE(b, kE), n > nmax − K
b, otherwise.(4)
Hence, the encryption function is only activated for the first K iterations of the coding algorithm, after
which the input bits are passed through, unencrypted.
SecST-SPIHT Coder:
Input: xT, s, λ, K, kE
1. Initialization: Find initial quantization leveln = nmax =
⌊
log2
(
max(i,j,k)
{|xT(i, j)k|}
)⌋
; set LSP= ∅; set
LIP = H; set LIS ={(i, j)k “type-A” | (i, j)k ∈ H, D(i, j)k 6= ∅}.
2. Sorting pass:
2.1. For each(i, j)k ∈ LIP:
2.1.1. If αp(i, j) not coded yet then outputfE(αp(i, j), kE , n, K);
2.1.2. If αp(i, j) = 1 then:
• OutputfE(Sn(i, j)k, kE , n, K);
• If Sn(i, j)k = 1 then move(i, j)k to the LSP and output the sign ofxT(i, j)k;
2.1.3. If αp(i, j) = 0 then remove(i, j)k from the LIP;
2.2. For each entry(i, j)k ∈ LIS:
[If “type-A” entry, T = D(i, j)k; If “type-B” entry, T = L(i, j)k]
2.2.1. If n ≥ λ and shape not completely coded, then:
• If αSD(T ) not coded yet then outputfE(αSD(T ), kE , n, K);
• If αSD(T ) = 0 then remove(i, j)k from the LIS and move on to next entry in the LIS (go to
Step 2.2);
• If αSD(T ) = 1 then:
– If αSR(T ) not coded yet then outputfE(αSR(T ), kE , n, K);
– If αSR(T ) = 0 andn = λ then runSCS(T );
2.2.2. If shape completely coded andαSD(T ) = 0 then remove(i, j)k from the LIS and move on to next
entry in the LIS (go to Step 2.2);
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 8
2.2.3. If “type-A” entry andαSD(T ) = 1:
• OutputSn(D(i, j)k);
• If Sn(D(i, j)k) = 1 then:
– For each(p, q)r ∈ O(i, j)k:
∗ OutputfE(Sn(p, q)r, kE , n, K);
∗ If Sn(p, q)r = 1 then add(p, q)r to the LSP and output sign ofxT(p, q)r;
∗ If Sn(p, q)r = 0 andαp(p, q) not coded yet, then outputfE(αp(p, q), kE , n, K);
∗ If αp(p, q) = 1 then add(p, q)r to the LIP;
– If L(i, j)k 6= ∅ then move(i, j)k to the end of the LIS as “type-B” entry; else, remove(i, j)k
from the LIS;
2.2.4. If “type-B” entry andαSD(T ) = 1:
• OutputSn(L(i, j)k);
• If Sn(L(i, j)k) = 1 then:
– Add each(p, q)r ∈ O(i, j)k to the end of the LIS as “type-A” entry;
– Remove(i, j)k from the LIS.
3. Refinement pass:For each(i, j)k ∈ LSP, except those found significant in the current sorting pass, output
the nth most significant bit of|xT(i, j)k|;
4. Quantization-step update:Decrementn by 1 and go to Step 2.
Secure Shape Code Set (SecSCS) Subroutine:
Input: setT with root (i, j)k, n, kE , K
1. If (i, j)k is “type-A” entry:
1.1. For each(p, q)r ∈ O(i, j)k:
1.1.1. If αp(p, q) not coded yet then outputfE(αp(p, q), kE , n, K);
1.1.2. If D(p, q)r 6= ∅ then:
• If αSD(D(p, q)r) not coded yet then outputfE(αSD(D(p, q)r), kE , n, K);
• If αSD(D(p, q)r) = 0 terminate processing ofD(p, q)r ;
• If αSD(D(p, q)r) = 1 then:
– If αSR(D(p, q)r) not coded yet then outputfE(αSR(D(p, q)r), kE , n, K);
– If αSD(D(p, q)r) = 0 then go to Step 1 treatingD(p, q)r as new “type-A” input;
2. If (i, j)k is “type-B” entry:
2.1. For each(p, q)r ∈ O(i, j)k, go to Step 1 treatingD(p, q)r as new “type-A” input;
The coding operation is typically terminated when a specified rate or distortion criterion is met. While
SecST-SPIHT allows for coding to be terminated before the shape has been losslessly coded, typical rate
criteria and values ofλ will result in complete lossless coding of the shape. Also, the coder may be
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 9
instructed not to code the shape in situations where, for example, the shape is implicitly available via
the shape of another object which surrounds the object to be coded (e.g., a background object).
The SecST-SPIHT decoder follows exactly the same executionpath as the coder and only requires
basic initialization information (i.e.,M , N , |G|, nmax, λ, the number of wavelet transform levels, ands
if the shape was not coded) to interpret the output bit-stream. Provided with the correct decryption key,
kD, the decoder decodes the bit-stream and instructs the decryption functionfD(b, kD) as to whether
each subsequent bit should be decrypted or passed through, unencrypted. Since the first bit is always in
Bnmax,LIP−α (generated from the first iteration of step 2.1.1), it must always be decrypted.
It should be noted that SecST-SPIHT is backward compatible such that when the input shapes fills
the entireM × N rectangular bounding box, the coding operation is identical to traditional SPIHT [15]
and the selective encryption algorithm operates the same asin [2].
B. Security Analysis of SecST-SPIHT
The SecST-SPIHT selective encryption ensures the confidentiality of the coded visual object data in
two ways: i) securing the most significant portion of the bit-stream using a secret cryptographic keykE
and a stream cipher; and ii) making the unencrypted portion of the bit-stream impossible to decode since
its location and the state of the decoder cannot be determined without correct decryption and decoding
of the encrypted portion.
As noted in the previous section, encryption is performed onthe bits b ∈ {Bn,LIP−α, Bn,LIP−sig,
Bn,LIS−α, Bn,LIS−sig}, for n = nmax, nmax − 1, . . . , nmax − K + 1. This represents a partial bit-plane
and shape encryption performed on the visual object in the SA-DWT domain, with the choice ofK
determining how many bit-planes are encrypted. Specifically, with K = 1, only the most significant bit-
plane is encrypted for the coefficients|xT(i, j)k | ≥ 2nmax ; for K = 2, only the top two most significant
bit-planes are encrypted for the coefficients|xT(i, j)k | ≥ 2nmax−1, and so on. In other words, the topK
bit-planes are encrypted for all coefficients that are foundsignificant in the firstK iterations of the coding
algorithm. Additionally, the output of eachα-test is encrypted, effectively encrypting the entire shape
code during the firstK iterations. IfK > nmax −λ, then the complete, lossless shape code is encrypted.
The choice ofK should be made to ensure that the number of bits finally encrypted is sufficient to
make it computationally infeasible to perform a brute-force, exhaustive search attack over all possible
sequences.
As with SPIHT and ST-SPIHT, the SecST-SPIHT coder and decoder follow a data-dependent execution
path. This means that the correct interpretation of a given bit in the output bit-stream requires complete
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 10
knowledge of all previous significance test andα-test bits. The result is that an attacker cannot in fact
locate the bits in the output bit-stream which are not encrypted. To demonstrate the difficulty encountered
by a cryptanalyst attempting to determine which bits are unencrypted, we usebjn,LIP to denote thejth
bit in the setBn,LIP, for j = 0, 1, 2, . . . , Nn,LIP − 1, whereNn,LIP is the total number of bits inBn,LIP.
According to the SecST-SPIHT coder definition, consideringthe initial coding iterations in whichn ≥ λ
(i.e., the shape is still being coded), it is knowna priori that the first bit is anα-test bit:
b0n,LIP ∈ Bn,LIP−α (5)
However, classification of the second bit depends on the firstbit:
b1n,LIP ∈
Bn,LIP−sig, if b0n,LIP = 1
Bn,LIP−α, otherwise(6)
And, consequently, classification of the third bit depends on the first and second bits:
b2n,LIP ∈
Bn,LIP−sig, if(
b0n,LIP = 0 and b1
n,LIP = 1)
Bn,LIP−sgn, if(
b0n,LIP = 1 and b1
n,LIP = 1)
Bn,LIP−α, otherwise
(7)
This can be generalized as follows:
bjn,LIP ∈
Bn,LIP−sig, if(
bj−1n,LIP ∈ Bn,LIP−α and b
j−1n,LIP = 1
)
Bn,LIP−sgn, if(
bj−1n,LIP ∈ Bn,LIP−sig and b
j−1n,LIP = 1
)
Bn,LIP−α, otherwise
, 1 ≤ j < Nn,LIP. (8)
From (8), it is evident that the bitsBn,LIP can in fact be treated as the ordered set of coded transition
instructions in a Markov chain. The classification ofbj−1n,LIP, indicating the(j−1)th state in the chain, must
be known along with the valuebjn,LIP (the transition instruction) in order to determine the classification
of bjn,LIP (the jth state in the chain). Since the value ofb
jn,LIP indicates only the transition and not the
state itself, it is clear that all previous bitsbln,LIP, 0 ≤ l < j must be known in order classifybj
n,LIP
and determine whether it is unencrypted. Similar argumentscan be made forBn,LIS. Hence, without the
correct decryption key, not only do the the encrypted bits remain confidential, but the locations of the
unencrypted bits cannot be determined and are thus also confidential.
In attempting to attack the encrypted portion of the bit-stream, the attacker may recreate the Markov
chain and perform statistical analyses so that the originalbits could be predicted with probabilityp > 0.5
from previous bits, thus aiding an exhaustive search attack. While recreating such an attack is beyond
the scope of this paper, the efficiency of the coding algorithm [1], [15] implies that the entropy of each
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 11
bit H(b) ≈ 1 and thusp ≈ 0.5, regardless of the additional contextual information. However, if a more
conservative estimate ofH(b) < 1 is made, thenK can simply be increased to increase the number of
encrypted bits in order to ensure that an exhaustive search remains computationally infeasible.
Alternatively, an attacker may attempt to locate the unencrypted portion of the bit-streamBu =
{Bn|n ≤ nmax − K} since it is known that all bitsbju ∈ Bu are unencrypted, and may reveal important
image features if correctly decoded. If we denote the set of encrypted bits asBe = {Bn,LIP−α, Bn,LIP−sig,
Bn,LIS−α, Bn,LIS−sig|nmax−K < n ≤ nmax}, and the total number of bits in the firstK coding iterations
(both encrypted and unencrypted) asNK , an attack onBu may be attractive ifH(Be) > H(NK). In
other words, if determining the location ofBu (which starts at bitNK +1 within the overall bit-streamB)
is computationally simpler than an exhaustive search over the encrypted bitsBe, the attacker may view
this approach as offering greater probability of success inrevealing image details. However, even with
knowledge ofBu, the state of the LSP, LIP, and LIS lists and the shape decoding remain unknown without
correct decryption and decoding ofBe. This means that while the initial bits inBu may be correctly
classified by the attacker, it cannot be determined which coordinates within the SA-DWT description of
the object the coded bits correspond to. Ultimately, the attacker will not be able to determine any image
details fromBu without correct decryption and decoding ofBe.
In summary, the SecST-SPIHT secure coder achieves confidentiality by encrypting the most significant
portion of the bit-stream as well as obfuscating the unencrypted portion. Choice of the parameterK
provides control of the number of coding iterations which are encrypted. This allows flexibility to meet
the security requirements of the application at hand.
III. E XPERIMENTAL RESULTS
The SecST-SPIHT secure coder was tested with a variety of input visual objects, as shown in Figs. 5 to
9. The ‘surveillance1’, ‘surveillance2’, and ‘surveillance3’ objects were extracted from actual surveillance
video frames using motion-based segmentation, whereas ‘akiyo’ and ‘foreman’ are the standard MPEG
test objects. The coder accepts an arbitrary binary segmentation map so that any segmentation algorithm
can be employed, depending on the requirements of the application. All frames are in 8-bit per channel
RGB CIF format (352× 288) with Table I showing the percentage of the frame that the object occupies.
The rate-distortion performance is identical to ST-SPIHT,which is examined in detail in [1], and will
not be covered here. In all test cases here, the SecST-SPIHT coder utilized the CDF 9/7 biorthogonal
wavelet filters [17] with a 4-level transform, and an output code bit-rate of 2.4 bits-per-object-pixel
(including the shape code, where applicable). Since the progressive/embedded output property of ST-
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 12
SPIHT is maintained, the output code may be arbitrarily truncated to achieve a lower bit-rate with the
sacrifice of greater texture distortion.1 If lossless coding of the texture is required, integer-to-integer
wavelet filters [18] and colour transforms can be utilized and the coder instructed to code all of the
transform domain bit-planes [1]. For simulation purposes,a Vernam cipher was employed as the stream
cipher [19], using 128-bit randomly generated key. However, any bit-level stream cipher that is sufficiently
secure for the application at hand can be utilized.
Figs. 10 to 14 show sample output using the test objects. In all cases, encryption is performed during
the first two coding iterations (K = 2). In the cases where the shape is coded and encrypted with the
object texture, the shape is code is completed in the third iteration (λ = nmax − 2). Figs. 10 to 12
show the decrypted/decoded output ’surveillance’ objects/frames when: (a)/(d) the correct decryption key
is provided; (b)/(e) the incorrect decryption key is provided; and (c)/(f) the incorrect decryption key
is provided, but the shape is available externally and only the texture is coded. In all cases where the
incorrect key is provided, the textural content is completely obscured; no object details can be seen. For
the case (b)/(e) where the shape is coded and encrypted with the texture, the shape is also completely
obscured. In order to reconstruct the frame without revealing the object shape mask, the background is
transmitted as a full frame, with the missing texture information behind the object filled-in using prior
frames.
Similarly, the decrypted/deoded test objects/frames ’akiyo’ and ’foreman’ are shown in Figs. 13 and 14,
respectively with: (a)/(d) the correct decryption key provided; (b) the incorrect decryption key provided;
and (c)/(e) the incorrect decryption key is provided, but the shape is available externally and only the
texture is coded. In the cases when the shape is coded and encrypted with the object and the incorrect
decryption key is provided (Figs. 13(b) and 14(b)), the fullframe background is not transmitted since
the prior frames in the sequence do not offer enough information to in-fill the original object area.
Fig. 15 shows the fraction of the output code bits which are encrypted vs. the number of coding
iterations during which encryption is performed (K). The total number of output code bits corresponds
to a bit-rate of 2.4 bits-per-object-pixel (including the shape code, where applicable). Fig. 15(a) shows
the case where the shape is not coded; Figs. 15(b) to 15(d) show the cases where the shape code is
completed during the first, second, and third coding iteration (λ = nmax, nmax − 1, and nmax − 2),
respectively. In Fig. 15(a), the effect of varyingK can clearly be seen, with the fraction of the output
code being encrypted rising withK. The fraction remains small for all consideredK = 1, · · · , 4, ranging
1At most bit-rates and choices ofλ, the shape will be coded losslessly.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 13
from approximately 0.2% to 1.6%. In Figs. 15(b) to 15(d), a large jump in the portion of the bit-stream
that is encrypted is observed onceK is set high enough to ensure that the shape is completely encrypted
(K = nmax − λ + 1). WhenK is raised above this point, the effect is more subtle since atlow output
bit-rates the shape code represents a significant portion ofthe bit-stream. WithK > nmax −λ, the actual
percentage of the output code that is encrypted is largely controlled by the portion which is the shape
code. If the user wishes to keep the level of encryption to a minimum for the purpose of computational
efficiency,λ should be set low enough to disperse the shape code further into the bit-stream, and setting
K ≤ nmax − λ so that only the initial portion of the shape code is encrypted. In this case,λ should
be chosen so thatK can still be set high enough to encrypt a minimum number of bits to achieve a
minimum desired level of security. For example, as in Figs. 10 to 14, settingK = 2 andλ = nmax − 2
(i.e., shape code completed in the third coding iteration).The drawback of this approach is that the shape
cannot be completely, losslessly decoded until later in theoutput bit-stream, possibly resulting in lossy
shape reconstruction in low bit-rate scenarios.
Table II shows the number of bits encrypted forλ = nmax − 2 and differentK. As in Fig 15(d), there
is a jump at the iteration at which the remaining shape code isgenerated and encrypted (K = 3). With
this choice ofλ, K = 2 can be chosen since the number of bits encrypted is large enough to prevent
a brute-force, exhaustive search attack over the encryptedbits, but still represent minimal processing
overhead with less than 5% of the output bit-stream encrypted for a bit-rate of 2.4 bits-per-object-pixel.
It should be noted that the property of SecST-SPIHT to disperse the shape code within the texture code
is inherited from ST-SPIHT. With the execution path of the texture decoding dependent on the shape
code, the two portions of the code cannot be separated without correct decryption of all encrypted bits.
IV. CONCLUSIONS
The SecST-SPIHT secure visual object coder was presented, offering an efficient solution for privacy
protection of subjects in digital video surveillance systems. Provided with segmented, arbitrarily-shaped
visual objects, SecST-SPIHT securely codes both the shape and texture, ensuring confidentiality through
the use of a private decryption key. In contrast to privacy protection systems that simply scramble or blur
the subject’s visual data, SecST-SPIHT allows complete recovery of the data if the correct decryption
key is provided. This is necessary in applications where, for example, the subject can be deemed to be
unauthorized, and the appropriate authorities must have access to the visual data. Additionally, the SecST-
SPIHT secure coder offers all the features of the ST-SPIHT visual object coder [1], namely efficient and
progressive/embedded parallel coding of the object shape and texture.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 14
The parameterK offers the user control over a variable level of application-dependent security. In effect,
increasingK increases the portion of the output bit-stream that is encrypted by performing encryption
for a greater number of coding iterations. In practice,K can be chosen to ensure that the number of
encrypted bits is high enough to protect against a brute-force, exhaustive search attack over the encrypted
portion of the bit-stream. The remaining unencrypted portion of the bit-stream cannot be decoded since
the data-dependent execution of the decoder requires complete knowledge of the prior (encrypted) portion
of the bit-stream.
The provided secure coding algorithm operates on individual visual object input frames, but may be
extended for video sequences. Motion compensation may be employed to reduce the size of the shape
and texture coded for subsequent frames. Consequently, fora givenK, the number of encrypted bits
for subsequent encrypted object frames would also be very low. However, confidentiality of those object
frames would not be compromised since correct decoding would require decryption of the previous
frames, thus extending the data dependent, partial encryption paradigm into the temporal dimension.
SecST-SPIHT is well suited as a privacy enhancing technology for surveillance-intensive environments.
However, the coder can be employed in any number of applications where the confidentiality and efficient
coding of arbitrarily-shaped visual objects is required.
REFERENCES
[1] K. Martin, R. Lukac, and K. N. Plataniotis, “SPIHT-basedcoding of the shape and texture of arbitrarily shaped visual
objects,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 10, pp. 1196–1208, Oct. 2006.
[2] ——, “Efficient encryption of wavelet-based coded color images,”Pattern Recognition, vol. 38, no. 7, pp. 1111–1115,
2005.
[3] S. Tansuriyavong and S. Hanaki, “Privacy protection by concealing person in circumstantial video image,” inProc. Workshop
on Perceptive User Interfaces, vol. 4, 2001, pp. 1–4.
[4] D. Chen, Y. Chang, R. Yan, and J. Yang, “Tools for protecting the privacy of specific individuals in video,”EURASIP
Jrnl. on Advances in Sig. Proc., vol. 2007, pp. 1–9, 2007.
[5] E. M. Newton, L. Sweeney, and B. Malin, “Preserving privacy by de-identifying face images,”IEEE Trans. Knowl. Data
Eng., vol. 17, no. 2, pp. 232–243, Feb. 2005.
[6] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “A full-body layered deformable model for automatic model-based
gait recognition,”EURASIP Jrnl. on Advances in Sig. Proc., Spec. Issue on Adv. Sig. Proc. and Patt. Recog. Methods for
Biometrics, preprint 2008.
[7] I. Martinez-Ponte, X. Desurmont, J. Meesen, and J.-F. Delaigle, “Robust human face hiding ensuring privacy,” inProc.
Int. Workshop on Image Analysis for Multimedia Interactive Services., 2005.
[8] A. Senior, S. Pankanti, A. Hampapur, L. Brown, Y.-L. Tian, A. Ekin, J. Connell, C. F. Shu, and M. Lu, “Enabling video
privacy through computer vision,”IEEE Security Privacy, vol. 3, no. 3, pp. 50–57, May–June 2005.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 15
[9] W. Zhang, S. S. Cheung, and M. Chen, “Hiding privacy information in video surveillance system,” inProc. IEEE Int.
Conf. on Image Proc., vol. 3, 2005, pp. 868–871.
[10] F. Dufaux, M. Ouaret, Y. Abdeljaoued, A. Navarro, F. Vergnenegre, and T. Ebrahimi, “Privacy enabling technology for
video surveillance,” inImage Processing for Military and Security Applications, S. S. Agaian and S. A. Jassim, Eds. Proc.
SPIE 6250, 2006, pp. 1–12.
[11] J. Wen, M. Severa, W. Zeng, M. H. Luttrell, and W. Jin, “A format-compliant configurable encryption framework for access
control of video,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 6, 2002.
[12] P.-C. Wang and T.-W. Hou, “An AV object oriented encryption algorithm for MPEG-4 streams,” inProc. Int. Conf. on
Multimedia and Expo, Jun. 2004, pp. 971–974.
[13] K. Martin, R. Lukac, and K. N. Plataniotis, “Binary shape mask representation for zerotree-based visual object coding,”
in Proceedings IEEE Canadian Conference on Electrical and Computer Engineering, May 2004, pp. 2197–2200.
[14] S. Li and W. Li, “Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual object coding,”IEEE Trans.
Circuits Syst. Video Technol., vol. 10, pp. 725–743, Aug. 2000.
[15] A. Said and W. A. Pearlman, “A new fast and efficient imagecodec based on set partitioning in hierarchical trees,”IEEE
Trans. Circuits Syst. Video Technol., vol. 6, pp. 243–250, Jun. 1996.
[16] A. A. Kassim and W. S. Lee, “Embedded color image coding using SPIHT with partially linked spatial orientation trees,”
IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 2, pp. 203–206, Feb. 2003.
[17] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies,“Image coding using wavelet transform,”IEEE Trans. Image
Process., vol. 1, pp. 205–220, Apr. 1992.
[18] R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo, “Wavelet transforms that map integers to integers,”Appl.
Comput. Harmon. Anal., vol. 5, no. 3, pp. 322–369, 1998.
[19] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone,Handbook of Applied Cryptography. CRC Press, 1996.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 16
Pre-Processing SA-DWTSecure
ST-SPIHTCoder
Shape Mask
Object
Image(texture)
0 1 0 0 1 0
01
10
Channel/Storage
11
00
101 1 0 0
SecureST-SPIHTDecoder
Inverse SA-DWTPost-Processing
Shape Mask
ReconstructedObject
Image(texture)
xT
s
x
xT^
s
x
Parameters:λ, K
kEkD
SecretKeys
Fig. 1. System level diagram of the SecST-SPIHT coding and decoding scheme.
BnBn,LIP Bn,LIS Bn,LSP
1 1 0 0 1 0 1 1 1 1 1 0 1 1 …
Bn,LIP-sgn
Bn,LIP-sig
0 1 1 1 0 1 0 0 1 1 1 0 1 1 …
Bn,LIS-sgn
Bn,LIS-sig
Bn,LIS-Tsig
Bn,LIP-α
Bn,LIS-α
Fig. 2. Composition of subsetBn of ST-SPIHT bit-stream forn > λ.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 17
ST-SPIHTCoder
TexturexT
0 1 1 1 0 0...
fE(b,kE)Stream Cipher
Encryption Function
Combined Codingand Encryption
Compressed/Encrypted Bitstream
Location of bits Bn,LIP-α, Bn,LIP-sig, Bn,LIS-α, Bn,LIS-sig
Secure ST-SPIHT Coder
Shapes
λ, K
kE
Fig. 3. SecST-SPIHT Coder.
ST-SPIHTDecoder...0 1 1 1 0 0...
fD(b,kD)
Combined Decryptionand Decoding
Compressed/Encrypted Bitstream
Stream Cipher Decryption Function
Secure ST-SPIHT Decoder
Location of bits Bn,LIP-α, Bn,LIP-sig, Bn,LIS-α, Bn,LIS-sig
TexturexT
Shapes
^
^
kD
Fig. 4. SecST-SPIHT Decoder.
TABLE I
PERCENTAGE OF FRAME OCCUPIED BY TEST OBJECTS.
Object Frame Percentage
’surveillance1’ 10.9%
’surveillance2’ 7.6%
’surveillance3’ 25.7%
’akiyo’ 37.2%
’foreman’ 29.4%
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 18
(a) original frame (b) segmentation map (c) segmented object
Fig. 5. ‘Surveillance1’ test object.
(a) original frame (b) segmentation map (c) segmented object
Fig. 6. ‘Surveillance2’ test object.
(a) original frame (b) segmentation map (c) segmented object
Fig. 7. ‘Surveillance3’ test object.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 19
(a) original frame (b) segmentation map (c) segmented object
Fig. 8. ‘Akiyo’ test object.
(a) original frame (b) segmentation map (c) segmented object
Fig. 9. ‘Foreman’ test object.
TABLE II
THE NUMBER OF BITS ENCRYPTED FOR THE TEST OBJECTS USING DIFFERENT VALUES OFK AND λ = nmax − 2.
K
Test Object 1 2 3 4
Surveillance1 777 805 4333 4507
Surveillance2 783 819 3239 3428
Surveillance3 734 790 3494 4030
Akiyo 768 901 4086 4934
Foreman 762 874 5381 5763
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 20
(a) (b) (c)
(d) (e) (f)
Fig. 10. ‘Surveillance1’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b)/(e) with incorrect
key; (c)/(f) with incorrect key and shape provided externally.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 21
(a) (b) (c)
(d) (e) (f)
Fig. 11. ‘Surveillance2’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b)/(e) with incorrect
key; (c)/(f) with incorrect key and shape provided externally.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 22
(a) (b) (c)
(d) (e) (f)
Fig. 12. ‘Surveillance3’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b)/(e) with incorrect
key; (c)/(f) with incorrect key and shape provided externally.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 23
(a) (b) (c)
(d) (e)
Fig. 13. ‘Akiyo’ test object/frame decoded and decrypted output (K = 2): (a)/(d) with correct key; (b) with incorrect key;
(c)/(e) with incorrect key and shape provided externally.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 24
(a) (b) (c)
(d) (e)
Fig. 14. ‘Foreman’ test object/frame decoded and decryptedoutput (K = 2): (a)/(d) with correct key; (b) with incorrect key;
(c)/(e) with incorrect key and shape provided externally.
October 12, 2007 DRAFT
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 25
1 2 3 40
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
K (# encrypted coding iterations)
# en
cryp
ted
bits
/# to
tal c
ode
bits
surveillance1surveillance2surveillance3akiyoforeman
(a) Shape not coded
1 2 3 40
0.05
0.1
0.15
0.2
0.25
K (# encrypted coding iterations)
# en
cryp
ted
bits
/# to
tal c
ode
bits
surveillance1surveillance2surveillance3akiyoforeman
(b) Shape code completed in first iteration (λ = nmax)
1 2 3 40
0.05
0.1
0.15
0.2
0.25
K (# encrypted coding iterations)
# en
cryp
ted
bits
/# to
tal c
ode
bits
surveillance1surveillance2surveillance3akiyoforeman
(c) Shape code completed in second iteration (λ = nmax − 1)
1 2 3 40
0.05
0.1
0.15
0.2
0.25
K (# encrypted coding iterations)
# en
cryp
ted
bits
/# to
tal c
ode
bits
surveillance1surveillance2surveillance3akiyoforeman
(d) Shape code completed in third iteration (λ = nmax − 2)
Fig. 15. The fraction of bits encrypted vs. the security level parameterK (number of encrypted coding iterations) for different
λ (shape code levels). The total bits in the code corresponds to a bit-rate of 2.4 bits-per-object-pixel.
October 12, 2007 DRAFT