journal of , vol., no., month yyyy 1 audio ...€¦ · abstract—this paper introduces a new audio...

(c) 2016 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2636094, IEEETransactions on Information Forensics and Security

JOURNAL OF , VOL. ?, NO. ?, MONTH YYYY 1

Audio watermarking using spikegram and atwo-dictionary approach

Yousof Erfani, Student Member, IEEE, Ramin Pichevar, Member, IEEE, and Jean Rouat, Senior Member, IEEE

Abstract—This paper introduces a new audio watermarkingtechnique based on a perceptual kernel representation of audiosignals (spikegram). Spikegram is a recent method to representaudio signals. It is combined with a dictionary of gammatones toconstruct a robust representation of sounds. In traditional phaseembedding methods, the phase of coefficients of a given signalin a specific domain (such as Fourier domain) is modified. Inthe encoder of the proposed method (two-dictionary approach),signs and phases of gammatones in the spikegram are chosenadaptively to maximize the strength of the decoder. Moreover, thewatermark is embedded only into kernels with high amplitudeswhere all masked gammatones have been already removed. Theefficiency of the proposed spikegram watermarking is shown viaseveral experimental results. First, robustness of the proposedmethod is shown against 32 kbps MP3 with an embedding rateof 56.5 bps. Second, we showed that the proposed method isrobust against unified speech and audio codec (24 kbps USAC,linear predictive and Fourier domain modes) with an averagepayload of 5-15 bps. Third, it is robust against simulated smallreal room attacks with a payload of roughly 1 bps. Lastly, it isshown that the proposed method is robust against a variety ofsignal processing transforms while preserving quality.

Index Terms—Copyright protection, Watermarking,Spikegram, Gammatone filter bank, Sparse representation,Multimedia security

I. INTRODUCTION

A n analysis by the Institute for Policy Innovation con-cludes that every year global music piracy is making

12.5 billion of economic losses, 71060 U.S. jobs lost, a lossof 2.7 billion in workers’ earnings and a loss of 422 millionin tax revenues, 291 million in personal income tax and 131million in lost corporate income and production taxes. Mostof the music piracy is because of rapid growth and easinessof current technologies for copying, sharing, manipulating anddistributing musical data [1].

As one promising solution, audio watermarking has beenproposed for post-delivery protection of audio data. Digitalwatermarking works by embedding a hidden, inaudible wa-termark stream into the host audio signal. Generally, whenthe embedded data is easily removed by manipulation, thewatermarking is said to be fragile which is suitable for au-thentication applications, whereas for copyright applications,the watermark needs to be robust against manipulations [2].Watermarking has also many other applications such as copy

Copyright (c) 2016 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected] authors are with NECOTIS group, Department of Electrical andComputer Engineering, Universite de Sherbrooke, Sherbrooke, QC, 2500boul. de l’Universite, Sherbrooke, e-mail: ({yousof.erfani, ramin.pichevar,jean.rouat}@usherbrooke.ca).

Manuscript received Month,DD,YYYY; revised Month DD, YYYY.

control, broadcast monitoring and data annotation [3], [4],[5]. For audio watermarking, several approaches have beenrecently proposed in the literature. These approaches includeaudio watermarking using phase embedding techniques [6],cochlear delay [7], spatial masking and ambisonics [8], echohiding [9], [10], [11], patchwork algorithm [12], wavelettransform [13], singular value decomposition [14] and FFTamplitude modification [15]. State of the art methods introducephase changes in the signal representation (i.e., from the phaseof the Fourier representation) [6], [16], while we adopt a moreoriginal strategy by using two dictionary of kernels and byshifting the sinusoidal term of the gammatones [17], [18]. Inthis paper, the watermarking is of multi-bit type [19] and couldbe used for data annotation.

Multiple dictionaries for sparse representation has alreadydrawn the attention of researchers in signal processing [20],[21], [22], [23]. For example, in [20], a two-dictionary methodis proposed for image inpainting where one decomposed imageserves as the cartoon and the other as the texture image.Also, a watermark detection algorithm was proposed by Sonet al. [21] for image watermarking where two dictionaries arelearned for horizontally and vertically clustered dots in thehalf tone cells of images. In [23], authors propose an audiodenoising algorithm using a sparse audio signal regressionwith a union of two dictionaries of modified discrete cosinetransform (MDCT) bases. They use long window MDCT basesto model the tonal parts and short window MDCT bases tomodel the transient parts of the audio signals. In [53], tworandom dictionaries are used to improve the cryptographicsecurity of spread spectrum (SS) image watermarking. In allmentioned methods, the goal is to have an efficient represen-tation of the signal. However for audio watermarking, onegoal is to manipulate the signal representation in a way tofind adaptively the spectro-temporal content of the signal forefficient transmission of watermark bits.

In this paper, we propose an embedding and decodingmethod for audio watermarking which jointly uses two typeof gammatone dictionaries (including gammasines and gam-macosines) and a spikegram of the audio signal. It is shownin [24] that in comparison to block based representations,spikegram is time-shift invariant, where the signal is de-composed over a dictionary of gammatones. To generate thespikegram, we use the Perceptual Matching Pursuit (PMP)[25]. PMP is a bio-inspired approach that generates a sparserepresentation and takes into account the auditory maskingat the output of a gammatone filter bank (the gammatonedictionary is obtained by duplicating the gammatone filterbank at different time samples).

The proposed method is blind, as the original signal is not




0 25 50 75 100 125 150 175 ms

3168

1173

350

Cen

ter

freq

uen

cy (

Hz)

(B) Spikes in the spikegram

Time

−0.5

0

0.5(A) An artifical audio signal

A masker kernel found by PMP

A masked kernel removed by PMP

Figure 1. A 2D plane of gammatone kernels of a spikegram generated fromPMP [25] , [35] coefficients. The 2D plane is generated by repeating Nc =4 gammatones at different channels (center frequencies) and at each timesamples. A gammatone with non-zero coefficient is called a spike.

required for decoding. Also, the only information needed tobe shared between the encoder and the decoder include thekey, which is used as the initial state for a Pseudo Noise (PN)sequence, the type and parameters for dictionary generation.To evaluate the performance of the proposed method, extensiveexperimental results are done with a variety of attacks.

Robustness against lossy perceptual codecs is a majorrequirement for a robust audio watermarking, thus we decidedto evaluate the robustness of the method against 32 kps MP3(although not used that often anymore, it is still a powerfulattack which can be used as an evaluation tool). The proposedmethod is robust against 32 kbps MP3 compression with theaverage payload of 56.5 bps while the state of the art robustpayload against this attack is lower than 50.3 bps [26]. Inthis paper, for the first time, we evaluate the robustness of theproposed method against USAC (Unified Speech and AudioCoding) [27], [28], [29]. USAC is a strong contemporarycodec (high quality, low bit rate), with dual options bothfor audio and speech. USAC applies technologies such asspectral band replication, CELP codec and LPC. Experimentsshow that the proposed method is robust against USAC forthe two modes of linear predictive domain (executed only forspeech signals) and frequency domain (executed only for audiosignals), with an average payload of 5-15 bps. The proposedmethod is also robust against simulated small real room attacks[30], [31] for the payload of roughly 1 bps. Lastly, therobustness against signal processing transforms such as re-sampling, re-quantization, low-pass filtering is evaluated andwe observed that the quality of signals can be preserved.

In this paper, the sampled version of any time domain signalis considered as a column vector with a bold face notation.

II. SPIKEGRAM KERNEL BASED REPRESENTATION

A. Definitions

With a sparse representation, a signal x[n], n = 1 : N (orx in vector format) is decomposed over a dictionary � ={g

i

[n];n = 1 : N, i = 1 : M} to render a sparse vector↵ = {↵

i

; i = 1 : M} which includes only a few non-zero

coefficients, having the smallest reconstruction error for thehost signal x [24], [25]. Hence,

x[n] ⇡MX

i=1

↵

i

g

i

[n], n = 1, 2, .., N (1)

where ↵

i

is a sparse coefficient. A 2D time-channel planeis generated by duplicating a bank of N

c

gammatone filters(having respectively different center frequencies) on each timesample of the signal. Also, all the gammatone kernels inthe mentioned 2D plane form the columns of the dictionary� (Hence, M = N

c

⇥ N ). Thus g

i

[n] is one base of thedictionary which is located at a point corresponding to channelc

i

2 {1, .., Nc

}, and time sample ⌧

i

2 {1, 2, .., N} inside the2D time-channel plane (Fig.1). The spikegram is the 2D plotof the coefficients at different instants and channels (centerfrequencies). The number of non-zero coefficients in ↵

i

persignal’s length N is defined as the density of the representation(note that sparsity = 1-density).

To compute the sparse representation in (1), many solu-tions have been presented in the literature including IterativeThresholding [32], Orthogonal Matching Pursuit (OMP) [33],Alternating Direction Method (ADM) [34], Perceptual Match-ing Pursuit (PMP) [25]. Here, we use PMP for three differentreasons: PMP is not computationally expensive, it is a highresolution representation for audio signals, and it generatesauditory masking thresholds and removes the inaudible contentunder the masks [25].

PMP is a recent approach which solves the problem in (1)for audio and speech using a gammatone dictionary [25], [35].PMP is a greedy method and an improvement over MatchingPursuit [36]. PMP finds only audible kernels for whichthe sensation level is above an iteratively updated maskingthreshold and neglects the rest. A kernel is considered as amasked kernel if it is under the masking of (or close enoughin time or channel to) another masker kernel with largeramplitude. The efficiency of PMP for signal representationis confirmed in [25] and [35]. The gammatone filter bank(used to generate the gammatone dictionary) is adapted to thenatural sounds [24] and is shown to be efficient for sparserepresentation [25]. A gammatone kernel equation [17] has agamma part and a tone part as below

g[n] = an

m�1e

�2⇡lncos[2⇡(f

c

/f

s

)n+ ✓], n = 1, ..,1 (2)

in which, n is the time index, m and l are used for tuning thegamma part of the equation. f

s

is the sampling frequency, ✓ isthe phase, f

c

is the center frequency of the gammatone. Theterm a is the normalization factor to set the energy of eachgamatone to one. Also, the effective length of a gammatoneis defined as the duration where the envelope is greater thanone percent of the maximum value of the gammatone. In thispaper, a 25-channel gammatone filter bank is used (TableI). Their bandwidths and center frequencies are fixed andchosen to correspond to 25 critical bands of hearing. Theyare implemented at the encoder and the decoder using (2).Also, a gammatone is called a gammacosine when ✓ = 0 or agammasine when ✓ = ⇡/2. In Table I, center frequencies andeffective lengths for some gammatones, versus their channel




Table ITHE EFFECTIVE LENGTHS AND CENTER FREQUENCIES FOR GAMMATONE KERNELS USED IN THIS WORK.

Channel number 1 2 3 4 ... 8 9 10 11 12 ... 23 24 25Effective length (msec) 55.2 39.1 31.0 25.9 ... 13.9 12.2 10.7 9.4 8.2 ... 1.4 1.0 0.8Center frequency (Hz) 50 150 250 350 ... 840 1000 1170 1370 1600 ... 10500 13500 18775

0 2 4 6 8 10 12 14 16

−0.1

−0.05

0

0.05

0.1

msec

gammasine

Effective length

gammacosine

Figure 2. A sample gammacosine (blue) and gammasine(red) (for channel-8)with a center frequency of 840 Hz and an effective length of 13.9 msec.Gammasines and gammacosines are chosen in the watermark embeddingproceess based on their correlation with the host signal and the inputwatermark bit. The sampling frequency is 44.1 kHz.

numbers are given. In Fig. 2, channel 8 gammasine andgammacosine are plotted.

B. Good characteristics of spikegram for audio watermarking

1) Time shift invariance: In most traditional watermarkingtechniques, the signal representation is block-based, wherethe signal is divided into overlapping blocks and watermarkis inserted into each block. The conventional methods havetwo drawbacks. First, they might misrepresent the transientsand periodicities in the signal. Moreover, in the block-basedrepresentation of nonstationary signals, small time shifts in thetime domain signal might produce large changes in the rep-resentation, depending on the position of a particular acousticevent in each block [24]. The spikegram representation in (1)is time-shift invariant and is suitable for robust watermarkingagainst time shifting de-synchronization attack.

2) Low host interference when using spikegram: In (1),many gammatones have either zero coefficients or are masked,thanks to PMP. Therefore, compared to traditional transformssuch as STFT and Wavelet transforms, spikegram is expectedto yield less host interference at the decoder (see Fig.13 andFig.14 for experiments regarding the dependence of the errorrate and the quality with the sparsity of the spikegrams).

3) Efficient embedding into robust coefficients: The water-mark bits are inserted only into large amplitude coefficientsobtained by PMP, where all inaudible gammatones have beena priori removed from the representation.

III. TWO-DICTIONARY APPROACH

The watermark bit stream is symbolized by b which is anM2⇥1 vector (M2 < M ). The goal is to embed the watermarkbit stream into the host signal. K, a P⇥1 vector (P < M2), isthe key which is shared between the encoder and the decoderof the watermarking system. Also, the sparse representationof the host signal x on the gammacosine dictionary (i.e., ↵

i

)is assumed to be known.

The proposed method relies on the fact that the changein signal quality should not be perceived when changing thephase of specific gammatone kernels. Moreover, it is calleda two dictionary approach, as a candidate kernel for water-mark insertion, is adaptively selected from a gammacosine orgammasine dictionary.

For inserting multiple bits, the host signal x[n] (x in vectorformat) is first represented using (1). Then, M2 gammatonesg

k

[n] from the representation in (1) are selected (the se-lection of watermark kernels is detailed in section III-D).These gammatones form the watermark dictionary D1 andcarry the watermark bit stream b

k

, k = 1, 2, ..,M2. OtherM1 = M � M2 kernels form the signal dictionary D2. Thesignal and watermark dictionaries are disjoint subsets of thegammatone dictionary used for sparse representation in (1),thus D1 \D2 = ;. Each watermark bit b

k

serves as the signof a watermark kernel. Hence (1) becomes

y[n] =M1X

i=1

↵

i

g

i

[n] +M2X

k=1

b

k

|↵k

| gk

[n] (3)

where y[n] is the watermarked signal. In (3), if the watermarkand signal dictionaries use the same gammatone kernels,the watermarking becomes a one dictionary method. In onedictionary method, the watermark bits are inserted as thesign of gammatone kernels. In two dictionary method, inaddition to the manipulation of the sign of gammatone kernels,their phase also might be shifted as much as ⇡/2, based onthe strength of the decoder. Hence, for the two-dictionaryapproach, each watermark kernel is chosen adaptively froma union of two dictionaries, one dictionary of gammacosinesand one dictionary of gammasines.

The k

th watermark kernel in the watermark dictionary isfound adaptively and symbolized with f

k

which is either agammasine or a gammacosine. Thus for the two dictionarymethod, the embedding equation in (3) becomes

y[n] =M1X

i=1

↵

i

g

i

[n] +M2X

k=1

b

k

|↵k

| fk

[n] (4)

To decode of the p

th watermark bit, we compute the projec-tions of the watermarked signal on the p

th watermark kernel.

< y,f

p

>=M1X

i=1,i 6=p

↵

i

< g

i

,f

p

> +

b

p

|↵p

|+M2X

k=1,k 6=p

b

k

|↵k

| < f

k

,f

p

>

(5)

The number of samples used to compute the projection in(5) is equal to the gammatone effective length. The goal isto decode the watermark bit as the sign of the projection< y,f

p

>. We later show how to find the best watermark




!"#$%&'(()*+",)*-%,.//'&0)1'"2'*+03"/%$"&34**'2

5%()0)%*"%/""647)686"&%'//)&)'*0("/%8*-"9:"5;5"/%$""&34**'2"""""""""""""

<4664&%()*' ='$*'2(>"2%&40'-")*"03'('"#24&'(>"6)+30"9'"#34('? (3)/0'-@

A%*0)*8'-

)(iCh

)( jCh

)(kCh

)(lCh

l

)(i

Figure 3. Watermark insertion using the two-dictionary method. First, thespikegram of the host signal is found using PMP with a dictionary of 25-channel gammacosines, located at each time sample along the time axis. Thenfor each processing window and each channel and based on the embedding bitb, the gammacosine, or gammasine (located at a blue circle) with maximumstrength factor (mc or ms) is chosen for the watermark insertion. In thiswork, gammatone channels Ch0s are selected in the range of 1-4 and 9-19 (odd channels only) for the watermark insertion. Also, to get the sameembedding strength for different embedding channels, processing windows ofdifferent channels have the same length.

kernels so that the first two terms in the right side of (5) havethe same signs as the watermark bit b

p

. There are two sourcesof interference in (5). First, the right term in the right sideof (5) is the interference that the decoder receives from otherwatermark bit insertions. To remove this interference term,the watermark insertion is performed into limited number ofchannels so that the watermark gammatones are uncorrelated.In fact, to design the watermark dictionary, we choose a subsetof the full overcomplete dictionary in such a way that thewatermark kernels are spectro-temporally far enough suchthat they are uncorrelated. Thus the watermark bits will bedecoded independently. Hence, in Fig. 3, for each channeland time sample, two neighbor watermark kernels shouldbe separated with at least one effective length and at leastone channel. With this assumption, the correlation betweenwatermark gammatones will be less than 0.02. The secondsource of interference is the left term in the right side of(5) which originates from the correlations between watermarkand signal gammatones, that is shown in (7). We reduce thisinterference in the encoder of the system in the next section,by adaptively searching for and embedding into the strongestwatermark gammatones in the spikegram.

As embedding of multiple watermark bits are performedindependently, thus in the next section, only the single bitwatermarking using the two dictionary method is explained.

A. The proposed informed embedder

Equation (1) is used to resynthesize the host signal x fromsparse coefficients and gammacosines.

Now, we want to embed one bit b 2 {�1, 1} from thewatermark bit stream b by changing the sign and/or the phaseof a gammacosine kernel g

p

(the p

th kernel found by PMP,still to be determined later in this section) with amplitude ↵

p

(to be determined) located at a given channel and processingwindow (each processing window is a time frame includingseveral effective lengths of a gammatone, Fig.3).To find an efficient watermark kernel f

p

which bears the

!"#$%&'()

b

ysc mm >

x ccc mk α,,

sss mk α,, *%+,-./%+#0%

1+#-&22345cp

cp kngcnf

αα =

−= ][][

sp

sp kngsnf

αα =

−= ][][

6%+

78

b

b

Figure 4. The proposed embedder for a given channel and processing window.The gammasine or gammacosine with maximum strength factor is chosen asthe watermark kernel and its amplitude is set to its associated sparse coefficientin the spikegram. Finally (6) is used to resynthesize the watermarked signaly (in vector format). ms and mc are respectively the strength factors forgammasine candidate and gammacosine candidate.

greatest decoding performance for the watermark b, we writethe 1-bit embedding equation as follows:

y[n] =MX

i=1,i 6=p

↵

i

g

i

[n] + b |↵p

| fp

[n] (6)

where the watermarked kernel fp

for a given channel numbercan be a gammacosine (gc) or a gammasine (gs) which arezero and ⇡/2 phase-shifted versions of the original gammatonekernel g

p

, respectively. The correlation between the water-marked signal y and the watermarked kernel f

p

, is found asbelow

< y,f

p

>=MX

i=1,i 6=p

↵

i

< g

i

,f

p

> +b |↵p

| (7)

Hence, to design a simple correlation-based decoder, thesign of the correlation in the left side of (7) is considered asdecoded the watermark bit. In this case, for correct detectionof the watermark bit b, the interference term should not changethe desired sign at the right hand side of (7). Moreover, thegammatone dictionary is not orthogonal, hence the left termin the right side of (7) may cause erroneous detection ofb. For a strong decoder, two terms on the right side of (7),should have the same sign with large values. We later showthat by finding an appropriate gammacosine or gammasine inthe spikegram, the right side of (7) can have the same sign asthe watermark bit b. In this case, the module of correlation in(7) is called watermark strength factor m

p

for the bit b and agreater strength factor means a stronger watermark bit againstattacks. In this case, (7) becomes

< y,f

p

>= bm

p

(8)

For a large value strength factor (and with the same sign of thewatermark bit), we search the peak value of the projectionsusing (7) when a gammatone candidate is a gammacosine orgammasine. Thus, for a given channel, a processing windowand watermark bit b, the signal interference is minimizedat the decoder using the informed encoder in (7). We dothe following procedure to find the phase, position and theamplitude of the watermarked kernel f

p

(Fig. 4). For a givenchannel, we consider the watermark gammatone candidate f

p

(the p

th gammatone kernel in the signal representation of (1))to be a gammacosine gc or a gammasine gs. Then, do thefollowing steps:




!"#$%&'(#)*

&#+,-'.'(#)

y

][][ sscc kPkP >

][, ccc kPk

][, sss kPk ])[( cc kPsignb=

/%*

0# ])[( ss kPsignb =

Figure 5. The proposed one-bit decoder. The projections with maximumabsolute value for gammacosines and gammasines are found as Pc[kc] andPs[ks] and at time sampes kc and ks respectively. The watermark bit b isconsidered as the sign of the projection with the largest absolute value.

• Shift the watermark gammatone candidate f

p

alongsideall processing windows, at time shifts equal to multi-ples of the gammatones’ effective length. For each shiftcompute the correlation of the watermarked signal withthe sliding watermark candidate kernel. Then, find theabsolute maximum of the correlation (watermark strengthfactor)

��< y,f

p

>

�� using (7) (Fig.3). The result is astrength factor, symbolized as m

c

for gammacosine,located at time sample k

c

with amplitude ↵

c

and alsoanother strength factor, symbolized as m

s

for a gamma-sine kernel located at k

s

with the amplitude ↵

s

. Thusm

c

= |< y, gc[n� k

c

] >|, ms

= |< y, gs[n� k

s

] >|.• Afterwards, the gammacosine or gammasine with greater

strength factor is chosen as the final watermark gam-matone f

p

and its time shift (sample), amplitude andphase are registered. Gammatone or gammasine withgreater strength factor is chosen as the final watermarkgammatone f

p

with the final watermark strength factorbeing m

t

= max(mc

,m

s

). The respective k

c

or k

s

,amplitude ↵

c

or ↵s

and phases are kept. Therefore, the al-gorithm finds the optimal watermark gamatone from twodictionaries including one dictionary of gammacosinesand one dictionary of gammasines.

• After all, the watermarked signal is synthesized using (6),f

p

with its amplitude set to b |↵p

|. This is equivalent tofinding a time-channel point in the spikegram ( Fig.3)where the optimal position for embedding is found. Then,g

p

is replaced with the watermark gammatone f

p

.

B. The proposed decoder

At the decoder, as the watermark kernels associated tothe watermark bits are uncorrelated, the decoding of onewatermark bit does not interfere with the decoding of otherbits. Hence, the same search procedure, used in the embedderto find the watermarked kernel candidate, is applied. Thereforefor a given channel and processing window, the decodingprocedure is shown Fig. 5.

For a given channel, suppose the watermark gammatonecandidate f

p

to be a gammacosine gc or a gammasine gs.Then do the following steps:

• Shift the watermark kernels gc and gs alongside theprocessing window at time shifts (respectively k

c

or k

s

)equal to multiples of the gammatone’s effective length.For each k (either k

c

or k

s

), compute the correlation of

1−4 5−8 9−12 13−16 17−20 21−240

0.005

0.01

0.015

0.02

0.025

0.03

Wate

rmark

str

en

gth

facto

r

Channel range

Two−dictionary method, mt=max (m

c,m

s)

One−dictionary method, mc

Figure 6. The average watermark strength factor for one-dictionary and two-dictionary methods versus embedding channel range. Simulations are done on100 signals (3 minutes each) of different music genres and English voices. Ineach embedding experiment, watermark is inserted in a specific channel rangewith a 45 msec processing windows. The maximum peak of a 20 dB AWGNis also plotted as a horizontal dashed line in black. The 95% confidenceintervals are plotted as vertical lines at the center of the average results forthree consequensive channels. Input signals are normalized to unit l2 norm.

the watermarked signal with the sliding watermark kernelcandidate.

• Then, find the absolute maximum of the corre-lation P

c

[k] = |< y, gc[n� k] >| and P

s

[k] =|< y, gs[n� k] >| (Fig.5). The result is one absolutemaximum correlation m

c

= max(|Pc

[k]|) for a gam-macosine located at the time sample k

c

with the ampli-tude ↵

c

and also another absolute maximum correlationm

s

= max(|Ps

[k]|) for a gammasine kernel located atk

s

with the amplitude ↵

s

.• Finally, if |m

c

| > |ms

| then b = sign(Pc

[kc

]) otherwiseb = sign(P

s

[ks

]).The resynchronization of the decoder and the encoder isdetailed in section IV-3. In Fig.6, the watermark strength factoris plotted versus the embedding channel ranges for the casesof one dictionary (when using only gammacosine kernels)and two dictionaries (when both gammasine and gammaco-sine kernels are used). As expected, the strength factors forthe two-dictionary method is all the time greater than theone for one dictionary method. The maximum improvementoccurs for middle channels between 5 and 16. Also, towardsgreater embedding channel ranges, the strength factor becomessmaller. For an insight about the robustness of the methodsagainst 20 dB additive white Gaussian noise (AWGN), theaverage maximum peaks of the noise signal is plotted as ahorizontal line. As is seen, the embedding channel rangeswhich are robust against 20 dB AWGN are wider for twodictionary method compared to one dictionary method (forone dictionary method it is below channel number 16 whilefor the two-dictionary method, it spans 1-24 channel range.)

As illustrated in Fig.3, for channel j, each processingwindow includes several effective lengths of the gammatone,l

j

. Thus, for each channel j, the algorithm searches for awatermark gammatone candidate among dLP

lje gammatones

(Fig. 3, vertical red lines in each processing window).Thanks to the content-based aspect of the approach, phase




0 5 10 15 20 250

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Channel number

Pe

SNR=10 dB

SNR=15 dB

SNR=20 dB

SNR=25 dB

Figure 7. The error probability of the decoder under additive white Gaussiannoise with different signal to noise ratios. The mean and variance of thewatermark strength factor are estimated from 100 signals including musicand English voices, 3 minutes each.

shifts and sign changes occur adaptively. Therefore, dependingon the signal, change in the sign and phase of the gammatoneis not necessary when generating the watermark. In thatsituation, some watermark gammatones are similar in shapeand phase to signal gammatones. This is one of the goodfeatures of the proposed method which contributes to thequality of the watermarked signals.

Note that the spikegram removes the inaudible content ofthe signal and uses the gammatone filterbank which is adaptedto the natural sound more than to noise [24]. Therefore, highvalue coefficients in the spikegram are robust against mildinaudible distortion and noises. These properties are essentialfor the proposed decoder when searching for the watermarkgammatones with high value coefficients.

C. Robustness of the proposed method against additive white

Gaussian noise

For robust watermarking evaluation, we suppose that anadditive white Gaussian noise (AWGN) z[n] is added to thewatermarked signal y of (6). Therefore, (8) becomes

r =< y,f

p

>= m

p

b+ < z,f

p

> (9)

in which, r is the correlation computed for the decoding ofbit b, z is the AWGN, with mean zero and variance �

2n

. Asthe gammatone kernel f

p

has zero mean and unit variance,mean and variance of < z,f

p

> are respectively 0 and �

2n

.It is assumed that for each channel number c, the strengthfactor m

p

are samples from a white Gaussian random processwith a specific mean � and variance �

2 (Fig.6). Assumingthat the decorrelation between the watermark strength factorfor channel k and noise z

n

, the mean m

r

b and variance forthe correlation term r in (9) are, m

r

= m

p

b and �

2r

= �

2n

+�

2. Therefore, by considering the decoded watermark bit b =sign(< y,f

p

>), the error probability of the decoder forchannel c is as below

p

k

= Pr{b < 0 | b = 1} = 12erfc( mr

�r

p2)

= 12erfc( �kp

2(�2n+�

2))

(10)

0 0.5 1 1.5 2 2.5 3 3.5

−0.01

−0.005

0

0.005

0.01

Time (sec)

Original signal

Watermarked minus original signal

Figure 8. The time domain waveforms for the original signal (blue) and thewatermarked minus the original signal (red). The original signal includes asolo harpsichord instrument sampled at 44.1 kHz. Each gammatone (a spike)in the difference signal (red) indicates the insertion of one watermark bit.

Where erfc(.) is the complementary error function. In (10),the probability of error is low when we have larger meanand smaller variance for each channel’s strength factor. InFig.7, the estimated error probability of the decoder in (10)is plotted versus the embedding channel number and differentlevels of signal to noise ratio (SNR). As is seen, for 20 dBSNR, the error probability of the first 20 channels stay below.04. Moreover, an additive noise with lower SNR has a moredestroying effect on channels with higher center frequencies.

D. Designing efficient dictionaries for high quality robust

watermarking

To design the watermark dictionary, we consider threeconditions. First, based on our empirical results, we do notadd watermarks in channels 5 to 8, because of the energygreater sensitivity of the ear in this channel range. Second,embedding into lower channels bears more robustness againstAWGN attack (Fig.6). Lastly, watermark gammatone kernelsshould be uncorrelated. They should be separated at least byone channel (along the channel axis) and one effective length(along the time axis). The final implementation uses channels2, 4, 9, 11, 13, 15, 17 and 19 for watermark insertion.

IV. EXPERIMENTAL SETUP

Table II lists the simulation conditions.1) Test signals: For the quality test, ABC/HR tests [37] are

done on pieces of 10 seconds of 6 types of audio signals: Pop,Jazz, Rock, Blues, Speech (in French) and Classic. Titles ofthe musical pieces are listed in Table III. For the robustnesstest, simulations are done on 100 audio signals, 3 minuteseach (5 hours in total) including different music genres andEnglish voices. Each signal is sampled at 44.1 kHz and has16-bit wave format. A sample original signal and the originalsignal minus the watermarked one are plotted in Fig.8.

2) Sparse signal representation using PMP and gammatone

dictionary: In all experiments, the sparsity of PMP represen-tation is 0.5. A bank of 25 gammacosine kernels distributedfrom 20 Hz to 20 kHz is implemented to generate spikegramsaccording to the conditions given in Table II.

3) Resynchonization and Watermark stream generation:

The first 13 bits of the embedded bit stream per each second isdevoted to the synchronization Barker sequence +1+1+1+1+1-1-1+1+1-1+1-1+1 [39]. This allows synchronization betweenthe decoder and encoder. The other bits in each second are




Table IITHE DEFAULT COMPUTER SIMULATION CONDITIONS

Quality test Audio signals of Table III, 10 seconds eachQuality measure Subjective difference grade (SDG) [37]Robustness test 100 audio signals, 3 minutes each

50 speech signals [54], 50 music signals [55]Signal characteristics sampled at 44.1 kHz, quantized at 16 bits

Processing window 45 msec2D spikegram 25-channel gammatone filter bank [17]

repeated each time sampleSparse representation PMP on 10- second frames, with, 50% sparsitySynchronization code 13-bit Barker sequence [39]Robustness measure Bit Error Rate (BER)

Table IIIEXCERPTS OF 10 SECONDS FROM THESE AUDIO FILES ARE CHOSEN FORTHE ABC/HR [37] LISTENING TESTS, AND THE ODG RESULTS AFTER

WATERMARKED (� = .01)

Audio Type Title (author or group Name) ODGPOP Power of love (Celine Dion) -.41

Classical Symphony No. 5 (Beethoven) -.85Jazz We’ve got (A tribe Called Quest) -.37Blues Bent Rules (Kiosk) -.23Rock Enter Sad man (Metallica) -.69

Speech French Voice (A Female) -.1

devoted to the watermark bit stream b. For robustness againstcryptographic attacks [40], at each frame, the watermark bitstream b is multiplied by a Pseudo Noise (PN) sequence [41]p in which p

i

2 {�1, 1}. Thus each embedded bit in thewatermark stream will be b

i

p

i

.For synchronization between the encoder and the decoder,

we define a 1000 ms rectangular sliding window, multiplyit to the watermarked signal at the decoder and decode thewatermark bits from all 22 successive processing windows inthe sliding window. Note that each 1 second sliding windowincludes 22 processing windows. Thus, if the Barker codeis decoded with more than 75 % accuracy, then the rest ofdecoding is performed. If the synchronization Barker codeis not acquired, the sliding window is shifted and the samementioned procedure is continued. Note that the signal is notshifted for synchronization, thanks to the time shift invarianceproperty of spikegram representation [24]. Also, because ofthe kernel based representation of the spikegram, the proposedmethod is robust against mild time scaling (Appendix A).

Moreover, using the mentioned synchronization approach,all 45 ms processing windows inside each 1000 ms slidingwindow are also synchronized. A corruption in one processingwindow, might result in at most 45ms⇥44100 = 1984 cor-rupted samples. In this case, to resynchronize the decoder withthe embedder, there might be a need to search for the barkercode with 1984 shifting of the sliding window. As the decodingis done in real time, the resynchronization procedure is notcomputationally expensive. For resynchronization of criticallytime rescaled watermarked signals, a search approach to findthe best time rescaling ratio, could be applied as in [6].

Finally, for the extraction of the watermark bits, the decoderuses the key (K) to generate a PN sequence p. Then each p

i

multiplies with its bit stream b

i

p

i

to find the watermark bitb

i

(hint: b

i

p

i

p

i

= b

i

). If the watermarked signal is shifted

Figure 9. Generation of the embedded watermark stream from the watermarkbits and synchronization code. A 13-bit synchronization Barker code isinserted into each second of the signal. The watermark bits in each frameare multiplied by a PN sequence.

one sample along the time axis, then the required time for theresynchronization equals the decoding time of one input frame(1 second) minus the preprocessing time.

V. EXPERIMENTAL EVALUATION

As a preprocessing task both for the encoder and thedecoder, a linear feedback shift register (LFSR) should bedesigned to generate a PN sequence. The key K includeslog2(M2) bits as the initial state of the LFSR. It also comprisestwo decimal digits associated to the spikegram generation,including, number of channels N

c

and time shifts q (in thiswork, N

c

= 25, q = 1) meaning two “7-bit” ASCII codes.Thus 14 bits are devoted to the spikegram parameters. Intotal, the key includes 14+log2(M2) bits. The spikegramparameters and the initial state of the LFSR are part of thekey which increases the robustness of the proposed methodagainst cryptographic attacks.

A. Quality

The embedding channels 2, 4, 9, 11, 13, 15, 17, 19 areselected for watermark insertion. In the embedding channelsfrom 9 to 19, each watermark bit is inserted through threewatermark kernels with the highest strength factors. However,for the first two channels (2 and 4), one watermark bitis inserted in each processing window. Therefore, the totalnumber of watermark insertions in each processing windowL

P

(in second) is 2+6⇥3 bits, and we have 1/LP

processingwindows per second. Hence, the total number of embeddedbits per second is M2 = 20/L

P

, while the number of distinctembedded watermark bits per second is 8/L

P

.Moreover, the total distortion depends on the quality of PMPrepresentation. As the PMP coefficients for the silent parts arezero, the proposed method also does not insert watermark intothe silent parts of the signal. Thus the watermarking payloadis calculated for the non-silent parts of the signal. As the PMPcoefficients change from signal to signal, the robustness andthe quality of the algorithm is also content dependent.To assess the quality of the watermark signals, ABC/HRlistening tests were conducted based on ITU-R BS.1116 [37]on segments of 10 seconds of audio signals given in Table




0 .005 .01 .05 0.1−4

−2

0

2S

DG

γ

(a)

POP

0 .005 .01 .05 0.1−4

−2

0

2

SD

G

γ

(b)

Classical

0 .005 .01 .05 0.1−4

−2

0

2

SD

G

γ(c)

Jazz

0 .005 .01 .05 0.1−4

−2

0

2

SD

G

γ

(d)

Blues

0 .005 .01 .05 0.1−4

−2

0

2

SD

G

γ

(e)

Rock

0 .005 .01 .05 0.1−4

−2

0

2

SD

G

γ

(f)

Speech

Figure 10. SDG as a function of embedding percentage factor � for 4 minutesof each audio clip, described in Table III, including POP (a), Classical (b),Jazz (c), Blues (d), Rock (e), Speech (f). The bottom ends of the bars indicateSDG means and the vertical red line segments represent the 95% confidenceintervals surrounding them. � = 0 indicates the original signals and � = .01is also used to generate watermarked signals for the robustness test.

III. Experiments are conducted for different embedding per-centage � (� is the percentage of embeddings per one secondframes and equals M2/44100. Note that, � is different frompayload). For the quality measurement, fifteen random subjects(varying from experts to people with no experience in audioprocessing, aged between 20-50 including male and female)participated in 5-scale ABC/HR tests by listening to signalsusing bayerdynamic DT250 headsets in a soundproof listeningroom.

In Fig.10, the average subjective difference grade (SDG)[37] for several types of test signals in respect with the embed-ding percentage factor � is plotted. The tips of the bar chartsand the vertical red line segments on them indicate the meanSDG values and their associated 95 % confidence intervalsrespectively. The SDG indicates the difference between theaverage quality grade of the watermarked signal (given bylisteners) minus the quality grade of the original signal (whichis zero). The SDG is a quality difference grade between zeroand -5 and a SDG strictly smaller than -1 means low quality.

As is seen from Fig.10, by increasing the embeddingpercentage factor �, the quality of watermark signals, exceptfor the classical audio, degrades and the confidence intervalwidens. For the classical audio, by increasing the embeddingpercentage � from .05 to .1, the average SDG, rated bylisteners, improves. One reason for this is because the selectedclassical signal in our test includes no silent and has a noiselike spectrum. Thus adding a small amount of watermark noisemight not be exactly perceived by the listeners.

In all results of Fig.10, when � is not higher than .01, the

Table IVPARAMETERS USED FOR THE ATTACK SIMULATIONS

Attack Condition set up valuesNo Attack - -

Resampling Frequency (KHz) 22, 16Requantization bits 8, 12

Low-pass filtering 2th Butterworth, Cut-off 0.5, 0.7⇥ 22.05HzAdditive white noise SNR (dB) 20, 25, 30

MP3 compression Bit rate (kbps) 64, 32Random cropping Cropping per total length 0.125%, 0.250%Amplitude Scaling Scale ratio 0.5, 2

Pitch scaling Scale ratio 0.95, 1.05Time scaling Scale ratio 0.95, 1.05

SDG is greater than �0.5 and confidence intervals are smallerthan 0.5 (vertical red lines) and cross the line SDG = 0. Thusfor the robustness test, to ensure high quality for the watermarksignals, � is set to .01.A sample of 10 seconds of each original signal type and itsassociated watermarked signal can be downloaded at the link:http://alum.sharif.ir/⇠yousof erfani/The signals in Table III are watermarked with � = .01, theirobjective difference grade (ODG) results are computed usingthe open source PEAQ [42] test and reported in Table III.

B. Payload and Robustness

The payload of the method is defined as the number ofwatermark bits embedded inside each second of the host signalwhile these bits are decoded accurately at the decoder. Forthe case of � = .01, the number of watermark kernels isM2 = .01 ⇤ 44100 = 441. Hence, the processing windowlength equals L

P

= 20/441 = 45msec and the maximum at-tainable payload for the proposed method is 8/45msec = 177bps. Fig.11 shows how many embedded watermark bits areperfectly recoverable under different attacks.

We use the same number of watermark kernels both forhigher and lower bit rates for the results reported on Fig.11.Hence, in the case of lower bit rates, we use larger repetitivecoding factor. Therefore, the watermark decoder for lowerbit rates is stronger compared to the case of higher bit rateembeddings. The high quality of the average watermarkedsignals are confirmed for a payload of 177 bps (� = .01)in Fig.10 and therefore for other bit rates in Fig.11.

The Bit Error Rate (BER) of the decoder is defined as thenumber of erroneously detected bits at the decoder per allembedded bits. In Fig. 11, to test the robustness of the pro-posed method, the BER was computed for a variety of attacksincluding: noise addition, MP3 compression, re-sampling, low-pass filtering and re-quantization. The parameter setting foreach attack is given in Table IV.

The audio editing tools used in the experiment are CoolEdit2.1 [43] (for re-sampling and re-quantization) and Wavepad[44] Sound Editor for MP3 compression. Other attacks ofTable IV are written in MATLAB [45].In addition, for all attacks, frame synchronization is performedusing the resynchronization approach mentioned in sectionIV.3. In Fig.11, the most powerful attacks are the MP3 32kbps (with average robust payload of 56.5 bps) and the MP364 kbps (with average robust payload of 77 bps). The proposed




0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](a)

Bit

Err

or

Ra

te [

%]

No Attack

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](b)

Bit

Err

or

Ra

te [

%]

25dB

20dB

15dB

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps(c)

Bit

Err

or

Ra

te [

%]

8 bits

12 bits

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](d)

Bit

Err

or

Ra

te [

%]

22kHz

16kHz

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](e)

Bit

Err

or

Ra

te [

%]

64Kbps

32Kbps

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](f)

Bit

Err

or

Ra

te [

%]

0.95

1.05

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](g)

Bit

Err

or

Ra

te [

%]

0.5

2.0

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](h)

Bit

Err

or

Ra

te [

%]

0.5

0.7

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](i)

Bit

Err

or

Ra

te [

%]

0.95

1.05

0 50 100 150 2000

3

6

9

Embedding Bitrate [bps](j)

Bit

Err

or

Ra

te [

%]

0.125%

0.250%

Figure 11. Robustness test results. 5 hours of different music genres andEnglish speech (100 signals, 3 minutes each) with conditions presentedin Table II are watermarked. The average BER versus the payload (bps)is plotted when the watermarked signals are exposed to the attacks (withconditions listed in Table IV) including (a) No attack (b) Additive noise(c) Re-quanitization (d) Re-sampling (e) MP3 compression (f) Pitch scaling(g) Amplitude scaling (h) Low-pass filtering (i) Time scaling (j) Randomcropping. � = .01, sparsity of the signal coefficient vector is forced to 0.5.Note that, the BER is calculated as the ratio of misdetected watermark bitsto the total number of embedded bits. The robust payload for the no-attack,and amplitude scaling conditions is 177 bps.

method has robustness against low-pass filtering, with a robustpayload greater than 89 bps. Also, the robust payload for allother attacks is around 95 bps.

Moreover, random cropping is done by setting to zero,0.125 or 0.250 percent of the signals samples, at randomplaces, and in every 1-second frame. By random cropping,we have 55 or 100 corrupted samples per second. Note thatwith this definition of random cropping, the signal length isnot changed. Hence, random cropping changes the spikegramcoefficients obtained by PMP at the decoder. However, thedecoder searches for high peaks in the spikegram which arerobust to mild modifications (i.e., very low value coefficientsare very prone to cropping). As we increase the percentage ofcropped samples, we expect more degradation on high valuecoefficients and hence more BER.

1 2 3 40

5

10

15

Attack strength level

BE

R

22kHz 16kHz

8kHz

4kHz

0.64

0.125

0.250

0.320

1.05

1.07

1.10

1.13

LPF

Cropping

Time rescaling

Figure 12. The average BER of the decoder under different attack strengths.

As an interesting observation, when a 10 dB additive whiteGaussian noise was added to the signal, we observed that morethan 90 percent of the error occurs because of the misdetectionof the location and type of correct gammatone (gammacosineversus gammasine) at the decoder. This is because manypeak amplitudes in the projection search space might haveclose values. This might lead to the misdetection of the truepeak in the attack situations. Under moderate attacks, it isless probable that the sign of the high amplitude peaks bechanged. In Fig.12, the average BER of the decoder is plottedversus the attack strength level for important attacks includingtime rescaling (with rescaling factors between 1.05 and 1.13),cropping (with corruption between 0.64% and 0.32%) andLPF (low pass filtering, 2th order Butterworth with cut-offfrequency between 4 kHz and 22 kHz). The payload in theseexperiments equals 100 bps. The experimental conditions arethe same as in Table IV. As is seen, even for low pass filtering(cut-off frequency greater than 6 kHz), the BER remainssmaller than 10 %. For cropping attack, (0.250 percent ofsamples are randomly put to zero), the BER is small. Thestrongest attack is time rescaling. With a factor of 1.10, westill have more than 10 percent of BER.

Note that our approach does not embed watermark into the5 most high frequency channels leading to a robust audiowatermarking method against low pass filtering.

Moreover, there is a trade-off between the quality of thesignal in (1) and the BER of the decoder. In Fig.13 andFig. 14, respectively, the average (ODG) of the watermarkedsignals and BER of the decoder are plotted versus the numberof gammatone channels and the density of the coefficients(when the payload is 100 bps, and there is no attacks).Results are obtained for 5 hours of different music genresand English speech (100 signals, 3 minutes each). Increasingthe number of gammatone channels and density means usingmore coefficients in the sparse representation. Hence, sparsityimposes a trade-off between quality and BER of the decoder.Using more coefficients for the sparse representation results inmore average ODG in Fig. 13 for the watermarked signals andat the same time, it results in more average BER in Fig.14.




8 16 25 32 40−5

−4

−3

−2

−1

0

Number of channels used in the spikegram

OD

G

Density (.4)

Density (.6)

Density(.8)

Figure 13. Average objective difference grade versus the number of gamma-tone channels in the spikegram and density (density=1-sparsity).

8 16 25 32 400

5

10

15

Number of channels used in the spikegram

BE

R

Density (.4)

Density (.6)

Density(.8)

Figure 14. Average BER [%] of the decoder versus the number of gammatonechannels in the spikegram and density.

When density increases (sparsity is reduced), the BER andODG becomes closer. This is because, for a greater density,we use more PMP iterations. New gammatones found withthe last iterations have smaller coefficients and therefore haveless impact on the quality and the BER of the decoder.

C. Robustness of the proposed method against USAC

In this section, the robustness of the proposed method isevaluated against a new generation codec called unified speechand audio coding (USAC) [29]. USAC applies linear predic-tion in time domain (LPD) along with residual coding forspeech signal segments and frequency domain (FD) algorithmsfor music segments. Also, it is able to switch between the twomodes dynamically in a signal-responsive manner. In Fig.15,the BER results of the proposed decoder under the USACattack are plotted. As is seen, for channels 2,4 and 8, the BERis smaller than .03. As the processing window length for theUSAC experiments is 200 msec, hence 5 bits per channel isembedded in each second. Thus the robust payload againstUSAC is between 5 bps-15 bps.

D. Real-time watermark decoding using the proposed method

The computational complexity of the proposed scheme wasanalyzed on a personal computer with an Intel CPU at afrequency of 2.5 GHz and DDR memory of 512 MB usinga MATLAB 7 compiler. The decoding procedure includes

2 4 8 10 12 14 16 180

0.05

0.1

0.15

0.2

Bit

err

or

rate

Channel number

20 kbps (LPD)

20 kbps (FD)

24 kbps (LPD)

24 kbps (FD)

Figure 15. The BER of the proposed decoder under the unified speechand audio coding (USAC) [29] for different bitrates (24 kps and 20kps)and different modes (linear prediction (LPD) and Fourier domain (FD)).The horizontal axis indicates the embedding channel. As is seen, only forembedding channels 2, 4 and 8, the BER is smaller than .02. The processingwindow length is 200 msec. Also, the BER for LPD mode is slightly largerthan the BER for FD mode. The experiments are done on 100 audio signals,including different music genres and English voices, 3 minutes each.

computing projections and finding a maximum value betweenseveral projections. Our experiments show that the requiredtime for the decoding of one second of the watermarkedsignal is 780 msec. Also the preprocessing time that includescreating the gammacosine and gammasine kernels, the pseudonoise, is around 2.3 second. This indicates that, after the initialpreprocessing stage, the proposed method can be used for real-time decoding of the watermark bits.

E. Comparison to recent audio watermarking techniques

Table V compares the proposed method with several recentmethods in terms of robustness against 32 kbps MP3 attacks.As is seen, the proposed method has a greater robust payloadagainst 32 kbps MP3 compression compared to the mentionedrecent methods. In the proposed method, PMP removes thecoefficients associated with inaudible content of the signalwhich are under the masking thresholds and the watermarkbits are inserted into high value coefficients. Therefore, thishelps having more robustness against MP3 attack in whichthe perceptual masking is also used. Note that, the conditionsof attacks in the caption of Table V are comparable to theconditions described in these references. Also, to the author’sknowledge, this is the first report on the robustness of an audiowatermarking system against next generation codec USAC. ABER smaller than 5% is achieved with an averaged payloadcomprised between 5 to 15 bps.

F. Prior works based on perceptual representation of signals

There are several methods which might have similarities tothe proposed approach. In [46], a speech watermarking methodis proposed that uses pitch modifications and quantizationindex modulation (QIM) for watermark embedding and isrobust against de-synchronizaion attacks. Although [46] isrobust against low bit rates speech codecs such as AMR




Table VCOMPARISON TO RECENT METHODS. THE AVERAGE RESULTS FOR 5 HOURS OF DIFFERENT MUSIC GENRES AND ENGLISH VOICES ARE COMPARED TOTHE AVERAGE REPORTED RESULTS. DURING THE ATTACKS, THE WATERMARKED SIGNALS ARE MODIFIED AS FOLLOWS: FOR THE RANDOM CROPPING,

THE NUMBER OF CROPPING PER TOTAL LENGTH EQUALS 0.125%. FOR RE-SAMPLING, THE SIGNALS ARE RE-SAMPLED AT 22.05 KHZ. FORRE-QUANTIZATION, THE SIGNALS ARE RE-QUANTIZED AT 8 BITS. FOR PITCH AND AMPLITUDE SCALING, THE PITCH AND THE AMPLITUDE OF THE

SIGNALS ARE SCALED WITH THE .95 AND 0.5-2 SCALING RATIOS, RESPECTIVELY. FOR LPF, SIGNALS ARE LOW-PASS FILTERED WITH CUT-OFFFREQUENCY EQUALS TO 11.025 KHZ. NM MEANS “NOT MENTIONED”.

Method Payload MP3(kbps), Cropping AWGN Resamp- Requantiz- Pitch Amplitude LPF(bps) BER SNR = 20dB ling ation, 8bits Scaling Scaling

Bhat [50] 45.9 32, .00 .00 .00 .00 .00 NM NM .00Khaldi [26] 50.3 32, 1.00 .00 .00 1.00 .00 NM NM .00

Yeo [51] 10 96, ⇡.20 NM NM .00 .00 NM .00 NMShaoquan [47] 172 96, ⇡.07 .00 <3.00 .00 .00 NM NM NM

Zhang [52] 43.07 64, .22 NM 8.64 .63 1.39 NM .04 .46Nishimura [8] 100 64, .00 NM ⇡1.0 .00 .00 .00 .00 .00Our Method ⇡56.5 32, .00 .00 .00 .00 .00 .00 .00 .00

codec, no payload results are given for audio signals. In [26],after empirical mode decomposition of the audio signals, thewatermarking embedding is done on the extrema of last IMF(intrinsic mode function) using QIM. Table V confirms thatour approach outperforms this method in terms of robustnessagainst 32 kbps Mp3 compression. In [47], the watermark isinserted into the wavelet coefficients using QIM. Also, in [48],the spread spectrum (SS) is applied on MDCT coefficientsalong with psychoacoustic masking for single-bit watermark-ing. Long duration audio frames are used along with cepstralfiltering at the decoder.

There are several differences between our approach andthe above-mentioned transform domain methods. First, weevaluate the efficiency of a new transform, called spikegram,for robust watermarking. We introduce a new frameworkfor audio watermarking called two-dictionary approach. Theencoder and the decoder search in a correlation space tofind the maximum projection (minimum signal interference).Second, the proposed approach is a phase embedding methodon gammatone kernels with uses of masking. Gammatonekernels are the building blocks to represent the audio signal.Third, the proposed method takes care of efficient embeddinginto non-masked, high value coefficients which make it robustagainst attacks such as universal speech and audio codec (24kbps USAC) [29] and 32 kbps MP3 compression. Also, thanksto the use of PMP, by removing many coefficients under themasks, the signal interference is further reduced at the decoder.

G. Robustness against analogue hole experiments

Here, the robustness of the proposed method against ana-logue hole is evaluated in a preliminary experiment. In Fig.16, the BER of the proposed method against a simulated realroom are given using the image source method for modelingthe room impulse response (RIR) [49], [31]. We embed onebit of watermark in each second of the host signal (1 bpspayload). We use an open source MATLAB code [30], [31] tosimulate the room impulse responses. A cascade of RIR of a4m ⇥ 4m ⇥ 4m room with a 20 dB additive white Gaussiannoise is considered as the simulated room impulse response.Also, only one microphone and loud speaker are modeled.The experiments are done for three distances d between theloudspeaker and the microphone including d = 1, 2 and 3

1 meter 2 meter 3 meter0

5

10

15

20

25

30

35

The distance between the loudspeaker and microphone

BE

R

Robustness under room background noise

Figure 16. The BER of the proposed method against a simulated analoguehole in combination with a 20 dB additive noise.

meters (d denotes the distance between the microphone andthe speaker). For watermark embedding, all the bits in each1-second frames are generated using a pseudo random numbergenerator. A spread spectrum (SS) correlation decoder is used.Hence, the 1-second sliding window is shifted sample bysample until the correlation of the SS decoder is above 0.75.Then, the watermark bit is decoded as the sign of the SScorrelation. Results are reported in Fig.16. From Fig.16, thedecoder can be robust against the analogue hole, when d = 1meter, with a BER lower than 5 %. While for d = 2 or 3meters, the BER increases sharply. The experiments are doneon the 5 signals presented in Table III.

VI. CONCLUSION

A new technique based on a spikegram representation ofthe acoustical signal and on the use of two dictionaries wasproposed. Gammatone kernels along with perceptual matchingpursuit are used for spikegram representation. To achieve thehighest robustness, the encoder selects the best kernels thatwill provide the maximum strength factors at the decoder andembeds the watermark bits into the phase of the found kernels.Results show better performance of the proposed methodagainst 32 kbps MP3 compression with a robust payload of56.5 bps compared to several recent techniques. Furthermore,




for the first time, we report robustness result against USAC(unified speech and audio coding) which uses a new standardfor speech and audio coding. It is observed that the BER is stillsmaller than 5% for a payload comprised between 5 and 15bps. The approach is versatile for a large range of applicationsthanks to the adaptive nature of the algorithm (adaptiveperceptive masking and adaptive selection of the kernels) andto the combination with well established algorithms comingfrom the watermarking community. It has fair performancewhen compared with the state of the art. The research in thisarea is still in its infancy (spikegrams for watermarking) andthere is plenty of room for improvements in future works.Moreover, we showed that the approach can be used for real-time watermark decoding thanks to the use of a projection-correlation based decoder. In addition, two-dictionary methodcould be investigated for image watermarking.

APPENDIX AROBUSTNESS AGAINST TIME RESCALING ATTACK

In the context of watermark decoding, we first computethe correlation between the watermarked signal with a slidinggammasine or gammacosine candidate for the given channelj and different time samples k = 1, 2, .., N

P

where N

P

isthe number of time samples in the processing window. Thedecoded watermark bit is the sign of the peak correlation,i.e, g

opt

= argmax

gj,k(|< y[↵n], gj,k

>|), b = sign(<y[↵n], g

opt

>), where

< y[↵n], gj,k

[n] >=< y[n], gj,k

[n

↵

] >, k = 1, 2, .., NP

(11)When ↵ is close to one, the position of peaks in (11) donot change compared to the no-attack situation. This resultsin robustness against mild time rescaling for single bit water-marking. In multibit watermarking, the watermark gammatoneis inserted in odd channel numbers. Thus, when ↵ is close toone, odd channel numbers have weak correlations. This meansthat time rescaling, with a small rescaling factor, does notaffect the decoder of the multi-bit watermarking. Figure 12reports the BER for ↵ between 1.05 and 1.13.

ACKNOWLEDGMENT

This work was initially funded by CRC and then by NSERCand Universite de Sherbrooke. Many thanks to Hossein Najaf-Zadeh for the PMP code, to Malcolm Slaney for the auditorymodeling toolbox, to Frederick Lavoie and Roch Lefebvrefor managing the listening tests, to Philippe Gournay forproviding the USAC code, to Simon Brodeur and SeanWood for optimizing the PMP code and to the participantsof the listening test. This research was enabled in part bysupport provided by Calcul Quebec (www.calculquebec.ca)and Compute Canada (www.computecanada.ca). Many thanksto the reviewers for their very constructive comments thatimproved the paper.

REFERENCES

[1] S.E. Siwek, “True Cost of Recorded Music Piracy to the U.S. Economy”,Published by IPI, Aug. 2007.

[2] I. Cox, M. Miller, J. Bloom, J. Fridrich and T. Kalker, “Digital Wa-termarking and Steganography”, San Francisco, USA: Morgan KaufmannPublishers Inc., 2nd ed., 2007.

[3] M. Steinebach and J. Dittmann, “Watermarking-based digital audio dataauthentication”, Eurasip J. Appl. Signal Process., pp.1001-1015, 2003.

[4] A. Boho, G. Van Wallendael, A. Dooms, J. De Cock, et al., “End-To-End Security for Video Distribution”, IEEE Signal Processing Magazine,vol.30, no.2, pp.97-107, 2013.

[5] S. Majumder, K.J. Devi, S.K. Sarkar, “Singular value decomposition andwavelet-based iris biometric watermarking”, IET Biometrics, vol.2, no.1,pp.21-27, 2013.

[6] M. Arnold, X. Chen, P. Baum, U. Gries, and G. Dorr, “A phase-basedaudio watermarking system robust to acoustic path propagation”, IEEETrans. on IFS, vol.9, no.3, pp.411-425, 2014.

[7] M. Unoki, R. Miyauchi, “Robust, blindly-detectable, and semi-reversibletechnique of audio watermarking based on cochlear delay”, IEICE Trans.on Inf. Syst. vol.E98-D, no.1, pp.38-48, 2015.

[8] R. Nishimura, “Audio watermarking using spatial masking and ambison-ics”, IEEE Trans. on ASLP, vol.20, no.9, pp.2461-2469, 2012.

[9] G. Hua, J. Goh, and V. L. L. Thing, “Time-spread echo-based audio wa-termarking with optimized imperceptibility and robustness”, IEEE Trans.ASLP, vol.23, no.2, pp.227-239, 2015.

[10] G. Hua, J. Goh, and V. L. L. Thing, “Cepstral analysis for the applicationof echo-based audio watermark detection”, IEEE Trans. on IFS, vol.10,no.9, pp.1850-1861, 2015.

[11] Y. Xiang, I. Natgunanathan, D. Peng, W. Zhou, S. Yu, “A dual-channeltime-spread echo method for audio watermarking”, IEEE Trans. IFS, vol.7,no.2, pp. 383-392, 2012.

[12] Y. Xiang, I. Natgunanathan, S. Guo, W. Zhou, and S. Nahavandi,“Patchwork-based audio watermarking method robust to desynchronizationattacks”, IEEE Trans. ASLP, vol.22, no.9, pp.1413-1423, 2014.

[13] C. M. Pun and X. C. Yuan, “Robust segments detector for desynchro-nization resilient audio watermarking”, IEEE Trans. ASLP., vol.21, no.11,pp. 2412-2424, 2013.

[14] B. Lei, I. Y. Soon, and E. L. Tan, “Robust SVD-based audio watermark-ing scheme with differential evolution optimization”, IEEE Trans. ASLP,vol.21, no.11, pp.2368-2377, 2013.

[15] D. Megas, J. Serra-Ruiz, M. Fallahpour, “Efficient self-synchronisedblind audio watermarking system based on time domain and FFT amplitudemodification”, Signal Processing, vol.90, no.12, pp.3078-3092, 2010.

[16] N. M. Ngo, M. Unoki, “Robust and reliable audio watermarking basedon phase coding”, IEEE ICASSP, pp.345-349, 2015.

[17] R.D. Patterson, B.C.J. Moore, “Auditory filters and excitation patterns asrepresentations of frequency resolution”, Academic Press Ltd., FrequencySelectivity in Hearing, London, pp.123-177, 1987.

[18] M. Slaney, “An Efficient Implementation of the Patterson-HoldsworthAuditory Filter Bank”, Apple Computer Technical Report 35, 1993.

[19] N. Nikolaidis, I. Pitas,“ Benchmarking of Watermarking Algorithms”,in Book: Intelligent Watermarking Techniques, World Scientific Press, pp.315-347, 2004.

[20] S.M. Valiollahzadeh, M. Nazari, M. Babaie-Zadeh, C. Jutten, “A newapproach in decomposition over multiple-overcomplete dictionaries withapplication to image inpainting”, Machine Learning for Signal Processing,IEEE MLSP2009, pp.1-6, 2009.

[21] Ch. H. Son, H. Choo, “Watermark detection from clustered halftone dotsvia learned dictionary”, Signal Processing, vol.102, pp.77-84, 2014.

[22] A. Adler., V. Emiya, M.G. Jafari, M. Elad, R. Gribonval, M.D. Plumbley,“Audio Inpainting”, IEEE Trans. ASLP, vol.20, no.3, pp.922-932, 2012.

[23] C. Fevotte, L. Daudet, S.J. Godsill, B. Torresani, “Sparse Regressionwith Structured Priors: Application to Audio Denoising”, IEEE ICASSP,pp.57-60, 2006.

[24] E. Smith, M. S. Lewicki, “Efficient Coding of Time-Relative StructureUsing Spikes”, Neural Computation, vol.17 , no.1 pp.19-45, 2005.

[25] R. Pichevar, H. Najaf-Zadeh, L. Thibault, H. Lahdili, “Auditory-inspiredsparse representation of audio signals”, Speech Communication, vol.53,no.5, pp.643-657, 2011.

[26] K. Khaldi, A.O. Boudraa, “Audio Watermarking Via EMD”, IEEE Trans.ASLP, vol.21, no.3, pp.675-680, 2013.

[27] S. Quackenbush, “MPEG Unified Speech and Audio Coding”, IEEEMultiMedia, vol.20, no.2, pp. 72-78, 2013.

[28] Y. Yamamoto, T. Chinen and M. Nishiguchi, “A new bandwidth ex-tension technology for MPEG Unified Speech and Audio Coding”, 2013IEEE ICASSP, pp.523-527, 2013.




[29] M. Neuendorf, P. Gournay, M. Multrus, J. Lecomte, B. Bessette, R.Geige, S. Bayer, G. Fuchs, J. Hilpert, N. Rettelbach, R. salami, G. Schuller,R. Lefebvre, B. Grill, “Unified speech and audio coding scheme for highquality at low bit rates”, IEEE ICASSP, pp.1-4, 2009.

[30] http://www.eric-lehmann.com/fast-ISM-code/[31] [4] E. Lehmann, A. Johansson, and S. Nordholm, “Reverberation-time

prediction method for room impulse responses simulated with the image-source model”, IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA’07), pp.159-162, New Paltz, USA, 2007.

[32] Th. Blumensath, Mike E. Davies, “Iterative hard thresholding for com-pressed sensing”, Applied and Computational Harmonic Analysis, vol.27,no.3 pp.265-274, 2009.

[33] Joel A. Tropp, Anna C. Gilbert, “Signal recovery from random measure-ments via orthogonal matching pursuit”, IEEE Trans. information theoryvol.53, no.12, pp. 4655-4666, 2007.

[34] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, “DistributedOptimization and Statistical Learning via the Alternating Direction Methodof Multipliers”, Foundations and Trends in Machine Learning, vol.3, no.1,pp.1122, 2011.

[35] H. Najaf-Zadeh, R. Pichevar, H. Lahdili, and L. Thibault, “Perceptualmatching pursuit for audio coding”, Audio Engineering Society Convention124, vol.5, 2008.

[36] S. Mallat and Z. Zhang, “Matching pursuit with time-frequency dictio-naries”, IEEE Trans. on SP, vol.41, pp.3397-3415, 1993.

[37] ITU-R BS.1116-1, “Methods for the Subjective Assessment of SmallImpairments in Audio Systems Including Multichannel Sound Systems”,Geneva, Switzerland: Int. Telecomm. Union, 1997.

[38] R. Pichevar, H. Najaf-Zadeh, L.Thibault, “New Trends in Biologically-Inspired Audio Coding”, Recent Advances in Signal Processing, In-TechPub., 2009.

[39] P. Borwein, J. M. Mossinghoff, “Barker sequences and flat polynomials”,In James McKee; Chris Smyth. Number Theory and Polynomials. LMSLecture Notes 352. Cambridge University Press., pp.71-88, 2008.

[40] S. Voloshynovskiy, S. Pereira, T. Pun, J. J. Eggers, J. K. Su, “Attackson digital watermarks: Classification, estimation-based attacks and bench-marks”, IEEE Communications Magazine, vol.39, no.8, pp.118-126, 2001.

[41] A. Klein, Stream Ciphers, Springer-Verlag, 2013.[42] P. Kabal, “An Examination and Interpretation of ITU-R BS.1387:

Perceptual Evaluation of Audio Quality”, TSP Lab Technical Report, Dept.Electrical and Computer Engineering, McGill University, May 2002.

[43] http://www.adobe.com/special/products/audition/syntrillium.html[44] http://www.nch.com.au/wavepad/[45] http://www.mathworks.com/[46] D. J. Coumou and G. Sharma, “Insertion, Deletion Codes With Feature-

Based Embedding: A New Paradigm for Watermark Synchronizastion WithApplications to Speech Watermarking”, IEEE Trans. on IFS, vol. 3, no. 2,pp. 153-165, 2008.

[47] Shaoquan Wu, Jiwu Huang, Daren Huang and Y. Q. Shi, “Efficientlyself-synchronized audio watermarking for assured audio data transmis-sion”, IEEE Trans. on Broadcasting, vol. 51, no. 1, pp. 69-76, March2005. doi: 10.1109/TBC.2004.838265

[48] D. Kirovski and H. S. Malvar, “Spread-spectrum watermarking of audiosignals”, IEEE Trans. on SP, vol.51, no.4, pp.1020-1033, Apr 2003.

[49] J. Allen and D. Berkley, “Image method for efficiently simulating small-room acoustics”, Journal of the Acoustical Society of America, vol.65,no.4, pp.943-950, April 1979.

[50] V. Bhat, I. Sengupta, A. Das, “An adaptive audio watermarking basedon the singular value decomposition in the wavelet domain”, Elesevier

Journal on DSP, vol.20, no.6, pp.1547-1558, 2010.[51] I.K. Yeo, H.J. Kim, “Modified patchwork algorithm: A novel audio

watermarking scheme”, IEEE Trans. on SAP, vol.11, no.4, pp.381-386,2003.

[52] P. Zhang, SH.Z. Xu, H.Z. Yang, “Robust Audio Watermarking Basedon Extended Improved Spread Spectrum with Perceptual Masking”, Inter-national Journal of Fuzzy Systems, vol.14, no. 2, pp.289-295, 2012.

[53] ] G. Hua, L. Zhao, and G. Bi, A secure data hiding system based on over-complete dictionary partitioning, in Proc. IEEE International Conferenceon Advanced Intelligent Mechatronics (AIM 2016), in press, Banff, 2016.

[54] http://learningenglish.voanews.com/[55] http://www.bensound.com/royalty-free-music

Yousof Erfani received his B.Sc. degree in electricalengineering (electronics) and his M.Sc. degree inelectrical engineering (communication) both fromSharif university of technology, Tehran, Iran, in2002 and 2004, respectively and the Ph.D. degreefrom Universite de Sherbrooke, Sherbrooke, QC,Canada, in electrical and computer engineering (sig-nal processing) in 2016. He has recently joined theauditory engineering lab of McMaster university,ON, Canada, as a postdoctroal fellow, where heworks on improving auditory models. His research

interests include machine learning, neural signal processing, bio-inspiredsparse representation, deep learning, Bayesian inference, watermarking andauditory models.

Ramin Pichevar was born in 1974, in Paris, France.He received his bachelor of science degree in electri-cal engineering (electronics) in 1996 and his masterof science in electrical engineering (telecommu-nication systems) in 1999, both in Tehran, Iran.He received his Ph.D. in electrical and computerengineering from Universit de Sherbrooke, Qubec,Canada in 2004. In 2001 and 2002 he performed twosummer internships at Ohio State University (USA)and at the University of Grenoble (France), respec-tively. From 2004 to 2006, he was a postdoctoral

research associate in the computational neuroscience and signal processinglaboratory at the University of Sherbrooke under an NSERC (Natural Sciencesand Engineering Council of Canada) Idea to Innovation (I2I) grant. He wasa research scientist at the Communications Research Center (CRC), Ottawa,Canada from 2006 to 2013. He is currently with Apple Inc. (California).His domains of interest are signal and speech processing, neural networkswith emphasis on bio-inspired neurons, speech recognition, audio and speechcoding, and sparse techniques.

Jean Rouat (S82M88SM06) received the M.Sc.degree in physics from Universit de Bretagne, Brest,France in 1981, an E. & E. M.Sc.A. degree inspeech coding and speech recognition from Uni-versit de Sherbrooke, Qc., Canada in 1984 and anE. & E. Ph.D. in cognitive and statistical speechrecognition jointly with Universit de Sherbrooke andMcGill University, Montral, Qc., Canada in 1988.His post-doc has been in psychoacoustics with theMRC, App. Psych. Unit, Cambridge, UK and inelectrophysiology with the Institute of physiology,

Lausanne, Switzerland. He is currently with Universit de Sherbrooke, wherehe founded the Computational Neuroscience and Intelligent Signal ProcessingResearch group (NECOTIS). He is also an Adjunct professor with thedepartment of Biological Sciences, Universit de Montral, and full memberof the Centre for Interdisciplinary Research in Music Media and Technology(CIRMMT), housed at the Schulich School of Music at McGill University,Montral, Qc., Canada. His translational research links neuroscience andengineering for the creation of new technologies and applications with theintegration of a better understanding and integration of multimodal repre-sentations (vision & audition). Information hidding in multimedia signals,development of hardware low power consumption neural processing units(NPU) for a sustainable development, interactions with artists for multimediaand musical creations are examples of transfers that he leads based onthe knowledge he gains from neuroscience and his knowledge of visual &auditory systems. He is leading maching learning funded projects to developsensory substitution and intelligent devices. He is also the coordinator of theinterdisciplinary IGLU CHIST-ERA european consortium (IGLU - InteractiveGrounded Language Understanding) for the development of an intelligentagent that learns through multimodal grounded interactions.

journal of , vol., no., month yyyy 1 audio ...€¦ · abstract—this paper introduces a new audio...

Documents