techniques and applications for audio 695410106 謝育任 695410121 劉威麟

Techniques and Applications for Audio

695410106 謝育任695410121 劉威麟

Outline

Audio Watermark

Audio Classification : Security Monitoring Using Microphone Arrays and Audio

Classification A Generic Audio Classification and Segmentation Approach for

Multimedia Indexing and Retrieval

Introduction

What’s watermark? A kind of technology for data hiding

Paper watermarks appears nearly 700 years ago The oldest one be found in 1292

The idea of digital image watermarking arose independently in 1990 Around 1993, coined the word “water mark”

Terminology

Steganography stands for techniques in general that allow secrete communication

Watermarking , as opposed to steganography, has the additional notion of robustness against attacks

Fingerprinting and labeling are terms that denote special applications of watermarking. Ex. Copyright

Bit-stream watermarking is sometimes used for data hiding or watermarking of compressed data

Requirement

A watermark shall convey as much information as possible

A watermark should in general be secret and should only be accessible by authorized parties

A watermark should stay in the host data regardless of whatever happens to the host data

A watermark should be imperceptible

Depend on media to be watermarked Blend V.S Non-blend Maybe required in a real time Low complexity-time

Basic Watermarking Principle

There are three main issues in the design of a watermarking system

Design of the watermark signal W to be added to the host signal. Typically, the watermark signal depends on a key K and watermark information I

possibly, it may also depend on the host data X into which it is embedded

Design of the embedding method itself that incorporates the watermark signal W into the host data X yielding watermarked data Y

Design of the corresponding extraction method that recovers the watermark information from the signal mixture using the key and with help of the original

or without the original

Embedding Technologies for Audio

Low-bit coding By replacing the least significant bit of each sampling point by a coded

binary string The major disadvantage of this method is its poor immunity to manipula

tion This method is useful only in close, digital-to-digital environment

Phase coding By substituting the phase of an initial audio segment with a reference p

hase that represents the data Procedure

Break the sound sequence s[i], (0 i I-1), into a series of N short segment, ≦ ≦sn[i] where (0 n N-1)≦ ≦

Apply a K-points discrete Fourier transform to n-th segment, Sn[i], where (k=1/N), and create a matrix of the phase, ψn(Wk), and magnitude, An(Wk) for (0 k K-1) ≦ ≦

Store the phase difference between each adjacent segment for (0 n N-1)≦ ≦

A binary set of data is represented as a ψdata = π/ 2 or –π/ 2 representing 0 or 1

Re-create phase matrixes for n > 0 by using the phase difference

Use the modified phase matrix and the original magnitude matrix to reconstruct the sound signal by applying the inverse DFT

• Spread spectrum coding

In the decoding stage, the following is assumed: The pseudorandom key is maximal The key stream for the encoding is known by the receiver.

Signal synchronization is done, and the start/stop point of the spread data are known

The following parameters are known by the receiver: chip rate, data rate, and carrier frequency

To keep the noise level low and inaudible, the spread code is attenuated to roughly 0.5 percent of the dynamic range of the host sound file

• Echo data hiding– The data are hidden by varying three parameters of the echo: initial

amplitude, decay rate, and offset

Example

Decoding:magnitude of the autocorrelation of the encoded signal’s cepstr

um:

Classification of attacks

“Simple attacks” (other possible names include “waveform attacks” and “noise attacks”) are conceptually simple attacks that attempt to impair the embedded watermark by manipulations of the whole watermarked data without an attempt to identify and isolate the watermark

“Detection-disabling attacks” (other possible names include “synchronization attacks”) are attacks that attempt to break the correlation and to make the recovery of the watermark impossible or infeasible for a watermark detector

Classification of attacks (continue)

“Ambiguity attacks” (other possible names include “deadlock attacks”, “inversion attacks”, “fake-watermark attacks”, and “fake-original attacks”) are attacks that attempt to confuse by producing fake original data or fake watermarked data

“Removal attacks” are attacks that attempt to analyze the watermarked data, estimate the watermark or the host data, separate the watermarked data into host data and watermark, and discard only the watermark

Watermark algorithm

LSB Working in time-domain and embedding the watermark in the

least significant bits The message is embedded many times into audio signal Parameters

Secrete key, error correction code, embedding message, etc

Microsoft Working in frequency domain and embedding watermark in

the frequency coefficients by using spread spectrum technique

Only one parameter: embedding message

VAWW ─ Viper Audio Water Wavelet Working in wavelet domain and embedding the watermark in selected c

oefficients Parameter :

Secrete key Threshold, which selects the coefficients for embedding. The default value is

40 Scale factor, which means the embedding strength. The default value is 0.2

Publimark Open source tool Parameter:

Embedded message Public/private key

Reference

Multimedia Watermarking TechniqueBy Hartung, F.; Kutter, M.; PROCEEDINGS OF THE IEEE, VOL. 87, NO. 7, JULY 1999

Techniques for data hidingBy W. Bender D. Gruhl N. Morimoto A. Lu ; IBM SYSTEMS JOURNAL, VOL 35, NOS 3&4, 1996

Transparency and Complexity Benchmarking of Audio Watermarking Algorithms IssusBy Andreas Lang, Jana Dittmann

Audio Classification

Audio Classification Security Multimedia Indexing and Retrieval Other

Introduction

The proposed system : Location Type of sound

SNR (signal to noise ratio) :

Reflection coefficient : A reflection coefficient describes either the amplitude or the intensity of a reflected wave relative to an incident wave.

Proposed Security monitoring instrument

Center Clipping

c(n) is the center clipped sample at time index n s(n) is the audio sample at time index n

PR algorithm

The PR algorithm divides the audio segment into frames, estimate the presence of the human pitch in each frame, and calculates a PR parameter.

PR = NP / NF NP : the numbers of frames that have human pitch NF : the total number of frames

Pitch Value

Human Pitch = {Pitch : 70Hz < Pitch < 280Hz}

arg (max{ ( ) : ( ) 0.4 })xx xxPitch R R RMSE

Proposed system

Non-speech Classification

Time Delay Neural Network (TDNN) is used to classify a nonspeech audio segment into an audio Type. (e.g., door opening, fan noise…etc)

MFCC (Mel-Filtered Cepstral Coefficient) :

△ MFCC (Delta Mel-Filtered Cepstral Coefficient) :

Simulation Enviroment

Simulation Environment :

Simulation Results

OR (overlap ratio) = 0.85 .

SD ( segment duration) = 400 MS

Introduction

A Generic Audio Classification and Segmentation Approach for Multimedia Indexing and Retrieval

Bi-model : Bit-Stream : a series of bits Generic mode : temporal and spectral information is extracted

from the PCM samples. Classification :

Speech Music Silence Fuzzy

Erroneous classification

Critical Errors : one pure class is misclassified into another pure class.

Semi-critical Errors : a fuzzy class type is misclassified as one of the pure class types.

Non-critical errors : a pure class is misclassified as a fuzzy class.

Classification and Segmentation framework

Spectral Template

Pulse Code Modulation : PCM is a common method of storing and transmitting uncompressed digital audio. PCM is also a very common format for AIFF and WAV files.

1 : positive voltage pulse

0 : absence of pulse

Spectral template : it formed from the input audio source, and it can be obtained from the MDCT coefficient of MP3 granules.

Power Spectrum : it obtained from the PCM samples.

About MP3

Layer3 encoding process starts by dividing the audio signal into frames, which corresponds to one or two granules.

Each granules has 576 PCM samples.

There are three windowing modes in Mpeg layer3 encoding scheme : Long, Short, Mixed

Bit-Stream Mode

MDCT (Modified discrete cosine transform) : 2 1

0

1 1cos[ ( )( )]

2 2 2

N

k nn

Nx x n k

N

Bit-Stream Mode

MDCT (w, f) w represents the window number f represents the line frequency index

Frame Features

Total Frame Energy (TFE) Calculation : to detect silent frame

Band Energy Ratio (BER) Calculation : to detect the ratio between of two spectral regions that are separated by a single cut-off frequency.

2( ( , ) )NoW NoF

j jw f

TFE SPEQ w f

2

0

2

( ( , ))( )

( ( , ))

c

c

c

Now f f

jw f

j cNow f f

jw f f f

SPEQ w fBER f

SPEQ w f

Frame Features

Fundamental Frequency Estimation : if the input audio is harmonic over a fundamental frequency, the real fundamental frequency (FF) value can be estimated from the spectral coefficient (SPEQ(w,f))

Subband Centroid Frequency Estimation : Subband Centroid (SC) is the first moment of the spectral distribution.

( ( , )* ( ))

( , )

NoW NoF

w fsc NoW NoF

w f

SPEQ w f FL ff

SPEQ w f

Initial Classification

Segment Features

Transition Rate (TR) : transition between consecutive frames. TR has a forced speech classification.

Fundamental Frequency Segment Feature : FF has a forced music classification

Subband Centroid Segment Feature : SC has two forced classification region, one for music and the other for speech content.

( )2

NoF i

iNoF TP

TR SNoF

Generic Decision Table

techniques and applications for audio 695410106 謝育任 695410121 劉威麟

Documents