techniques and applications for audio 695410106 謝育任 695410121 劉威麟
TRANSCRIPT
Techniques and Applications for Audio
695410106 謝育任695410121 劉威麟
Outline
Audio Watermark
Audio Classification : Security Monitoring Using Microphone Arrays and Audio
Classification A Generic Audio Classification and Segmentation Approach for
Multimedia Indexing and Retrieval
Introduction
What’s watermark? A kind of technology for data hiding
Paper watermarks appears nearly 700 years ago The oldest one be found in 1292
The idea of digital image watermarking arose independently in 1990 Around 1993, coined the word “water mark”
Terminology
Steganography stands for techniques in general that allow secrete communication
Watermarking , as opposed to steganography, has the additional notion of robustness against attacks
Fingerprinting and labeling are terms that denote special applications of watermarking. Ex. Copyright
Bit-stream watermarking is sometimes used for data hiding or watermarking of compressed data
Requirement
A watermark shall convey as much information as possible
A watermark should in general be secret and should only be accessible by authorized parties
A watermark should stay in the host data regardless of whatever happens to the host data
A watermark should be imperceptible
Depend on media to be watermarked Blend V.S Non-blend Maybe required in a real time Low complexity-time
Basic Watermarking Principle
There are three main issues in the design of a watermarking system
Design of the watermark signal W to be added to the host signal. Typically, the watermark signal depends on a key K and watermark information I
possibly, it may also depend on the host data X into which it is embedded
Design of the embedding method itself that incorporates the watermark signal W into the host data X yielding watermarked data Y
Design of the corresponding extraction method that recovers the watermark information from the signal mixture using the key and with help of the original
or without the original
Embedding Technologies for Audio
Low-bit coding By replacing the least significant bit of each sampling point by a coded
binary string The major disadvantage of this method is its poor immunity to manipula
tion This method is useful only in close, digital-to-digital environment
Phase coding By substituting the phase of an initial audio segment with a reference p
hase that represents the data Procedure
Break the sound sequence s[i], (0 i I-1), into a series of N short segment, ≦ ≦sn[i] where (0 n N-1)≦ ≦
Apply a K-points discrete Fourier transform to n-th segment, Sn[i], where (k=1/N), and create a matrix of the phase, ψn(Wk), and magnitude, An(Wk) for (0 k K-1) ≦ ≦
Store the phase difference between each adjacent segment for (0 n N-1)≦ ≦
A binary set of data is represented as a ψdata = π/ 2 or –π/ 2 representing 0 or 1
Re-create phase matrixes for n > 0 by using the phase difference
Use the modified phase matrix and the original magnitude matrix to reconstruct the sound signal by applying the inverse DFT
• Spread spectrum coding
In the decoding stage, the following is assumed: The pseudorandom key is maximal The key stream for the encoding is known by the receiver.
Signal synchronization is done, and the start/stop point of the spread data are known
The following parameters are known by the receiver: chip rate, data rate, and carrier frequency
To keep the noise level low and inaudible, the spread code is attenuated to roughly 0.5 percent of the dynamic range of the host sound file
• Echo data hiding– The data are hidden by varying three parameters of the echo: initial
amplitude, decay rate, and offset
Example
Decoding:magnitude of the autocorrelation of the encoded signal’s cepstr
um:
Classification of attacks
“Simple attacks” (other possible names include “waveform attacks” and “noise attacks”) are conceptually simple attacks that attempt to impair the embedded watermark by manipulations of the whole watermarked data without an attempt to identify and isolate the watermark
“Detection-disabling attacks” (other possible names include “synchronization attacks”) are attacks that attempt to break the correlation and to make the recovery of the watermark impossible or infeasible for a watermark detector
Classification of attacks (continue)
“Ambiguity attacks” (other possible names include “deadlock attacks”, “inversion attacks”, “fake-watermark attacks”, and “fake-original attacks”) are attacks that attempt to confuse by producing fake original data or fake watermarked data
“Removal attacks” are attacks that attempt to analyze the watermarked data, estimate the watermark or the host data, separate the watermarked data into host data and watermark, and discard only the watermark
Watermark algorithm
LSB Working in time-domain and embedding the watermark in the
least significant bits The message is embedded many times into audio signal Parameters
Secrete key, error correction code, embedding message, etc
Microsoft Working in frequency domain and embedding watermark in
the frequency coefficients by using spread spectrum technique
Only one parameter: embedding message
VAWW ─ Viper Audio Water Wavelet Working in wavelet domain and embedding the watermark in selected c
oefficients Parameter :
Secrete key Threshold, which selects the coefficients for embedding. The default value is
40 Scale factor, which means the embedding strength. The default value is 0.2
Publimark Open source tool Parameter:
Embedded message Public/private key
Reference
Multimedia Watermarking TechniqueBy Hartung, F.; Kutter, M.; PROCEEDINGS OF THE IEEE, VOL. 87, NO. 7, JULY 1999
Techniques for data hidingBy W. Bender D. Gruhl N. Morimoto A. Lu ; IBM SYSTEMS JOURNAL, VOL 35, NOS 3&4, 1996
Transparency and Complexity Benchmarking of Audio Watermarking Algorithms IssusBy Andreas Lang, Jana Dittmann
Audio Classification
Audio Classification Security Multimedia Indexing and Retrieval Other
Introduction
The proposed system : Location Type of sound
SNR (signal to noise ratio) :
Reflection coefficient : A reflection coefficient describes either the amplitude or the intensity of a reflected wave relative to an incident wave.
Proposed Security monitoring instrument
Center Clipping
c(n) is the center clipped sample at time index n s(n) is the audio sample at time index n
PR algorithm
The PR algorithm divides the audio segment into frames, estimate the presence of the human pitch in each frame, and calculates a PR parameter.
PR = NP / NF NP : the numbers of frames that have human pitch NF : the total number of frames
Pitch Value
Human Pitch = {Pitch : 70Hz < Pitch < 280Hz}
arg (max{ ( ) : ( ) 0.4 })xx xxPitch R R RMSE
Proposed system
Non-speech Classification
Time Delay Neural Network (TDNN) is used to classify a nonspeech audio segment into an audio Type. (e.g., door opening, fan noise…etc)
MFCC (Mel-Filtered Cepstral Coefficient) :
△ MFCC (Delta Mel-Filtered Cepstral Coefficient) :
Simulation Enviroment
Simulation Environment :
Simulation Results
OR (overlap ratio) = 0.85 .
SD ( segment duration) = 400 MS
Introduction
A Generic Audio Classification and Segmentation Approach for Multimedia Indexing and Retrieval
Bi-model : Bit-Stream : a series of bits Generic mode : temporal and spectral information is extracted
from the PCM samples. Classification :
Speech Music Silence Fuzzy
Erroneous classification
Critical Errors : one pure class is misclassified into another pure class.
Semi-critical Errors : a fuzzy class type is misclassified as one of the pure class types.
Non-critical errors : a pure class is misclassified as a fuzzy class.
Classification and Segmentation framework
Spectral Template
Pulse Code Modulation : PCM is a common method of storing and transmitting uncompressed digital audio. PCM is also a very common format for AIFF and WAV files.
1 : positive voltage pulse
0 : absence of pulse
Spectral template : it formed from the input audio source, and it can be obtained from the MDCT coefficient of MP3 granules.
Power Spectrum : it obtained from the PCM samples.
About MP3
Layer3 encoding process starts by dividing the audio signal into frames, which corresponds to one or two granules.
Each granules has 576 PCM samples.
There are three windowing modes in Mpeg layer3 encoding scheme : Long, Short, Mixed
Bit-Stream Mode
MDCT (Modified discrete cosine transform) : 2 1
0
1 1cos[ ( )( )]
2 2 2
N
k nn
Nx x n k
N
Bit-Stream Mode
MDCT (w, f) w represents the window number f represents the line frequency index
Frame Features
Total Frame Energy (TFE) Calculation : to detect silent frame
Band Energy Ratio (BER) Calculation : to detect the ratio between of two spectral regions that are separated by a single cut-off frequency.
2( ( , ) )NoW NoF
j jw f
TFE SPEQ w f
2
0
2
( ( , ))( )
( ( , ))
c
c
c
Now f f
jw f
j cNow f f
jw f f f
SPEQ w fBER f
SPEQ w f
Frame Features
Fundamental Frequency Estimation : if the input audio is harmonic over a fundamental frequency, the real fundamental frequency (FF) value can be estimated from the spectral coefficient (SPEQ(w,f))
Subband Centroid Frequency Estimation : Subband Centroid (SC) is the first moment of the spectral distribution.
( ( , )* ( ))
( , )
NoW NoF
w fsc NoW NoF
w f
SPEQ w f FL ff
SPEQ w f
Initial Classification
Segment Features
Transition Rate (TR) : transition between consecutive frames. TR has a forced speech classification.
Fundamental Frequency Segment Feature : FF has a forced music classification
Subband Centroid Segment Feature : SC has two forced classification region, one for music and the other for speech content.
( )2
NoF i
iNoF TP
TR SNoF
Step2
Generic Decision Table
Step3