7 dec'08comp30291 section 81 university of manchester school of computer science comp30291...

87
7 Dec'08 Comp30291 Section 8 1 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech, Music & Video

Upload: hannah-harris

Post on 26-Dec-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 1

UNIVERSITY of MANCHESTER

School of Computer Science

Comp30291

Digital Media Processing 2008-09

Section 8:

Processing Speech, Music & Video

Page 2: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 2

8.1 Digitising speech

• Traditional telephone channels restrict speech to 300- 3400 Hz. • Considered not to incur serious loss of intelligibility. • Significant effect on naturalness of sound.

• Once band-limited, speech may be sampled at 8 kHz.

•ITU-T G711 standard for speech in POTS allocates 64000 b/s for 8 kHz sampling rate with 8 bits / sample.

•Exercise: Why are components below 300 Hz removed?

Page 3: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 3

8.1.1 International standards for speech coding:

• ITU committee CCITT until 1993 part of UNESCO.• Since 1993, CCITT has become part of ITU-T. • Within ITU-T is study group responsible for speech digitisation & coding standards. •Among other organisations defining standards for telecoms & telephony are:

“TCH-HS”: part of ETSI (GSM). “TIA” USA equivalent of ETSI.“RCR” Japanese equivalent of ETSI. “Inmarsat” & various committees within NATO.

• Standards exist for digitising “wide-band” speech (50 Hz to 7 kHz) e.g. ITU G722 .

Page 4: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 4

8.1.2. Uniform quantisation.

•Quantisation: each sample, x[n], of x(t) approximated by closest

available quantisation level. •Uniform quantisation: constant voltage difference between levels.

•With 8 bits, & input range V , have 256 levels with = V/128.•If x(t) between V, & samples are rounded, uniform quantisation

produces 2/ ][ 2/ where][][ ][ˆ nenenxnx

• Otherwise, overflow will occur & magnitude of error may >> /2. • Overflow is best avoided.

Page 5: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 5

•Samples e[n] ‘random’ within /2. •When quantised signal converted back to analogue, adds random error or “noise” signal to x(t).

•Noise heard as ‘white noise’ sound added to x(t).•Samples e[n] have uniform probability between /2. It follows that the mean square value of e[n] is:

123

11)(

22/

2/

32/

2/

22/

2/

2

e

deedeepe

Power of analogue quantisation noise in 0 Hz to fS/2.

Noise due to uniform quantisation error

Page 6: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 6

8.1.3. Signal-to-quantisation noise ratio (SQNR) • Measure how seriously signal is degraded by quantisation noise.

(dB.) decibelsin power noiseon quantisati

power signallog10 SQNR 10

• With uniform quantisation, quantisation-noise power is 2/12 • Independent of signal power. • Therefore, SQNR will depend on signal power.• If we amplify signal as much as possible without overflow, for sinusoidal waveforms with m-bit uniform quantiser:

•Approximately true for speech also.

SQNR 6m + 1.8 dB.

Page 7: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 7

•For telephone users with loud voices & quiet voices,

quantisation-noise will have same power, 2/12.• may be too large for quiet voices, OK for slightly louder ones,

& too small (risking overflow) for much louder voices.•Useful to know over what range of loudness will speech quality

be acceptable to users.

Variation of input levels

000

111

001

volts

OK too big for quiet voice

too small for loud voice

Page 8: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 8

8.1.4. Dynamic Range (Dy)

. .

) ( log 10 10 dB

SQNRacceptablegiveswhichpowerMin

overflownopowersignalpossibleMaxDy

Dy = max possible SQNR (dB) min acceptable SQNR (dB)

12/

log10

12/

log10

:onquantisati uniformFor

210210

powerMinpowersignalMaxDy

This final expression for Dy is well worth remembering, but it only works for uniform quantisation!

Page 9: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 9

Excercise

If SQNR must be at least 30dB to be acceptable, what is Dy assuming sinusoids & an 8-bit uniform quantiser? Solution: = Max possible - Min acceptable SQNR (dB) = (6m + 1.8) - 30 = 49.8 - 30 = 19.8 dB.Too small for telephony

Exercise: Repeat this calculation for 12-bit uniform quantisation.

Page 10: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 10

8.1.5. Instantaneous companding

• 8-bits per sample not sufficient for good speech encoding with uniform quantisation.

•Problem lies with setting a suitable quantisation step-size .

•One solution is to use instantaneous companding.

•Step-size adjusted according to amplitude of sample.

•For larger amplitudes, larger step-sizes used as illustrated next.

•‘Instantaneous’ because step-size changes from sample to sample.

Page 11: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 11

Non-uniform quantisation used for companding

x(t)

t 001 111

Page 12: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 12

Implementation of companding (in principle)

• Pass x(t) thro’ compressor to produce y(t).

• y(t) is quantised uniformly to give y’(t) which is transmitted or stored digitally.

• At receiver, y’(t) passed thro’ expander which reverses effect of compressor.

• Analog implementation uncommon but shows concept well.

Com-pressor

Uniform quantiser

Expanderx(t) y(t)

Transmitor store

y’(t) x’(t)

Page 13: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 13

Effect of compressor

• Increase smaller amplitudes of x(t) and reduce larger ones.

• When uniform quantiser is applied fixed appears: smaller in proportion to smaller amplitudes of x(t), larger in proportion to larger amplitudes.

• Effect is non-uniform quantisation as illustrated previously.

Page 14: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 14

Effect of ‘compressor’ on sinusoid

t

1

x(t) y(t)

1

-1 -1

t

• ‘Expander’ reverses this shape change.

Page 15: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 15

Common compressor is linear for |x(t)close to zero & logarithmic for larger values. A suitable formula for x(t) in range ±1 volt is:

A1 |x(t)| 1 : |x(t)|log

K

11sign(x(t))

A1 |x(t)| : K

x(t)

y(t)e

where K = 1+ loge (A) A is constant which determines cross-over between linear & log.

‘A-Law’ instantaneous companding

Page 16: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 16

Mapping from x(t) to y(t) by A-law companding

y(t)

x(t)

+1

-1

1/A

-1/A

-1

1

1/K

-1/K

A 13

(Too difficult to draw if A is any larger)

Page 17: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 17

•A-law companding as used in UK with A = 87.6 & K=5.47.• General formula becomes:

• Maps x(t) in range ±1 to y(t) in range ±1.1 % of domain of x(t) linearly mapped onto 20 % of range of y(t). • Remaining 99% of domain of x(t) logarithmically mapped onto 80% of range for y(t).

87.61 |x(t)| 1 : |x(t)|log183.01sign(x(t))

87.61 |x(t)| : 16x(t)

y(t)e

G711 standard ‘A-law’ companding with A=87.6

Page 18: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 18

A-law expander formula

1 |(t)y| 1/K : (t))eysign(

1/K |(t)y| : (t)/A yK (t)x ) 1 (t)|y | (K

1 |(t)y| 0.183 : (t))eysign(

0.183 |(t)y| : (t)/16y (t)x ) 1 (t)|y | ( 5.47

When A = 87.6,

Page 19: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 19

Graph of A-law expander formula

y(t)

x(t)+1

-1

1/A

-1/A

-1

11/K-1/K

A 3(difficult to draw when A=87.6)

Page 20: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 20

Effect of expander on ‘small samples’ • Samples of x(t) ‘small’ when in range ±1/A (quiet speech).• Compresser has multiplied these by 16 to produce y(t).• (Assume A = 87.6)• Without quantisation, passing y(t) thro’ expander would

produce original samples of x(t) exactly, by dividing by 16.• But any small changes to y(t) are also divided by 16.• Effectively reduces quantisation step by factor 16. • Reduces quantisation noise power (2/12) by factor 162 .• Multiplies SQNR for ‘small’ samples by 162

• Increases SQNR for quiet speech by 24 dB as:

10log10(162) = 10log10(28)

= 80log10(2) 800.3 = 24 dB

Page 21: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 21

Effect of expander on ‘large’ samples

35 dB

• Quantisation-step now increases in proportion to |x(t)| when |x(t)| increases further, above 1/A towards 1.

• Therefore, SQNR will remain at about 35 dB.

• It will not increase much further as |x(t)| increases above 1/A.

• Samples of x(t) are ‘large’ when amplitude 1/A.• When x(t) =1/A, SQNR at output of expander is:

72

2

10 2/1 and 6.87A with dB. 12/

2/)/16(log10

A

Page 22: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 22

Observations

• For |x(t)| > 1/A, expander causes quantisation step to increase in proportion to x(t).

• Quantisation noise gets louder as the signal gets louder.

• If all samples of x(t) ‘large’, SQNR would remain approximately the same, i.e. about 35 or 36 dB for G711 (see next slide)

• If 30 dB SQNR is acceptable, large signals are quantised satisfactorily.

• Small signals quantised satisfactorily for lower amplitudes.

• Largest amplitude unchanged & smallest reduced by factor 16.

• Dy increased by factor 16 i.e. 24 dB to 19.8+24 = 43.8 dB

• Same as 12-bit linear (6x12+1.8 – 30) but with only 8 bits.

• Quantisation error for A-law worse than for uniform when |x(t)| > 1/A

• Price to be paid for the increased dynamic range.

Page 23: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 23

48

36

24

12

VV/2V/4V/16 3V/4

A-law

Uniform

Amplitudeof sample

0

SQNR dB

Variation of SQNR with amplitude of sample

Page 24: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 24

Mu ()-law companding

•Similar companding technique adopted in USA.

lly_used255_genera (mu) μ )μ 1 (log

) /Vx(t)μ 1 (logx(t)signy(t)

e

e

•When |x(t)| < V / , y(t) ( / loge(1+) )x(t)/V since loge(1+x) x x2/2 + x2/3 - … when |x| < 1.

-law with = 255 is like A-law with A=255, though transition from small quantisation-steps for small x(t) to larger ones for larger values of x(t) is more gradual with -law.

Page 25: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 25

• Compression/expansion may be done digitally via ‘look-up’ table. • x[n] assumed to be 12-bit integer in range -2048 to 2047.• Digitise x(t) by 12-bit ADC, or use 16-bit ADC & truncate to 12 bits. • For each 12-bit integer, table has corresponding 8-bit value of x(t). • Expander table has a 12-bit word for each 8 bit input. • Each 8-bit G711 ‘A-law’ sample is a sort of ‘floating point number:

Implementing compressor & expander

S X2 X1 X0 M3 M2 M1 M0

x m

Value is (-1)S+1 . m . 2x-1 where the ‘mantissa m is: (1 M3 M2 M1 M0 1 0 0 0)2 if x>0 or (0 M3 M2 M1 M0 1 0 0 0)2 if x=0

Page 26: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 26

8.2.Further reducing bit-rate for digitised speech

•PCM encodes each speech sample independently & is capable of

encoding any wave-shape correctly sampled.

•This is ‘waveform encoding’: simple but needs high bit-rate.

•Speech waveforms have special properties that can be exploited

by ‘parametric coding techniques’ to achieve lower bit-rates.

Page 27: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 27

• General trends may be identified allowing one to estimate which sample value is likely to follow a given set of samples. • Makes part of information transmitted by PCM redundant • Speech has 'voiced' & 'unvoiced' parts i.e. 'vowels' & 'consonants'. • Predictability lies mostly in voiced speech as it has periodicity.• In voiced speech, a ‘characteristic waveform', like a decaying sinusoid, is repeated periodically (or approximately so).

t

Volts

Properties of speech waveforms

Page 28: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 28

A characteristic waveform for voiced speech

Volts

t

Page 29: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 29

Voiced speech (vowels)•Shapes of characteristic waveforms are, to some extent, predictable from first few samples. •Also, once one characteristic waveform has been received, the next one can be predicted. •Prediction not 100% accurate. •Sending a ‘prediction error’ is more efficient than sending the whole signal.

•‘Decaying sinusoid’ shape of each characteristic waveform is due to the way sound is 'coloured' by shape of mouth.•Similarity of repeated characteristic waveforms due to periodicity of sound produced by vocal cords.

Page 30: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 30

Unvoiced speech (consonants)

• Random or noise-like with little periodicity.

• Lower in amplitude than voiced telephone speech.

• Exact shape of its noise-like waveform not critical for

perception

•Almost any noise-like waveform will do as long as energy level

is correct, i.e. it is not too loud or too quiet.

• Unvoiced speech is easier to encode than voiced.

• Separate them at transmitter & encode them in separate ways.

Page 31: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 31

Characteristics of speech & perception exploited to reduce bit-rate

1. Higher amplitudes may be digitised less accurately.2. Adjacent samples usually close in value.3. Voiced characteristic waveforms repeat quasi-periodically 4. Predictability within characteristic waveforms.5. Unvoiced telephone speech quieter than voiced & exact wave- shape not critical for perception.6. Pauses of about 60 % duration per speaker.7. Ear insensitive to phase spectrum of telephone speech 8. Ear more sensitive in some frequency ranges than others.9. Audibility of low level frequency components 'masked' by

adjacent higher level components,

Page 32: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 32

Section of voiced & unvoiced telephone speech. Large amplitudes are voiced (speech bandlimited to 3.4kHz).

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 -1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1 x 10

4

Time (s)

Am

pli

tud

e

Page 33: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 33

Small section of female voiced speech

This was extracted from previous graph (sampled at 8 kHz).

0 500 1000 1500 2000 2500 3000 3500-8000

-6000

-4000

-2000

0

2000

4000

6000

8000

10000

sample number

Page 34: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 34

Smaller section of male voiced speech.

0 500 1000 1500-2

-1.5

-1

-0.5

0

0.5

1

1.5

2x 10

4

sample

Bandlimited to 300-3400 Hz & sampled at 8 kHz.

Page 35: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 35

Magnitude spectrum of male speech

Obtained by FFT analysing about 4 cycles as shown on next slide

0 0.5 1 1.5 2 2.5 3 3.5 46

7

8

9

10

11

12

13

14

kHz

dBW

Page 36: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 36

0 100 200 300 400 500 600 700-1.5

-1

-0.5

0

0.5

1

1.5

2x 10

4

Section of male voiced speech

Page 37: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 37

8.2.2. Differential coding

• Encode differences between samples.

• Where differences transmitted by PCM this is ‘differential PCM’.

Page 38: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 38

8.2.3. Simplified DPCM coder & decoder

Delay by 1 sample

Quantiser s[n] e[n]

Transmit or store samples

s[n-1]

e[n] = e[n] + q[n]

Coder

DecoderDelay by 1 sample

s[n-1]

Receive fromchannel or store

e[n] ][ns

Page 39: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 39

Simple DPCM receiver

][ns

PCM receiver

Delay by 1 sample

s[n-1]

Receive e[n]

Page 40: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 40

Why is speech is well suited to differential coding?

• Speech energy biased towards low frequency end of spectrum.

• Especially true for voiced, (not so much for unvoiced)

• Adjacent speech samples often quite close

• As energy of {e[n]} < energy of {s[n]} {e[n]} encoded using fewer bits/sample.

Page 41: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 41

Modify simple DPCM diagram in 2 ways

Receiver: introduce with =0.99.

Transmitter: derive as at the receiver. ^

Instead of s[n] - s[n-1] , transmit s[n] - s[n-1] .

][ns

Page 42: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 42

Modified DPCM encoder

--

z-1

z-1

Quantise

Copy of receiver

e[n]

e[n] s[n]

s[n]

Page 43: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 43

Practical DPCM encoder

Previous diagram simplifies to:

z-1

Quantise

e[n] s[n]

s[n]

e[n]

^

^

Page 44: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 44

32 kb/s ADPCM

• DPCM with adaptive quantiser (adapts its step-size according to signal e[n] )

• ITU standard for speech coding (G726) (also for 40, 24 & 16 kb/s)

Page 45: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 45

8.2.4. Linear prediction coding (LPC)

• Concept of differential coding described in terms of prediction.• Consider again a differential quantiser as shown below:

s[n]

s[n-1]

Quantiser e[n]

s[n] z -1 Prediction

Prediction error e[n]

Page 46: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 46

Prediction • Predict that s[n] will be identical to s[n-1]. • Prediction error (residual) will be {e[n]}. • If prediction good, {e[n]} small & fewer bits needed for it.• Better prediction from several previous samples:

+ + +

+ s[n]

s[n]

b1 b2 b3 bM

z-1 z-1 z-1

z-1

e[n]

Prediction

Page 47: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 47

Short term linear prediction (LP) filter (for analysis)

• FIR digital filter of order M with coeffs b1, b2, ..., bM . • Must be adaptive,• New set of coeffs for each block of 10 or 20 ms.• Code and transmit e[n] & coeffs b1, b2, ..., bM for each block.•Typically M=10.

Page 48: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 48

LPC decoder• Receiver reconstruct speech s[n] as follows:

+ + +

+ s[n]

b1 b2 b3

z-1 z-1 z-1

z-1

e[n]

Prediction

bM

• Prediction filter now used recursively for ‘synthesis’

• Puts back what was removed at transmitter

Page 49: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 49

Comments on LPC coder/encoder

•M between 6 &12.

•Error {e[n]}, has special properties which allow it to be coded

very efficiently.

•Study human speech production mechanism.

Page 50: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 50

Human speech production mechanism•For vowels, air forced thro’ vocal cords & causes them to vibrate.• Periodic build-up & release of pressure produces sound.• "Vocal tract excitation".• If high-pass filtered, would appear as a series of pulses:

Mouth

Nose

Air from lungs

Vocal cords

Vocal tract

Velum

T Time

High -pass filtered excitation Volts

Speech-likewaveform

Time

Page 51: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 51

Unvoiced speech (consonants)

• For unvoiced speech, vocal cords do not vibrate. • Turbulent air flow produces “hissing” sound. • Random or noise-like vocal tract excitation.

Page 52: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 52

Formants in speech • Excitation passes thro' vocal tract. • Flexible tube : begins at glottis & ends at lips. • Shape determined by tongue, jaw & lips.• "Nasal tract" also tube connected at "velum".

• Tubes act like a filter with resonances ("formants").• Resonances change as person speaks. • Resonance at f1 Hz emphasises power close to f1. • Amplitudes of some harmonics increased with respect to others.• Refer back to speech graphs (slide 39). We see this effect.

Page 53: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 53

Effect of prediction filter at transmitter

With correctly adapted coeffs: prediction (analysis) filter at transmitter removes resonances (formants).

Remaining ‘prediction error’ (or ‘residual’) signal {e[n]} becomes high-pass filtered excitation signal:

periodic series of pulses (voiced), or

spectrally white random signal (unvoiced).

Spectrum of {e[n]} should now be flat. No formants.

All harmonics with same amplitude

Page 54: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 54

Voiced speech {s[n]} & residual {e[n]} with spectra

e[n]

nP P P

s[n]

P P P

n

f

f

|S(..)|

|E(..)|

1/P

Page 55: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 55

Deriving short term prediction coeffs b1, b2, …, bM

• DSP procedure: “LP(C)” analysis.

• Takes block of say 160 samples of speech (20 ms).

• Coeffs calculated such that energy of {e[n]} minimised.

•In MATLAB given an array s of say 160 speech samples: b = lpc(s, 10);

• Produces an array containing 1, b1, b2, ..., bM with M=10

Page 56: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 56

‘LPC-10’

• Once b1 , b2 , ..., b M derived by LPC analysis, {e[n]} may be obtained by passing speech thro' analysis filter• In MATLAB: e = filter(b, 1, s);

• LPC-10 codes 20 ms speech frames in the form of:Ten LPC filter coeffs (about 40 bits say)A ‘stylised’ representation {e[n]} consisting of:

(i) Unvoiced/voiced decision (1 bit)(ii) Amplitude ( a single number: 8 bits say)(iii) A pitch-period (a single number: 8 bits say)

• Total, about 57 bits/frame, 50 frames/s i.e. 2850 bit/s

• LPC-10 coder at 2400 b/s is widely used in military comms.

Page 57: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 57

LPC-10 receiver

LPC ‘synthesis’ filter

Coefficients

Speech

{e[n]}

Noise generator

Impulsesequencegenerator

Pitch-period P

V/UV decision

Amplitude

Page 58: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 58

Demo in MATLAB

• A simple LPC-10 coder may be written and tested in MATLAB.

• We know how to derive the filter coeffs and the prediction error for a segment of speech.

• To make things really easy, forget about unvoiced speech and use a fixed frequency excitation signal consisting of a series of pulses at intervals of say 80 samples.

• Only the LP filter coeffs and the energy of {e[n]} per 160 sample frame need be transmitted or stored.

• Forget about quantising filter coeffs.

Page 59: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 59

8.2.5. Waveform coding and parametric coding.

•Waveform coding techniques such as PCM & DPCM try to

preserve exact shape of speech waveform as far as possible. •Simple to understand & implement, but cannot achieve very low

bit-rates. •Parametric techniques (e.g. LPC) do not aim to preserve exact

wave-shape, & represent features expected to be perceptually

significant by sets of parameters,

• i.e. bi coefficients & parameters of stylised error signal.

• Parametric coding more complicated to understand &

implement than waveform coding, but achieves lower bit-rates.

Page 60: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 60

8.2.6. Regular pulse excited LPC with LTP : (RPE-LTP)

•Original speech coding technique for GSM: “GSM 06.10”. •One of many LPC techniques for mobile telephony.•Relatively simple by today’s standards. (CELP)•Its bit-rate is 13 kb/s. •Produces 260 bits for each 20 ms block of speech. •Channel coding adds 196 bits to produce 456 bits per block & overall bit-rate of 22.8 kb/s. •LPC to calculate short term prediction coeffs & {e[n]}.•Does not adopt “stylised” description of {e[n]}. •Employs “long term predictn (LTP) filter”.•Cancels out periodicity due to vibration of vocal cords.

Page 61: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 61

Long term predictor (LTP)

Delay of P samples

e[n]

K

e[n] with periodicity removed

Most of the periodicity removed to leave a lower-amplitude noise-like residual.

Multiplier K is applied to compensate for signal getting slightly louder (K<1) or softer (K>1).

Page 62: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 62

Encoding the RPE-LTP residual

• LTP residual encoded “regular pulse excitation” coding.

• Divides each 160 sample block into four ‘40 sample’ sub-blocks.

• Each sub-block divided into 3 interleaved sets of locations :

• {1,4,7,..., 37,40}, {2,5,8,...38}, {3,6,9,12,...39} referred to as ‘tracks’.

• Find track with most energy & encode choice with 2-bits.

• The 13 samples are quantised with 3-bit per sample & a 6-bit overall scaling factor.

• At receiver, residual is decoded & applied to inverse LTP. • Resulting periodic signal e[n] then becomes input to short-term

inverse LPC filter.

Page 63: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 63

RPE-LTP receiver

Fig 8.17

e[n] with periodicity removed

Delay of P samples

e[n]

K

Short term LPC inverse filter

Speech

Coeffs

P

K

Page 64: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 64

8.2.7. CELP

• Based on similar principles to RPE-LRP.

• Has short & long term predictor to produce low level noise-like residual.

• Uses ‘analysis by synthesis’ to encode residual .

• Has 2 identical code-books, one at transmitter & one at receiver.

• Codebooks contain duplicate segments each 40 samples long.

• Each codebook has M-bit index where M may be 10.

• 1024 ‘40-sample’ sequences, indexed 0, 1, ..., M-1.

• Transmitter splits residual into four ‘40-sample’ sub-segments.

• Tries each code-book entry to find best one to use as excitation.

• Index of best entry transmitted.

• Four indices transmitted for each 160 sample block of residual.

Page 65: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 65

Analysis-by-synthesis• Transmitter has copy of receiver.

• For a 40 sample sub-segment, it applies each codebook entry to the long & short term synthesis filters to synthesise speech as it would be synthesised at receiver.

• It then compares shape of synthesised waveform with original speech.

• Similarity is measured by computing a cross-correlation function.

• Compares two sequences where the amplification of one is optimised to make them as close as possible.

• NB. k can be made negative.

• We are comparing their ‘shapes’ not their magnitudes .

• In CELP coders, the cross-correlation is weighted to take into account some characteristics of human sound perception, i.e. it is ‘perceptually weighted’.

• Codebooks may be populated with samples of pseudo-random noise.

• More recent versions of CELP use ‘algebraic’ code-book entries.

Page 66: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 66

A simplified view of CELPDerive LPC filter coeffs

Find fundamental (pitch) frequency

Construct excitation from codebook entry & long term predictor

LPC synthesis filter

Compute cross-correlation to measure similarity

Control codebook search to maximize similarity.

Apply perceptual weighting

Speech

Trans-mit

Transmitresults of search

Page 67: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 67

8.2.8. Algebraic CELP (G729) • Instead of being read from a code-book, each excitation is constructed by

positioning a suitable number of +ve or ve unit impulses within the sub-segment.

• Search made among a range of possible impulse positions.

• Suitability of each one evaluated by comparing synthesised with original speech.

• Each sub-segment divided into five interleaved sets of possible pulse locations : {1,6,11,16,21,26,31,36}, {2,7,12,17,22,27,32,37}, …, {5,10,15,20,25,30,35,40} referred to as ‘tracks’.

• Position one +ve or -ve unit impulse per track within 1st 3 tracks & 4th or 5th.

• Must choose best possible positions & polarities for these impulses.

• ‘Best’ means max cross-correlation between synthesized & original speech.

• Recording 4 unit-impulse positions with polarities & the choice of tracks requires 17 bits per sub-segment.

Page 68: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 68

8.3. Digitising music:

• Standards exist for the digitisation of ‘audio’ quality sound.• Compact disks: 2.83 megabits per second. (fs= 44.1kHz, 16 bits/sample, stereo, with FEC.)• Too high for many applications.• DSP compression takes advantage of human hearing. • Compression is “lossy”, i.e. not like “ZIP” . • MPEG concerned with hi-fi audio, video & their synchronisation.• DAB, MUSICAM use “MPEG-1 audio layer 2” (MP2). • 3 MPEG layers offer range of sampling rates (32, 44.1, 48 kHz) & bit-rates from 224 kb/s to 32 kb/s per channel. • Difference lies in complexity of DSP processing required .

Page 69: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 69

MPEG1-4 with MP3 etc

•Layer 1 of MPEG-1 (MP1) simplest & best suited to higher bit-rates: e.g. Philips digital compact cassette at 192 kb/s per channel. • Layer 2 has intermediate complexity & is suited to bit-rates around 128 kb/s: DAB uses MP2. • Layer 3 (MP3) is most complex but offers best audio quality. Can be used at bit-rates as low as 64 kb/s: Now used for distribution of musical recordings via Internet. • All 3 MPEG-1 audio layers simple enough to be implemented on a single DSP chip as real time encoder/decoder.•MPEG-2 supports enhanced & original (MPEG-1) versions of MP1-3.•MPEG-3 was started but abandoned. Don’t confuse MPEG-3 with MP3 (MPEG-1/2 audio level 3)•MPEG-4 ongoing & introduces ‘AAC’ & ‘MP4’.

Page 70: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 70

CD recordings take little account of the nature of the music and music perception.

Frequency masking & temporal masking can be exploited to reduce bit-rate required for recording music. This is ‘lossy’ rather than ‘loss-less’ compression.

Frequency masking: ‘A strong tonal audio signal at a given frequency will mask, i.e. render inaudible, quieter tones at nearby frequencies, above and below that of the strong tone, the closer the frequency the more effective the matching’.

Temporal masking: ‘A loud sound will mask i.e. render inaudible a quieter sound occurring shortly before or shortly after it. The time difference depends on the amplitude difference’.

Introduction to MP3

Page 71: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 71

Perceived loudness

• Human ear is not equally sensitive to sound at different frequencies.

• Most sensitive to sound between about 1 kHz and 4 kHz.

• ‘Equal loudness contour graphs’ reproduced below.

• For any contour, human listeners find sound equally loud even though actual sound level varies with frequency.

• The sound level is expressed in dB_SPA (see notes)

Page 72: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 72

Equal loudness countours

Page 73: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 73

MP3 sub-bands • MP3 is a frequency-domain encoding technique.

• Takes 50% overlapping frames of either 1152 or 384 samples.

• With fs = 44.1 kHz : 26.12 ms or 8.7 ms of music

• Sound split into frequency sub-bands.

• Splitting in 2 stages, firstly by bank of 32 FIR band-pass filters.

• Output from each filter down-sampled by a factor 32.

• Then further split by means of DCT• Obtain 1152 or 384 frequency domain samples.

Page 74: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 74

Frequency-domain coding

• Frequency-domain samples can be coded.

• Spectral energy may be found to be concentrated in certain frequency bands and may be very low in others.

• Assign more bits to some frequency-domain samples than others.

• Equal loudness contours useful here.

• Use ‘signal to masking’ (SMR) ratio at each frequency to determine number of bits allocated to the spectral sample.

• More significant advantages from ‘psycho-acoustics’.

Page 75: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 75

Frequency masking

• ‘A strong tonal audio signal at a given frequency will mask, i.e. render inaudible, quieter tones at nearby frequencies, above and below that of the strong tone, the closer the frequency the more effective the matching’.

• Characterised by a ‘spreading’ function.

• Triangular approximation available.• Example given for a strong tone (60 dB_SPL) at 1kHz.

Page 76: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 76

Example spreading function (triangular approx)

Represents a threshold of hearing for frequencies adjacent to the tone.

A tone below the spreading function will be masked & not heard.

Tones make masking threshold contour different from ‘in quiet’.

2000500 1000

f Hz

60

40

20

dB_SPL

2k

5001000

60

40

20

dB_SPL

-20

4k

-40

f Hz

Page 77: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 77

Triangular approximation to spreading function

• Derived from +25 and -10 dB per Bark triangular approximation (see WIKI)

• Assuming approximately 4 Barks per octave.

• Bark scale is alternative & perceptually based frequency scale

• Logarithmic for frequencies above 500 Hz.

• More accurate spreading functions are highly non-linear, not quite triangular, & vary with & amplitude & frequency of masker.

Page 78: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 78

Effect of 2 strong tones on masking contour

•Add spreading function for each tone as identified by FFT.

•Allocate bits according to SMR relative to this masking contour,•More efficient & economic coding scheme is possible.

2020 k

5k1k

f Hz

10k100

60dB_SPL

0

Effect of tones at 800 Hz & 4 kHz

masking contour in quiet

Page 79: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 79

Temporal masking

• ‘A loud sound will mask a quieter sound occurring shortly before or shortly after it. The time difference depends on the amplitude difference’.

• Effect of frequency masking continues for up to 200 ms after the strong tone has finished, and even before it starts.

• Frequency masking contour for a given frame should be calculated taking account previous & next frames

Page 80: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 80

Temporal characteristic of masking by a 20 ms tone at 60 dB_SPL.

0200 400

t (ms)

-50

60

dB_SPL

20

40

simultaneous Post-masking

Pre-

Page 81: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 81

Diagram of an MP3 coder

Music Transform to frequency domain

Derive psychoacoustic masking function

Devise quantsation scheme for sub-bands according to masking

Apply Huffman coding

MP3

Page 82: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 82

Huffman coding & more detail

• Quantisation scheme tries to make the 2/12 noise < masking threshold.

• Non-uniform quantisation is used.

• Further efficiency with Huffman coding (lossless)

• ‘Self terminating’ variable length codes for the quantisation levels.

• Quantised samples which occur more often given shorter wordlengths.

• MP3 decoder simpler than encoder

• Reverses quantisation to get frequency domain samples &

• Transforms back to time domain taking into account frames are 50% overlapped.

• Some more detail about MP3 compression and Huffman coding is given in references quoted in the Comp30192 web-site.

• Frequency masking demo: www.ece.uvic.ca/~aupward/p/demos.htm.

Page 83: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 83

8.4. Digitisation of Video

• Digital TV with 486 lines would require 720 pixels per line, each pixel requiring 5 bits per colour, i.e. 2 bytes per pixel. • At 30 frames/s, bit-rate 168 Mb/s or 21 Mbytes/s. • Normal CD-Rom would hold 30 s of TV video at this bit-rate. • For HDTV, requirement is about 933 Mb/s• For film quality, required bit-rate 2300Mb/s. • SVGA screen with 800x600 pixels requires 3 x 8 = 24 bits per pixel, & 288 Mb/s if refreshed at 25Hz with interlacing. • Need for video compression is clear.

Page 84: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 84

• MPEG-1 & 2 & FCC standard for HDTV use “2-D discrete cosine transform (DCT)” applied to 8 x 8 (or 10x10) pixel “tiles”. • Red, green & blue colour measurements for each pixel transformed to a “luminance” & 2 “chrominance” measurements.• The eye is more sensitive to differences in luminance than to variations in chrominance. • Three separate images dealt with separately. •For chrominance, average sets of 4 pixels to produce fewer pixels. • Apply DCT to each tile to obtain 8x8 (or 10x10) 2-D frequency domain samples starting with a sort of “dc value” which represents overall brightness of the “tile”. • Finer detail added by higher 2-D frequency samples. • Just as for 1-D, higher frequencies add finer detail to signal shape.

Page 85: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 85

• The 2-D frequency-domain samples for each tile now quantised according to perceptual importance. • Accurately quantise differences betw dc values of adjacent tiles. • Differences often quite small & need few bits. • Remaining DCT coeffs for each tile diminish in importance with increasing frequency.• Many are so small that they may be set to zero.• Runs of zeros easily digitised by recording length of run. • Further bit-rate savings achieved by Huffman coding.• Assigns longer codes to numbers which occur less often.• Shorter codes for commonly occurring numbers.

Page 86: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 86

•Technique above may be applied to a single video frame•Used to digitise still pictures according to “JPEG” standard. •MPEG-1 & 2 send JPEG encoded frames ( I-frames) about once or twice per second. •Between the I-frames MPEG-1 &-2 send

“P-frames” which encode differences between current frame & previous frame

“B-frames” which encode differences between current frame the previous & the next frame. • MPEG-1 originally for encoding reasonable quality video at about 1.2 Mb/s. • MPEG-2 originally for encoding broadcast quality video at bit- rates between 4 & 6 Mb/s.

Page 87: 7 Dec'08Comp30291 Section 81 UNIVERSITY of MANCHESTER School of Computer Science Comp30291 Digital Media Processing 2008-09 Section 8: Processing Speech,

7 Dec'08 Comp30291 Section 8 87

End of section 8.