audio compression notes(data compression)

Chapter2

AUDIOCOMPRESSION

DigitalAudio,Lossysoundcompression,MlawandAlawCompanding,DPCMandADPCMaudiocompression,MPEGaudiostandard,frequencydomaincoding,formatofcompresseddata.

1.Introduction:

Two important features of audio compression are (1) it can be lossy and (2) it requires fast

decoding.Textcompressionmustbelossless,butimagesandaudiocanlosemuchdatawithouta

noticeable degradation of quality. Thus, there are both lossless and lossy audio compression

algorithms. Often, audio is stored in compressed form and has to be decompressed in realtime

whentheuserwantstolistentoit.Thisiswhymostaudiocompressionmethodsareasymmetric.

The encoder can be slow, but the decoder has to be fast. This is also why audio compression

methods are not dictionary based. A dictionarybased compression method may have many

advantages,butfastdecodingisnotoneofthem.

Wecandefinesoundas:

(a)An intuitivedefinition: Sound is the sensation detected by our ears and interpreted by our

braininacertainway.

(b) A scientific definition: Sound is a physical disturbance in a medium. It propagates in the

mediumasapressurewavebythemovementofatomsormolecules.

Likeanyotherwave,soundhasthreeimportantattributes,itsspeed,amplitude,andperiod.

Thespeedofsounddependsmostlyonthemediumitpassesthrough,andonthetemperature.The

humanearissensitivetoawiderangeofsoundfrequencies,normallyfromabout20Hztoabout

22,000Hz,dependingonapersonsageandhealth.Thisistherangeofaudiblefrequencies.Some

animals, most notably dogs and bats, can hear higher frequencies (ultrasound). Loudness is

commonly measured in units of dB SPL (sound pressure level) instead of sound power. The

definitionis,

10

10

20

2.DigitalAudio:

Sound can be digitized and broken up into numbers. Digitizing sound is done bymeasuring the

voltage at many points in time, translating each measurement into a number, and writing the

numbersona file.Thisprocess is calledsampling.Thesoundwave is sampled,and thesamples

becomethedigitizedsound.Thedeviceusedforsamplingiscalledananalogtodigitalconverter

(ADC).

Since theaudio samples arenumbers, theyareeasy toedit.However, themainuseof an

audiofileistoplayitback.Thisisdonebyconvertingthenumericsamplesbackintovoltagesthat

are continuously fed into a speaker. The device that does that is called a digitaltoanalog

converter (DAC). Intuitively, it is clear that a high sampling rate would result in better sound

reproduction,butalsoinmanymoresamplesandthereforebiggerfiles.Thus,themainproblemin

audiosamplingishowoftentosampleagivensound.

Figure1:SamplingofaSoundWave

Figure1ashowstheeffectof lowsamplingrate.Thesoundwave in the figure issampled

fourtimes,andallfoursampleshappentobeidentical.Whenthesesamplesareusedtoplayback

the sound, the result is silence. Figure 1b shows seven samples, and they seem to follow the

original wave fairly closely. Unfortunately, when they are used to reproduce the sound, they

produce the curve shown in dashed. There simply are not enough samples to reconstruct the

originalsoundwave.Thesolutiontothesamplingproblemis tosamplesoundata littleoverthe

Nyquistfrequency,whichistwicethemaximumfrequencycontainedinthesound.

The sampling rate plays a different role in determining the quality of digital sound reproduction. One classic

law in digital signal processing was published by Harry Nyquist. He determined that to accurately reproduce a

signal of frequency f, the sampling rate has to be greater than 2*f. This is commonly called the Nyquist Rate. It

is used in many practical situations. The range of human hearing, for instance, is between 16 Hz and 22,000 Hz.

When sound is digitized at high quality (such as music recorded on a CD), it is sampled at the rate of 44,100 Hz.

Anything lower than that results in distortions.

Thus, ifasoundcontains frequenciesofupto2kHz, itshouldbesampledata littlemore

than4kHz.Suchasamplingrateguaranteestruereproductionofthesound.This is illustratedin

Figure1c,whichshows10equallyspacedsamplestakenoverfourperiods.Noticethatthesamples

donothavetobetakenfromthemaximaorminimaofthewave;theycancomefromanypoint.

The range of humanhearing is typically from1620Hz to 20,00022,000Hz, depending on the

personandonage.Whensoundisdigitizedathighfidelity,itshouldthereforebesampledatalittle

overtheNyquistrateof222000=44000Hz.Thisiswhyhighqualitydigitalsoundisbasedona

44,100Hz sampling rate. Anything lower than this rate results in distortions, while higher

samplingratesdonotproduceanyimprovementinthereconstruction(playback)ofthesound.We

canconsider thesamplingrateof44,100Hza lowpass filter, since it effectively removesall the

frequenciesabove22,000Hz.

The telephone system, originally designed for conversations, not for digital

communications, samples sound at only 8 kHz. Thus, any frequency higher than 4000 Hz gets

distortedwhensentoverthephone,whichiswhyitishardtodistinguish,onthephone,between

thesoundsof fands.Thesecondprobleminsoundsampling is thesamplesize.Eachsample

becomesanumber,buthowlargeshouldthisnumberbe?Inpractice,samplesarenormallyeither

8 or 16 bits. Assuming that the highest voltage in a soundwave is 1 volt, an 8bit sample can

distinguishvoltagesaslowas1/2560.004volt,or4millivolts(mv).Aquietsound,generatinga

wavelowerthan4mv,wouldbesampledaszeroandplayedbackassilence.Incontrast,witha16

bitsampleitispossibletodistinguishsoundsaslowas1/65,53615microvolt(v).Wecanthink

ofthesamplesizeasaquantizationoftheoriginalaudiodata.

Audio sampling is also calledpulse codemodulation (PCM). The termpulsemodulation

refers to techniques for converting a continuous wave to a stream of binary numbers (audio

samples). Possible pulsemodulation methods include pulse amplitude modulation (PAM), pulse

positionmodulation(PPM),pulsewidthmodulation(PWM),andpulsenumbermodulation(PNM)

isagoodsourceofinformationonthesemethods.Inpractice,however,PCMhasprovedthemost

effective form of converting soundwaves to numbers.When stereo sound is digitized, the PCM

encodermultiplexes the left and right sound samples. Thus, stereo sound sampled at 22,000Hz

with16bitsamplesgenerates44,00016bitsamplespersecond,foratotalof704,000bits/sec,or

88,000bytes/sec.

2.1DigitalAudioandLaplaceDistribution:Alargeaudiofilewithalong,complexpieceofmusictendstohaveallthepossiblevaluesofaudio

samples.Considerthesimplecaseof8bitaudiosamples,whichhavevaluesintheinterval[0,255].

A large audio file, with millions of audio samples, will tend to have many audio samples

concentrated around the center of this interval (around 128), fewer large samples (close to the

maximum255),andfewsmallsamples(althoughtheremaybemanyaudiosamplesof0,because

manytypesofsoundtendtohaveperiodsofsilence).Thedistributionofthesamplesmayhavea

maximumatitscenterandanotherspikeat0.Thus,theaudiosamplesthemselvesdonotnormally

haveasimpledistribution.

However,whenweexaminethedifferencesofadjacentsamples,weobserveacompletely

differentbehavior.Consecutiveaudiosamplestendtobecorrelated,whichiswhythedifferencesof

consecutivesamplestendtobesmallnumbers.Experimentswithmanytypesofsoundindicatethat

the distribution of audio differences resembles the Laplace distribution. The differences of

consecutivecorrelatedvaluestendtohaveanarrow,peakeddistribution,resemblingtheLaplace

distribution. This is true for the differences of audio samples as well as for the differences of

consecutive pixels of an image. A compression algorithm may take advantage of this fact and

encode the differences with variablesize codes that have a Laplace distribution. A more

sophisticated versionmay compute differences between actual values (audio samples or pixels)

and their predicted values, and then encode the (Laplace distributed) differences. Two such

methodsareimageMLPandFLAC.

2.2.TheHumanAuditorySystem

The frequency range of the human ear is from about 20 Hz to about 20,000 Hz, but the ears

sensitivity to sound isnotuniform. Itdependson the frequency. It shouldalsobenoted that the

rangeofthehumanvoiceismuchmorelimited.It isonlyfromabout500Hztoabout2kHz.The

existenceofthehearingthresholdsuggestsanapproachtolossyaudiocompression.Justdeleteany

audio samples that are below the threshold. Since the threshold depends on the frequency, the

encoderneeds to know the frequency spectrumof the soundbeing compressed at any time. If a

signalforfrequencyfissmallerthanthehearingthresholdatf,it(thesignal)shouldbedeleted.In

additiontothis,twomorepropertiesofthehumanhearingsystemareusedinaudiocompression.

Theyarefrequencymaskingandtemporalmasking.

2.2.1SpectralMaskingorFrequencyMasking:

Frequencymasking(alsoknownasauditorymaskingorSpectralmasking)occurswhenasound

thatwecannormallyhear(becauseit is loudenough)ismaskedbyanothersoundwithanearby

frequency. The thick arrow in Figure 2 represents a strong sound source at 8 kHz. This source

raisesthenormalthresholdinitsvicinity(thedashedcurve),withtheresultthatthenearbysound

representedby thearrowat x, a sound thatwouldnormallybeaudiblebecause it is above the

threshold,isnowmasked,andisinaudible.Agoodlossyaudiocompressionmethodshouldidentify

this case anddelete the signals corresponding to sound x, because it cannot be heard anyway.

Thisisonewaytolossilycompresssound.

Figure2:Spectralorfrequencymasking

The frequency masking (the width of the dashed curve of Figure 2) depends on the

frequency.Itvariesfromabout100Hzforthelowestaudiblefrequenciestomorethan4kHzfor

thehighest.Therangeofaudiblefrequenciescanthereforebepartitionedintoanumberofcritical

bands that indicate the declining sensitivity of the ear (rather, its declining resolvingpower) for

higherfrequencies.Wecanthinkofthecriticalbandsasameasuresimilartofrequency.However,

incontrasttofrequency,whichisabsoluteandhasnothingtodowithhumanhearing,thecritical

bands are determined according to the sound perception of the ear. Thus, they constitute a

perceptuallyuniformmeasureoffrequency.Table1lists27approximatecriticalbands.

Table1:TwentySevenApproximateCriticalBands.

This alsopoints theway todesigning apractical lossy compression algorithm.The audio

signal should first be transformed into its frequency domain, and the resulting values (the

frequencyspectrum)shouldbedividedintosubbandsthatresemblethecriticalbandsasmuchas

possible. Once this is done, the signals in each subband should be quantized such that the

quantization noise (the difference between the original sound sample and its quantized value)

shouldbeinaudible.

2.2.2TemporalMasking

TemporalmaskingmayoccurwhenastrongsoundAoffrequencyfisprecededorfollowedintime

byaweakersoundBatanearby(orthesame)frequency.Ifthetimeintervalbetweenthesoundsis

short, sound Bmay not be audible. Figure 3 illustrates an example of temporal masking. The

thresholdoftemporalmaskingduetoaloudsoundattime0goesdown,firstsharply,thenslowly.

Aweakersoundof30dBwillnotbeaudibleifitoccurs10msbeforeoraftertheloudsound,but

willbeaudibleifthetimeintervalbetweenthesoundsis20ms.

Figure3:ThresholdandMaskingofSound.

If the masked sound occurs prior to the masking tone, this is called premasking or

backwardmasking, and if the sound beingmasked occurs after themasking tone this effect is

called postmasking or forwardmasking. The forward masking remains in effect for a much

longertimeintervalthanthebackwardmasking.

3.LossySoundCompressionItispossibletogetbettersoundcompressionbydevelopinglossymethodsthattakeadvantageof

our perception of sound, and discard data to which the human ear is not sensitive. We briefly

describetwoapproaches,silencecompressionandcompanding.

Theprincipleofsilencecompressionistotreatsmallsamplesasiftheyweresilence(i.e.,as

samples of 0). This generates run lengthsof zero, so silence compression is actually a variant of

RLE,suitableforsoundcompression.Thismethodusesthefactthatsomepeoplehavelesssensitive

hearingthanothers,andwilltoleratethelossofsoundthatissoquiettheymaynothearitanyway.

Audiofilescontaininglongperiodsoflowvolumesoundwillrespondtosilencecompressionbetter

than other fileswith highvolume sound. Thismethod requires a usercontrolledparameter that

specifiesthelargestsamplethatshouldbesuppressed.

Companding(shortforcompressing/expanding)usesthefactthattheearrequiresmore

precise samples at low amplitudes (soft sounds), but is more forgiving at higher amplitudes. A

typicalADCusedinsoundcardsforpersonalcomputersconvertsvoltagestonumberslinearly.Ifan

amplitudeaisconvertedtothenumbern,thenamplitude2awillbeconvertedtothenumber2n.A

compressionmethodusing companding examines every sample in the sound file, and employs a

nonlinearformulatoreducethenumberofbitsdevotedtoit.Moresophisticatedmethods,suchas

lawandAlaw,arecommonlyused.

4.LawandALawCompandingTheLawandALawcompanding standardsemploy logarithmbased functions to encodeaudio

samples for ISDN (integrated services digital network) digital telephony services, by means of

nonlinearquantization.TheISDNhardwaresamplesthevoicesignalfromthetelephone8KHz,and

generates14bitsamples(13forAlaw).ThemethodoflawcompandingisusedinNorthAmerica

andJapan,andAlawisusedelsewhere.

Experiments indicate that the lowamplitudesof speechsignals containmore information

thanthehighamplitudes.Thisiswhynonlinearquantizationmakessense.Imagineanaudiosignal

sentonatelephonelineanddigitizedto14bitsamples.Theloudertheconversation,thehigherthe

amplitude,andthebiggerthevalueof thesample.Sincehighamplitudesare less important, they

canbecoarselyquantized.Ifthelargestsample,whichis2141=16,383,isquantizedto255(the

largest8bitnumber),thenthecompressionfactoris14/8=1.75.Whendecoded,acodeof255will

become very different from the original 16,383.We say that because of the coarse quantization,

largesamplesendupwithhighquantizationnoise.Smallersamplesshouldbefinelyquantized,so

theyendupwithlowquantizationnoise.Thelawencoderinputs14bitsamplesandoutputs

8bit codewords.TheAlaw inputs 13bit samples and also outputs 8bit codewords. The

telephone signals are sampled at 8 kHz (8,000 timesper second), so thelaw encoder receives

8,00014=112,000bits/sec.Atacompressionfactorof1.75,theencoderoutputs64,000bits/sec.

4.1LawEncoder:

Thelawencoderreceivesa14bitsignedinputsamplex.Thus,theinputisintherange[8192,

+8191]. The sample isnormalized to the interval [1,+1], and theencoderuses the logarithmic

expression

||

Where

1, 00, 01, 0

(andisapositiveinteger),tocomputeandoutputan8bitcodeinthesameinterval[1,+1].The

output is thenscaled to the range [256,+255]. Figure4 shows thisoutputasa functionof the

input for the three values 25, 255, and 2555. It is clear that large values of cause coarser

quantizationforlargeramplitudes.Suchvaluesallocatemorebitstothesmaller,moreimportant,

amplitudes. The G.711 standard recommends the use of = 255. The diagram shows only the

nonnegativevaluesoftheinput(i.e.,from0to8191).Thenegativesideofthediagramhasthesame

shapebutwithnegativeinputsandoutputs.

Figure4:TheLawforValuesof25,255,and2555.

The following simple examples illustrate the nonlinear nature of the law. The two

(normalized)inputsamples0.15and0.16aretransformedbylawtooutputs0.6618and0.6732.

Thedifferencebetweentheoutputsis0.0114.Ontheotherhand,thetwoinputsamples0.95and

0.96 (bigger inputs but with the same difference) are transformed to 0.9908 and 0.9927. The

differencebetween these twooutputs is0.0019;muchsmaller.Bigger samplesaredecodedwith

morenoise,andsmallersamplesaredecodedwithlessnoise.However,thesignaltonoiseratiois

constantbecauseboththelawandtheSNRuselogarithmicexpressions.

P S2 S1 S0 Q3 Q2 Q1 Q0

Figure5:G.711LawCodeword.

Logarithmsareslowtocompute,sothelawencoderperformsmuchsimplercalculations

thatproduceanapproximation.Theoutputspecifiedby theG.711standard isan8bitcodeword

whoseformatisshowninFigure5.BitPinFigure5isthesignbitoftheoutput(sameasthesignbit

ofthe14bitsignedinputsample).BitsS2,S1,andS0arethesegmentcode,andbitsQ3throughQ0

arethequantizationcode.Theencoderdeterminesthesegmentcodeby(1)addingabiasof33to

theabsolutevalueoftheinputsample,(2)determiningthebitpositionofthemostsignificant1bit

among bits 5 through 12 of the input, and (3) subtracting 5 from that position. The 4bit

quantizationcodeissettothefourbitsfollowingthebitpositiondeterminedinstep2.Theencoder

ignores the remaining bits of the input sample, and it inverts (1s complements) the codeword

beforeitisoutput.

ExampleofLawCodeword:

(a)Encoding:Weusetheinputsample656asanexample.Thesampleis

Q3 Q2 Q1 Q0

0 0 0 1 0 1 0 1 1 0 0 0 1

12 11 10 9 8 7 6 5 4 3 2 1 0

Figure6:EncodingInputSample656.

negative, so bit P becomes 1. Adding 33 to the absolute value of the input yields 689 =

0010101100012(Figure6).Themostsignificant1bitinpositions5through12isfoundatposition

9.Thesegmentcodeisthus95=4.Thequantizationcodeisthefourbits0101atpositions85,

and the remaining five bits 10001 are ignored. The 8bit codeword (which is later inverted)

becomes

P S2 S1 S0 Q3 Q2 Q1 Q0

1 1 0 0 0 1 0 1

(b)Decoding:Thelaw decoder inputs an 8bit codeword and inverts it. It then decodes it as

follows:

1. Multiplythequantizationcodeby2andadd33(thebias)totheresult.

2. Multiplytheresultby2raisedtothepowerofthesegmentcode.

3. Decrementtheresultbythebias.

4. UsebitPtodeterminethesignoftheresult.

Applyingthesestepstoourexampleproduces

1. Thequantizationcodeis01012=5,so52+33=43.

2. Thesegmentcodeis1002=4,so4324=688.

3. Decrementbythebias68833=655.

4. BitP is1,sothe finalresult is655.Thus,thequantizationerror(thenoise)is1;very

small.

Figure 7 illustrates the nature of the law midtread quantization. Zero is one of the valid

output values, and the quantization steps are centered at the input value of 0. The steps are

organizedineightsegmentsof16stepseach.Thestepswithineachsegmenthavethesamewidth,

Figure7:LawMidtreadQuantization.

but they double inwidth from one segment to the next. If we denote the segment number by i

(wherei=0,1...7)andthewidthofasegmentbyk(wherek=1,2...16),thenthemiddleofthe

treadofeachstepinFigure7(i.e.,thepointslabeledxj)isgivenby,

16

wheretheconstantsT(i)andD(i)aretheinitialvalueandthestepsizeforsegment i,respectively.

Theyaregivenby,

i 0 1 2 3 4 5 6 7

T(i) 1 35 103 239 511 1055 2143 4319

D(i) 2 4 8 16 32 64 128 256

4.2TheAlawencoder:TheAlawencoderusesthesimilarexpression

||

, ||

||

, || 1

TheG.711standardrecommendstheuseofA=87.6.

Figure8:ALawMidriserQuantization.

TheoperationoftheAlawencoderissimilar,exceptthatthequantization(Figure8)isofthemidriservariety.ThebreakpointsxjaregivenbyEquation,

16

buttheinitialvalueT(i)andthestepsizeD(i)forsegmentiaredifferentfromthoseusedbythelawencoderandaregivenby,

i 0 1 2 3 4 5 6 7

T(i) 0 32 64 128 256 512 1024 2048

D(i) 2 2 4 8 16 32 64 128

TheAlawencodergeneratesan8bitcodewordwiththesameformatasthelawencoder.

ItsetsthePbittothesignoftheinputsample.Itthendeterminesthesegmentcodeinthefollowing

steps:

1.Determinethebitpositionofthemostsignificant1bitamongthesevenmostsignificantbitsof

theinput.

2.Ifsucha1bitisfound,thesegmentcodebecomesthatpositionminus4.Otherwise,thesegment

codebecomeszero.

The4bitquantizationcodeissettothe fourbits followingthebitpositiondeterminedin

step1,ortohalftheinputvalueifthesegmentcodeiszero.Theencoderignorestheremainingbits

of the inputsample,and it invertsbitPand theevennumberedbitsof thecodewordbefore it is

output.

TheAlawdecoderdecodesan8bitcodewordintoa13bitaudiosampleasfollows:

1.ItinvertsbitPandtheevennumberedbitsofthecodeword.

2.Ifthesegmentcodeisnonzero,thedecodermultipliesthequantizationcodeby2andincrements

thisbythebias(33).Theresultisthenmultipliedby2andraisedtothepowerofthe(segmentcode

minus1).Ifthesegmentcodeis0,thedecoderoutputstwicethequantizationcode,plus1.

3.BitPisthenusedtodeterminethesignoftheoutput.

5.ADPCMAudioCompression:Adjacent audio samples tend to be similar inmuch the sameway that neighboring pixels in an

image tend to have similar colors. The simplest way to exploit this redundancy is to subtract

adjacentsamplesandcodethedifferences,whichtendtobesmallintegers.Anyaudiocompression

method based on this principle is called DPCM (differential pulse code modulation). Such

methods,however,areinefficient,becausetheydonotadaptthemselvestothevaryingmagnitudes

of the audio stream. Better results are achieved by an adaptive version, and any such version is

calledADPCM.

ADPCM: Short for Adaptive Differential Pulse Code Modulation, a form of pulse code modulation (PCM) that

produces a digital signal with a lower bit rate than standard PCM. ADPCM produces a lower bit rate by

recording only the difference between samples and adjusting the coding scale dynamically to accommodate

large and small differences.

ADPCM employs linear prediction. It uses the previous sample (or several previous

samples)topredictthecurrentsample.Itthencomputesthedifferencebetweenthecurrentsample

anditsprediction,andquantizesthedifference.ForeachinputsampleX[n],theoutputC[n]ofthe

encoderissimplyacertainnumberofquantizationlevels.Thedecodermultipliesthisnumberby

thequantizationstep(andmayaddhalf thequantizationstep, forbetterprecision) toobtain the

reconstructedaudiosample.Themethodisefficientbecausethequantizationstepisupdatedallthe

time,bybothencoderanddecoder,inresponsetothevaryingmagnitudesoftheinputsamples.Itis

alsopossibletomodifyadaptivelythepredictionalgorithm.VariousADPCMmethodsdifferinthe

waytheypredictthecurrentaudiosampleandinthewaytheyadapttotheinput(bychangingthe

quantizationstepsizeand/orthepredictionmethod).

Inadditiontothequantizedvalues,anADPCMencodercanprovidethedecoderwithside

information.Thisinformationincreasesthesizeofthecompressedstream,butthisdegradationis

acceptable to the users, because it makes the compressed audio data more useful. Typical

applications of side information are (1) help the decoder recover from errors and (2) signal an

entrypointintothecompressedstream.Anoriginalaudiostreammayberecordedincompressed

formonamediumsuchasaCDROM.Iftheuser(listener)wantstolistentosong5,thedecodercan

usethesideinformationtoquicklyfindthestartofthatsong.

Figure9:(a)ADPCMEncoderand(b)Decoder.

Figure9showsthegeneralorganizationoftheADPCMencoderanddecoder.Theadaptive

quantizer receives thedifferenceD[n]between the current input sampleX[n] and theprediction

Xp[n1].ThequantizercomputesandoutputsthequantizedcodeC[n]ofX[n].Thesamecodeis

senttotheadaptivedequantizer(thesamedequantizerusedbythedecoder),whichproducesthe

nextdequantizeddifferencevalueDq[n].ThisvalueisaddedtothepreviouspredictoroutputXp[n

1],andthesumXp[n]issenttothepredictortobeusedinthenextstep.

Better prediction would be obtained by feeding the actual input X[n] to the predictor.

However,thedecoderwouldntbeabletomimicthat,sinceitdoesnothaveX[n].Weseethatthe

basicADPCMencoderissimple,andthedecoderisevensimpler.ItinputsacodeC[n],dequantizes

ittoadifferenceDq[n],whichisaddedtotheprecedingpredictoroutputXp[n1]toformthenext

outputXp[n].Thenextoutputisalsofedintothepredictor,tobeusedinthenextstep.

6.SpeechCompression:

Certainaudiocodecsaredesignedspecifically tocompressspeechsignals.Suchsignalsareaudio

andaresampledlikeanyotheraudiodata,butbecauseofthenatureofhumanspeech,theyhave

propertiesthatcanbeexploitedforefficientcompression.

6.1PropertiesofSpeech

Weproducesoundbyforcingair fromthe lungs throughthevocalcords intothevocal tract.The

vocalcordscanopenandclose,andtheopeningbetweenthemiscalledtheglottis.Themovements

of the glottis and vocal tract give rise to different types of sound. The three main types are as

follows:

1.Voiced sounds. These are the soundswemakewhenwe talk. The vocal cords vibratewhich

opensandclosestheglottis,therebysendingpulsesofairatvaryingpressurestothetract,whereit

isshapedintosoundwaves.Thefrequenciesofthehumanvoice,ontheotherhand,aremuchmore

restrictedandaregenerallyintherangeof500Hztoabout2kHz.Thisisequivalenttotimeperiods

of2msto20ms.Thus,voicedsoundshavelongtermperiodicity.

2.Unvoiced sounds. These are sounds that are emitted and can be heard, but are not parts of

speech.Suchasoundistheresultofholdingtheglottisopenandforcingairthroughaconstriction

inthevocaltract.Whenanunvoicedsoundissampled,thesamplesshowlittlecorrelationandare

randomorclosetorandom.

3.Plosivesounds.These resultwhen theglottis closes, the lungsapply airpressureon it, and it

suddenlyopens,lettingtheairescapesuddenly.Theresultisapoppingsound.

6.2Speechcodecs

Therearethreemaintypesofspeechcodecs.

1. Waveform speech codecs: It produce good to excellent speech after compressing and

decompressingit,butgeneratebitratesof1064kbps.

2. Sourcecodecs(alsocalledvocoders):Vocodersgenerallyproducepoortofairspeechbut

cancompressittoverylowbitrates(downto2kbps).

3. Hybrid codecs: These codecs are combinations of the former two types and produce

speechthatvariesfromfairtogood,withbitratesbetween2and16kbps.

Figure10illustratesthespeechqualityversusbitrateofthesethreetypes.

Figure10:SpeechQualityversusBitrateforSpeechCodecs.

6.3WaveformCodecs

Waveformcodecdoesnotattempttopredicthowtheoriginalsoundwasgenerated.Itonlytriesto

produce,afterdecompression,audiosamplesthatareasclosetotheoriginalonesaspossible.Thus,

such codecs are not designed specifically for speech coding and can perform equallywell on all

kindsofaudiodata.AsFigure10illustrates,whensuchacodecisforcedtocompresssoundtoless

than16kbps,thequalityofthereconstructedsounddropssignificantly.

The simplest waveform encoder is pulse codemodulation (PCM). This encoder simply

quantizeseachaudiosample.Speechistypicallysampledatonly8kHz.Ifeachsampleisquantized

to 12 bits, the resulting bitrate is 8k 12 = 96 kbps and the reproduced speech sounds almost

natural. Better results are obtained with a logarithmic quantizer, such as the law and Alaw

compandingmethods.Theyquantizeaudiosamplestovaryingnumbersofbitsandmaycompress

speech to8bitspersampleonaverage, therebyresulting ina bitrateof64kbps,withverygood

qualityofthereconstructedspeech.

AdifferentialPCM speechencoderuses the fact that theaudiosamplesofvoiced speech

arecorrelated.This typeofencodercomputes thedifferencebetween thecurrent sampleand its

predecessorandquantizes thedifference.Anadaptiveversion(ADPCM)maycompressspeechat

goodqualitydowntoabitrateof32kbps.

Waveform coders may also operate in the frequency domain. The subband coding

algorithm (SBC) transforms the audio samples to the frequencydomain,partitions the resulting

coefficientsintoseveralcriticalbands(orfrequencysubbands),andcodeseachsubbandseparately

withADPCMorasimilarquantizationmethod.TheSBCdecoderdecodesthefrequencycoefficients,

recombinesthem,andperformstheinversetransformationto(lossily)reconstructaudiosamples.

TheadvantageofSBCisthattheearissensitivetocertainfrequenciesandlesssensitivetoothers

Subbands of frequencies to which the ear is less sensitive can therefore be coarsely quantized

without loss of sound quality. This type of coder typically produces good reconstructed speech

qualityatbitratesof1632kbps.Theyare,however,morecomplextoimplementthanPCMcodecs

andmayalsobeslower.

The adaptive transform coding (ATC) speech compression algorithm transforms audio

samplestothefrequencydomainwiththediscretecosinetransform(DCT).Theaudiofileisdivided

into blocks of audio samples and the DCT is applied to each block, resulting in a number of

frequency coefficients. Each coefficient is quantized according to the frequency to which it

corresponds.Goodqualityreconstructedspeechcanbeachievedatbitratesaslowas16kbps.

6.4SourceCodecs

Ingeneral,asourceencoderusesamathematicalmodelofthesourceofdata.Themodeldepends

oncertainparameters,andtheencoderusestheinputdatatocomputethoseparameters.Oncethe

parameters are obtained, they are written (after being suitably encoded) on the compressed

stream.Thedecoderinputstheparametersandemploysthemathematicalmodeltoreconstructthe

originaldata.Iftheoriginaldataisaudio,thesourcecoderiscalledvocoder(fromvocalcoder).

6.4.1LinearPredictiveCoder(LPC):

Figure 11 shows a simplified model of speech production. Part (a) illustrates the process in a

person, whereas part (b) shows the corresponding LPCmathematical model. In this model, the

outputisthesequenceofspeechsampless(n)comingoutoftheLPCfilter(whichcorrespondsto

thevocal tractand lips).The inputu(n) to themodel(andto the filter) iseithera trainofpulses

(whenthesoundisvoicedspeech)orwhitenoise(whenthesoundisunvoicedspeech).The

Figure11:SpeechProduction:(a)Real.(b)LPCModel

quantitiesu(n)arealsotermedinnovation.Themodelillustrateshowsampless(n)ofspeechcanbe

generated by mixing innovations (a train of pulses and white noise). Thus, it represents

mathematically the relation between speech samples and innovations. The task of the speech

encoder is to input samples s(n) of actual speech, use the filter as a mathematical function to

determineanequivalentsequenceofinnovationsu(n),andoutputtheinnovationsincompressed

form. The correspondence between the models parameters and the parts of real speech is as

follows:

1.ParameterV(voiced)correspondstothevibrationsofthevocalcords.UVexpressestheunvoiced

sounds.

2.Tistheperiodofthevocalcordsvibrations.

3.G(gain)correspondstotheloudnessortheairvolumesentfromthelungseachsecond.

4.Theinnovationsu(n)correspondtotheairpassingthroughthevocaltract.

5.Thesymbolsanddenoteamplificationandcombination,respectively.

ThemainequationoftheLPCmodeldescribestheoutputoftheLPCfilteras,

wherezistheinputtothefilter[thevalueofoneoftheu(n)].Anequivalentequationdescribesthe

relationbetweenthe innovationsu (n)ontheonehandandthe10coefficientsaiandthespeech

audiosampless(n)ontheotherhand.Therelationis,

Thisrelationimpliesthateachnumberu(n)inputtotheLPCfilteristhesumofthecurrentaudio

samples(n)andaweightedsumofthe10precedingsamples.TheLPCmodelcanbewrittenasthe

13tuple

, , , , , /, ,

whereV/UV is a single bit specifying the source (voiced or unvoiced) of the input samples. The

modelassumesthatAstaysstableforabout20ms,thengetsupdatedbytheaudiosamplesofthe

next20ms.Atasamplingrateof8kHz,thereare160audiosampless(n)every20ms.Themodel

computes the 13 quantities in A from these 160 samples, writes A (as 13 numbers) on the

compressedstream,thenrepeatsforthenext20ms.Theresultingcompressionfactoristherefore

13numbersforeachsetof160audiosamples.

Its important to distinguish the operation of the encoder from the diagram of the LPCs

mathematicalmodeldepictedinFigure11b.Thefigureshowshowasequenceofinnovationsu(n)

generatesspeechsampless(n).Theencoder,however,startswith thespeechsamples. It inputsa

20ms sequence of speech samples s(n), computes an equivalent sequence of innovations,

compresses them to 13 numbers, and outputs the numbers after further encoding them. This

repeatsevery20ms.

LPCencoding(oranalysis)startswith160soundsamplesandcomputesthe10LPCparametersai

byminimizingtheenergyoftheinnovationu(n).Theenergyisthefunction

, , ,

and itsminimum is computed by differentiating it 10 times,with respect to each of its 10 . The

autocorrelationfunctionofthesampless(n)isgivenby,

GKB

Whichisusedtoobtain10LPCparametersai.Theremainingthreeparameters,V/UV,G,andT,are

determinedfromthe160audiosamples.Ifthosesamplesexhibitperiodicity,thenTbecomesthat

periodandthe1bitparameterV/UVissettoV.Ifthe160samplesdonotfeatureanywelldefined

period,thenTremainsundefinedandV/UVissettoUV.ThevalueofGisdeterminedbythelargest

sample.

LPC decoding (or synthesis) starts with a set of 13 LPC parameters and computes 160 audio

samplesastheoutputoftheLPCfilterby,

Thesesamplesareplayedat8,000samplespersecondandresultin20msof(voicedorunvoiced)

reconstructedspeech.

AdvantagesofLPC:

1. LPCprovidesagoodmodelofthespeechsignal.

2. Theway in which LPC is applied to the analysis of speech signals leads to a reasonable

sourcevocaltractseparation.

3. LPCisananalyticallytractablemodel.Themodelismathematicallypreciseandsimpleand

straightforwardtoimplementineithersoftwareorhardware.

6.5HybridCodecs

This type of speech codec combines features from bothwaveform and source codecs. Themost

popular hybrid codecs are AnalysisbySynthesis (AbS) timedomain algorithms. Like the LPC

vocoder,thesecodecsmodelthevocaltractbyalinearpredictionfilter,butuseanexcitationsignal

insteadof thesimple, twostatevoiceunvoicemodel tosupply theu(n) (innovation) input to the

filter.Thus,anAbSencoderstartswithasetofspeechsamples(aframe),encodesthemsimilarto

LPC,decodesthem,andsubtractsthedecodedsamplesfromtheoriginalones.Thedifferencesare

sent through an error minimization process that outputs improved encoded samples. These

samplesareagaindecoded,subtractedfromtheoriginalsamples,andnewdifferencescomputed.

Thisisrepeateduntilthedifferencessatisfyaterminationcondition.Theencoderthenproceedsto

thenextsetofspeechsamples(nextframe).

6.5.1CodeExcitedLinearPrediction(CELP):

Oneofthemostimportantfactorsingeneratingnaturalsoundingspeechistheexcitationsignal.As

thehumanearisespeciallysensitivetopitcherrors,agreatdealofefforthasbeendevotedtothe

developmentofaccuratepitchdetectionalgorithms.

In CELP instead of having a codebook of pulse patterns,we allow a variety of excitation

signals.Foreachsegmenttheencoderfindstheexcitationvectorthatgeneratessynthesizedspeech

thatbestmatchesthespeechsegmentbeingencoded.Thisapproachiscloserinastrictsensetoa

waveform coding technique such as DPCM than to the analysis/synthesis schemes. The main

components of the CELP coder include the LPC analysis, the excitation codebook, and the

perceptualweightingfilter.BesidesCELP,theMPLPCalgorithmhadanotherdescendantthathas

becomeastandard.Insteadofusingexcitationvectorsinwhichthenonzerovaluesareseparated

byanarbitrarynumberofzerovalues,theyforcedthenonzerovaluestooccuratregularlyspaced

intervals.Furthermore,MPLPCallowedthenonzerovaluestotakeonanumberofdifferentvalues.

Thisschemeiscalledasregularpulseexcitation(RPE)coding.AvariationofRPE,calledregular

pulse excitationwith longterm prediction (RPELTP), was adopted as a standard for digital

cellular telephony by the Group Speciale Mobile (GSM) subcommittee of the European

TelecommunicationsStandardsInstituteattherateof13kbps.

ThevocaltractfilterusedbytheCELPcoderisgivenby

wherePisthepitchperiodandthetermynPisthecontributionduetothepitchperiodicity.

1. The inputspeech issampledat8000samplespersecondanddivided into30millisecond

framescontaining240samples.

2. Eachframeisdividedintofoursubframesoflength7.5milliseconds.

3. The coefficients for the 10thorder shortterm filter are obtained using the

autocorrelationmethod.

4. ThepitchperiodPiscalculatedonceeverysubframe.Inordertoreducethecomputational

load,thepitchvalueisassumedtoliebetween20and147everyoddsubframe.

5. In every even subframe, the pitch value is assumed to liewithin 32 samples of the pitch

valueinthepreviousframe.

6. The algorithmuses two codebooks, a stochastic codebook and an adaptive codebook. An

excitationsequenceisgeneratedforeachsubframebyaddingonescaledelementfromthe

stochasticcodebookandonescaledelementfromtheadaptivecodebook.

7. Thestochasticcodebookcontains512entries.TheseentriesaregeneratedusingaGaussian

randomnumbergenerator, theoutputofwhich isquantized to1,0,or1.Thecodebook

entriesareadjustedsothateachentrydiffersfromtheprecedingentryinonlytwoplaces.

8. The adaptive codebook consists of the excitation vectors from the previous frame. Each

timeanewexcitationvector isobtained, it is added to the codebook. In thismanner, the

codebookadaptstolocalstatistics.

9. The coder has been shown to provide excellent reproductions in both quiet and noisy

environmentsatratesof4.8kbpsandabove.

10. Thequalityofthereproductionofthiscoderat4.8kbpshasbeenshowntobeequivalentto

adeltamodulatoroperatingat32kbps.Thepriceforthisqualityismuchhighercomplexity

andamuchlongercodingdelay.

CCITTG.728CELPSpeechcodingStandard:

By their nature, the speech coding schemes have some coding delay built into them. By coding

delay,wemeanthetimebetweenwhenaspeechsample isencodedtowhenit isdecodedif the

encoderanddecoderwereconnectedbacktoback(i.e.,therewerenotransmissiondelays).Inthe

schemeswehavestudied,asegmentofspeechisfirststoredinabuffer.Wedonotstartextracting

thevariousparametersuntilacompletesegmentofspeechisavailabletous.Oncethesegmentis

completelyavailable, it isprocessed. If theprocessing is real time, thismeansanother segments

worth of delay. Finally, once the parameters have been obtained, coded, and transmitted, the

receiverhastowaituntilatleastasignificantpartoftheinformationisavailablebeforeitcanstart

decoding the first sample. Therefore, if a segment contains 20 milliseconds worth of data, the

codingdelaywouldbeapproximatelysomewherebetween40to60milliseconds.

Forsuchapplications,CCITTapprovedrecommendationG.728,aCELPcoderwithacoder

delayof2millisecondsoperatingat16kbps.Astheinputspeechissampledat8000samplesper

second,thisratecorrespondstoanaveragerateof2bitsper sample.TheG.728recommendation

usesasegmentsizeoffivesamples.Withfivesamplesandarateof2bitspersample,weonlyhave

10bitsavailabletous.Usingonly10bits,itwouldbeimpossibletoencodetheparametersofthe

vocal tract filter aswell as theexcitationvector. Therefore, thealgorithmobtains thevocal tract

filter parameters in a backward adaptivemanner; that is, the vocal tract filter coefficients to be

usedtosynthesizethecurrentsegmentareobtainedbyanalyzingthepreviousdecodedsegments.

TheG.728algorithmusesa50thordervocaltract filter.Theorderofthe filter is largeenoughto

modelthepitchofmostfemalespeakers.Notbeingabletousepitchinformationformalespeakers

doesnotcausemuchcorruptedbychannelerrors.Therefore,thevocaltractfilterisupdatedevery

fourth frame,which is once every20 samples or 2.5milliseconds. The autocorrelationmethod is

usedtoobtainthevocaltractparameters.

FIGURE12:EncoderanddecoderfortheCCITTG.72816kbpsCELPspeechcodec

Tenbitswouldbeable to index1024excitationsequences.However, toexamine1024excitation

sequences every 0.625milliseconds is a rather large computational load. In order to reduce this

load,theG.728algorithmusesaproductcodebookwhereeachexcitationsequenceisrepresented

by a normalized sequence and a gain term. The final excitation sequence is a product of the

normalizedexcitationsequenceandthegain.Ofthe10bits,3bitsareusedtoencodethegainusing

apredictiveencodingscheme,whiletheremaining7bitsformtheindextoacodebookcontaining

127sequences.

BlockdiagramsoftheencoderanddecoderfortheCCITTG.728coderareshowninFigure12.The

lowdelay CCITT G.728 CELP coder operating at 16 kbps provides reconstructed speech quality

superiortothe32kbpsCCITTG.726ADPCMalgorithm.Variouseffortsareunderwaytoreducethe

bitrateforthisalgorithmwithoutcompromisingtoomuchonqualityanddelay.

6.5.3MixedExcitationLinearPrediction(MELP):

Themixedexcitationlinearprediction(MELP)coderwasselectedtobethenewfederalstandard

forspeechcodingat2.4kbpsbywhichusesthesameLPCfiltertomodelthevocaltract.However,it

usesamuchmorecomplexapproachtothegenerationoftheexcitationsignal.Ablockdiagramof

thedecoderfortheMELPsystemisshowninFigure13.Asevidentfromthefigure,theexcitation

signal forthesynthesis filter isno longersimplynoiseoraperiodicpulsebutamultibandmixed

excitation.Themixedexcitationcontainsbothafilteredsignalfromanoisegeneratoraswellasa

contributionthatdependsdirectlyontheinputsignal.

Thefirststepinconstructingtheexcitationsignalispitchextraction.TheMELPalgorithm

obtains thepitchperiodusingamultistepapproach. In the first stepan integerpitchvalueP1 is

obtainedby

1.firstfilteringtheinputusingalowpassfilterwithacutoffof1kHz

2.computingthenormalizedautocorrelationforlagsbetween40and160

Thenormalizedautocorrelationr()isdefinedas

,

, ,

,

ThefirstestimateofthepitchP1isobtainedasthevalueofthatmaximizesthenormalized

autocorrelation function. This stage uses two values of P1, one from the current frame andone

fromthepreviousframe,ascandidates.Thenormalizedautocorrelationvaluesareobtainedforlags

fromfivesampleslesstofivesamplesmorethanthecandidateP1values.

Thelagsthatprovidethemaximumnormalizedautocorrelationvalueforeachcandidateareused

forfractionalpitchrefinement.

Figure13:BlockdiagramofMELPdecoder.

Thefinalrefinementsofthepitchvalueareobtainedusingthelinearpredictionresiduals.

Theresidualsequenceisgeneratedbyfilteringtheinputspeechsignalwiththefilterobtainedusing

theLPCanalysis.For thepurposesofpitchrefinement theresidual signal is filteredusinga low

pass filter with a cutoff of 1 kHz. The normalized autocorrelation function is computed for this

filteredresidualsignal forlagsfromfivesampleslesstofivesamplesmorethanthecandidateP2

value,andacandidatevalueofP3isobtained.

Theinputisalsosubjectedtoamultibandvoicinganalysisusingfivefilterswithpassbands

0500,5001000,10002000,20003000,and30004000Hz.Thegoaloftheanalysisistoobtain

thevoicingstrengthsVbpiforeachbandusedintheshapingfilters.IfthevalueofVbp1issmall,this

indicatesalackoflowfrequencystructure,whichinturnindicatesanunvoicedortransitioninput.

Thus,ifVbp106,the

valuesoftheothervoicingstrengthsarequantizedto1iftheirvalueisgreaterthan0.6,andto0

otherwise. In thisway signal energy in the different bands is turnedon or off depending on the

voicingstrength.

Inordertogeneratethepulseinput,thealgorithmmeasuresthemagnitudeofthediscrete

Fouriertransformcoefficientscorrespondingtothefirst10harmonicsofthepitch.Themagnitudes

oftheharmonicsarequantizedusingavectorquantizerwithacodebooksizeof256.Thecodebook

is searched using aweighted Euclidean distance that emphasizes lower frequencies over higher

frequencies.

At the decoder, using the magnitudes of the harmonics and information about the

periodicity of the pulse train, the algorithm generates one excitation signal. Another signal is

generated using a random number generator. Both are shaped by the multiband shaping filter

before being combined. This mixture signal is then processed through an adaptive spectral

enhancementfilter,whichisbasedontheLPCcoefficients,toformthefinalexcitationsignal.Note

that inorder topreservecontinuity fromframeto frame, theparametersused forgeneratingthe

excitationsignalareadjustedbasedontheircorrespondingvaluesinneighboringframes.

6.6MPEGAudioCoding

TheformalnameofMPEG1is the internationalstandard formovingpicturevideocompression,IS

11172. It consists of five parts, of which part 3 [ISO/IEC 93] is the definition of the audio

compressionalgorithm.ThedocumentdescribingMPEG1hasnormativeandinformativesections.

Anormative section is part of the standard specification. It is intended for implementers, it is

written inaprecise language,and it shouldbestrictly followed in implementing thestandardon

actualcomputerplatforms.Aninformativesection,ontheotherhand,illustratesconceptsthatare

discussedelsewhere, explains the reasons that led to certain choices anddecisions, and contains

backgroundmaterial.Anexampleofanormativesectionisthetablesofvariousparametersandof

theHuffmancodesusedinMPEGaudio.Anexampleofaninformativesectionisthealgorithmused

by MPEG audio to implement a psychoacoustic model. MPEG does not require any particular

algorithm,andanMPEGencodercanuseanymethodtoimplementthemodel.Thisinformative

sectionsimplydescribesvariousalternatives.

The MPEG1 and MPEG2 (or in short, MPEG1/2) audio standard specifies three

compressionmethods called layers and designated I, II,and III. All three layers are part of the

MPEG1 standard.Amovie compressedbyMPEG1usesonlyone layer, and the layernumber is

specifiedinthecompressedstream.Anyofthelayerscanbeusedtocompressanaudiofilewithout

anyvideo.Aninterestingaspectofthedesignofthestandardisthatthelayersformahierarchyin

thesensethatalayerIIIdecodercanalsodecodeaudiofilescompressedbylayersIorII.

Theresultofhaving three layerswasan increasingpopularity of layer III.Theencoder is

extremelycomplex,butitproducesexcellentcompression,andthis,combinedwiththefactthatthe

decoderismuchsimpler,hasproducedinthelate1990sanexplosionofwhatispopularlyknown

asmp3soundfiles.ItiseasytolegallyandfreelyobtainalayerIIIdecoderandmuchmusicthatis

alreadyencodedinlayerIII.Sofar,thishasbeenabigsuccessoftheaudiopartoftheMPEGproject.

The principle of MPEG audio compression is quantization. The values being quantized,

however,arenottheaudiosamplesbutnumbers(calledsignals)takenfromthefrequencydomain

of the sound. The fact that the compression ratio (or equivalently, the bitrate) is known to the

encodermeansthattheencoderknowsatanytimehowmanybits itcanallocatetothequantized

signals. Thus, the (adaptive) bitallocationalgorithm is an important part of the encoder. This

algorithmusestheknownbitrateandthefrequencyspectrumofthemostrecentaudiosamplesto

determinethesizeofthequantizedsignalssuchthatthequantizationnoise(thedifferencebetween

anoriginalsignalandaquantizedone)willbeinaudible.

Figure14:MPEGAudio:(a)Encoderand(b)Decoder

Thepsychoacousticmodelsuse the frequencyof the sound that isbeing compressed,but

the input stream consists of audio samples, not sound frequencies. The frequencies have to be

computed from the samples. This is why the first step in MPEG audio encoding is a discrete

Fouriertransform,whereasetof512consecutiveaudiosamplesistransformedtothefrequency

domain. Since the number of frequencies can be huge, they are grouped into 32 equalwidth

frequencysubbands(layerIIIusesdifferentnumbersbutthesameprinciple).Foreachsubband,a

number is obtained that indicates the intensity of the sound at that subbands frequency range.

These numbers (called signals) are then quantized. The coarseness of the quantization in each

subband is determinedby themasking threshold in the subband andby thenumber of bits still

available to the encoder. The masking threshold is computed for each subband using a

psychoacousticmodel.

MPEGusestwopsychoacousticmodelstoimplementfrequencymaskingandtemporal

masking.Eachmodeldescribeshowloudsoundmasksothersoundsthathappentobeclosetoitin

frequencyorintime.Themodelpartitionsthefrequencyrangeinto24criticalbandsandspecifies

how masking effects apply within each band. The masking effects depend, of course, on the

frequencies and amplitudes of the tones.When the sound is decompressed and played, the user

(listener)may select any playback amplitude,which iswhy the psychoacousticmodel has to be

designed for theworst case. Themasking effects also dependon thenature of the source of the

sound being compressed. The source may be tonelike or noiselike. The two psychoacoustic

modelsemployedbyMPEGarebasedonexperimentalworkdonebyresearchersovermanyyears.

Thedecodermustbefast,sinceitmayhavetodecodetheentiremovie(videoandaudio)at

realtime,soitmustbesimple.Asaresultitdoesnotuseanypsychoacousticmodelorbitallocation

algorithm. The compressed stream must therefore contain all the information that the decoder

needs for dequantizing the signals. This information (the size of the quantized signals)must be

written by the encoder on the compressed stream, and it constitutes overhead that should be

subtractedfromthenumberofremainingavailablebits.

Figure 14 is a block diagram of the main components of the MPEG audio encoder and

decoder.Theancillarydataisuserdefinableandwouldnormallyconsistofinformationrelatedto

specificapplications.Thisdataisoptional.

6.7FrequencyDomainCoding

The first step in encoding the audio samples is to transform them from the time domain to the

frequencydomain.Thisisdonebyabankofpolyphasefiltersthattransformthesamplesinto32

equalwidth frequency subbands. The filters were designed to provide fast operation combined

withgoodtimeandfrequencyresolutions.Asaresult,theirdesigninvolvedthreecompromises.

1. The first compromise is the equal widths of the 32 frequency bands. This simplifies the

filtersbutisincontrasttothebehaviorofthehumanauditorysystem,whosesensitivityis

frequencydependent. When several critical bands are covered by a subband X, the bit

allocation algorithm selects the critical band with the least noisemasking and uses that

criticalbandtocomputethenumberofbitsallocatedtothequantizedsignalsinsubbandX.

2. Thesecondcompromiseinvolvestheinversefilterbank,theoneusedbythedecoder.The

original timetofrequency transformation involves loss of information (even before any

quantization).Theinversefilterbankthereforereceivesdatathatisslightlybad,andusesit

to perform the inverse frequencytotime transformation, resulting in more distortions.

Therefore,thedesignofthetwofilterbanks(fordirectandinversetransformations)hadto

usecompromisestominimizethislossofinformation.

3. The thirdcompromisehas todowith the individual filters.Adjacent filters should ideally

pass different frequency ranges. In practice, they have considerable frequency overlap.

Soundofasingle,pure,frequencycanthereforepenetratethroughtwofiltersandproduce

signals(thatarelaterquantized)intwoofthe32subbandsinsteadofinjustonesubband.

Thepolyphase filter bankuses (in addition to other intermediate data structures) a bufferX

withroomfor512inputsamples.ThebufferisaFIFOqueueandalwayscontainsthemostrecent

512samplesinput.Figure15showsthefivemainstepsofthepolyphasefilteringalgorithm.

Figure15:PolyphaseFilterBank.6.8MPEGLayerICoding

The Layer I coding scheme provides a 4:1 compression. In Layer I coding the time frequency

mappingisaccomplishedusingabankof32subbandfilters.Theoutputofthesubbandfiltersis

criticallysampled.Thatis,theoutputofeachfilterisdownsampledby32.Thesamplesaredivided

intogroupsof12sampleseach.Twelvesamplesfromeachofthe32subbandfilters,oratotalof

384 samples, make up one frame of the Layer I coder. Once the frequency components are

obtained the algorithm examines each group of 12 samples to determine a scalefactor. The

scalefactorisusedtomakesurethatthecoefficientsmakeuseoftheentirerangeofthequantizer.

Thesubbandoutputisdividedbythescalefactorbeforebeinglinearlyquantized.Thereareatotal

of63scalefactorsspecifiedintheMPEGstandard.Specificationofeachscalefactorrequires6bits.

Figure16:FramestructureforLayer1.

Todetermine thenumberofbits tobeused forquantization, the codermakesuseof the

psychoacousticmodel.TheinputstothemodelincludeFastFourierTransform(FFT)oftheaudio

data as well as the signal itself. The model calculates the masking thresholds in each subband,

which in turn determine the amount of quantization noise that can be tolerated and hence the

quantization step size. As the quantizers all cover the same range, selection of the quantization

stepsizeisthesameasselectionofthenumberofbitstobeusedforquantizingtheoutputofeach

subband. InLayer I theencoderhasachoiceof14differentquantizers foreachband(plus the

optionof assigning0bits).Thequantizersare allmidtreadquantizers ranging from3 levels to

65,535levels.Eachsubbandgetsassignedavariablenumberofbits.However,thetotalnumberof

bitsavailabletorepresentallthesubbandsamplesisfixed.Therefore,thebitallocationcanbean

iterativeprocess.Theobjectiveistokeepthenoisetomaskratiomoreorlessconstantacrossthe

subbands.

Theoutputofthequantizationandbitallocationstepsarecombinedintoaframeasshown

inFigure16.BecauseMPEGaudioisastreamingformat,eachframecarriesaheader,ratherthan

havingasingleheaderfortheentireaudiosequence.

1. Theheaderismadeupof32bits.

2. Thefirst12bitscompriseasyncpatternconsistingofall1s.

3. Thisisfollowedbya1bitversionID,

4. A2bitlayerindicator,

5. A 1bit CRC protection. The CRC protection bit is set to 0 if there is no CRC

protectionandissettoa1ifthereisCRCprotection.

6. If the layer and protection information is known, all 16 bits can be used for

providingframesynchronization.

7. Thenext4bitsmakeupthebitrateindex,whichspecifiesthebitrateinkbits/sec.

Thereare14specifiedbitratestochosefrom.

8. This is followed by 2 bits that indicate the sampling frequency. The sampling

frequencies for MPEG1 and MPEG2 are different (one of the few differences

between the audio coding standards forMPEG1 andMPEG2) and are shown in

Table2.

9. Thesebitsarefollowedbyasinglepaddingbit.Ifthebitis1,theframeneedsan

additional bit to adjust the bit rate to the sampling frequency. The next two bits

indicate themode.Thepossiblemodesare stereo, joint stereo, dual channel,

and single channel. The stereomode consists of two channels that are encoded

separatelybutintendedtobeplayedtogether.Thejointstereomodeconsistsoftwo

channelsthatareencodedtogether.

Table2:AllowablesamplingfrequenciesinMPEG1andMPEG2.

Theleftandrightchannelsarecombinedtoformamidandasidesignalasfollows:

Thedualchannelmodeconsistsoftwochannelsthatareencodedseparatelyandarenot

intended to be played together, such as a translation channel. These are followed by twomode

extension bits that are used in the joint stereomode. The next bit is a copyright bit (1 if the

materialiscopyrighted,0ifitisnot).Thenextbitissetto1fororiginalmediaand0forcopy.

The final twobits indicate the typeofdeemphasis tobeused. If theCRCbit is set, theheader is

followedbya16bitCRC.Thisisfollowedbythebitallocationsusedbyeachsubbandandisinturn

followed by the set of 6bit scalefactors. The scalefactor data is followed by the quantized 384

samples.

16.9LayerIICoding

The Layer II coder provides a higher compression rate by making some relatively minor

modifications to the Layer I coding scheme. Thesemodifications includehow the samples are

grouped together, the representation of the scalefactors, and the quantization strategy.

WheretheLayerIcoderputs12samplesfromeachsubbandintoaframe,theLayerIIcodergroups

threesetsof12samplesfromeachsubbandintoaframe.Thetotalnumberofsamplesperframe

increasesfrom384samplesto1152samples.Thisreducestheamountofoverheadpersample.In

LayerIcodingaseparatescalefactorisselectedforeachblockof12samples.InLayerIIcodingthe

encodertriestoshareascalefactoramongtwoorallthreegroupsofsamplesfromeachsubband

filter.Theonlytimeseparatescalefactorsareusedforeachgroupof12samplesiswhennotdoing

so would result in a significant increase in distortion. The particular choice used in a frame is

signaledthroughthescalefactorselectioninformationfieldinthebitstream.

The major difference between the Layer I and Layer II coding schemes is in the

quantizationstep.IntheLayerIcodingschemetheoutputofeachsubbandisquantizedusingone

of 14 possibilities; the same 14 possibilities for each of the subbands. In Layer II coding the

quantizers used for each of the subbands can be selected from a different set of quantizers

depending on the sampling rate and the bit rates. For some sampling rate and bit rate

combinations,manyofthehighersubbandsareassigned0bits.Thatis,theinformationfromthose

subbandsissimplydiscarded.Wherethequantizerselectedhas3,5,or9levels,theLayerIIcoding

schemeusesonemoreenhancement.Noticethat inthecaseof3 levelswehavetouse2bitsper

sample,whichwouldhaveallowedustorepresent4levels.Thesituationisevenworseinthecase

of5levels,whereweareforcedtouse3bits,wastingthreecodewords,andinthecaseof9levels

wherewehavetouse4bits,thuswasting7levels.

Toavoidthissituation,theLayerIIcodergroups3samplesintoagranule.Ifeachsample

cantakeon3levels,agranulecantakeon27levels.Thiscanbeaccommodatedusing5bits.Ifeach

samplehadbeenencoded separatelywewouldhaveneeded6bits. Similarly, if each sample can

takeon9values,agranulecantakeon729values.Wecanrepresent729valuesusing10bits.If

eachsampleinthegranulehadbeenencodedseparately,wewouldhaveneeded12bits.Usingall

thesesavings,thecompressionratioinLayerIIcodingcanbeincreasefrom4:1to8:1or6:1.

TheframestructurefortheLayerIIcodercanbeseeninFigure17.Theonlyrealdifference

between this frame structure and the frame structure of the Layer I coder is the scalefactor

selectioninformationfield.

Figure17:FramestructureforLayer2.

16.10LayerIIICodingmp3

Layer III coding, which has becomewidely popular under the namemp3, is considerablymore

complexthantheLayerIandLayerIIcodingschemes.OneoftheproblemswiththeLayerIand

codingschemeswasthatwiththe32banddecomposition,thebandwidthofthesubbandsatlower

frequenciesissignificantlylargerthanthecriticalbands.Thismakesitdifficulttomakeanaccurate

judgmentof themasktosignalratio. Ifwegetahighamplitudetonewithinasubbandandif the

subbandwasnarrowenough,wecouldassumethatitmaskedothertonesintheband.However,if

thebandwidthofthesubbandissignificantlyhigherthanthecriticalbandwidthatthatfrequency,it

becomesmoredifficulttodeterminewhetherothertonesinthesubbandwillbemasked.

Tosatisfythebackwardcompatibilityrequirement,thespectraldecompositionintheLayer

IIIalgorithmisperformedintwostages.Firstthe32bandsubbanddecompositionusedinLayerI

and Layer II is employed. The output of each subband is transformed using amodified discrete

cosinetransform(MDCT)witha50%overlap.TheLayerIIIalgorithmspecifiestwosizesforthe

MDCT,6or18.Thismeansthattheoutputofeachsubbandcanbedecomposedinto18frequency

coefficientsor6frequencycoefficients.

Thereasonforhavingtwosizes fortheMDCTis thatwhenwetransformasequence into

thefrequencydomain,welosetimeresolutionevenaswegainfrequencyresolution.The larger

theblocksize themorewe lose in termsof time resolution.Theproblemwith this is that any

quantizationnoiseintroducedintothefrequencycoefficientswillgetspreadovertheentireblock

size of the transform. Backward temporalmasking occurs for only a short duration prior to the

maskingsound(approximately20msec).Therefore,quantizationnoisewillappearasapreecho.

Forthe longwindowsweendupwith18frequenciespersubband,resultinginatotalof

576 frequencies. For the shortwindows we get 6 coefficients per subband for a total of 192

frequencies.Thestandardallowsforamixedblockmodeinwhichthetwolowestsubbandsuse

longwindowswhiletheremainingsubbandsuseshortwindows.Noticethatwhilethenumberof

frequenciesmaychangedependingonwhetherweareusinglongorshortwindows,thenumberof

samples in a frame stays at 1152. That is 36 samples, or 3 groups of 12, from each of the 32

subbandfilters.

ThecodingandquantizationoftheoutputoftheMDCTisconductedinaniterativefashion

usingtwonestedloops.Thereisanouterloopcalledthedistortioncontrolloopwhosepurposeis

to ensure that the introduced quantization noise lies below the audibility threshold. The

scalefactorsareusedtocontrolthelevelofquantizationnoise.InLayerIIIscalefactorsareassigned

togroupsorbandsofcoefficientsinwhichthebandsareapproximatelythesizeofcriticalbands.

Thereare21scalefactorbandsforlongblocksand12scalefactorbandsforshortblocks.

Figure19:FramesinLayerIII

Theinnerloopiscalledtheratecontrol loop.Thegoalofthisloopistomakesurethata

targetbitrateisnotexceeded.ThisisdonebyiteratingbetweendifferentquantizersandHuffman

codes. The quantizers used in mp3 are companded nonuniform quantizers. The scaled MDCT

coefficients are first quantized and organized into regions. Coefficients at the higher end of the

frequencyscalearelikelytobequantizedtozero.Theseconsecutivezerooutputsaretreatedasa

single region and the runlength is Huffman encoded. Below this region of zero coefficients, the

encoder identifies the set of coefficients that are quantized to 0 or 1. These coefficients are

grouped into groups of four. This set of quadruplets is the second region of coefficients. Each

quadrupletisencodedusingasingleHuffmancodeword.

The remaining coefficients are divided into two or three subregions. Each subregion is

assignedaHuffmancodebasedon itsstatisticalcharacteristics. If theresultofusingthisvariable

lengthcodingexceedsthebitbudget,thequantizerisadjustedtoincreasethequantizationstepsize.

Theprocessisrepeateduntilthetargetrateissatisfied.Thepsychoacousticmodelisused

tocheckwhetherthequantizationnoiseinanybandexceedsthealloweddistortion.Ifitdoes,the

scalefactor isadjusted toreduce thequantizationnoise.Once all scalefactorshavebeenadjusted,

control returns to the rate control loop. The iterations terminate eitherwhen the distortion and

rateconditionsaresatisfiedorthescalefactorscannotbeadjustedanyfurther.

TherewillbeframesinwhichthenumberofbitsusedbytheHuffmancoderislessthanthe

amountallocated.Thesebitsaresavedinaconceptualbitreservoir.Inpracticewhatthismeansis

thatthestartofablockofdatadoesnotnecessarilycoincidewiththeheaderoftheframe.Consider

the three frames shown in Figure 19. In this example, themain data for the first frame (which

includes scalefactor information and theHuffman codeddata) does not occupy the entire frame.

Therefore,themaindataforthesecondframestartsbeforethesecondframeactuallybegins.The

sameistruefortheremainingdata.Themaindatacanbeginintheprevious frame.However,the

main data for a particular frame cannot spill over into the following frame. All this complexity

allowsforaveryefficientencodingofaudio inputs.Thetypicalmp3audiofilehasacompression

ratioofabout10:1.Inspiteofthishighlevelofcompression,mostpeoplecannottellthedifference

betweentheoriginalandthecompressedrepresentation.

audio compression notes(data compression)

Documents

whenthes sound

sound power

digitizing sound

audio compressionmethods

high sampling rate

figure issampledfourtimes

digital signal

theyareeasy toedit