audio compression notes(data compression)

35
Chapter 2 AUDIO COMPRESSION Digital Audio, Lossy sound compression, Mlaw and Alaw Companding, DPCM and ADPCM audio compression, MPEG audio standard, frequency domain coding, format of compressed data. 1. Introduction: Two important features of audio compression are (1) it can be lossy and (2) it requires fast decoding. Text compression must be lossless, but images and audio can lose much data without a noticeable degradation of quality. Thus, there are both lossless and lossy audio compression algorithms. Often, audio is stored in compressed form and has to be decompressed in real‐time when the user wants to listen to it. This is why most audio compression methods are asymmetric. The encoder can be slow, but the decoder has to be fast. This is also why audio compression methods are not dictionary based. A dictionary‐based compression method may have many advantages, but fast decoding is not one of them. We can define sound as: (a)An intuitive definition: Sound is the sensation detected by our ears and interpreted by our brain in a certain way. (b) A scientific definition: Sound is a physical disturbance in a medium. It propagates in the medium as a pressure wave by the movement of atoms or molecules. Like any other wave, sound has three important attributes, its speed, amplitude, and period. The speed of sound depends mostly on the medium it passes through, and on the temperature. The human ear is sensitive to a wide range of sound frequencies, normally from about 20 Hz to about 22,000 Hz, depending on a person’s age and health. This is the range of audible frequencies. Some animals, most notably dogs and bats, can hear higher frequencies (ultrasound). Loudness is commonly measured in units of dB SPL (sound pressure level) instead of sound power. The definition is, ܮݒ ൌ 10 10 20 ܤ ܮ

Upload: infinityankit

Post on 25-Sep-2015

38 views

Category:

Documents


3 download

DESCRIPTION

These are notes on audio compression .Pls view these notes they are really useful while preparing for exams.

TRANSCRIPT

  • Chapter2

    AUDIOCOMPRESSION

    DigitalAudio,Lossysoundcompression,MlawandAlawCompanding,DPCMandADPCMaudiocompression,MPEGaudiostandard,frequencydomaincoding,formatofcompresseddata.

    1.Introduction:

    Two important features of audio compression are (1) it can be lossy and (2) it requires fast

    decoding.Textcompressionmustbelossless,butimagesandaudiocanlosemuchdatawithouta

    noticeable degradation of quality. Thus, there are both lossless and lossy audio compression

    algorithms. Often, audio is stored in compressed form and has to be decompressed in realtime

    whentheuserwantstolistentoit.Thisiswhymostaudiocompressionmethodsareasymmetric.

    The encoder can be slow, but the decoder has to be fast. This is also why audio compression

    methods are not dictionary based. A dictionarybased compression method may have many

    advantages,butfastdecodingisnotoneofthem.

    Wecandefinesoundas:

    (a)An intuitivedefinition: Sound is the sensation detected by our ears and interpreted by our

    braininacertainway.

    (b) A scientific definition: Sound is a physical disturbance in a medium. It propagates in the

    mediumasapressurewavebythemovementofatomsormolecules.

    Likeanyotherwave,soundhasthreeimportantattributes,itsspeed,amplitude,andperiod.

    Thespeedofsounddependsmostlyonthemediumitpassesthrough,andonthetemperature.The

    humanearissensitivetoawiderangeofsoundfrequencies,normallyfromabout20Hztoabout

    22,000Hz,dependingonapersonsageandhealth.Thisistherangeofaudiblefrequencies.Some

    animals, most notably dogs and bats, can hear higher frequencies (ultrasound). Loudness is

    commonly measured in units of dB SPL (sound pressure level) instead of sound power. The

    definitionis,

    10

    10

    20

  • 2.DigitalAudio:

    Sound can be digitized and broken up into numbers. Digitizing sound is done bymeasuring the

    voltage at many points in time, translating each measurement into a number, and writing the

    numbersona file.Thisprocess is calledsampling.Thesoundwave is sampled,and thesamples

    becomethedigitizedsound.Thedeviceusedforsamplingiscalledananalogtodigitalconverter

    (ADC).

    Since theaudio samples arenumbers, theyareeasy toedit.However, themainuseof an

    audiofileistoplayitback.Thisisdonebyconvertingthenumericsamplesbackintovoltagesthat

    are continuously fed into a speaker. The device that does that is called a digitaltoanalog

    converter (DAC). Intuitively, it is clear that a high sampling rate would result in better sound

    reproduction,butalsoinmanymoresamplesandthereforebiggerfiles.Thus,themainproblemin

    audiosamplingishowoftentosampleagivensound.

    Figure1:SamplingofaSoundWave

  • Figure1ashowstheeffectof lowsamplingrate.Thesoundwave in the figure issampled

    fourtimes,andallfoursampleshappentobeidentical.Whenthesesamplesareusedtoplayback

    the sound, the result is silence. Figure 1b shows seven samples, and they seem to follow the

    original wave fairly closely. Unfortunately, when they are used to reproduce the sound, they

    produce the curve shown in dashed. There simply are not enough samples to reconstruct the

    originalsoundwave.Thesolutiontothesamplingproblemis tosamplesoundata littleoverthe

    Nyquistfrequency,whichistwicethemaximumfrequencycontainedinthesound.

    The sampling rate plays a different role in determining the quality of digital sound reproduction. One classic

    law in digital signal processing was published by Harry Nyquist. He determined that to accurately reproduce a

    signal of frequency f, the sampling rate has to be greater than 2*f. This is commonly called the Nyquist Rate. It

    is used in many practical situations. The range of human hearing, for instance, is between 16 Hz and 22,000 Hz.

    When sound is digitized at high quality (such as music recorded on a CD), it is sampled at the rate of 44,100 Hz.

    Anything lower than that results in distortions.

    Thus, ifasoundcontains frequenciesofupto2kHz, itshouldbesampledata littlemore

    than4kHz.Suchasamplingrateguaranteestruereproductionofthesound.This is illustratedin

    Figure1c,whichshows10equallyspacedsamplestakenoverfourperiods.Noticethatthesamples

    donothavetobetakenfromthemaximaorminimaofthewave;theycancomefromanypoint.

    The range of humanhearing is typically from1620Hz to 20,00022,000Hz, depending on the

    personandonage.Whensoundisdigitizedathighfidelity,itshouldthereforebesampledatalittle

    overtheNyquistrateof222000=44000Hz.Thisiswhyhighqualitydigitalsoundisbasedona

    44,100Hz sampling rate. Anything lower than this rate results in distortions, while higher

    samplingratesdonotproduceanyimprovementinthereconstruction(playback)ofthesound.We

    canconsider thesamplingrateof44,100Hza lowpass filter, since it effectively removesall the

    frequenciesabove22,000Hz.

    The telephone system, originally designed for conversations, not for digital

    communications, samples sound at only 8 kHz. Thus, any frequency higher than 4000 Hz gets

    distortedwhensentoverthephone,whichiswhyitishardtodistinguish,onthephone,between

    thesoundsof fands.Thesecondprobleminsoundsampling is thesamplesize.Eachsample

    becomesanumber,buthowlargeshouldthisnumberbe?Inpractice,samplesarenormallyeither

    8 or 16 bits. Assuming that the highest voltage in a soundwave is 1 volt, an 8bit sample can

    distinguishvoltagesaslowas1/2560.004volt,or4millivolts(mv).Aquietsound,generatinga

  • wavelowerthan4mv,wouldbesampledaszeroandplayedbackassilence.Incontrast,witha16

    bitsampleitispossibletodistinguishsoundsaslowas1/65,53615microvolt(v).Wecanthink

    ofthesamplesizeasaquantizationoftheoriginalaudiodata.

    Audio sampling is also calledpulse codemodulation (PCM). The termpulsemodulation

    refers to techniques for converting a continuous wave to a stream of binary numbers (audio

    samples). Possible pulsemodulation methods include pulse amplitude modulation (PAM), pulse

    positionmodulation(PPM),pulsewidthmodulation(PWM),andpulsenumbermodulation(PNM)

    isagoodsourceofinformationonthesemethods.Inpractice,however,PCMhasprovedthemost

    effective form of converting soundwaves to numbers.When stereo sound is digitized, the PCM

    encodermultiplexes the left and right sound samples. Thus, stereo sound sampled at 22,000Hz

    with16bitsamplesgenerates44,00016bitsamplespersecond,foratotalof704,000bits/sec,or

    88,000bytes/sec.

    2.1DigitalAudioandLaplaceDistribution:Alargeaudiofilewithalong,complexpieceofmusictendstohaveallthepossiblevaluesofaudio

    samples.Considerthesimplecaseof8bitaudiosamples,whichhavevaluesintheinterval[0,255].

    A large audio file, with millions of audio samples, will tend to have many audio samples

    concentrated around the center of this interval (around 128), fewer large samples (close to the

    maximum255),andfewsmallsamples(althoughtheremaybemanyaudiosamplesof0,because

    manytypesofsoundtendtohaveperiodsofsilence).Thedistributionofthesamplesmayhavea

    maximumatitscenterandanotherspikeat0.Thus,theaudiosamplesthemselvesdonotnormally

    haveasimpledistribution.

    However,whenweexaminethedifferencesofadjacentsamples,weobserveacompletely

    differentbehavior.Consecutiveaudiosamplestendtobecorrelated,whichiswhythedifferencesof

    consecutivesamplestendtobesmallnumbers.Experimentswithmanytypesofsoundindicatethat

    the distribution of audio differences resembles the Laplace distribution. The differences of

    consecutivecorrelatedvaluestendtohaveanarrow,peakeddistribution,resemblingtheLaplace

    distribution. This is true for the differences of audio samples as well as for the differences of

    consecutive pixels of an image. A compression algorithm may take advantage of this fact and

    encode the differences with variablesize codes that have a Laplace distribution. A more

    sophisticated versionmay compute differences between actual values (audio samples or pixels)

    and their predicted values, and then encode the (Laplace distributed) differences. Two such

    methodsareimageMLPandFLAC.

  • 2.2.TheHumanAuditorySystem

    The frequency range of the human ear is from about 20 Hz to about 20,000 Hz, but the ears

    sensitivity to sound isnotuniform. Itdependson the frequency. It shouldalsobenoted that the

    rangeofthehumanvoiceismuchmorelimited.It isonlyfromabout500Hztoabout2kHz.The

    existenceofthehearingthresholdsuggestsanapproachtolossyaudiocompression.Justdeleteany

    audio samples that are below the threshold. Since the threshold depends on the frequency, the

    encoderneeds to know the frequency spectrumof the soundbeing compressed at any time. If a

    signalforfrequencyfissmallerthanthehearingthresholdatf,it(thesignal)shouldbedeleted.In

    additiontothis,twomorepropertiesofthehumanhearingsystemareusedinaudiocompression.

    Theyarefrequencymaskingandtemporalmasking.

    2.2.1SpectralMaskingorFrequencyMasking:

    Frequencymasking(alsoknownasauditorymaskingorSpectralmasking)occurswhenasound

    thatwecannormallyhear(becauseit is loudenough)ismaskedbyanothersoundwithanearby

    frequency. The thick arrow in Figure 2 represents a strong sound source at 8 kHz. This source

    raisesthenormalthresholdinitsvicinity(thedashedcurve),withtheresultthatthenearbysound

    representedby thearrowat x, a sound thatwouldnormallybeaudiblebecause it is above the

    threshold,isnowmasked,andisinaudible.Agoodlossyaudiocompressionmethodshouldidentify

    this case anddelete the signals corresponding to sound x, because it cannot be heard anyway.

    Thisisonewaytolossilycompresssound.

    Figure2:Spectralorfrequencymasking

    The frequency masking (the width of the dashed curve of Figure 2) depends on the

    frequency.Itvariesfromabout100Hzforthelowestaudiblefrequenciestomorethan4kHzfor

  • thehighest.Therangeofaudiblefrequenciescanthereforebepartitionedintoanumberofcritical

    bands that indicate the declining sensitivity of the ear (rather, its declining resolvingpower) for

    higherfrequencies.Wecanthinkofthecriticalbandsasameasuresimilartofrequency.However,

    incontrasttofrequency,whichisabsoluteandhasnothingtodowithhumanhearing,thecritical

    bands are determined according to the sound perception of the ear. Thus, they constitute a

    perceptuallyuniformmeasureoffrequency.Table1lists27approximatecriticalbands.

    Table1:TwentySevenApproximateCriticalBands.

    This alsopoints theway todesigning apractical lossy compression algorithm.The audio

    signal should first be transformed into its frequency domain, and the resulting values (the

    frequencyspectrum)shouldbedividedintosubbandsthatresemblethecriticalbandsasmuchas

    possible. Once this is done, the signals in each subband should be quantized such that the

    quantization noise (the difference between the original sound sample and its quantized value)

    shouldbeinaudible.

    2.2.2TemporalMasking

    TemporalmaskingmayoccurwhenastrongsoundAoffrequencyfisprecededorfollowedintime

    byaweakersoundBatanearby(orthesame)frequency.Ifthetimeintervalbetweenthesoundsis

    short, sound Bmay not be audible. Figure 3 illustrates an example of temporal masking. The

    thresholdoftemporalmaskingduetoaloudsoundattime0goesdown,firstsharply,thenslowly.

    Aweakersoundof30dBwillnotbeaudibleifitoccurs10msbeforeoraftertheloudsound,but

    willbeaudibleifthetimeintervalbetweenthesoundsis20ms.

  • Figure3:ThresholdandMaskingofSound.

    If the masked sound occurs prior to the masking tone, this is called premasking or

    backwardmasking, and if the sound beingmasked occurs after themasking tone this effect is

    called postmasking or forwardmasking. The forward masking remains in effect for a much

    longertimeintervalthanthebackwardmasking.

    3.LossySoundCompressionItispossibletogetbettersoundcompressionbydevelopinglossymethodsthattakeadvantageof

    our perception of sound, and discard data to which the human ear is not sensitive. We briefly

    describetwoapproaches,silencecompressionandcompanding.

    Theprincipleofsilencecompressionistotreatsmallsamplesasiftheyweresilence(i.e.,as

    samples of 0). This generates run lengthsof zero, so silence compression is actually a variant of

    RLE,suitableforsoundcompression.Thismethodusesthefactthatsomepeoplehavelesssensitive

    hearingthanothers,andwilltoleratethelossofsoundthatissoquiettheymaynothearitanyway.

    Audiofilescontaininglongperiodsoflowvolumesoundwillrespondtosilencecompressionbetter

    than other fileswith highvolume sound. Thismethod requires a usercontrolledparameter that

    specifiesthelargestsamplethatshouldbesuppressed.

    Companding(shortforcompressing/expanding)usesthefactthattheearrequiresmore

    precise samples at low amplitudes (soft sounds), but is more forgiving at higher amplitudes. A

    typicalADCusedinsoundcardsforpersonalcomputersconvertsvoltagestonumberslinearly.Ifan

    amplitudeaisconvertedtothenumbern,thenamplitude2awillbeconvertedtothenumber2n.A

    compressionmethodusing companding examines every sample in the sound file, and employs a

  • nonlinearformulatoreducethenumberofbitsdevotedtoit.Moresophisticatedmethods,suchas

    lawandAlaw,arecommonlyused.

    4.LawandALawCompandingTheLawandALawcompanding standardsemploy logarithmbased functions to encodeaudio

    samples for ISDN (integrated services digital network) digital telephony services, by means of

    nonlinearquantization.TheISDNhardwaresamplesthevoicesignalfromthetelephone8KHz,and

    generates14bitsamples(13forAlaw).ThemethodoflawcompandingisusedinNorthAmerica

    andJapan,andAlawisusedelsewhere.

    Experiments indicate that the lowamplitudesof speechsignals containmore information

    thanthehighamplitudes.Thisiswhynonlinearquantizationmakessense.Imagineanaudiosignal

    sentonatelephonelineanddigitizedto14bitsamples.Theloudertheconversation,thehigherthe

    amplitude,andthebiggerthevalueof thesample.Sincehighamplitudesare less important, they

    canbecoarselyquantized.Ifthelargestsample,whichis2141=16,383,isquantizedto255(the

    largest8bitnumber),thenthecompressionfactoris14/8=1.75.Whendecoded,acodeof255will

    become very different from the original 16,383.We say that because of the coarse quantization,

    largesamplesendupwithhighquantizationnoise.Smallersamplesshouldbefinelyquantized,so

    theyendupwithlowquantizationnoise.Thelawencoderinputs14bitsamplesandoutputs

    8bit codewords.TheAlaw inputs 13bit samples and also outputs 8bit codewords. The

    telephone signals are sampled at 8 kHz (8,000 timesper second), so thelaw encoder receives

    8,00014=112,000bits/sec.Atacompressionfactorof1.75,theencoderoutputs64,000bits/sec.

    4.1LawEncoder:

    Thelawencoderreceivesa14bitsignedinputsamplex.Thus,theinputisintherange[8192,

    +8191]. The sample isnormalized to the interval [1,+1], and theencoderuses the logarithmic

    expression

    ||

    Where

    1, 00, 01, 0

    (andisapositiveinteger),tocomputeandoutputan8bitcodeinthesameinterval[1,+1].The

    output is thenscaled to the range [256,+255]. Figure4 shows thisoutputasa functionof the

  • input for the three values 25, 255, and 2555. It is clear that large values of cause coarser

    quantizationforlargeramplitudes.Suchvaluesallocatemorebitstothesmaller,moreimportant,

    amplitudes. The G.711 standard recommends the use of = 255. The diagram shows only the

    nonnegativevaluesoftheinput(i.e.,from0to8191).Thenegativesideofthediagramhasthesame

    shapebutwithnegativeinputsandoutputs.

    Figure4:TheLawforValuesof25,255,and2555.

    The following simple examples illustrate the nonlinear nature of the law. The two

    (normalized)inputsamples0.15and0.16aretransformedbylawtooutputs0.6618and0.6732.

    Thedifferencebetweentheoutputsis0.0114.Ontheotherhand,thetwoinputsamples0.95and

    0.96 (bigger inputs but with the same difference) are transformed to 0.9908 and 0.9927. The

    differencebetween these twooutputs is0.0019;muchsmaller.Bigger samplesaredecodedwith

  • morenoise,andsmallersamplesaredecodedwithlessnoise.However,thesignaltonoiseratiois

    constantbecauseboththelawandtheSNRuselogarithmicexpressions.

    P S2 S1 S0 Q3 Q2 Q1 Q0

    Figure5:G.711LawCodeword.

    Logarithmsareslowtocompute,sothelawencoderperformsmuchsimplercalculations

    thatproduceanapproximation.Theoutputspecifiedby theG.711standard isan8bitcodeword

    whoseformatisshowninFigure5.BitPinFigure5isthesignbitoftheoutput(sameasthesignbit

    ofthe14bitsignedinputsample).BitsS2,S1,andS0arethesegmentcode,andbitsQ3throughQ0

    arethequantizationcode.Theencoderdeterminesthesegmentcodeby(1)addingabiasof33to

    theabsolutevalueoftheinputsample,(2)determiningthebitpositionofthemostsignificant1bit

    among bits 5 through 12 of the input, and (3) subtracting 5 from that position. The 4bit

    quantizationcodeissettothefourbitsfollowingthebitpositiondeterminedinstep2.Theencoder

    ignores the remaining bits of the input sample, and it inverts (1s complements) the codeword

    beforeitisoutput.

    ExampleofLawCodeword:

    (a)Encoding:Weusetheinputsample656asanexample.Thesampleis

    Q3 Q2 Q1 Q0

    0 0 0 1 0 1 0 1 1 0 0 0 1

    12 11 10 9 8 7 6 5 4 3 2 1 0

    Figure6:EncodingInputSample656.

    negative, so bit P becomes 1. Adding 33 to the absolute value of the input yields 689 =

    0010101100012(Figure6).Themostsignificant1bitinpositions5through12isfoundatposition

    9.Thesegmentcodeisthus95=4.Thequantizationcodeisthefourbits0101atpositions85,

    and the remaining five bits 10001 are ignored. The 8bit codeword (which is later inverted)

    becomes

    P S2 S1 S0 Q3 Q2 Q1 Q0

    1 1 0 0 0 1 0 1

  • (b)Decoding:Thelaw decoder inputs an 8bit codeword and inverts it. It then decodes it as

    follows:

    1. Multiplythequantizationcodeby2andadd33(thebias)totheresult.

    2. Multiplytheresultby2raisedtothepowerofthesegmentcode.

    3. Decrementtheresultbythebias.

    4. UsebitPtodeterminethesignoftheresult.

    Applyingthesestepstoourexampleproduces

    1. Thequantizationcodeis01012=5,so52+33=43.

    2. Thesegmentcodeis1002=4,so4324=688.

    3. Decrementbythebias68833=655.

    4. BitP is1,sothe finalresult is655.Thus,thequantizationerror(thenoise)is1;very

    small.

    Figure 7 illustrates the nature of the law midtread quantization. Zero is one of the valid

    output values, and the quantization steps are centered at the input value of 0. The steps are

    organizedineightsegmentsof16stepseach.Thestepswithineachsegmenthavethesamewidth,

    Figure7:LawMidtreadQuantization.

    but they double inwidth from one segment to the next. If we denote the segment number by i

    (wherei=0,1...7)andthewidthofasegmentbyk(wherek=1,2...16),thenthemiddleofthe

    treadofeachstepinFigure7(i.e.,thepointslabeledxj)isgivenby,

    16

    wheretheconstantsT(i)andD(i)aretheinitialvalueandthestepsizeforsegment i,respectively.

    Theyaregivenby,

  • i 0 1 2 3 4 5 6 7

    T(i) 1 35 103 239 511 1055 2143 4319

    D(i) 2 4 8 16 32 64 128 256

    4.2TheAlawencoder:TheAlawencoderusesthesimilarexpression

    ||

    , ||

    ||

    , || 1

    TheG.711standardrecommendstheuseofA=87.6.

    Figure8:ALawMidriserQuantization.

    TheoperationoftheAlawencoderissimilar,exceptthatthequantization(Figure8)isofthemidriservariety.ThebreakpointsxjaregivenbyEquation,

    16

    buttheinitialvalueT(i)andthestepsizeD(i)forsegmentiaredifferentfromthoseusedbythelawencoderandaregivenby,

    i 0 1 2 3 4 5 6 7

    T(i) 0 32 64 128 256 512 1024 2048

    D(i) 2 2 4 8 16 32 64 128

  • TheAlawencodergeneratesan8bitcodewordwiththesameformatasthelawencoder.

    ItsetsthePbittothesignoftheinputsample.Itthendeterminesthesegmentcodeinthefollowing

    steps:

    1.Determinethebitpositionofthemostsignificant1bitamongthesevenmostsignificantbitsof

    theinput.

    2.Ifsucha1bitisfound,thesegmentcodebecomesthatpositionminus4.Otherwise,thesegment

    codebecomeszero.

    The4bitquantizationcodeissettothe fourbits followingthebitpositiondeterminedin

    step1,ortohalftheinputvalueifthesegmentcodeiszero.Theencoderignorestheremainingbits

    of the inputsample,and it invertsbitPand theevennumberedbitsof thecodewordbefore it is

    output.

    TheAlawdecoderdecodesan8bitcodewordintoa13bitaudiosampleasfollows:

    1.ItinvertsbitPandtheevennumberedbitsofthecodeword.

    2.Ifthesegmentcodeisnonzero,thedecodermultipliesthequantizationcodeby2andincrements

    thisbythebias(33).Theresultisthenmultipliedby2andraisedtothepowerofthe(segmentcode

    minus1).Ifthesegmentcodeis0,thedecoderoutputstwicethequantizationcode,plus1.

    3.BitPisthenusedtodeterminethesignoftheoutput.

    5.ADPCMAudioCompression:Adjacent audio samples tend to be similar inmuch the sameway that neighboring pixels in an

    image tend to have similar colors. The simplest way to exploit this redundancy is to subtract

    adjacentsamplesandcodethedifferences,whichtendtobesmallintegers.Anyaudiocompression

    method based on this principle is called DPCM (differential pulse code modulation). Such

    methods,however,areinefficient,becausetheydonotadaptthemselvestothevaryingmagnitudes

    of the audio stream. Better results are achieved by an adaptive version, and any such version is

    calledADPCM.

    ADPCM: Short for Adaptive Differential Pulse Code Modulation, a form of pulse code modulation (PCM) that

    produces a digital signal with a lower bit rate than standard PCM. ADPCM produces a lower bit rate by

    recording only the difference between samples and adjusting the coding scale dynamically to accommodate

    large and small differences.

    ADPCM employs linear prediction. It uses the previous sample (or several previous

    samples)topredictthecurrentsample.Itthencomputesthedifferencebetweenthecurrentsample

  • anditsprediction,andquantizesthedifference.ForeachinputsampleX[n],theoutputC[n]ofthe

    encoderissimplyacertainnumberofquantizationlevels.Thedecodermultipliesthisnumberby

    thequantizationstep(andmayaddhalf thequantizationstep, forbetterprecision) toobtain the

    reconstructedaudiosample.Themethodisefficientbecausethequantizationstepisupdatedallthe

    time,bybothencoderanddecoder,inresponsetothevaryingmagnitudesoftheinputsamples.Itis

    alsopossibletomodifyadaptivelythepredictionalgorithm.VariousADPCMmethodsdifferinthe

    waytheypredictthecurrentaudiosampleandinthewaytheyadapttotheinput(bychangingthe

    quantizationstepsizeand/orthepredictionmethod).

    Inadditiontothequantizedvalues,anADPCMencodercanprovidethedecoderwithside

    information.Thisinformationincreasesthesizeofthecompressedstream,butthisdegradationis

    acceptable to the users, because it makes the compressed audio data more useful. Typical

    applications of side information are (1) help the decoder recover from errors and (2) signal an

    entrypointintothecompressedstream.Anoriginalaudiostreammayberecordedincompressed

    formonamediumsuchasaCDROM.Iftheuser(listener)wantstolistentosong5,thedecodercan

    usethesideinformationtoquicklyfindthestartofthatsong.

    Figure9:(a)ADPCMEncoderand(b)Decoder.

  • Figure9showsthegeneralorganizationoftheADPCMencoderanddecoder.Theadaptive

    quantizer receives thedifferenceD[n]between the current input sampleX[n] and theprediction

    Xp[n1].ThequantizercomputesandoutputsthequantizedcodeC[n]ofX[n].Thesamecodeis

    senttotheadaptivedequantizer(thesamedequantizerusedbythedecoder),whichproducesthe

    nextdequantizeddifferencevalueDq[n].ThisvalueisaddedtothepreviouspredictoroutputXp[n

    1],andthesumXp[n]issenttothepredictortobeusedinthenextstep.

    Better prediction would be obtained by feeding the actual input X[n] to the predictor.

    However,thedecoderwouldntbeabletomimicthat,sinceitdoesnothaveX[n].Weseethatthe

    basicADPCMencoderissimple,andthedecoderisevensimpler.ItinputsacodeC[n],dequantizes

    ittoadifferenceDq[n],whichisaddedtotheprecedingpredictoroutputXp[n1]toformthenext

    outputXp[n].Thenextoutputisalsofedintothepredictor,tobeusedinthenextstep.

    6.SpeechCompression:

    Certainaudiocodecsaredesignedspecifically tocompressspeechsignals.Suchsignalsareaudio

    andaresampledlikeanyotheraudiodata,butbecauseofthenatureofhumanspeech,theyhave

    propertiesthatcanbeexploitedforefficientcompression.

    6.1PropertiesofSpeech

    Weproducesoundbyforcingair fromthe lungs throughthevocalcords intothevocal tract.The

    vocalcordscanopenandclose,andtheopeningbetweenthemiscalledtheglottis.Themovements

    of the glottis and vocal tract give rise to different types of sound. The three main types are as

    follows:

    1.Voiced sounds. These are the soundswemakewhenwe talk. The vocal cords vibratewhich

    opensandclosestheglottis,therebysendingpulsesofairatvaryingpressurestothetract,whereit

    isshapedintosoundwaves.Thefrequenciesofthehumanvoice,ontheotherhand,aremuchmore

    restrictedandaregenerallyintherangeof500Hztoabout2kHz.Thisisequivalenttotimeperiods

    of2msto20ms.Thus,voicedsoundshavelongtermperiodicity.

    2.Unvoiced sounds. These are sounds that are emitted and can be heard, but are not parts of

    speech.Suchasoundistheresultofholdingtheglottisopenandforcingairthroughaconstriction

    inthevocaltract.Whenanunvoicedsoundissampled,thesamplesshowlittlecorrelationandare

    randomorclosetorandom.

    3.Plosivesounds.These resultwhen theglottis closes, the lungsapply airpressureon it, and it

    suddenlyopens,lettingtheairescapesuddenly.Theresultisapoppingsound.

  • 6.2Speechcodecs

    Therearethreemaintypesofspeechcodecs.

    1. Waveform speech codecs: It produce good to excellent speech after compressing and

    decompressingit,butgeneratebitratesof1064kbps.

    2. Sourcecodecs(alsocalledvocoders):Vocodersgenerallyproducepoortofairspeechbut

    cancompressittoverylowbitrates(downto2kbps).

    3. Hybrid codecs: These codecs are combinations of the former two types and produce

    speechthatvariesfromfairtogood,withbitratesbetween2and16kbps.

    Figure10illustratesthespeechqualityversusbitrateofthesethreetypes.

    Figure10:SpeechQualityversusBitrateforSpeechCodecs.

    6.3WaveformCodecs

    Waveformcodecdoesnotattempttopredicthowtheoriginalsoundwasgenerated.Itonlytriesto

    produce,afterdecompression,audiosamplesthatareasclosetotheoriginalonesaspossible.Thus,

    such codecs are not designed specifically for speech coding and can perform equallywell on all

    kindsofaudiodata.AsFigure10illustrates,whensuchacodecisforcedtocompresssoundtoless

    than16kbps,thequalityofthereconstructedsounddropssignificantly.

    The simplest waveform encoder is pulse codemodulation (PCM). This encoder simply

    quantizeseachaudiosample.Speechistypicallysampledatonly8kHz.Ifeachsampleisquantized

    to 12 bits, the resulting bitrate is 8k 12 = 96 kbps and the reproduced speech sounds almost

  • natural. Better results are obtained with a logarithmic quantizer, such as the law and Alaw

    compandingmethods.Theyquantizeaudiosamplestovaryingnumbersofbitsandmaycompress

    speech to8bitspersampleonaverage, therebyresulting ina bitrateof64kbps,withverygood

    qualityofthereconstructedspeech.

    AdifferentialPCM speechencoderuses the fact that theaudiosamplesofvoiced speech

    arecorrelated.This typeofencodercomputes thedifferencebetween thecurrent sampleand its

    predecessorandquantizes thedifference.Anadaptiveversion(ADPCM)maycompressspeechat

    goodqualitydowntoabitrateof32kbps.

    Waveform coders may also operate in the frequency domain. The subband coding

    algorithm (SBC) transforms the audio samples to the frequencydomain,partitions the resulting

    coefficientsintoseveralcriticalbands(orfrequencysubbands),andcodeseachsubbandseparately

    withADPCMorasimilarquantizationmethod.TheSBCdecoderdecodesthefrequencycoefficients,

    recombinesthem,andperformstheinversetransformationto(lossily)reconstructaudiosamples.

    TheadvantageofSBCisthattheearissensitivetocertainfrequenciesandlesssensitivetoothers

    Subbands of frequencies to which the ear is less sensitive can therefore be coarsely quantized

    without loss of sound quality. This type of coder typically produces good reconstructed speech

    qualityatbitratesof1632kbps.Theyare,however,morecomplextoimplementthanPCMcodecs

    andmayalsobeslower.

    The adaptive transform coding (ATC) speech compression algorithm transforms audio

    samplestothefrequencydomainwiththediscretecosinetransform(DCT).Theaudiofileisdivided

    into blocks of audio samples and the DCT is applied to each block, resulting in a number of

    frequency coefficients. Each coefficient is quantized according to the frequency to which it

    corresponds.Goodqualityreconstructedspeechcanbeachievedatbitratesaslowas16kbps.

    6.4SourceCodecs

    Ingeneral,asourceencoderusesamathematicalmodelofthesourceofdata.Themodeldepends

    oncertainparameters,andtheencoderusestheinputdatatocomputethoseparameters.Oncethe

    parameters are obtained, they are written (after being suitably encoded) on the compressed

    stream.Thedecoderinputstheparametersandemploysthemathematicalmodeltoreconstructthe

    originaldata.Iftheoriginaldataisaudio,thesourcecoderiscalledvocoder(fromvocalcoder).

    6.4.1LinearPredictiveCoder(LPC):

    Figure 11 shows a simplified model of speech production. Part (a) illustrates the process in a

    person, whereas part (b) shows the corresponding LPCmathematical model. In this model, the

    outputisthesequenceofspeechsampless(n)comingoutoftheLPCfilter(whichcorrespondsto

  • thevocal tractand lips).The inputu(n) to themodel(andto the filter) iseithera trainofpulses

    (whenthesoundisvoicedspeech)orwhitenoise(whenthesoundisunvoicedspeech).The

    Figure11:SpeechProduction:(a)Real.(b)LPCModel

    quantitiesu(n)arealsotermedinnovation.Themodelillustrateshowsampless(n)ofspeechcanbe

    generated by mixing innovations (a train of pulses and white noise). Thus, it represents

    mathematically the relation between speech samples and innovations. The task of the speech

    encoder is to input samples s(n) of actual speech, use the filter as a mathematical function to

    determineanequivalentsequenceofinnovationsu(n),andoutputtheinnovationsincompressed

    form. The correspondence between the models parameters and the parts of real speech is as

    follows:

    1.ParameterV(voiced)correspondstothevibrationsofthevocalcords.UVexpressestheunvoiced

    sounds.

    2.Tistheperiodofthevocalcordsvibrations.

    3.G(gain)correspondstotheloudnessortheairvolumesentfromthelungseachsecond.

    4.Theinnovationsu(n)correspondtotheairpassingthroughthevocaltract.

    5.Thesymbolsanddenoteamplificationandcombination,respectively.

  • ThemainequationoftheLPCmodeldescribestheoutputoftheLPCfilteras,

    wherezistheinputtothefilter[thevalueofoneoftheu(n)].Anequivalentequationdescribesthe

    relationbetweenthe innovationsu (n)ontheonehandandthe10coefficientsaiandthespeech

    audiosampless(n)ontheotherhand.Therelationis,

    Thisrelationimpliesthateachnumberu(n)inputtotheLPCfilteristhesumofthecurrentaudio

    samples(n)andaweightedsumofthe10precedingsamples.TheLPCmodelcanbewrittenasthe

    13tuple

    , , , , , /, ,

    whereV/UV is a single bit specifying the source (voiced or unvoiced) of the input samples. The

    modelassumesthatAstaysstableforabout20ms,thengetsupdatedbytheaudiosamplesofthe

    next20ms.Atasamplingrateof8kHz,thereare160audiosampless(n)every20ms.Themodel

    computes the 13 quantities in A from these 160 samples, writes A (as 13 numbers) on the

    compressedstream,thenrepeatsforthenext20ms.Theresultingcompressionfactoristherefore

    13numbersforeachsetof160audiosamples.

    Its important to distinguish the operation of the encoder from the diagram of the LPCs

    mathematicalmodeldepictedinFigure11b.Thefigureshowshowasequenceofinnovationsu(n)

    generatesspeechsampless(n).Theencoder,however,startswith thespeechsamples. It inputsa

    20ms sequence of speech samples s(n), computes an equivalent sequence of innovations,

    compresses them to 13 numbers, and outputs the numbers after further encoding them. This

    repeatsevery20ms.

    LPCencoding(oranalysis)startswith160soundsamplesandcomputesthe10LPCparametersai

    byminimizingtheenergyoftheinnovationu(n).Theenergyisthefunction

    , , ,

    and itsminimum is computed by differentiating it 10 times,with respect to each of its 10 . The

    autocorrelationfunctionofthesampless(n)isgivenby,

  • GKB

    Whichisusedtoobtain10LPCparametersai.Theremainingthreeparameters,V/UV,G,andT,are

    determinedfromthe160audiosamples.Ifthosesamplesexhibitperiodicity,thenTbecomesthat

    periodandthe1bitparameterV/UVissettoV.Ifthe160samplesdonotfeatureanywelldefined

    period,thenTremainsundefinedandV/UVissettoUV.ThevalueofGisdeterminedbythelargest

    sample.

    LPC decoding (or synthesis) starts with a set of 13 LPC parameters and computes 160 audio

    samplesastheoutputoftheLPCfilterby,

    Thesesamplesareplayedat8,000samplespersecondandresultin20msof(voicedorunvoiced)

    reconstructedspeech.

    AdvantagesofLPC:

    1. LPCprovidesagoodmodelofthespeechsignal.

    2. Theway in which LPC is applied to the analysis of speech signals leads to a reasonable

    sourcevocaltractseparation.

    3. LPCisananalyticallytractablemodel.Themodelismathematicallypreciseandsimpleand

    straightforwardtoimplementineithersoftwareorhardware.

    6.5HybridCodecs

    This type of speech codec combines features from bothwaveform and source codecs. Themost

    popular hybrid codecs are AnalysisbySynthesis (AbS) timedomain algorithms. Like the LPC

    vocoder,thesecodecsmodelthevocaltractbyalinearpredictionfilter,butuseanexcitationsignal

    insteadof thesimple, twostatevoiceunvoicemodel tosupply theu(n) (innovation) input to the

    filter.Thus,anAbSencoderstartswithasetofspeechsamples(aframe),encodesthemsimilarto

    LPC,decodesthem,andsubtractsthedecodedsamplesfromtheoriginalones.Thedifferencesare

    sent through an error minimization process that outputs improved encoded samples. These

    samplesareagaindecoded,subtractedfromtheoriginalsamples,andnewdifferencescomputed.

    Thisisrepeateduntilthedifferencessatisfyaterminationcondition.Theencoderthenproceedsto

    thenextsetofspeechsamples(nextframe).

    6.5.1CodeExcitedLinearPrediction(CELP):

    Oneofthemostimportantfactorsingeneratingnaturalsoundingspeechistheexcitationsignal.As

  • thehumanearisespeciallysensitivetopitcherrors,agreatdealofefforthasbeendevotedtothe

    developmentofaccuratepitchdetectionalgorithms.

    In CELP instead of having a codebook of pulse patterns,we allow a variety of excitation

    signals.Foreachsegmenttheencoderfindstheexcitationvectorthatgeneratessynthesizedspeech

    thatbestmatchesthespeechsegmentbeingencoded.Thisapproachiscloserinastrictsensetoa

    waveform coding technique such as DPCM than to the analysis/synthesis schemes. The main

    components of the CELP coder include the LPC analysis, the excitation codebook, and the

    perceptualweightingfilter.BesidesCELP,theMPLPCalgorithmhadanotherdescendantthathas

    becomeastandard.Insteadofusingexcitationvectorsinwhichthenonzerovaluesareseparated

    byanarbitrarynumberofzerovalues,theyforcedthenonzerovaluestooccuratregularlyspaced

    intervals.Furthermore,MPLPCallowedthenonzerovaluestotakeonanumberofdifferentvalues.

    Thisschemeiscalledasregularpulseexcitation(RPE)coding.AvariationofRPE,calledregular

    pulse excitationwith longterm prediction (RPELTP), was adopted as a standard for digital

    cellular telephony by the Group Speciale Mobile (GSM) subcommittee of the European

    TelecommunicationsStandardsInstituteattherateof13kbps.

    ThevocaltractfilterusedbytheCELPcoderisgivenby

    wherePisthepitchperiodandthetermynPisthecontributionduetothepitchperiodicity.

    1. The inputspeech issampledat8000samplespersecondanddivided into30millisecond

    framescontaining240samples.

    2. Eachframeisdividedintofoursubframesoflength7.5milliseconds.

    3. The coefficients for the 10thorder shortterm filter are obtained using the

    autocorrelationmethod.

    4. ThepitchperiodPiscalculatedonceeverysubframe.Inordertoreducethecomputational

    load,thepitchvalueisassumedtoliebetween20and147everyoddsubframe.

    5. In every even subframe, the pitch value is assumed to liewithin 32 samples of the pitch

    valueinthepreviousframe.

    6. The algorithmuses two codebooks, a stochastic codebook and an adaptive codebook. An

    excitationsequenceisgeneratedforeachsubframebyaddingonescaledelementfromthe

    stochasticcodebookandonescaledelementfromtheadaptivecodebook.

  • 7. Thestochasticcodebookcontains512entries.TheseentriesaregeneratedusingaGaussian

    randomnumbergenerator, theoutputofwhich isquantized to1,0,or1.Thecodebook

    entriesareadjustedsothateachentrydiffersfromtheprecedingentryinonlytwoplaces.

    8. The adaptive codebook consists of the excitation vectors from the previous frame. Each

    timeanewexcitationvector isobtained, it is added to the codebook. In thismanner, the

    codebookadaptstolocalstatistics.

    9. The coder has been shown to provide excellent reproductions in both quiet and noisy

    environmentsatratesof4.8kbpsandabove.

    10. Thequalityofthereproductionofthiscoderat4.8kbpshasbeenshowntobeequivalentto

    adeltamodulatoroperatingat32kbps.Thepriceforthisqualityismuchhighercomplexity

    andamuchlongercodingdelay.

    CCITTG.728CELPSpeechcodingStandard:

    By their nature, the speech coding schemes have some coding delay built into them. By coding

    delay,wemeanthetimebetweenwhenaspeechsample isencodedtowhenit isdecodedif the

    encoderanddecoderwereconnectedbacktoback(i.e.,therewerenotransmissiondelays).Inthe

    schemeswehavestudied,asegmentofspeechisfirststoredinabuffer.Wedonotstartextracting

    thevariousparametersuntilacompletesegmentofspeechisavailabletous.Oncethesegmentis

    completelyavailable, it isprocessed. If theprocessing is real time, thismeansanother segments

    worth of delay. Finally, once the parameters have been obtained, coded, and transmitted, the

    receiverhastowaituntilatleastasignificantpartoftheinformationisavailablebeforeitcanstart

    decoding the first sample. Therefore, if a segment contains 20 milliseconds worth of data, the

    codingdelaywouldbeapproximatelysomewherebetween40to60milliseconds.

    Forsuchapplications,CCITTapprovedrecommendationG.728,aCELPcoderwithacoder

    delayof2millisecondsoperatingat16kbps.Astheinputspeechissampledat8000samplesper

    second,thisratecorrespondstoanaveragerateof2bitsper sample.TheG.728recommendation

    usesasegmentsizeoffivesamples.Withfivesamplesandarateof2bitspersample,weonlyhave

    10bitsavailabletous.Usingonly10bits,itwouldbeimpossibletoencodetheparametersofthe

    vocal tract filter aswell as theexcitationvector. Therefore, thealgorithmobtains thevocal tract

    filter parameters in a backward adaptivemanner; that is, the vocal tract filter coefficients to be

    usedtosynthesizethecurrentsegmentareobtainedbyanalyzingthepreviousdecodedsegments.

    TheG.728algorithmusesa50thordervocaltract filter.Theorderofthe filter is largeenoughto

    modelthepitchofmostfemalespeakers.Notbeingabletousepitchinformationformalespeakers

    doesnotcausemuchcorruptedbychannelerrors.Therefore,thevocaltractfilterisupdatedevery

  • fourth frame,which is once every20 samples or 2.5milliseconds. The autocorrelationmethod is

    usedtoobtainthevocaltractparameters.

    FIGURE12:EncoderanddecoderfortheCCITTG.72816kbpsCELPspeechcodec

    Tenbitswouldbeable to index1024excitationsequences.However, toexamine1024excitation

    sequences every 0.625milliseconds is a rather large computational load. In order to reduce this

    load,theG.728algorithmusesaproductcodebookwhereeachexcitationsequenceisrepresented

  • by a normalized sequence and a gain term. The final excitation sequence is a product of the

    normalizedexcitationsequenceandthegain.Ofthe10bits,3bitsareusedtoencodethegainusing

    apredictiveencodingscheme,whiletheremaining7bitsformtheindextoacodebookcontaining

    127sequences.

    BlockdiagramsoftheencoderanddecoderfortheCCITTG.728coderareshowninFigure12.The

    lowdelay CCITT G.728 CELP coder operating at 16 kbps provides reconstructed speech quality

    superiortothe32kbpsCCITTG.726ADPCMalgorithm.Variouseffortsareunderwaytoreducethe

    bitrateforthisalgorithmwithoutcompromisingtoomuchonqualityanddelay.

    6.5.3MixedExcitationLinearPrediction(MELP):

    Themixedexcitationlinearprediction(MELP)coderwasselectedtobethenewfederalstandard

    forspeechcodingat2.4kbpsbywhichusesthesameLPCfiltertomodelthevocaltract.However,it

    usesamuchmorecomplexapproachtothegenerationoftheexcitationsignal.Ablockdiagramof

    thedecoderfortheMELPsystemisshowninFigure13.Asevidentfromthefigure,theexcitation

    signal forthesynthesis filter isno longersimplynoiseoraperiodicpulsebutamultibandmixed

    excitation.Themixedexcitationcontainsbothafilteredsignalfromanoisegeneratoraswellasa

    contributionthatdependsdirectlyontheinputsignal.

    Thefirststepinconstructingtheexcitationsignalispitchextraction.TheMELPalgorithm

    obtains thepitchperiodusingamultistepapproach. In the first stepan integerpitchvalueP1 is

    obtainedby

    1.firstfilteringtheinputusingalowpassfilterwithacutoffof1kHz

    2.computingthenormalizedautocorrelationforlagsbetween40and160

    Thenormalizedautocorrelationr()isdefinedas

    ,

    , ,

    ,

    ThefirstestimateofthepitchP1isobtainedasthevalueofthatmaximizesthenormalized

    autocorrelation function. This stage uses two values of P1, one from the current frame andone

    fromthepreviousframe,ascandidates.Thenormalizedautocorrelationvaluesareobtainedforlags

    fromfivesampleslesstofivesamplesmorethanthecandidateP1values.

    Thelagsthatprovidethemaximumnormalizedautocorrelationvalueforeachcandidateareused

    forfractionalpitchrefinement.

  • Figure13:BlockdiagramofMELPdecoder.

    Thefinalrefinementsofthepitchvalueareobtainedusingthelinearpredictionresiduals.

    Theresidualsequenceisgeneratedbyfilteringtheinputspeechsignalwiththefilterobtainedusing

    theLPCanalysis.For thepurposesofpitchrefinement theresidual signal is filteredusinga low

    pass filter with a cutoff of 1 kHz. The normalized autocorrelation function is computed for this

    filteredresidualsignal forlagsfromfivesampleslesstofivesamplesmorethanthecandidateP2

    value,andacandidatevalueofP3isobtained.

    Theinputisalsosubjectedtoamultibandvoicinganalysisusingfivefilterswithpassbands

    0500,5001000,10002000,20003000,and30004000Hz.Thegoaloftheanalysisistoobtain

    thevoicingstrengthsVbpiforeachbandusedintheshapingfilters.IfthevalueofVbp1issmall,this

    indicatesalackoflowfrequencystructure,whichinturnindicatesanunvoicedortransitioninput.

    Thus,ifVbp106,the

    valuesoftheothervoicingstrengthsarequantizedto1iftheirvalueisgreaterthan0.6,andto0

    otherwise. In thisway signal energy in the different bands is turnedon or off depending on the

    voicingstrength.

    Inordertogeneratethepulseinput,thealgorithmmeasuresthemagnitudeofthediscrete

    Fouriertransformcoefficientscorrespondingtothefirst10harmonicsofthepitch.Themagnitudes

  • oftheharmonicsarequantizedusingavectorquantizerwithacodebooksizeof256.Thecodebook

    is searched using aweighted Euclidean distance that emphasizes lower frequencies over higher

    frequencies.

    At the decoder, using the magnitudes of the harmonics and information about the

    periodicity of the pulse train, the algorithm generates one excitation signal. Another signal is

    generated using a random number generator. Both are shaped by the multiband shaping filter

    before being combined. This mixture signal is then processed through an adaptive spectral

    enhancementfilter,whichisbasedontheLPCcoefficients,toformthefinalexcitationsignal.Note

    that inorder topreservecontinuity fromframeto frame, theparametersused forgeneratingthe

    excitationsignalareadjustedbasedontheircorrespondingvaluesinneighboringframes.

    6.6MPEGAudioCoding

    TheformalnameofMPEG1is the internationalstandard formovingpicturevideocompression,IS

    11172. It consists of five parts, of which part 3 [ISO/IEC 93] is the definition of the audio

    compressionalgorithm.ThedocumentdescribingMPEG1hasnormativeandinformativesections.

    Anormative section is part of the standard specification. It is intended for implementers, it is

    written inaprecise language,and it shouldbestrictly followed in implementing thestandardon

    actualcomputerplatforms.Aninformativesection,ontheotherhand,illustratesconceptsthatare

    discussedelsewhere, explains the reasons that led to certain choices anddecisions, and contains

    backgroundmaterial.Anexampleofanormativesectionisthetablesofvariousparametersandof

    theHuffmancodesusedinMPEGaudio.Anexampleofaninformativesectionisthealgorithmused

    by MPEG audio to implement a psychoacoustic model. MPEG does not require any particular

    algorithm,andanMPEGencodercanuseanymethodtoimplementthemodel.Thisinformative

    sectionsimplydescribesvariousalternatives.

    The MPEG1 and MPEG2 (or in short, MPEG1/2) audio standard specifies three

    compressionmethods called layers and designated I, II,and III. All three layers are part of the

    MPEG1 standard.Amovie compressedbyMPEG1usesonlyone layer, and the layernumber is

    specifiedinthecompressedstream.Anyofthelayerscanbeusedtocompressanaudiofilewithout

    anyvideo.Aninterestingaspectofthedesignofthestandardisthatthelayersformahierarchyin

    thesensethatalayerIIIdecodercanalsodecodeaudiofilescompressedbylayersIorII.

    Theresultofhaving three layerswasan increasingpopularity of layer III.Theencoder is

    extremelycomplex,butitproducesexcellentcompression,andthis,combinedwiththefactthatthe

    decoderismuchsimpler,hasproducedinthelate1990sanexplosionofwhatispopularlyknown

  • asmp3soundfiles.ItiseasytolegallyandfreelyobtainalayerIIIdecoderandmuchmusicthatis

    alreadyencodedinlayerIII.Sofar,thishasbeenabigsuccessoftheaudiopartoftheMPEGproject.

    The principle of MPEG audio compression is quantization. The values being quantized,

    however,arenottheaudiosamplesbutnumbers(calledsignals)takenfromthefrequencydomain

    of the sound. The fact that the compression ratio (or equivalently, the bitrate) is known to the

    encodermeansthattheencoderknowsatanytimehowmanybits itcanallocatetothequantized

    signals. Thus, the (adaptive) bitallocationalgorithm is an important part of the encoder. This

    algorithmusestheknownbitrateandthefrequencyspectrumofthemostrecentaudiosamplesto

    determinethesizeofthequantizedsignalssuchthatthequantizationnoise(thedifferencebetween

    anoriginalsignalandaquantizedone)willbeinaudible.

    Figure14:MPEGAudio:(a)Encoderand(b)Decoder

    Thepsychoacousticmodelsuse the frequencyof the sound that isbeing compressed,but

    the input stream consists of audio samples, not sound frequencies. The frequencies have to be

    computed from the samples. This is why the first step in MPEG audio encoding is a discrete

    Fouriertransform,whereasetof512consecutiveaudiosamplesistransformedtothefrequency

    domain. Since the number of frequencies can be huge, they are grouped into 32 equalwidth

    frequencysubbands(layerIIIusesdifferentnumbersbutthesameprinciple).Foreachsubband,a

    number is obtained that indicates the intensity of the sound at that subbands frequency range.

    These numbers (called signals) are then quantized. The coarseness of the quantization in each

  • subband is determinedby themasking threshold in the subband andby thenumber of bits still

    available to the encoder. The masking threshold is computed for each subband using a

    psychoacousticmodel.

    MPEGusestwopsychoacousticmodelstoimplementfrequencymaskingandtemporal

    masking.Eachmodeldescribeshowloudsoundmasksothersoundsthathappentobeclosetoitin

    frequencyorintime.Themodelpartitionsthefrequencyrangeinto24criticalbandsandspecifies

    how masking effects apply within each band. The masking effects depend, of course, on the

    frequencies and amplitudes of the tones.When the sound is decompressed and played, the user

    (listener)may select any playback amplitude,which iswhy the psychoacousticmodel has to be

    designed for theworst case. Themasking effects also dependon thenature of the source of the

    sound being compressed. The source may be tonelike or noiselike. The two psychoacoustic

    modelsemployedbyMPEGarebasedonexperimentalworkdonebyresearchersovermanyyears.

    Thedecodermustbefast,sinceitmayhavetodecodetheentiremovie(videoandaudio)at

    realtime,soitmustbesimple.Asaresultitdoesnotuseanypsychoacousticmodelorbitallocation

    algorithm. The compressed stream must therefore contain all the information that the decoder

    needs for dequantizing the signals. This information (the size of the quantized signals)must be

    written by the encoder on the compressed stream, and it constitutes overhead that should be

    subtractedfromthenumberofremainingavailablebits.

    Figure 14 is a block diagram of the main components of the MPEG audio encoder and

    decoder.Theancillarydataisuserdefinableandwouldnormallyconsistofinformationrelatedto

    specificapplications.Thisdataisoptional.

    6.7FrequencyDomainCoding

    The first step in encoding the audio samples is to transform them from the time domain to the

    frequencydomain.Thisisdonebyabankofpolyphasefiltersthattransformthesamplesinto32

    equalwidth frequency subbands. The filters were designed to provide fast operation combined

    withgoodtimeandfrequencyresolutions.Asaresult,theirdesigninvolvedthreecompromises.

    1. The first compromise is the equal widths of the 32 frequency bands. This simplifies the

    filtersbutisincontrasttothebehaviorofthehumanauditorysystem,whosesensitivityis

    frequencydependent. When several critical bands are covered by a subband X, the bit

    allocation algorithm selects the critical band with the least noisemasking and uses that

    criticalbandtocomputethenumberofbitsallocatedtothequantizedsignalsinsubbandX.

    2. Thesecondcompromiseinvolvestheinversefilterbank,theoneusedbythedecoder.The

    original timetofrequency transformation involves loss of information (even before any

  • quantization).Theinversefilterbankthereforereceivesdatathatisslightlybad,andusesit

    to perform the inverse frequencytotime transformation, resulting in more distortions.

    Therefore,thedesignofthetwofilterbanks(fordirectandinversetransformations)hadto

    usecompromisestominimizethislossofinformation.

    3. The thirdcompromisehas todowith the individual filters.Adjacent filters should ideally

    pass different frequency ranges. In practice, they have considerable frequency overlap.

    Soundofasingle,pure,frequencycanthereforepenetratethroughtwofiltersandproduce

    signals(thatarelaterquantized)intwoofthe32subbandsinsteadofinjustonesubband.

    Thepolyphase filter bankuses (in addition to other intermediate data structures) a bufferX

    withroomfor512inputsamples.ThebufferisaFIFOqueueandalwayscontainsthemostrecent

    512samplesinput.Figure15showsthefivemainstepsofthepolyphasefilteringalgorithm.

    Figure15:PolyphaseFilterBank.6.8MPEGLayerICoding

    The Layer I coding scheme provides a 4:1 compression. In Layer I coding the time frequency

    mappingisaccomplishedusingabankof32subbandfilters.Theoutputofthesubbandfiltersis

    criticallysampled.Thatis,theoutputofeachfilterisdownsampledby32.Thesamplesaredivided

    intogroupsof12sampleseach.Twelvesamplesfromeachofthe32subbandfilters,oratotalof

    384 samples, make up one frame of the Layer I coder. Once the frequency components are

    obtained the algorithm examines each group of 12 samples to determine a scalefactor. The

    scalefactorisusedtomakesurethatthecoefficientsmakeuseoftheentirerangeofthequantizer.

    Thesubbandoutputisdividedbythescalefactorbeforebeinglinearlyquantized.Thereareatotal

    of63scalefactorsspecifiedintheMPEGstandard.Specificationofeachscalefactorrequires6bits.

  • Figure16:FramestructureforLayer1.

    Todetermine thenumberofbits tobeused forquantization, the codermakesuseof the

    psychoacousticmodel.TheinputstothemodelincludeFastFourierTransform(FFT)oftheaudio

    data as well as the signal itself. The model calculates the masking thresholds in each subband,

    which in turn determine the amount of quantization noise that can be tolerated and hence the

    quantization step size. As the quantizers all cover the same range, selection of the quantization

    stepsizeisthesameasselectionofthenumberofbitstobeusedforquantizingtheoutputofeach

    subband. InLayer I theencoderhasachoiceof14differentquantizers foreachband(plus the

    optionof assigning0bits).Thequantizersare allmidtreadquantizers ranging from3 levels to

    65,535levels.Eachsubbandgetsassignedavariablenumberofbits.However,thetotalnumberof

    bitsavailabletorepresentallthesubbandsamplesisfixed.Therefore,thebitallocationcanbean

    iterativeprocess.Theobjectiveistokeepthenoisetomaskratiomoreorlessconstantacrossthe

    subbands.

    Theoutputofthequantizationandbitallocationstepsarecombinedintoaframeasshown

    inFigure16.BecauseMPEGaudioisastreamingformat,eachframecarriesaheader,ratherthan

    havingasingleheaderfortheentireaudiosequence.

    1. Theheaderismadeupof32bits.

    2. Thefirst12bitscompriseasyncpatternconsistingofall1s.

    3. Thisisfollowedbya1bitversionID,

    4. A2bitlayerindicator,

    5. A 1bit CRC protection. The CRC protection bit is set to 0 if there is no CRC

    protectionandissettoa1ifthereisCRCprotection.

    6. If the layer and protection information is known, all 16 bits can be used for

    providingframesynchronization.

  • 7. Thenext4bitsmakeupthebitrateindex,whichspecifiesthebitrateinkbits/sec.

    Thereare14specifiedbitratestochosefrom.

    8. This is followed by 2 bits that indicate the sampling frequency. The sampling

    frequencies for MPEG1 and MPEG2 are different (one of the few differences

    between the audio coding standards forMPEG1 andMPEG2) and are shown in

    Table2.

    9. Thesebitsarefollowedbyasinglepaddingbit.Ifthebitis1,theframeneedsan

    additional bit to adjust the bit rate to the sampling frequency. The next two bits

    indicate themode.Thepossiblemodesare stereo, joint stereo, dual channel,

    and single channel. The stereomode consists of two channels that are encoded

    separatelybutintendedtobeplayedtogether.Thejointstereomodeconsistsoftwo

    channelsthatareencodedtogether.

    Table2:AllowablesamplingfrequenciesinMPEG1andMPEG2.

    Theleftandrightchannelsarecombinedtoformamidandasidesignalasfollows:

    Thedualchannelmodeconsistsoftwochannelsthatareencodedseparatelyandarenot

    intended to be played together, such as a translation channel. These are followed by twomode

    extension bits that are used in the joint stereomode. The next bit is a copyright bit (1 if the

    materialiscopyrighted,0ifitisnot).Thenextbitissetto1fororiginalmediaand0forcopy.

    The final twobits indicate the typeofdeemphasis tobeused. If theCRCbit is set, theheader is

    followedbya16bitCRC.Thisisfollowedbythebitallocationsusedbyeachsubbandandisinturn

    followed by the set of 6bit scalefactors. The scalefactor data is followed by the quantized 384

    samples.

  • 16.9LayerIICoding

    The Layer II coder provides a higher compression rate by making some relatively minor

    modifications to the Layer I coding scheme. Thesemodifications includehow the samples are

    grouped together, the representation of the scalefactors, and the quantization strategy.

    WheretheLayerIcoderputs12samplesfromeachsubbandintoaframe,theLayerIIcodergroups

    threesetsof12samplesfromeachsubbandintoaframe.Thetotalnumberofsamplesperframe

    increasesfrom384samplesto1152samples.Thisreducestheamountofoverheadpersample.In

    LayerIcodingaseparatescalefactorisselectedforeachblockof12samples.InLayerIIcodingthe

    encodertriestoshareascalefactoramongtwoorallthreegroupsofsamplesfromeachsubband

    filter.Theonlytimeseparatescalefactorsareusedforeachgroupof12samplesiswhennotdoing

    so would result in a significant increase in distortion. The particular choice used in a frame is

    signaledthroughthescalefactorselectioninformationfieldinthebitstream.

    The major difference between the Layer I and Layer II coding schemes is in the

    quantizationstep.IntheLayerIcodingschemetheoutputofeachsubbandisquantizedusingone

    of 14 possibilities; the same 14 possibilities for each of the subbands. In Layer II coding the

    quantizers used for each of the subbands can be selected from a different set of quantizers

    depending on the sampling rate and the bit rates. For some sampling rate and bit rate

    combinations,manyofthehighersubbandsareassigned0bits.Thatis,theinformationfromthose

    subbandsissimplydiscarded.Wherethequantizerselectedhas3,5,or9levels,theLayerIIcoding

    schemeusesonemoreenhancement.Noticethat inthecaseof3 levelswehavetouse2bitsper

    sample,whichwouldhaveallowedustorepresent4levels.Thesituationisevenworseinthecase

    of5levels,whereweareforcedtouse3bits,wastingthreecodewords,andinthecaseof9levels

    wherewehavetouse4bits,thuswasting7levels.

    Toavoidthissituation,theLayerIIcodergroups3samplesintoagranule.Ifeachsample

    cantakeon3levels,agranulecantakeon27levels.Thiscanbeaccommodatedusing5bits.Ifeach

    samplehadbeenencoded separatelywewouldhaveneeded6bits. Similarly, if each sample can

    takeon9values,agranulecantakeon729values.Wecanrepresent729valuesusing10bits.If

    eachsampleinthegranulehadbeenencodedseparately,wewouldhaveneeded12bits.Usingall

    thesesavings,thecompressionratioinLayerIIcodingcanbeincreasefrom4:1to8:1or6:1.

    TheframestructurefortheLayerIIcodercanbeseeninFigure17.Theonlyrealdifference

    between this frame structure and the frame structure of the Layer I coder is the scalefactor

    selectioninformationfield.

  • Figure17:FramestructureforLayer2.

    16.10LayerIIICodingmp3

    Layer III coding, which has becomewidely popular under the namemp3, is considerablymore

    complexthantheLayerIandLayerIIcodingschemes.OneoftheproblemswiththeLayerIand

    codingschemeswasthatwiththe32banddecomposition,thebandwidthofthesubbandsatlower

    frequenciesissignificantlylargerthanthecriticalbands.Thismakesitdifficulttomakeanaccurate

    judgmentof themasktosignalratio. Ifwegetahighamplitudetonewithinasubbandandif the

    subbandwasnarrowenough,wecouldassumethatitmaskedothertonesintheband.However,if

    thebandwidthofthesubbandissignificantlyhigherthanthecriticalbandwidthatthatfrequency,it

    becomesmoredifficulttodeterminewhetherothertonesinthesubbandwillbemasked.

    Tosatisfythebackwardcompatibilityrequirement,thespectraldecompositionintheLayer

    IIIalgorithmisperformedintwostages.Firstthe32bandsubbanddecompositionusedinLayerI

    and Layer II is employed. The output of each subband is transformed using amodified discrete

    cosinetransform(MDCT)witha50%overlap.TheLayerIIIalgorithmspecifiestwosizesforthe

    MDCT,6or18.Thismeansthattheoutputofeachsubbandcanbedecomposedinto18frequency

    coefficientsor6frequencycoefficients.

    Thereasonforhavingtwosizes fortheMDCTis thatwhenwetransformasequence into

    thefrequencydomain,welosetimeresolutionevenaswegainfrequencyresolution.The larger

    theblocksize themorewe lose in termsof time resolution.Theproblemwith this is that any

    quantizationnoiseintroducedintothefrequencycoefficientswillgetspreadovertheentireblock

    size of the transform. Backward temporalmasking occurs for only a short duration prior to the

    maskingsound(approximately20msec).Therefore,quantizationnoisewillappearasapreecho.

  • Forthe longwindowsweendupwith18frequenciespersubband,resultinginatotalof

    576 frequencies. For the shortwindows we get 6 coefficients per subband for a total of 192

    frequencies.Thestandardallowsforamixedblockmodeinwhichthetwolowestsubbandsuse

    longwindowswhiletheremainingsubbandsuseshortwindows.Noticethatwhilethenumberof

    frequenciesmaychangedependingonwhetherweareusinglongorshortwindows,thenumberof

    samples in a frame stays at 1152. That is 36 samples, or 3 groups of 12, from each of the 32

    subbandfilters.

    ThecodingandquantizationoftheoutputoftheMDCTisconductedinaniterativefashion

    usingtwonestedloops.Thereisanouterloopcalledthedistortioncontrolloopwhosepurposeis

    to ensure that the introduced quantization noise lies below the audibility threshold. The

    scalefactorsareusedtocontrolthelevelofquantizationnoise.InLayerIIIscalefactorsareassigned

    togroupsorbandsofcoefficientsinwhichthebandsareapproximatelythesizeofcriticalbands.

    Thereare21scalefactorbandsforlongblocksand12scalefactorbandsforshortblocks.

    Figure19:FramesinLayerIII

    Theinnerloopiscalledtheratecontrol loop.Thegoalofthisloopistomakesurethata

    targetbitrateisnotexceeded.ThisisdonebyiteratingbetweendifferentquantizersandHuffman

    codes. The quantizers used in mp3 are companded nonuniform quantizers. The scaled MDCT

    coefficients are first quantized and organized into regions. Coefficients at the higher end of the

    frequencyscalearelikelytobequantizedtozero.Theseconsecutivezerooutputsaretreatedasa

    single region and the runlength is Huffman encoded. Below this region of zero coefficients, the

    encoder identifies the set of coefficients that are quantized to 0 or 1. These coefficients are

    grouped into groups of four. This set of quadruplets is the second region of coefficients. Each

    quadrupletisencodedusingasingleHuffmancodeword.

  • The remaining coefficients are divided into two or three subregions. Each subregion is

    assignedaHuffmancodebasedon itsstatisticalcharacteristics. If theresultofusingthisvariable

    lengthcodingexceedsthebitbudget,thequantizerisadjustedtoincreasethequantizationstepsize.

    Theprocessisrepeateduntilthetargetrateissatisfied.Thepsychoacousticmodelisused

    tocheckwhetherthequantizationnoiseinanybandexceedsthealloweddistortion.Ifitdoes,the

    scalefactor isadjusted toreduce thequantizationnoise.Once all scalefactorshavebeenadjusted,

    control returns to the rate control loop. The iterations terminate eitherwhen the distortion and

    rateconditionsaresatisfiedorthescalefactorscannotbeadjustedanyfurther.

    TherewillbeframesinwhichthenumberofbitsusedbytheHuffmancoderislessthanthe

    amountallocated.Thesebitsaresavedinaconceptualbitreservoir.Inpracticewhatthismeansis

    thatthestartofablockofdatadoesnotnecessarilycoincidewiththeheaderoftheframe.Consider

    the three frames shown in Figure 19. In this example, themain data for the first frame (which

    includes scalefactor information and theHuffman codeddata) does not occupy the entire frame.

    Therefore,themaindataforthesecondframestartsbeforethesecondframeactuallybegins.The

    sameistruefortheremainingdata.Themaindatacanbeginintheprevious frame.However,the

    main data for a particular frame cannot spill over into the following frame. All this complexity

    allowsforaveryefficientencodingofaudio inputs.Thetypicalmp3audiofilehasacompression

    ratioofabout10:1.Inspiteofthishighlevelofcompression,mostpeoplecannottellthedifference

    betweentheoriginalandthecompressedrepresentation.