a guide to mpeg

Upload: nguyen-duong

Post on 30-May-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 A guide to MPEG

    1/48Copyright 1997, Tektronix, Inc. All rights reserved.

    A Guide to MPEGFundamentals andProtocol Analysis(Including DVB and ATSC)

  • 8/14/2019 A guide to MPEG

    2/48

    Section 1 Introduction to MPEG . . . . . . . . . . . . . . .3

    1.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2 Why compression is needed . . . . . . . . . . . . . . . . . .31.3 Applications of compression . . . . . . . . . . . . . . . . .31.4 Introduction to video compression . . . . . . . . . . . . .41.5 Introduction to audio compression . . . . . . . . . . . . .61.6 MPEG signals . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

    1.7 Need for monitoring and analysis . . . . . . . . . . . . . .71.8 Pitfalls of compression . . . . . . . . . . . . . . . . . . . . . .7

    Section 2 Compression in Video . . . . . . . . . . . . . . .8

    2.1 Spatial or temporal coding? . . . . . . . . . . . . . . . . . .82.2 Spatial coding . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.3 Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.4 Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112.5 Entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . .112.6 A spatial coder . . . . . . . . . . . . . . . . . . . . . . . . . . .112.7 Temporal coding . . . . . . . . . . . . . . . . . . . . . . . . .122.8 Motion compensation . . . . . . . . . . . . . . . . . . . . . .132.9 Bidirectional coding . . . . . . . . . . . . . . . . . . . . . . .142.10 I, P, and B pictures . . . . . . . . . . . . . . . . . . . . . . .14

    2.11 An MPEG compressor . . . . . . . . . . . . . . . . . . . .162.12 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . .192.13 Profiles and levels . . . . . . . . . . . . . . . . . . . . . . .202.14 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21

    Section 3 Audio Compression . . . . . . . . . . . . . . .22

    3.1 The hearing mechanism . . . . . . . . . . . . . . . . . . . .223.2 Subband coding . . . . . . . . . . . . . . . . . . . . . . . . . .233.3 MPEG Layer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . .243.4 MPEG Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . .253.5 Transform coding . . . . . . . . . . . . . . . . . . . . . . . . .253.6 MPEG Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . .253.7 AC-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

    Section 4 Elementary Streams . . . . . . . . . . . . . . .26

    4.1 Video elementary stream syntax . . . . . . . . . . . . . .264.2 Audio elementary streams . . . . . . . . . . . . . . . . . .27

    Contents

    Section 5 Packetized Elementary Streams (PES) . . .28

    5.1 PES packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .285.2 Time stamps . . . . . . . . . . . . . . . . . . . . . . . . . . . .285.3 PTS/DTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

    Section 6 Program Streams . . . . . . . . . . . . . . . . .29

    6.1 Recording vs. transmission . . . . . . . . . . . . . . . . .29

    6.2 Introduction to program streams . . . . . . . . . . . . .29

    Section 7 Transport streams . . . . . . . . . . . . . . . .30

    7.1 The job of a transport stream . . . . . . . . . . . . . . . .307.2 Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .307.3 Program Clock Reference (PCR) . . . . . . . . . . . . . .317.4 Packet Identification (PID) . . . . . . . . . . . . . . . . . .317.5 Program Specific Information (PSI) . . . . . . . . . . .32

    Section 8 Introduction to DVB/ATSC . . . . . . . . . . .33

    8.1 An overall view . . . . . . . . . . . . . . . . . . . . . . . . . . .338.2 Remultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . .338.3 Service Information (SI) . . . . . . . . . . . . . . . . . . . .348.4 Error correction . . . . . . . . . . . . . . . . . . . . . . . . . .34

    8.5 Channel coding . . . . . . . . . . . . . . . . . . . . . . . . . .358.6 Inner coding . . . . . . . . . . . . . . . . . . . . . . . . . . . .368.7 Transmitting digits . . . . . . . . . . . . . . . . . . . . . . . .37

    Section 9 MPEG Testing . . . . . . . . . . . . . . . . . . .38

    9.1 Testing requirements . . . . . . . . . . . . . . . . . . . . . .389.2 Analyzing a Transport Stream . . . . . . . . . . . . . . . .389.3 Hierarchic view . . . . . . . . . . . . . . . . . . . . . . . . . .399.4 Interpreted view . . . . . . . . . . . . . . . . . . . . . . . . . .409.5 Syntax and CRC analysis . . . . . . . . . . . . . . . . . . .419.6 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .419.7 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .429.8 Elementary stream testing . . . . . . . . . . . . . . . . . .439.9 Sarnoff compliant bit streams . . . . . . . . . . . . . . . .43

    9.10 Elementary stream analysis . . . . . . . . . . . . . . . .439.11 Creating a transport stream . . . . . . . . . . . . . . . .449.12 Jitter generation . . . . . . . . . . . . . . . . . . . . . . . . .449.13 DVB tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45

    Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

  • 8/14/2019 A guide to MPEG

    3/48

    SECTION 1INTRODUCTION TO MPEG

    MPEG is one of the most popularaudio/video compression tech-niques because it is not just asingle standard. Instead it is arange of standards suitable fordifferent applications but basedon similar principles. MPEG isan acronym for the Moving

    Picture Experts Group which wasset up by the ISO (InternationalStandards Organization) to workon compression.

    MPEG can be described as theinteraction of acronyms. As ETSIstated "The CAT is a pointer toenable the IRD to find the EMMsassociated with the CA system(s)that it uses." If you can under-stand that sentence you don'tneed this book.

    1.1 Convergence

    Digital techniques have maderapid progress in audio andvideo for a number of reasons.Digital information is more robustand can be coded to substantiallyeliminate error. This means thatgeneration loss in recording andlosses in transmission are elimi-nated. The Compact Disc wasthe first consumer product todemonstrate this.

    While the CD has an improvedsound quality with respect to its

    vinyl predecessor, comparison ofquality alone misses the point.The real point is that digitalrecording and transmission tech-niques allow content manipula-tion to a degree that is impossiblewith analog. Once audio or videoare digitized they become data.

    Such data cannot be distinguishedfrom any other kind of data;therefore, digital video andaudio become the province ofcomputer technology.

    The convergence of computersand audio/video is an inevitableconsequence of the key inventionsof computing and Pulse CodeModulation. Digital media can

    store any type of information, soit is easy to utilize a computerstorage device for digital video.The nonlinear workstation wasthe first example of an applicationof convergent technology thatdid not have an analog forerunner.Another example, multimedia,mixed the storage of audio, video,graphics, text and data on thesame medium. Multimedia isimpossible in the analog domain.

    1.2 Why compression is needed

    The initial success of digitalvideo was in post-productionapplications, where the high costof digital video was offset by itslimitless layering and effectscapability. However, production-standard digital video generatesover 200 megabits per second ofdata and this bit rate requiresextensive capacity for storageand wide bandwidth for trans-mission. Digital video could only

    be used in wider applications if

    the storage and bandwidthrequirements could be eased;easing these requirements is thepurpose of compression.

    Compression is a way ofexpressing digital audio andvideo by using less data.Compression has the followingadvantages:

    A smaller amount of storage isneeded for a given amount ofsource material. With high-density recording, such as withtape, compression allows highlyminiaturized equipment forconsumer and Electronic NewsGathering (ENG) use. The accesstime of tape improves with com-pression because less tape needs

    to be shuttled to skip over agiven amount of program. Withexpensive storage media such asRAM, compression makes newapplications affordable.

    When working in real time, com-pression reduces the bandwidthneeded. Additionally, compres-sion allows faster-than-real-timetransfer between media, forexample, between tape and disk.

    A compressed recording formatcan afford a lower recording

    density and this can make therecorder less sensitive toenvironmental factors andmaintenance.

    1.3 Applications of compression

    Compression has a long associa-tion with television. Interlace isa simple form of compressiongiving a 2:1 reduction in band-width. The use of color-differencesignals instead of GBR is anotherform of compression. Becausethe eye is less sensitive to color

    detail, the color-difference signalsneed less bandwidth. Whencolor broadcasting was intro-duced, the channel structure ofmonochrome had to be retainedand composite video was devel-oped. Composite video systems,such as PAL, NTSC and SECAM,are forms of compression becausethey use the same bandwidth forcolor as was used for monochrome.

    3

  • 8/14/2019 A guide to MPEG

    4/48

    Figure 1.1a shows that in tradi-tional television systems, theGBR camera signal is convertedto Y, Pr, Pb components for pro-duction and encoded into ana-logue composite for transmission.Figure 1.1b shows the modernequivalent. The Y, Pr, Pb signalsare digitized and carried as Y,Cr, Cb signals in SDI form

    through the production processprior to being encoded withMPEG for transmission. Clearly,MPEG can be considered by the

    broadcaster as a more efficientreplacement for compositevideo. In addition, MPEG hasgreater flexibility because the bitrate required can be adjusted tosuit the application. At lower bitrates and resolutions, MPEG can

    be used for video conferencingand video telephones.

    DVB and ATSC (the European-

    and American-originated digital-television broadcasting standards)would not be viable withoutcompression because the band-width required would be toogreat. Compression extends theplaying time of DVD (digitalvideo/versatile disc) allowingfull-length movies on a standardsize compact disc. Compressionalso reduces the cost of ElectronicNews Gathering and other contri-

    butions to television production.

    In tape recording, mild compres-sion eases tolerances and addsreliability in Digital Betacam andDigital-S, whereas in SX, DVC,DVCPRO and DVCAM, the goalis miniaturization. In magneticdisk drives, such as the TektronixProfile

    storage system, that are

    used in file servers and networks(especially for news purposes),

    compression lowers storage cost.Compression also lowers band-width, which allows more usersto access a given server. Thischaracteristic is also importantfor VOD (Video On Demand)applications.

    1.4 Introduction to videocompression

    In all real program material,there are two types of componentsof the signal: those which arenovel and unpredictable andthose which can be anticipated.The novel component is calledentropy and is the true informa-tion in the signal. The remainderis called redundancy because itis not essential. Redundancy may

    be spatial, as it is in large plainareas of picture where adjacentpixels have almost the samevalue. Redundancy can also be

    temporal as it is where similaritiesbetween successive pictures areused. All compression systemswork by separating the entropyfrom the redundancy in theencoder. Only the entropy isrecorded or transmitted and thedecoder computes the redundancyfrom the transmitted signal.Figure 1.2a shows this concept.

    An ideal encoder would extractall the entropy and only this will

    be transmitted to the decoder.

    An ideal decoder would thenreproduce the original signal. Inpractice, this ideal cannot bereached. An ideal coder would

    be complex and cause a verylong delay in order to use tem-poral redundancy. In certainapplications, such as recordingor broadcasting, some delay isacceptable, but in videoconfer-

    encing it is not. In some cases, avery complex coder would betoo expensive. It follows thatthere is no one ideal compres-sion system.

    In practice, a range of coders isneeded which have a range ofprocessing delays and complexi-ties. The power of MPEG is thatit is not a single compression

    format, but a range of standard-ized coding tools that can becombined flexibly to suit a rangeof applications. The way inwhich coding has been performedis included in the compresseddata so that the decoder canautomatically handle whateverthe coder decided to do.

    MPEG coding is divided intoseveral profiles that have differentcomplexity, and each profile can

    be implemented at a different

    level depending on the resolutionof the input picture. Section 2considers profiles and levelsin detail.

    There are many different digitalvideo formats and each has adifferent bit rate. For example ahigh definition system mighthave six times the bit rate of astandard definition system.Consequently just knowing the

    bit rate out of the coder is notvery useful. What matters is thecompression factor, which is the

    ratio of the input bit rate to thecompressed bit rate, for example2:1, 5:1, and so on.

    Unfortunately the number ofvariables involved make it verydifficult to determine a suitablecompression factor. Figure 1.2ashows that for an ideal coder, ifall of the entropy is sent, thequality is good. However, if thecompression factor is increasedin order to reduce the bit rate,not all of the entropy is sent and

    the quality falls. Note that in acompressed system when thequality loss occurs, compressionis steep (Figure 1.2b). If theavailable bit rate is inadequate,it is better to avoid this area byreducing the entropy of theinput picture. This can be done

    by filtering. The loss of resolu-tion caused by the filtering issubjectively more acceptablethan the compression artifacts.

    4

    AnalogComposite

    Out

    (PAL, NTSCor SECAM)

    B

    G

    R

    Y

    Pr

    Pb

    DigitalCompressed

    Out

    Matrix ADC ProductionProcess

    B

    G

    R

    Y

    Pr

    Pb

    Y

    Cr

    Cb

    Y

    Cr

    Cb

    SDI

    MPEGCoder

    a)

    b)

    MatrixCamera

    Camera

    CompositeEncoder

    Figure 1.1.

  • 8/14/2019 A guide to MPEG

    5/48

    To identify the entropy perfectly,an ideal compressor would haveto be extremely complex. Apractical compressor may be lesscomplex for economic reasonsand must send more data to besure of carrying all of the entropy.Figure 1.2b shows the relationship

    between coder complexity andperformance. The higher the com-

    pression factor required, the morecomplex the encoder has to be.

    The entropy in video signalsvaries. A recording of anannouncer delivering the newshas much redundancy and is easyto compress. In contrast, it ismore difficult to compress arecording with leaves blowing inthe wind or one of a footballcrowd that is constantly movingand therefore has less redundancy(more information or entropy). Ineither case, if all the entropy is

    not sent, there will be quality loss.Thus, we may choose between aconstant bit-rate channel withvariable quality or a constantquality channel with variable bitrate. Telecommunications networkoperators tend to prefer a constant

    bit rate for practical purposes,but a buffer memory can be usedto average out entropy variationsif the resulting increase in delayis acceptable. In recording, avariable bit rate maybe easier tohandle and DVD uses variable

    bit rate, speeding up the discwhere difficult material exists.

    Intra-coding (intra = within) is atechnique that exploits spatialredundancy, or redundancywithin the picture; inter-coding(inter = between) is a techniquethat exploits temporal redundancy.Intra-coding may be used alone,as in the JPEG standard for stillpictures, or combined withinter-coding as in MPEG.

    Intra-coding relies on two char-

    acteristics of typical images.First, not all spatial frequenciesare simultaneously present, andsecond, the higher the spatialfrequency, the lower the ampli-tude is likely to be. Intra-codingrequires analysis of the spatialfrequencies in an image. Thisanalysis is the purpose of trans-forms such as wavelets and DCT(discrete cosine transform).Transforms produce coefficientswhich describe the magnitude ofeach spatial frequency.

    Typically, many coefficients willbe zero, or nearly zero, and thesecoefficients can be omitted,resulting in a reduction in bit rate.

    Inter-coding relies on findingsimilarities between successivepictures. If a given picture is

    available at the decoder, the nextpicture can be created by sendingonly the picture differences. Thepicture differences will beincreased when objects move,

    but this magnification can beoffset by using motion compen-sation, since a moving objectdoes not generally change itsappearance very much from onepicture to the next. If the motioncan be measured, a closer approx-imation to the current picturecan be created by shifting part of

    the previous picture to a newlocation. The shifting process iscontrolled by a vector that istransmitted to the decoder. Thevector transmission requires lessdata than sending the picture-difference data.

    MPEG can handle both interlacedand non-interlaced images. Animage at some point on the timeaxis is called a "picture," whetherit is a field or a frame. Interlaceis not ideal as a source for digital

    compression because it is initself a compression technique.Temporal coding is made morecomplex because pixels in onefield are in a different position tothose in the next.

    Motion compensation minimizes

    but does not eliminate thedifferences between successivepictures. The picture-differenceis itself a spatial image and can

    be compressed using transform-based intra-coding as previouslydescribed. Motion compensationsimply reduces the amount ofdata in the difference image.

    The efficiency of a temporalcoder rises with the time spanover which it can act. Figure 1.2cshows that if a high compressionfactor is required, a longer timespan in the input must be con-sidered and thus a longer codingdelay will be experienced. Clearlytemporally coded signals are dif-ficult to edit because the contentof a given output picture may be

    based on image data which wastransmitted some time earlier.Production systems will haveto limit the degree of temporalcoding to allow editing and thislimitation will in turn limit theavailable compression factor.

    5

    Short DelayCoder has to

    send even more

    Non-IdealCoder has tosend more

    Ideal Codersends only

    Entropy

    Entropy

    PCM Video

    WorseQuality

    BetterQuality

    Latency

    CompressionFactor

    CompressionFactor

    WorseQuality

    BetterQuality

    Complexity

    a)

    b) c)

    Figure 1.2.

  • 8/14/2019 A guide to MPEG

    6/48

    Stream differs from a ProgramStream in that the PES packetsare further subdivided into shortfixed-size packets and in thatmultiple programs encoded withdifferent clocks can be carried.This is possible because a trans-port stream has a program clockreference (PCR) mechanismwhich allows transmission of

    multiple clocks, one of which isselected and regenerated at thedecoder. A Single ProgramTransport Stream (SPTS) is alsopossible and this may be found

    between a coder and a multi-plexer. Since a Transport Streamcan genlock the decoder clockto the encoder clock, the SingleProgram Transport Stream(SPTS) is more common thanthe Program Stream.

    A Transport Stream is more thanjust a multiplex of audio andvideo PES. In addition to thecompressed audio, video anddata, a Transport Streamincludes a great deal of metadatadescribing the bit stream. Thisincludes the Program AssociationTable (PAT) that lists every pro-gram in the transport stream.Each entry in the PAT points toa Program Map Table (PMT) thatlists the elementary streamsmaking up each program. Someprograms will be open, but someprograms may be subject to con-ditional access (encryption) andthis information is also carriedin the metadata.

    The Transport Stream consistsof fixed-size data packets, eachcontaining 188 bytes. Eachpacket carries a packet identifiercode (PID). Packets in the sameelementary stream all have thesame PID, so that the decoder(or a demultiplexer) can selectthe elementary stream(s) itwants and reject the remainder.

    Packet-continuity counts ensurethat every packet that is neededto decode a stream is received.An effective synchronizationsystem is needed so thatdecoders can correctly identifythe beginning of each packetand deserialize the bit streaminto words.

    complicating audio compressionis that delayed resonances inpoor loudspeakers actually maskcompression artifacts. Testing acompressor with poor speakersgives a false result, and signalswhich are apparently satisfactorymay be disappointing whenheard on good equipment.

    1.6 MPEG signalsThe output of a single MPEGaudio or video coder is calledan Elementary Stream. AnElementary Stream is an endlessnear real-time signal. For conve-nience, it can be broken intoconvenient-sized data blocks ina Packetized Elementary Stream(PES). These data blocks needheader information to identifythe start of the packets and mustinclude time stamps becausepacketizing disrupts the time axis.

    Figure 1.3 shows that one videoPES and a number of audio PEScan be combined to form aProgram Stream, provided thatall of the coders are locked to acommon clock. Time stamps ineach PES ensure lip-sync

    between the video and audio.Program Streams have variable-length packets with headers.They find use in data transfersto and from optical and harddisks, which are error free

    and in which files of arbitrarysizes are expected. DVD usesProgram Streams.

    For transmission and digitalbroadcasting, several programsand their associated PES can bemultiplexed into a singleTransport Stream. A Transport

    1.5 Introduction to audiocompression

    The bit rate of a PCM digitalaudio channel is only about onemegabit per second, which isabout 0.5% of 4:2:2 digital video.With mild video compressionschemes, such as DigitalBetacam, audio compression isunnecessary. But, as the video

    compression factor is raised, itbecomes necessary to compressthe audio as well.

    Audio compression takes advan-tage of two facts. First, in typicalaudio signals, not all frequenciesare simultaneously present.Second, because of the phenom-enon of masking, human hearingcannot discern every detail ofan audio signal. Audio compres-sion splits the audio spectruminto bands by filtering or trans-

    forms, and includes less datawhen describing bands in whichthe level is low. Where maskingprevents or reduces audibility ofa particular band, even less dataneeds to be sent.

    Audio compression is not aseasy to achieve as is video com-pression because of the acuity ofhearing. Masking only worksproperly when the masking andthe masked sounds coincidespatially. Spatial coincidence isalways the case in mono record-

    ings but not in stereo recordings,where low-level signals canstill be heard if they are in adifferent part of the soundstage.Consequently, in stereo and sur-round sound systems, a lowercompression factor is allowablefor a given quality. Another factor

    6

    Figure 1.3.

    Video

    Data

    Audio

    Data

    ElementaryStream

    Video

    PES

    Audio

    PES

    Data

    Program

    Stream(DVD)

    SingleProgram

    TransportStream

    VideoEncoder

    Audio

    Encoder

    Packetizer

    Packetizer

    ProgramStreamMUX

    TransportStreamMUX

  • 8/14/2019 A guide to MPEG

    7/487

    1.7 Need for monitoring and analysis

    The MPEG transport stream isan extremely complex structureusing interlinked tables andcoded identifiers to separate theprograms and the elementarystreams within the programs.Within each elementary stream,there is a complex structure,allowing a decoder to distinguish

    between, for example, vectors,coefficients and quantizationtables.

    Failures can be divided intotwo broad categories. In the firstcategory, the transport systemcorrectly multiplexes and deliversinformation from an encoder toa decoder with no bit errors oradded jitter, but the encoder orthe decoder has a fault. In thesecond category, the encoderand decoder are fine, but the

    transport of data from one to theother is defective. It is veryimportant to know whether thefault lies in the encoder, thetransport, or the decoder if aprompt solution is to be found.

    Synchronizing problems, suchas loss or corruption of syncpatterns, may prevent receptionof the entire transport stream.Transport-stream protocoldefects may prevent the decoderfrom finding all of the data for aprogram, perhaps delivering

    picture but not sound. Correctdelivery of the data but withexcessive jitter can cause decodertiming problems.

    If a system using an MPEGtransport stream fails, the faultcould be in the encoder, themultiplexer, or in the decoder.How can this fault be isolated?First, verify that a transportstream is compliant with theMPEG-coding standards. If thestream is not compliant, adecoder can hardly be blamedfor having difficulty. If it is, thedecoder may need attention.

    Traditional video testing tools,the signal generator, the wave-form monitor and vectorscope,are not appropriate in analyzingMPEG systems, except to ensurethat the video signals enteringand leaving an MPEG systemare of suitable quality. Instead,a reliable source of valid MPEGtest signals is essential for

    testing receiving equipmentand decoders. With a suitableanalyzer, the performance ofencoders, transmission systems,multiplexers and remultiplexerscan be assessed with a highdegree of confidence. As a longstanding supplier of high gradetest equipment to the videoindustry, Tektronix continues toprovide test and measurementsolutions as the technologyevolves, giving the MPEG userthe confidence that complex

    compressed systems are correctlyfunctioning and allowing rapiddiagnosis when they are not.

    1.8 Pitfalls of compression

    MPEG compression is lossy inthat what is decoded, is notidentical to the original. Theentropy of the source varies,and when entropy is high, thecompression system may leavevisible artifacts when decoded.In temporal compression,redundancy between successive

    pictures is assumed. When thisis not the case, the system fails.An example is video from a pressconference where flashguns arefiring. Individual pictures con-taining the flash are totally dif-ferent from their neighbors, andcoding artifacts become obvious.

    Irregular motion or severalindependently moving objectson screen require a lot of vector

    bandwidth and this requirementmay only be met by reducing

    the picture-data bandwidth.Again, visible artifacts may

    occur whose level varies anddepends on the motion. Thisproblem often occurs in sports-coverage video.

    Coarse quantizing results inluminance contouring and pos-terized color. These can be seenas blotchy shadows and blockingon large areas of plain color.Subjectively, compression artifacts

    are more annoying than therelatively constant impairmentsresulting from analog televisiontransmission systems.

    The only solution to these prob-lems is to reduce the compressionfactor. Consequently, the com-pression user has to make avalue judgment between theeconomy of a high compressionfactor and the level of artifacts.

    In addition to extending theencoding and decoding delay,

    temporal coding also causesdifficulty in editing. In fact, anMPEG bit stream cannot bearbitrarily edited at all. Thisrestriction occurs because intemporal coding the decodingof one picture may require thecontents of an earlier pictureand the contents may not beavailable following an edit.The fact that pictures may besent out of sequence alsocomplicates editing.

    If suitable coding has been used,edits can take place only atsplice points, which are relativelywidely spaced. If arbitrary editingis required, the MPEG streammust undergo a read-modify-write process, which will resultin generation loss.

    The viewer is not interested inediting, but the production userwill have to make another valuejudgment about the edit flexibilityrequired. If greater flexibility isrequired, the temporal compres-

    sion has to be reduced and ahigher bit rate will be needed.

  • 8/14/2019 A guide to MPEG

    8/488

    sufficient accuracy, the outputof the inverse transform is iden-tical to the original waveform.

    The most well known transformis the Fourier transform. Thistransform finds each frequencyin the input signal. It finds eachfrequency by multiplying theinput waveform by a sample ofa target frequency, called a basis

    function, and integrating theproduct. Figure 2.1 shows thatwhen the input waveform doesnot contain the target frequency,the integral will be zero, butwhen it does, the integral will

    be a coefficient describing theamplitude of that componentfrequency.

    The results will be as describedif the frequency component is inphase with the basis function.However if the frequency com-

    ponent is in quadrature with thebasis function, the integral willstill be zero. Therefore, it isnecessary to perform twosearches for each frequency,with the basis functions inquadrature with one another sothat every phase of the inputwill be detected.

    The Fourier transform has thedisadvantage of requiring coeffi-cients for both sine and cosinecomponents of each frequency.In the cosine transform, the input

    waveform is time-mirrored withitself prior to multiplication bythe basis functions. Figure 2.2shows that this mirroring cancelsout all sine components anddoubles all of the cosine compo-nents. The sine basis functionis unnecessary and only onecoefficient is needed for eachfrequency.

    The discrete cosine transform(DCT) is the sampled version ofthe cosine transform and is used

    extensively in two-dimensionalform in MPEG. A block of 8 x 8pixels is transformed to becomea block of 8 x 8 coefficients.Since the transform requiresmultiplication by fractions,there is wordlength extension,resulting in coefficients thathave longer wordlength than thepixel values. Typically an 8-bit

    Spatial compression relies onsimilarities between adjacentpixels in plain areas of pictureand on dominant spatial fre-quencies in areas of patterning.The JPEG system uses spatialcompression only, since it isdesigned to transmit individualstill pictures. However, JPEGmay be used to code a succession

    of individual pictures for video.In the so-called "Motion JPEG"application, the compressionfactor will not be as good as iftemporal coding was used, butthe bit stream will be freelyeditable on a picture-by-picture basis.

    2.2 Spatial coding

    The first step in spatial codingis to perform an analysis of spa-tial frequency using a transform.A transform is simply a way of

    expressing a waveform in a dif-ferent domain, in this case, thefrequency domain. The outputof a transform is a set of coeffi-cients that describe how muchof a given frequency is present.An inverse transform reproducesthe original waveform. If thecoefficients are handled with

    SECTION 2COMPRESSION IN VIDEO

    This section shows how videocompression is based on theperception of the eye. Importantenabling techniques, such astransforms and motion compen-sation, are considered as anintroduction to the structure ofan MPEG coder.

    2.1 Spatial or temporal coding?

    As was seen in Section 1, videocompression can take advantageof both spatial and temporalredundancy. In MPEG, temporalredundancy is reduced first byusing similarities between suc-cessive pictures. As much aspossible of the current picture iscreated or "predicted" by usinginformation from pictures alreadysent. When this technique isused, it is only necessary tosend a difference picture, whicheliminates the differences

    between the actual picture andthe prediction. The differencepicture is then subject to spatialcompression. As a practical mat-ter it is easier to explain spatialcompression prior to explainingtemporal compression.

    No Correlationif Frequency

    Different

    High Correlationif Frequency

    the Same

    Input

    BasisFunction

    Input

    BasisFunction

    Mirror

    Cosine ComponentCoherent Through Mirror

    Sine ComponentInverts at Mirror Cancels

    Figure 2.1.

    Figure 2.2.

  • 8/14/2019 A guide to MPEG

    9/489

    Figure 2.3.

    pixel block results in an 11-bitcoefficient block. Thus, a DCTdoes not result in any compres-sion; in fact it results in theopposite. However, the DCTconverts the source pixels into aform where compression is easier.

    Figure 2.3 shows the results ofan inverse transform of each ofthe individual coefficients of an

    8 x 8 DCT. In the case of theluminance signal, the top-leftcoefficient is the average bright-ness or DC component of thewhole block. Moving acrossthe top row, horizontal spatialfrequency increases. Movingdown the left column, verticalspatial frequency increases. Inreal pictures, different verticaland horizontal spatial frequen-cies may occur simultaneouslyand a coefficient at some pointwithin the block will representall possible horizontal andvertical combinations.

    Figure 2.3 also shows thecoefficients as a one dimensionalhorizontal waveform. Combiningthese waveforms with variousamplitudes and either polaritycan reproduce any combinationof 8 pixels. Thus combining the64 coefficients of the 2-D DCTwill result in the original 8 x 8pixel block. Clearly for colorpictures, the color difference

    samples will also need to behandled. Y, Cr, and Cb data areassembled into separate 8 x 8arrays and are transformedindividually.

    In much real program material,many of the coefficients willhave zero or near zero valuesand, therefore, will not betransmitted. This fact results insignificant compression that isvirtually lossless. If a highercompression factor is needed,then the wordlength of the non-

    zero coefficients must be reduced.This reduction will reduce accu-racy of these coefficients andwill introduce losses into theprocess. With care, the lossescan be introduced in a way that

    is least visible to the viewer.

    2.3 Weighting

    Figure 2.4 shows that thehuman perception of noise inpictures is not uniform but is afunction of the spatial frequency.More noise can be tolerated athigh spatial frequencies. Also,video noise is effectively masked

    by fine detail in the picture,whereas in plain areas it is highlyvisible. The reader will be awarethat traditional noise measure-ments are always weighted so

    that the technical measurementrelates to the subjective result.

    Compression reduces the accuracyof coefficients and has a similareffect to using shorter wordlengthsamples in PCM; that is, thenoise level rises. In PCM, theresult of shortening the word-length is that the noise levelrises equally at all frequencies.

    As the DCT splits the signal intodifferent frequencies, it becomespossible to control the spectrumof the noise. Effectively, low-frequency coefficients are

    Horizontal spatialfrequency waveforms

    H

    V

    HumanVision

    Sensitivity

    Spatial Frequency

    Figure 2.4.

  • 8/14/2019 A guide to MPEG

    10/4810

    As an alternative to truncation,weighted coefficients may benonlinearly requantized so thatthe quantizing step size increaseswith the magnitude of the coef-ficient. This technique allowshigher compression factors butworse levels of artifacts.

    Clearly, the degree of compres-sion obtained and, in turn, the

    output bit rate obtained, is afunction of the severity of therequantizing process. Different

    bit rates will require differentweighting tables. In MPEG, it ispossible to use various differentweighting tables and the table inuse can be transmitted to thedecoder, so that correct decodingautomatically occurs.

    increased noise. Coefficientsrepresenting higher spatial fre-quencies are requantized withlarge steps and suffer morenoise. However, fewer stepsmeans that fewer bits are neededto identify the step and a com-pression is obtained.

    In the decoder, a low-order zerowill be added to return the

    weighted coefficients to theircorrect magnitude. They willthen be multiplied by inverseweighting factors. Clearly, athigh frequencies the multiplica-tion factors will be larger, so therequantizing noise will be greater.Following inverse weighting,the coefficients will have theiroriginal DCT output values, plusrequantizing error, which will

    be greater at high frequencythan at low frequency.

    Figure 2.5 shows that, in theweighting process, the coeffi-cients from the DCT are divided

    by constants that are a functionof two-dimensional frequency.Low-frequency coefficients will

    be divided by small numbers,and high-frequency coefficientswill be divided by large numbers.Following the division, the

    least-significant bit is discardedor truncated. This truncation isa form of requantizing. In theabsence of weighting, thisrequantizing would have theeffect of doubling the size of thequantizing step, but with weight-ing, it increases the step sizeaccording to the division factor.

    As a result, coefficients repre-senting low spatial frequenciesare requantized with relativelysmall steps and suffer little

    Input DCT Coefficients(a more complex block)

    Output DCT CoefficientsValue for display only

    not actual results

    Quant Matrix ValuesValue used correspondsto the coefficient location

    Quant Scale ValuesNot all code values are shown

    One value used for complete 8x8 block

    Divide byQuantMatrix

    Divide byQuantScale

    980 12 23 16 13 4 1 0

    12

    7

    5

    2

    2

    1

    0

    9 8 11 2 1 0 0

    13

    6

    3

    4

    4

    0

    8 3 0 2 0 1

    6

    8

    2

    2

    1

    4 2 1 0 0

    1

    1

    1

    0

    0

    0 0

    0 0 0

    1 0 0 0

    0 0

    0 00 0

    7842 199 448 362 342 112 31 22

    198

    142

    111

    49

    58

    30

    22

    151 181 264 59 37 14 3

    291

    133

    85

    120

    121

    28

    218 87 27 88 27 12

    159

    217

    60

    61

    2

    119 58 65 36 2

    50

    40

    22

    33

    8 3 14 12

    41 11 2 1

    30 1 0 1

    24 51 44 81

    8 16 19 22 26 27 29 34

    16

    19

    22

    22

    26

    26

    27

    16 22 24 27 29 34 37

    22

    22

    26

    27

    27

    29

    26 27 29 34 34 38

    26

    27

    29

    29

    35

    27 29 34 37 40

    29

    32

    34

    38

    35 40 48

    35 40 48

    38 48 56 69

    46 56 69 83

    32

    58

    Code LinearQuant Scale

    Non-LinearQuant Scale

    1

    8

    16

    20

    24

    28

    31

    2

    16

    32

    40

    48

    56

    62

    1

    8

    24

    40

    88

    112

    56

    Figure 2.5.

  • 8/14/2019 A guide to MPEG

    11/48

    Figure 2.6.

    2.6 A spatial coder

    Figure 2.7 ties together all ofthe preceding spatial codingconcepts. The input signal isassumed to be 4:2:2 SDI (SerialDigital Interface), which mayhave 8- or 10-bit wordlength.MPEG uses only 8-bit resolutiontherefore, a rounding stage will

    be needed when the SDI signal

    contains 10-bit words. MostMPEG profiles operate with 4:2:0sampling; therefore, a verticallow pass filter/interpolationstage will be needed. Roundingand color subsampling intro-duces a small irreversible loss ofinformation and a proportionalreduction in bit rate. The rasterscanned input format will needto be stored so that it can beconverted to 8 x 8 pixel blocks.

    (RLC) allows these coefficientsto be handled more efficiently.Where repeating values, such asa string of 0s, are present, runlength coding simply transmitsthe number of zeros rather thaneach individual bit.

    The probability of occurrence ofparticular coefficient values inreal video can be studied. In

    practice, some values occur veryoften; others occur less often.This statistical information can

    be used to achieve further com-pression using variable lengthcoding (VLC). Frequently occur-ring values are converted toshort code words, and infrequentvalues are converted to longcode words. To aid deserializa-tion, no code word can be theprefix of another.

    2.4 Scanning

    In typical program material, thesignificant DCT coefficients aregenerally found in the top-leftcorner of the matrix. Afterweighting, low-value coefficientsmight be truncated to zero.More efficient transmission can

    be obtained if all of the non-zerocoefficients are sent first, fol-

    lowed by a code indicating thatthe remainder are all zero.Scanning is a technique whichincreases the probability ofachieving this result, because itsends coefficients in descendingorder of magnitude probability.Figure 2.6a shows that in a non-interlaced system, the probabilityof a coefficient having a highvalue is highest in the top-leftcorner and lowest in the bottom-right corner. A 45 degree diagonalzig-zag scan is the best sequence

    to use here.

    In Figure 2.6b, the scan for aninterlaced source is shown. Inan interlaced picture, an 8 x 8DCT block from one fieldextends over twice the verticalscreen area, so that for a givenpicture detail, vertical frequencieswill appear to be twice as greatas horizontal frequencies. Thus,the ideal scan for an interlacedpicture will be on a diagonalthat is twice as steep. Figure 2.6b

    shows that a given vertical spa-tial frequency is scanned beforescanning the same horizontalspatial frequency.

    2.5 Entropy coding

    In real video, not all spatialfrequencies are simultaneouslypresent; therefore, the DCTcoefficient matrix will have zeroterms in it. Despite the use ofscanning, zero coefficients willstill appear between the signifi-cant values. Run length coding

    11

    Zigzag or Classic (nominally for frames)a) Alternate (nominally for fields)b)

    Quantizing

    FullBitrate10-bitData

    Rate Control

    Quantizing Data

    CompressedData

    Data reduced(no loss)

    Data reduced(information lost)

    No LossNo Data reduced

    Information lostData reduced

    Convert4:2:2 to

    8-bit 4:2:0Quantize

    EntropyCoding BufferDCT

    Entropy CodingReduce the number of bits for each coefficient.

    Give preference to certain coefficients.Reduction can differ for each coefficient.

    Variable Length CodingUse short words for most frequent

    values (like Morse Code)

    Run Length CodingSend a unique code word instead

    of strings of zeros

    Figure 2.7.

  • 8/14/2019 A guide to MPEG

    12/48

    Figure 2.8.

    Following an inverse transform,the 8 x 8 pixel block is recreated.To obtain a raster scanned output,the blocks are stored in RAM,which is read a line at a time.To obtain a 4:2:2 output from4:2:0 data, a vertical interpolationprocess will be needed as shownin Figure 2.8.

    The chroma samples in 4:2:0 are

    positioned half way betweenluminance samples in the verticalaxis so that they are evenlyspaced when an interlacedsource is used.

    2.7 Temporal coding

    Temporal redundancy can beexploited by inter-coding ortransmitting only the differences

    between pictures. Figure 2.9shows that a one-picture delaycombined with a subtractor can

    compute the picture differences.The picture difference is animage in its own right and can

    be further compressed by thespatial coder as was previouslydescribed. The decoder reversesthe spatial coding and adds thedifference picture to the previouspicture to obtain the next picture.

    There are some disadvantages tothis simple system. First, asonly differences are sent, it isimpossible to begin decodingafter the start of the transmission.

    This limitation makes it difficultfor a decoder to provide picturesfollowing a switch from one

    bit stream to another (as occurswhen the viewer changes chan-nels). Second, if any part of thedifference data is incorrect, theerror in the picture willpropagate indefinitely.

    The solution to these problems isto use a system that is not com-pletely differential. Figure 2.10shows that periodically completepictures are sent. These arecalled Intra-coded pictures (orI-pictures), and they are obtained

    by spatial compression only. Ifan error or a channel switchoccurs, it will be possible toresume correct decoding at thenext I-picture.

    system, a buffer memory isused to absorb variations incoding difficulty. Highlydetailed pictures will tend to fillthe buffer, whereas plain pic-tures will allow it to empty. Ifthe buffer is in danger of over-flowing, the requantizing stepswill have to be made larger, sothat the compression factor is

    effectively raised.In the decoder, the bit stream isdeserialized and the entropycoding is reversed to reproducethe weighted coefficients. Theinverse weighting is applied andcoefficients are placed in thematrix according to the zig-zagscan to recreate the DCT matrix.

    The DCT stage transforms thepicture information to the fre-quency domain. The DCT itselfdoes not achieve any compres-sion. Following DCT, the coeffi-cients are weighted and truncated,providing the first significantcompression. The coefficientsare then zig-zag scanned toincrease the probability that the

    significant coefficients occurearly in the scan. After the lastnon-zero coefficient, an EOB(end of block) code is generated.

    Coefficient data are further com-pressed by run length and vari-able length coding. In a variable

    bit-rate system, the quantizing isfixed, but in a fixed bit-rate

    12

    4:2:2 Rec 601 4:1:1

    4:2:0

    1 Luminance sample Y

    2 Chrominance samples Cb, Cr

    N

    N+1

    N+2

    N+3

    ThisPicture

    PreviousPicture

    Picture

    Difference

    Picture Delay

    +

    _

    Figure 2.9.

    Difference Restart

    Start Intra IntraTemporal Temporal Temporal

    DifferenceDifference

    Figure 2.10.

  • 8/14/2019 A guide to MPEG

    13/48

    Figure 2.11.

    objects in the image and therewill be occasions where part ofthe macroblock moves and partof it does not. In this case, it isimpossible to compensate prop-erly. If the motion of the movingpart is compensated by trans-mitting a vector, the stationarypart will be incorrectly shifted,and it will need difference datato be corrected. If no vector issent, the stationary part will becorrect, but difference data will

    be needed to correct the movingpart. A practical compressormight attempt both strategiesand select the one which requiredthe least difference data.

    with a resolution of half a pixelover the entire search range.When the greatest correlation isfound, this correlation is assumedto represent the correct motion.

    The motion vector has a verticaland horizontal component. In

    typical program material,motion continues over a numberof pictures. A greater compressionfactor is obtained if the vectorsare transmitted differentially.Consequently, if an objectmoves at constant speed, thevectors do not change and thevector difference is zero.

    Motion vectors are associatedwith macroblocks, not with real

    2.8 Motion compensation

    Motion reduces the similaritiesbetween pictures and increasesthe data needed to create thedifference picture. Motion com-pensation is used to increase thesimilarity. Figure 2.11 shows theprinciple. When an object movesacross the TV screen, it mayappear in a different place in

    each picture, but it does notchange in appearance very much.The picture difference can bereduced by measuring the motionat the encoder. This is sent tothe decoder as a vector. Thedecoder uses the vector to shiftpart of the previous picture to amore appropriate place in thenew picture.

    One vector controls the shiftingof an entire area of the picturethat is known as a macroblock.

    The size of the macroblock isdetermined by the DCT codingand the color subsamplingstructure. Figure 2.12a showsthat, with a 4:2:0 system, thevertical and horizontal spacingof color samples is exactly twicethe spacing of luminance. Asingle 8 x 8 DCT block of colorsamples extends over the samearea as four 8 x 8 luminance

    blocks; therefore this is theminimum picture area whichcan be shifted by a vector. One

    4:2:0 macroblock contains fourluminance blocks: one Cr blockand one Cb block.

    In the 4:2:2 profile, color is onlysubsampled in the horizontalaxis. Figure 2.12b shows that in4:2:2, a single 8 x 8 DCT blockof color samples extends overtwo luminance blocks. A 4:2:2macroblock contains four lumi-nance blocks: two Cr blocks andtwo Cb blocks.

    The motion estimator works bycomparing the luminance datafrom two successive pictures. Amacroblock in the first pictureis used as a reference. When theinput is interlaced, pixels will

    be at different vertical locationsin the two fields, and it will,therefore, be necessary to inter-polate one field before it can becompared with the other. Thecorrelation between the referenceand the next picture is measuredat all possible displacements

    13

    Actions:

    1. Compute Motion Vector

    2. Shift Data from Picture NUsing Vector to MakePredicted Picture N+1

    3. Compare Actual Picturewith Predicted Picture

    4. Send Vector andPrediction ErrorPicture N Picture N+1

    MotionVector

    Part of Moving Object

    a) 4:2:0 has 1/4 as many chroma sampling points as Y.

    b) 4:2:2 has twice as much chroma data as 4:2:0.

    8

    8

    8

    8

    8

    8

    8

    8

    8

    8

    8

    8

    16

    16 4 x Y

    Cr

    Cb

    88

    8

    8

    2 x Cr

    2 x Cb

    16

    16

    8

    8

    8

    8

    8

    8

    8

    8

    4 x Y

    8

    8

    8

    8

    Figure 2.12.

  • 8/14/2019 A guide to MPEG

    14/48

    Figure 2.13.

    2.10 I, P and B pictures

    In MPEG, three different typesof pictures are needed to supportdifferential and bidirectionalcoding while minimizing errorpropagation:

    I pictures are Intra-coded picturesthat need no additional informa-tion for decoding. They requirea lot of data compared to otherpicture types, and therefore theyare not transmitted any morefrequently than necessary. Theyconsist primarily of transformcoefficients and have no vectors.I pictures allow the viewer toswitch channels, and they arresterror propagation.

    P pictures are forward Predictedfrom an earlier picture, whichcould be an I picture or a P pic-ture. P-picture data consists ofvectors describing where, in the

    previous picture, each macro-block should be taken from, andnot of transform coefficients thatdescribe the correction or differ-ence data that must be added tothat macroblock. P picturesrequire roughly half the data ofan I picture.

    B pictures are Bidirectionallypredicted from earlier or later Ior P pictures. B-picture dataconsists of vectors describingwhere in earlier or later picturesdata should be taken from. Italso contains the transformcoefficients that provide thecorrection. Because bidirectionalprediction is so effective, thecorrection data are minimaland this helps the B picture totypically require one quarter thedata of an I picture.

    moved backwards in time tocreate part of an earlier picture.

    Figure 2.13 shows the conceptof bidirectional coding. On anindividual macroblock basis, a

    bidirectionally coded picturecan obtain motion-compensateddata from an earlier or laterpicture, or even use an averageof earlier and later data.

    Bidirectional coding significantlyreduces the amount of differencedata needed by improving thedegree of prediction possible.MPEG does not specify how anencoder should be built, onlywhat constitutes a compliant bitstream. However, an intelligentcompressor could try all threecoding strategies and select theone that results in the least datato be transmitted.

    2.9 Bidirectional coding

    When an object moves, it concealsthe background at its leadingedge and reveals the backgroundat its trailing edge. The revealed

    background requires new data tobe transmitted because the areaof background was previouslyconcealed and no informationcan be obtained from a previous

    picture. A similar problemoccurs if the camera pans: newareas come into view and nothingis known about them. MPEGhelps to minimize this problem

    by using bidirectional coding,which allows information to betaken from pictures before andafter the current picture. If a

    background is being revealed, itwill be present in a later picture,and the information can be

    14

    Revealed Area isNot in Picture (N)

    Revealed Area isin Picture N+2

    Picture(N)

    Picture(N+1)

    Picture(N+2)

    TI

    ME

  • 8/14/2019 A guide to MPEG

    15/48

    Figure 2.14.

    Figure 2.14 introduces theconcept of the GOP or GroupOf Pictures. The GOP beginswith an I picture and then hasP pictures spaced throughout.The remaining pictures areB pictures. The GOP is definedas ending at the last picture

    before the next I picture. TheGOP length is flexible, but 12 or

    15 pictures is a common value.Clearly, if data for B pictures areto be taken from a future picture,that data must already be avail-able at the decoder. Consequently,

    bidirectional coding requiresthat picture data is sent out ofsequence and temporarilystored. Figure 2.14 also showsthat the P-picture data are sent

    before the B-picture data. Notethat the last B pictures in theGOP cannot be transmitted untilafter the I picture of the next

    GOP since this data will beneeded to bidirectionally decodethem. In order to return picturesto their correct sequence, a tem-poral reference is included witheach picture. As the picture rateis also embedded periodically inheaders in the bit stream, anMPEG file may be displayed by,for example, a personal computer,in the correct order and timescale.

    Sending picture data out ofsequence requires additionalmemory at the encoder anddecoder and also causes delay.The number of bidirectionally

    15

    Rec 601Video Frames

    Elementary

    Stream

    temporal_reference

    I B PB B PB B B I B PB

    Frame Frame Frame FrameFrame Frame Frame Frame Frame Frame Frame Frame Frame

    I IB B B B B BB BP P P

    0 3 1 2 6 4 5 0 7 8 3 1 2

    Bit RateMBit/Sec

    LowerQuality

    HigherQuality

    ConstantQualityCurve

    10

    20

    30

    40

    50

    I IB IBBP

    Figure 2.15.

    coded pictures between intra-or forward-predicted picturesmust be restricted to reduce costand minimize delay, if delay isan issue.

    Figure 2.15 shows the tradeoffthat must be made between

    compression factor and codingdelay. For a given quality, sendingonly I pictures requires morethan twice the bit rate of anIBBP sequence. Where the abilityto edit is important, an IB

    sequence is a useful compromise.

  • 8/14/2019 A guide to MPEG

    16/48

    Figure 2.16a.

    the data pass straight through to

    be spatially coded. Subtractoroutput data also pass to a framestore that can hold several pic-tures. The I picture is held inthe store.

    order. The data then enter the

    subtractor and the motion esti-mator. To create an I picture, seeFigure 2.16a, the end of theinput delay is selected and thesubtractor is turned off, so that

    2.11 An MPEG compressor

    Figures 2.16a, b, and c show atypical bidirectional motioncompensator structure. Pre-processed input video enters aseries of frame stores that can be

    bypassed to change the picture

    16

    Reordering Frame Delay QuantizingTables

    In

    SpatialCoder

    RateControl

    BackwardPrediction

    Error

    Disablefor I,P

    SpatialData

    VectorsOut

    ForwardPrediction

    Error

    Norm

    Reorder

    ForwardVectors

    BackwardVectors

    GOPControl

    CurrentPicture

    Out

    F

    B

    F

    B

    QT RC

    Subtract SpatialCoder

    Forward

    - Backward

    Decision

    ForwardPredictor

    BackwardPredictor

    PastPicture

    MotionEstimator

    FuturePicture

    SpatialDecoder

    SpatialDecoder

    Subtract

    Pass (I)

    I Pictures(shaded areas are unused)

  • 8/14/2019 A guide to MPEG

    17/48

    Figure 2.16b.

    Reordering Frame Delay QuantizingTables

    In

    SpatialCoder

    RateControl

    BackwardPrediction

    Error

    Disablefor I,P

    SpatialData

    VectorsOut

    ForwardPrediction

    Error

    Norm

    Reorder

    ForwardVectors

    BackwardVectors

    GOPControl

    CurrentPicture

    Out

    F

    B

    F

    B

    QT RC

    Subtract SpatialCoder

    Forward

    - Backward

    Decision

    BackwardPredictor

    FuturePicture

    SpatialDecoder

    P Pictures(shaded areas are unused)

    ForwardPredictor

    MotionEstimator

    PastPicture

    SpatialDecoder

    Subtract

    Pass (I)

    17

    To encode a P picture, see

    Figure 2.16b, the B pictures inthe input buffer are bypassed, sothat the future picture is selected.The motion estimator comparesthe I picture in the output store

    with the P picture in the input

    store to create forward motionvectors. The I picture is shifted

    by these vectors to make a pre-dicted P picture. The predictedP picture is subtracted from theactual P picture to produce the

    prediction error, which is

    spatially coded and sent alongwith the vectors. The predictionerror is also added to the pre-dicted P picture to create a locallydecoded P picture that alsoenters the output store.

  • 8/14/2019 A guide to MPEG

    18/4818

    Figure 2.16c.

    Reordering Frame Delay QuantizingTables

    In

    SpatialCoder

    RateControl

    BackwardPrediction

    Error

    Disablefor I,P

    SpatialData

    VectorsOut

    ForwardPrediction

    Error

    Norm

    Reorder

    ForwardVectors

    BackwardVectors

    GOPControl

    CurrentPicture

    Out

    F

    B

    F

    B

    QT RC

    ForwardPredictor

    MotionEstimator

    PastPicture

    SpatialDecoder

    Subtract

    Pass (I)

    Subtract SpatialCoder

    Forward

    - Backward

    Decision

    BackwardPredictor

    FuturePicture

    SpatialDecoder

    B Pictures(shaded area is unused)

    output is spatially coded and

    the vectors are added in a multi-plexer. Syntactical data is alsoadded which identifies the typeof picture (I, P or B) and pro-vides other information to help adecoder (see section 4). The out-put data are buffered to allowtemporary variations in bit rate.If the bit rate shows a long termincrease, the buffer will tend tofill up and to prevent overflowthe quantization process willhave to be made more severe.Equally, should the buffer show

    signs of underflow, the quantiza-tion will be relaxed to maintainthe average bit rate. This meansthat the store contains exactlywhat the store in the decoderwill contain, so that the resultsof all previous coding errors arepresent. These will automatically

    be reduced when the predictedpicture is subtracted from theactual picture because the differ-ence data will be more accurate.

    forward or backward data are

    selected according to which rep-resent the smallest differences.The picture differences are thenspatially coded and sent withthe vectors.

    When all of the intermediateB pictures are coded, the inputmemory will once more be

    bypassed to create a new P pic-ture based on the previousP picture.

    Figure 2.17 shows an MPEGcoder. The motion compensator

    The output store then contains

    an I picture and a P picture. AB picture from the input buffercan now be selected. The motioncompensator, see Figure 2.16c,will compare the B picture withthe I picture that precedes it andthe P picture that follows it toobtain bidirectional vectors.Forward and backward motioncompensation is performed toproduce two predicted B pictures.These are subtracted from thecurrent B picture. On a macro-

    block-by-macroblock basis, the

    Figure 2.17.

    In

    Out

    DemandClock

    MUX Buffer

    Quantizing

    Tables

    Spatial

    Data

    Motion

    Vectors

    SyntacticalData

    Rate Control

    Entropy andRun Length Coding

    DifferentialCoder

    BidirectionalCoder

    (Fig. 2.13)

  • 8/14/2019 A guide to MPEG

    19/48

    this behavior would result in aserious increase in vector data.In 60 Hz video, 3:2 pulldown isused to obtain 60 Hz from 24 Hzfilm. One frame is made intotwo fields, the next is made intothree fields, and so on.

    Consequently, one field in fiveis completely redundant. MPEGhandles film material best by

    discarding the third field in 3:2systems. A 24 Hz code in thetransmission alerts the decoderto recreate the 3:2 sequence byre-reading a field store. In 50and 60 Hz telecine, pairs offields are deinterlaced to createframes, and then motion ismeasured between frames. Thedecoder can recreate interlace

    by reading alternate lines in theframe store.

    A cut is a difficult event for a

    compressor to handle because itresults in an almost completeprediction failure, requiring alarge amount of correction data.If a coding delay can be tolerated,a coder may detect cuts inadvance and modify the GOPstructure dynamically, so thatthe cut is made to coincide withthe generation of an I picture. Inthis case, the cut is handledwith very little extra data. Thelast B pictures before the I framewill almost certainly need to

    use forward prediction. In someapplications that are not real-time, such as DVD mastering, acoder could take two passes atthe input video: one pass toidentify the difficult or highentropy areas and create a codingstrategy, and a second pass toactually compress the input video.

    If a high compression factor isrequired, the level of artifactscan increase, especially if inputquality is poor. In this case, itmay be better to reduce theentropy entering the coder usingprefiltering. The video signal issubject to two-dimensional,low-pass filtering, whichreduces the number of coeffi-

    cients needed and reduces thelevel of artifacts. The picturewill be less sharp, but lesssharpness is preferable to a highlevel of artifacts.

    In most MPEG-2 applications,4:2:0 sampling is used, whichrequires a chroma downsam-pling process if the source is4:2:2. In MPEG-1, the luminanceand chroma are further down-sampled to produce an inputpicture or SIF (Source InputFormat), that is only 352-pixelswide. This technique reducesthe entropy by a further factor.For very high compression, theQSIF (Quarter Source InputFormat) picture, which is176-pixels wide, is used.Downsampling is a process thatcombines a spatial low-pass filterwith an interpolator. Downsam-pling interlaced signals is prob-lematic because vertical detail isspread over two fields whichmay decorrelate due to motion.

    When the source material istelecine, the video signal hasdifferent characteristics thannormal video. In 50 Hz video,pairs of fields represent thesame film frame, and there is nomotion between them. Thus, themotion between fields alternates

    between zero and the motionbetween frames. Since motionvectors are sent differentially,

    2.12 Preprocessing

    A compressor attempts to elimi-nate redundancy within thepicture and between pictures.Anything which reduces thatredundancy is undesirable.Noise and film grain are particu-larly problematic because theygenerally occur over the entirepicture. After the DCT process,

    noise results in more non-zerocoefficients, which the codercannot distinguish from genuinepicture data. Heavier quantizingwill be required to encode all ofthe coefficients, reducing picturequality. Noise also reduces simi-larities between successive pic-tures, increasing the differencedata needed.

    Residual subcarrier in videodecoded from composite videois a serious problem because it

    results in high, spatial frequenciesthat are normally at a low levelin component programs. Subcar-rier also alternates from pictureto picture causing an increase indifference data. Naturally, anycomposite decoding artifactthat is visible in the input to theMPEG coder is likely to bereproduced at the decoder.

    Any practice that causes un-wanted motion is to be avoided.Unstable camera mountings, inaddition to giving a shaky picture,

    increase picture differences andvector transmission requirements.This will also happen withtelecine material if film weaveor hop due to sprocket holedamage is present. In general,video that is to be compressedmust be of the highest qualitypossible. If high quality cannot

    be achieved, then noise reductionand other stabilization techniqueswill be desirable.

    19

  • 8/14/2019 A guide to MPEG

    20/48

    Figure 2.18.

    picture with moderate signal-to-noise ratio results. If, however,that picture is locally decodedand subtracted pixel-by-pixelfrom the original, a quantizingnoise picture results. This picturecan be compressed and trans-mitted as the helper signal. Asimple decoder only decodesthe main, noisy bit stream, but a

    more complex decoder candecode both bit streams andcombine them to produce alow noise picture. This is theprinciple of SNR scaleability.

    As an alternative, coding onlythe lower spatial frequencies ina HDTV picture can produce amain bit stream that an SDTVreceiver can decode. If the lowerdefinition picture is locallydecoded and subtracted fromthe original picture, a definition-enhancing picture would result.This picture can be coded into ahelper signal. A suitable decodercould combine the main andhelper signals to recreate theHDTV picture. This is theprinciple of Spatial scaleability.

    The High profile supports bothSNR and spatial scaleability aswell as allowing the option of4:2:2 sampling.

    The 4:2:2 profile has beendeveloped for improved compat-ibility with digital production

    equipment. This profile allows4:2:2 operation without requiringthe additional complexity ofusing the high profile. Forexample, an HP@ML decodermust support SNR scaleability,which is not a requirement forproduction. The 4:2:2 profilehas the same freedom of GOPstructure as other profiles, butin practice it is commonly usedwith short GOPs making editingeasier. 4:2:2 operation requires ahigher bit rate than 4:2:0, and

    the use of short GOPs requiresan even higher bit rate for agiven quality.

    low level uses a low resolutioninput having only 352 pixelsper line. The majority of broad-cast applications will requirethe MP@ML (Main Profile atMain Level) subset of MPEG,which supports SDTV (StandardDefinition TV).

    The high-1440 level is a highdefinition scheme that doubles

    the definition compared to themain level. The high level notonly doubles the resolution butmaintains that resolution with16:9 format by increasing thenumber of horizontal samplesfrom 1440 to 1920.

    In compression systems usingspatial transforms and requan-tizing, it is possible to producescaleable signals. A scaleableprocess is one in which the inputresults in a main signal and a

    "helper" signal. The main signalcan be decoded alone to give apicture of a certain quality, but,if the information from the helpersignal is added, some aspect ofthe quality can be improved.

    For example, a conventionalMPEG coder, by heavily requan-tizing coefficients, encodes a

    2.13 Profiles and levels

    MPEG is applicable to a widerange of applications requiringdifferent performance and com-plexity. Using all of the encodingtools defined in MPEG, there aremillions of combinations possi-

    ble. For practical purposes, theMPEG-2 standard is dividedinto profiles, and each profile

    is subdivided into levels (seeFigure 2.18). A profile is basicallya subset of the entire codingrepertoire requiring a certaincomplexity. A level is a parame-ter such as the size of the pictureor bit rate used with that profile.In principle, there are 24 combi-nations, but not all of these have

    been defined. An MPEG decoderhaving a given Profile and Levelmust also be able to decodelower profiles and levels.

    The simple profile does not sup-port bidirectional coding, and soonly I and P pictures will beoutput. This reduces the codingand decoding delay and allowssimpler hardware. The simpleprofile has only been defined atMain level (SP@ML).

    The Main Profile is designed fora large proportion of uses. The

    20

    HIGH

    HIGH-1440

    MAIN

    LOW

    LEVELPROFILE

    4:2:01920x1152

    80 Mb/sI,P,B

    4:2:01440x1152

    60 Mb/sI,P,B

    4:2:0720x57615 Mb/s

    I,P,B

    4:2:0352x2884 Mb/s

    I,P,B

    4:2:0720x57615 Mb/s

    I,P

    SIMPLE MAIN

    4:2:2720x60850 Mb/s

    I,P,B

    4:2:2PROFILE

    4:2:0352x2884 Mb/sI,P,B

    4:2:0720x57615 Mb/s

    I,P,B

    4:2:01440x1152

    60 Mb/sI,P,B

    4:2:0, 4:2:21440x1152

    80 Mb/sI,P,B

    4:2:0, 4:2:2720x57620 Mb/s

    I,P,B

    4:2:0, 4:2:21920x1152

    100 Mb/sI,P,B

    SNR SPATIAL HIGH

  • 8/14/2019 A guide to MPEG

    21/48

    Figure 2.19.

    of pitch in steady tones.

    For video coding, wavelets havethe advantage of producing reso-lution scaleable signals withalmost no extra effort. In movingvideo, the advantages of waveletsare offset by the difficulty ofassigning motion vectors to avariable size block, but in still-picture or I-picture coding this

    Figure 2.19 contrasts the fixedblock size of the DFT/DCT withthe variable size of the wavelet.

    Wavelets are especially usefulfor audio coding because theyautomatically adapt to the con-flicting requirements of theaccurate location of transients intime and the accurate assessment

    2.14 Wavelets

    All transforms suffer fromuncertainty because the moreaccurately the frequency domainis known, the less accurately thetime domain is known (and viceversa). In most transforms suchas DFT and DCT, the blocklength is fixed, so the time andfrequency resolution is fixed.

    The frequency coefficients rep-resent evenly spaced values ona linear scale. Unfortunately,

    because human senses are loga-rithmic, the even scale of theDFT and DCT gives inadequatefrequency resolution at one endand excess resolution at the other.

    The wavelet transform is notaffected by this problem becauseits frequency resolution is afixed fraction of an octave andtherefore has a logarithmic

    characteristic. This is done bychanging the block length as afunction of frequency. As fre-quency goes down, the block

    becomes longer. Thus, a charac-teristic of the wavelet transformis that the basis functions allcontain the same number ofcycles, and these cycles are sim-ply scaled along the time axis tosearch for different frequencies.

    21

    FFTWavelet

    Transform

    ConstantSize Windows

    in FFT

    ConstantNumber of Cyclesin Basis Function

  • 8/14/2019 A guide to MPEG

    22/48

    Figure 3.1.

    frequencies available determinesthe frequency range of humanhearing, which in most peopleis from 20 Hz to about 15 kHz.

    Different frequencies in theinput sound cause differentareas of the membrane tovibrate. Each area has differentnerve endings to allow pitchdiscrimination. The basilar

    membrane also has tiny musclescontrolled by the nerves thattogether act as a kind of positivefeedback system that improvesthe Q factor of the resonance.

    The resonant behavior of thebasilar membrane is an exactparallel with the behavior of atransform analyzer. Accordingto the uncertainty theory oftransforms, the more accuratelythe frequency domain of a signalis known, the less accurately the

    time domain is known.Consequently, the more able atransform is able to discriminate

    between two frequencies, theless able it is to discriminate

    between the time of two events.Human hearing has evolvedwith a certain compromise that

    balances time-uncertaintydiscrimination and frequencydiscrimination; in the balance,neither ability is perfect.

    The imperfect frequencydiscrimination results in the

    inability to separate closelyspaced frequencies. This inabilityis known as auditory masking,defined as the reduced sensitivityto sound in the presenceof another.

    The physical hearing mechanismconsists of the outer, middleand inner ears. The outer earcomprises the ear canal and theeardrum. The eardrum convertsthe incident sound into a vibra-tion in much the same way asdoes a microphone diaphragm.The inner ear works by sensingvibrations transmitted through a

    fluid. The impedance of fluid ismuch higher than that of air andthe middle ear acts as an imped-ance-matching transformer thatimproves power transfer.

    Figure 3.1 shows that vibrationsare transferred to the inner ear

    by the stirrup bone, which actson the oval window. Vibrationsin the fluid in the ear travel upthe cochlea, a spiral cavity inthe skull (shown unrolled inFigure 3.1 for clarity). The basilarmembrane is stretched acrossthe cochlea. This membranevaries in mass and stiffnessalong its length. At the end nearthe oval window, the membraneis stiff and light, so its resonantfrequency is high. At the distantend, the membrane is heavy andsoft and resonates at low fre-quency. The range of resonant

    SECTION 3AUDIO COMPRESSION

    Lossy audio compression isbased entirely on the character-istics of human hearing, whichmust be considered before anydescription of compression ispossible. Surprisingly, humanhearing, particularly in stereo, isactually more critically discrim-

    inating than human vision, andconsequently audio compressionshould be undertaken with care.As with video compression,audio compression requires anumber of different levels ofcomplexity according to therequired compression factor.

    3.1 The hearing mechanism

    Hearing comprises physicalprocesses in the ear and nervous/mental processes that combineto give us an impression ofsound. The impression wereceive is not identical to theactual acoustic waveform presentin the ear canal because someentropy is lost. Audio compres-sion systems that lose only thatpart of the entropy that will belost in the hearing mechanismwill produce good results.

    22

    OuterEar

    EarDrum

    MiddleEar

    StirrupBone

    Basilar Membrane

    InnerEar

    Cochlea (unrolled)

  • 8/14/2019 A guide to MPEG

    23/48

    Figure 3.2a.

    3.2 Subband coding

    Figure 3.4 shows a band-splittingcompandor. The band-splittingfilter is a set of narrow-band,linear-phase filters that overlapand all have the same band-width. The output in each bandconsists of samples representinga waveform. In each frequency

    band, the audio input is ampli-

    fied up to maximum level priorto transmission. Afterwards, eachlevel is returned to its correctvalue. Noise picked up in thetransmission is reduced in each

    band. If the noise reduction iscompared with the threshold ofhearing, it can be seen thatgreater noise can be tolerated insome bands because of masking.Consequently, in each band aftercompanding, it is possible toreduce the wordlength of sam-ples. This technique achieves a

    compression because the noiseintroduced by the loss of resolu-tion is masked.

    be present for at least about 1millisecond before it becomesaudible. Because of this slowresponse, masking can still takeplace even when the two signalsinvolved are not simultaneous.Forward and backward maskingoccur when the masking soundcontinues to mask sounds atlower levels before and after the

    masking sound's actual duration.Figure 3.3 shows this concept.

    Masking raises the threshold ofhearing, and compressors takeadvantage of this effect by raisingthe noise floor, which allowsthe audio waveform to beexpressed with fewer bits. Thenoise floor can only be raised atfrequencies at which there iseffective masking. To maximizeeffective masking, it is necessaryto split the audio spectrum intodifferent frequency bands toallow introduction of differentamounts of companding andnoise in each band.

    Figure 3.2a shows that thethreshold of hearing is a functionof frequency. The greatest sensi-tivity is, not surprisingly, in thespeech range. In the presence ofa single tone, the threshold ismodified as in Figure 3.2b. Notethat the threshold is raised fortones at higher frequency and tosome extent at lower frequency.

    In the presence of a complexinput spectrum, such as music,the threshold is raised at nearlyall frequencies. One consequenceof this behavior is that the hissfrom an analog audio cassette isonly audible during quiet pas-sages in music. Compandingmakes use of this principle byamplifying low-level audio sig-nals prior to recording or trans-mission and returning them totheir correct level afterwards.

    The imperfect time discrimina-tion of the ear is due to its reso-nant response. The Q factor issuch that a given sound has to

    23

    SoundPressure

    20 Hz

    Thresholdin Quiet

    1 kHz 20 kHz

    SoundPressure

    20 Hz

    MaskingThreshold

    1 kHz 20 kHz

    1 kHz Sine Wave

    Figure 3.2b.

    Figure 3.3.

    SoundPressure

    Post-Masking

    Time

    Pre-Masking

    Figure 3.4.

    Gain Factor

    CompandedAudio

    Input Sub-BandFilter

    LevelDetect

    X

  • 8/14/2019 A guide to MPEG

    24/48

    Figure 3.5.

    be reversed at the decoder.

    The filter bank output is alsoanalyzed to determine the spec-trum of the input signal. Thisanalysis drives a masking modelthat determines the degree ofmasking that can be expected ineach band. The more maskingavailable, the less accurate thesamples in each band can be.

    The sample accuracy is reducedby requantizing to reducewordlength. This reduction isalso constant for every word ina band, but different bands canuse different wordlengths. Thewordlength needs to be trans-mitted as a bit allocation codefor each band to allow thedecoder to deserialize the bitstream properly.

    3.3 MPEG Layer 1

    Figure 3.6 shows an MPEGLevel 1 audio bit stream.Following the synchronizingpattern and the header, there are32-bit allocation codes of four

    bits each. These codes describethe wordlength of samples ineach subband. Next come the 32scale factors used in the com-panding of each band. Thesescale factors determine the gainneeded in the decoder to returnthe audio to the correct level.The scale factors are followed,

    in turn, by the audio data ineach band.

    Figure 3.7 shows the Layer 1decoder. The synchronizationpattern is detected by the timinggenerator, which deserializes the

    bit allocation and scale factordata. The bit allocation data thenallows deserialization of thevariable length samples. Therequantizing is reversed and thecompression is reversed by thescale factor data to put each

    band back to the correct level.

    These 32 separate bands arethen combined in a combinerfilter which produces theaudio output.

    output of the filter there are 12samples in each of 32 bands.Within each band, the level isamplified by multiplication to

    bring the level up to maximum.The gain required is constant forthe duration of a block, and asingle scale factor is transmittedwith each block for each bandin order to allow the process to

    Figure 3.5 shows a simple band-splitting coder as is used inMPEG Layer 1. The digitalaudio input is fed to a band-splitting filter that divides thespectrum of the signal into anumber of bands. In MPEG thisnumber is 32. The time axis isdivided into blocks of equallength. In MPEG layer 1, this is

    384 input samples, so in the

    24

    In

    OutMUX

    X32

    MaskingThresholds

    BandsplittingFilter32

    Subbands

    Scalerand

    Quantizer

    DynamicBit and

    Scale FactorAllocater

    and Coder

    }

    SubbandSamples

    384 PCM Audio Input SamplesDuration 8 msec @ 48 kHz

    Header

    20 Bit System12 Bit Sync

    CRC

    Optional

    Bit Allocation

    4 Bit Linear

    Scale Factors

    6 Bit Linear

    Anc Data

    UnspecifiedLength

    GR0 GR1 GR2 GR11

    0 1 2 31

    MPEGLayer 1

    InOut

    SyncDET

    32InputFilterX

    Scale

    Factors

    X32

    X32

    Samples

    X32Bit

    Allocation

    Demux

    Figure 3.6.

    Figure 3.7.

  • 8/14/2019 A guide to MPEG

    25/48

    Figure 3.8.

    performed by the psychoacousticmodel. It has been found thatpre-echo is associated with theentropy in the audio risingabove the average value. Toobtain the highest compressionfactor, nonuniform quantizingof the coefficients is used along

    with Huffman coding. Thistechnique allocates the shortestwordlengths to the most commoncode values.

    3.7 AC-3

    The AC-3 audio coding techniqueis used with the ATSC systeminstead of one of the MPEGaudio coding schemes. AC-3 is atransform-based system thatobtains coding gain by requan-tizing frequency coefficients.

    The PCM input to an AC-3coder is divided into overlappingwindowed blocks as shown inFigure 3.9. These blocks contain512 samples each, but becauseof the complete overlap, there is100% redundancy. After thetransform, there are 512 coeffi-cients in each block, but becauseof the redundancy, these coeffi-cients can be decimated to 256coefficients using a techniquecalled Time Domain AliasingCancellation (TDAC).

    response cannot increase orreduce rapidly. Consequently, ifan audio waveform is trans-formed into the frequencydomain, the coefficients do notneed to be sent very often. Thisprinciple is the basis of transformcoding. For higher compression

    factors, the coefficients can berequantized, making them lessaccurate. This process producesnoise which will be placed atfrequencies where the maskingis the greatest. A by-product of atransform coder is that the inputspectrum is accurately known,so a precise masking model can

    be created.

    3.6 MPEG Layer 3

    This complex level of coding isreally only required when the

    highest compression factor isneeded. It has a degree of com-monality with Layer 2. A discretecosine transform is used having384 output coefficients per block.This output can be obtained bydirect processing of the inputsamples, but in a multi-levelcoder, it is possible to use ahybrid transform incorporatingthe 32-band filtering of Layers 1and 2 as a basis. If this is done,the 32 subbands from the QMF(Quadrature Mirror Filter) are

    each further processed by a12-band MDCT (ModifiedDiscreet Cosine Transform) toobtain 384 output coefficients.

    Two window sizes are used toavoid pre-echo on transients.The window switching is

    3.4 MPEG Layer 2

    Figure 3.8 shows that when theband-splitting filter is used todrive the masking model, thespectral analysis is not veryaccurate, since there are only 32

    bands and the energy could beanywhere in the band. Thenoise floor cannot be raised verymuch because, in the worst case

    shown, the masking may notoperate. A more accurate spectralanalysis would allow a highercompression factor. In MPEGlayer 2, the spectral analysis isperformed by a separate process.A 512-point FFT working directlyfrom the input is used to drivethe masking model instead. Toresolve frequencies more accu-rately, the time span of thetransform has to be increased,which is done by raising the

    block size to 1152 samples.

    While the block-compandingscheme is the same as in Layer 1,not all of the scale factors aretransmitted, since they contain adegree of redundancy on realprogram material. The scale factorof successive blocks in the same

    band exceeds 2 dB less than10% of the time, and advantageis taken of this characteristic byanalyzing sets of three successivescale factors. On stationary pro-grams, only one scale factor out

    of three is sent. As transientcontent increases in a given sub-band, two or three scale factorswill be sent. A scale factorselect code is also sent to allowthe decoder to determine whathas been sent in each subband.This technique effectivelyhalves the scale factor bit rate.

    3.5 Transform coding

    Layers 1 and 2 are based onband-splitting filters in whichthe signal is still represented as

    a waveform. However, Layer 3adopts transform coding similarto that used in video coding. Aswas mentioned above, the earperforms a kind of frequencytransform on the incident soundand because of the Q factor ofthe basilar membrane, the

    25

    WideSub-Band

    Is the MaskingLike This?

    Or Like This?

    512Samples

    Figure 3.9.

  • 8/14/2019 A guide to MPEG

    26/48

    Figure 4.1.

    in absolute form. The remainderare transmitted differentiallyand the decoder adds the differ-ence to the previous exponent.Where the input audio has asmooth spectrum, the exponentsin several frequency bands may

    be the same. Exponents can begrouped into sets of two or fourwith flags that describe what

    has been done.Sets of six blocks are assembledinto an AC-3 sync frame. Thefirst block of the frame alwayshas full exponent data, but incases of stationary signals, later

    blocks in the frame can use thesame exponents.

    analysis of the input to a finiteaccuracy on a logarithmic scalecalled the spectral envelope. Thisspectral analysis is the input tothe masking model that deter-mines the degree to which noisecan be raised at each frequency.

    The masking model drives therequantizing process, whichreduces the accuracy of each

    coefficient by rounding themantissae. A significant propor-tion of the transmitted dataconsist of these mantissae.

    The exponents are also transmit-ted, but not directly as there isfurther redundancy within themthat can be exploited. Within a

    block, only the first (lowest fre-quency) exponent is transmitted

    The input waveform is analyzed,and if there is a significant tran-sient in the second half of the

    block, the waveform will besplit into two to prevent pre-echo. In this case, the number ofcoefficients remains the same,

    but the frequency resolutionwill be halved and the temporalresolution will be doubled. A

    flag is set in the bit stream toindicate to the decoder that thishas been done.

    The coefficients are output infloating-point notation as amantissa and an exponent. Therepresentation is the binaryequivalent of scientific notation.Exponents are effectively scalefactors. The set of exponents ina block produce a spectral

    26

    decoders. The approach alsoallows the use of proprietarycoding algorithms, which neednot enter the public domain.

    4.1 Video elementary stream syntax

    Figure 4.1 shows the construc-tion of the elementary videostream. The fundamental unit ofpicture information is the DCT

    block, which represents an 8 x 8array of pixels that can be Y, Cr,

    or Cb. The DC coefficient is sentfirst and is represented moreaccurately than the other coeffi-cients. Following the remainingcoefficients, an end of block(EOB) code is sent.

    encoders. By standardizing thedecoder, they can be made atlow cost. In contrast, theencoder can be more complexand more expensive without agreat cost penalty, but with thepotential for better picturequality as complexity increases.When the encoder and thedecoder are different in com-plexity, the coding system issaid to be asymmetrical.

    The MPEG approach also allowsfor the possibility that qualitywill improve as coding algo-rithms are refined while stillproducing bit streams that can

    be understood by earlier

    SECTION 4ELEMENTARY STREAMS

    An elementary stream is basicallythe raw output of an encoderand contains no more than isnecessary for a decoder toapproximate the original pictureor audio. The syntax of the com-pressed signal is rigidly definedin MPEG so that decoders can

    be guaranteed to operate on it.The encoder is not defined

    except that it must somehowproduce the right syntax.

    The advantage of this approachis that it suits the real world inwhich there are likely to bemany more decoders than

    MotionVector

    SyncPattern

    GlobalVectors

    PictureSize

    AspectRatio

    ProgressiveInterlace

    ChromaType