cap video formation perception and representation

34
 

Upload: aanastas

Post on 04-Nov-2015

15 views

Category:

Documents


0 download

DESCRIPTION

video formation representation

TRANSCRIPT

  • Chapter 1

    VIDEO FORMATION,

    PERCEPTION, AND

    REPRESENTATION

    In this rst chapter, we describe what is a video signal, how is it captured and

    perceived, how is it stored/transmitted, and what are the important parameters

    that determine the quality and bandwidth (which in turn determines the data rate)

    of a video signal. We rst present the underlying physics for color perception

    and specication (Sec. 1.1). We then describe the principles and typical devices

    for video capture and display (Sec. 1.2). As will be seen, analog videos are cap-

    tured/stored/transmitted in a raster scan format, using either progressive or in-

    terlaced scans. As an example, we review the analog color television (TV) system

    (Sec. 1.4), and give insights as to how are certain critical parameters, such as frame

    rate and line rate, chosen, what is the spectral content of a color TV signal, and how

    can dierent components of the signal be multiplexed into a composite signal. Fi-

    nally, Section 1.5 introduces the ITU-R BT.601 video format (formerly CCIR601),

    the digitized version of the analog color TV signal. We present some of the consider-

    ations that have gone into the selection of various digitization parameters. We also

    describe several other digital video formats, including high-denition TV (HDTV).

    The compression standards developed for dierent applications and their associated

    video formats are summarized.

    The purpose of this chapter is to give the readers background knowledge about

    analog and digital video, and to provide insights to common video system design

    problems. As such, the presentation is intentionally made more qualitative than

    quantitative. In later chapters, we will come back to certain problems mentioned

    in this chapter and provide more rigorous descriptions/solutions.

    1.1 Color Perception and Specication

    A video signal is a sequence of two dimensional (2D) images projected from a

    dynamic three dimensional (3D) scene onto the image plane of a video camera. The

    1

  • 2 Video Formation, Perception, and Representation Chapter 1

    color value at any point in a video frame records the emitted or reected light at a

    particular 3D point in the observed scene. To understand what does the color value

    mean physically, we review in this section basics of light physics and describe the

    attributes that characterize light and its color. We will also describe the principle

    of human color perception and dierent ways to specify a color signal.

    1.1.1 Light and Color

    Light is an electromagnetic wave with wavelengths in the range of 380 to 780

    nanometer (nm), to which the human eye is sensitive. The energy of light is mea-

    sured by ux, with a unit of watt, which is the rate at which energy is emitted. The

    radiant intensity of a light, which is directly related to the brightness of the light

    we perceive, is dened as the ux radiated into a unit solid angle in a particular

    direction, measured in watt/solid-angle. A light source usually can emit energy in

    a range of wavelengths, and its intensity can be varying in both space and time. In

    this book, we use C(X; t; ) to represent the radiant intensity distribution of a light,

    which species the light intensity at wavelength , spatial location X = (X;Y; Z)

    and time t.

    The perceived color of a light depends on its spectral content (i.e. the wavelength

    composition). For example, a light that has its energy concentrated near 700 nm

    appears red. A light that has equal energy in the entire visible band appears white.

    In general, a light that has a very narrow bandwidth is referred to as a spectral

    color. On the other hand, a white light is said to be achromatic.

    There are two types of light sources: the illuminating source, which emits an

    electromagnetic wave, and the reecting source, which reects an incident wave.

    1

    The illuminating light sources include the sun, light bulbs, the television (TV)

    monitors, etc. The perceived color of an illuminating light source depends on the

    wavelength range in which it emits energy. The illuminating light follows an additive

    rule, i.e. the perceived color of several mixed illuminating light sources depends on

    the sum of the spectra of all light sources. For example, combining red, green, and

    blue lights in right proportions creates the white color.

    The reecting light sources are those that reect an incident light (which could

    itself be a reected light). When a light beam hits an object, the energy in a certain

    wavelength range is absorbed, while the rest is reected. The color of a reected light

    depends on the spectral content of the incident light and the wavelength range that

    is absorbed. A reecting light source follows a subtractive rule, i.e. the perceived

    color of several mixed reecting light sources depends on the remaining, unabsorbed

    wavelengths. The most notable reecting light sources are the color dyes and paints.

    For example, if the incident light is white, a dye that absorbs the wavelength near

    700 nm (red) appears as cyan. In this sense, we say that cyan is the complement of

    1

    The illuminating and reecting light sources are also referred to as primary and secondary light

    sources, respectively. We do not use those terms to avoid the confusion with the primary colors

    associated with light. In other places, illuminating and reecting lights are also called additive

    colors and subtractive colors, respectively.

  • Section 1.1. Color Perception and Specication 3

    Figure 1.1. Solid line: Frequency responses of the three types of cones on the human

    retina. The blue response curve is magnied by a factor of 20 in the gure. Dashed Line:

    The luminous eciency function. From [10, Fig. 1].

    red (or white minus red). Similarly, magenta and yellow are complements of green

    and blue, respectively. Mixing cyan, magenta, and yellow dyes produces black,

    which absorbs the entire visible spectrum.

    1.1.2 Human Perception of Color

    The perception of a light in the human being starts with the photo receptors located

    in the retina (the surface of the rear of the eye ball). There are two types of

    receptors: cones that function under bright light and can perceive the color tone,

    and rods that work under low ambient light and can only extract the luminance

    information. The visual information from the retina is passed via optic nerve bers

    to the brain area called the visual cortex, where visual processing and understanding

    is accomplished. There are three types of cones which have overlapping pass-bands

    in the visible spectrum with peaks at red (near 570 nm), green (near 535 nm), and

    blue (near 445 nm) wavelengths, respectively, as shown in Figure 1.1. The responses

    of these receptors to an incoming light distribution C() can be described by:

    C

    i

    =

    Z

    C()a

    i

    ()d; i = r; g; b; (1.1.1)

    where a

    r

    (); a

    g

    (); a

    b

    () are referred to as the frequency responses or relative ab-

    sorption functions of the red, green, and blue cones. The combination of these

    three types of receptors enables a human being to perceive any color. This implies

    that the perceived color only depends on three numbers, C

    r

    ; C

    g

    ; C

    b

    , rather than

    the complete light spectrum C(). This is known as the tri-receptor theory of color

    vision, rst discovered by Young [14].

  • 4 Video Formation, Perception, and Representation Chapter 1

    There are two attributes that describe the color sensation of a human being:

    luminance and chrominance. The term luminance refers to the perceived brightness

    of the light, which is proportional to the total energy in the visible band. The term

    chrominance describes the perceived color tone of a light, which depends on the

    wavelength composition of the light. Chrominance is in turn characterized by two

    attributes: hue and saturation. Hue species the color tone, which depends on

    the peak wavelength of the light, while saturation describes how pure the color is,

    which depends on the spread or bandwidth of the light spectrum. In this book, we

    use the word color to refer to both the luminance and chrominance attributes of a

    light, although it is customary to use the word color to refer to the chrominance

    aspect of a light only.

    Experiments have shown that there exists a secondary processing stage in the

    human visual system (HVS), which converts the three color values obtained by the

    cones into one value that is proportional to the luminance and two other values that

    are responsible for the perception of chrominance. This is known as the opponent

    color model of the HVS [3, 9]. It has been found that the same amount of energy

    produces dierent sensations of the brightness at dierent wavelengths, and this

    wavelength-dependent variation of the brightness sensation is characterized by a

    relative luminous eciency function, a

    y

    (), which is also shown (in dashed line)

    in Fig. 1.1. It is essentially the sum of the frequency responses of all three types

    of cones. We can see that the green wavelength contributes most to the perceived

    brightness, the red wavelength the second, and the blue the least. The luminance

    (often denoted by Y) is related to the incoming light spectrum by:

    Y =

    Z

    C()a

    y

    ()d: (1.1.2)

    In the above equations, we have neglected the time and space variables, since we

    are only concerned with the perceived color or luminance at a xed spatial and

    temporal location. We also neglected the scaling factor commonly associated with

    each equation, which depends on the desired unit for describing the color intensities

    and luminance.

    1.1.3 The Trichromatic Theory of Color Mixture

    A very important nding in color physics is that most colors can be produced by

    mixing three properly chosen primary colors. This is known as the trichromatic

    theory of color mixture, rst demonstrated by Maxwell in 1855 [9, 13]. Let C

    k

    ; k =

    1; 2; 3 represent the colors of three primary color sources, and C a given color. Then

    the theory essentially says

    C =

    X

    k=1;2;3

    T

    k

    C

    k

    ; (1.1.3)

    where T

    k

    's are the amounts of the three primary colors required to match color

    C. The T

    k

    's are known as tristimulus values. In general, some of the T

    k

    's can be

  • Section 1.1. Color Perception and Specication 5

    negative. Assuming only T

    1

    is negative, this means that one cannot match color

    C by mixing C

    1

    ; C

    2

    ; C

    3

    , but one can match color C + jT

    1

    jC

    1

    with T

    2

    C

    2

    + T

    3

    C

    3

    :

    In practice, the primary colors should be chosen so that most natural colors can

    be reproduced using positive combinations of primary colors. The most popular

    primary set for the illuminating light source contains red, green, and blue colors,

    known as the RGB primary. The most common primary set for the reecting light

    source contains cyan, magenta, and yellow, known as the CMY primary. In fact,

    RGB and CMY primary sets are complement of each other, in that mixing two

    colors in one set will produce one color in the other set. For example, mixing red

    with green will yield yellow. This complementary information is best illustrated by

    a color wheel, which can be found in many image processing books, e.g., [9, 4].

    For a chosen primary set, one way to determine tristimulus values of any color

    is by rst determining the color matching functions, m

    i

    (), for primary colors, C

    i

    ,

    i=1,2,3. These functions describe the tristimulus values of a spectral color with

    wavelength , for various in the entire visible band, and can be determined by

    visual experiments with controlled viewing conditions. Then the tristimulus values

    for any color with a spectrum C() can be obtained by [9]:

    T

    i

    =

    Z

    C()m

    i

    ()d; i = 1; 2; 3: (1.1.4)

    To produce all visible colors with positive mixing, the matching functions associated

    with the primary colors must be positive.

    The above theory forms the basis for color capture and display. To record the

    color of an incoming light, a camera needs to have three sensors that have frequency

    responses similar to the color matching functions of a chosen primary set. This can

    be accomplished by optical or electronic lters with the desired frequency responses.

    Similarly, to display a color picture, the display device needs to emit three optical

    beams of the chosen primary colors with appropriate intensities, as specied by

    the tristimulus values. In practice, electronic beams that strike phosphors with

    the red, green and blue colors are used. All present display systems use a RGB

    primary, although the standard spectra specied for the primary colors may be

    slightly dierent. Likewise, a color printer can produce dierent colors by mixing

    three dyes with the chosen primary colors in appropriate proportions. Most of

    the color printers use the CMY primary. For a more vivid and wide-range color

    rendition, some color printers use four primaries, by adding black (K) to the CMY

    set. This is known as the CMYK primary, which can render the black color more

    truthfully.

    1.1.4 Color Specication by Tristimulus Values

    Tristimulus Values We have introduced the tristimulus representation of a color

    in Sec. 1.1.3, which species the proportions, i.e. the T

    k

    's in Eq. (1.1.3), of the

    three primary colors needed to create the desired color. In order to make the color

    specication independent of the absolute energy of the primary colors, these values

  • 6 Video Formation, Perception, and Representation Chapter 1

    should be normalized so that T

    k

    = 1; k = 1; 2; 3 for a reference white color (equal

    energy in all wavelengths) with a unit energy. When we use a RGB primary, the

    tristimulus values are usually denoted by R;G; and B.

    Chromaticity Values: The above tristimulus representation mixes the luminance

    and chrominance attributes of a color. To measure only the chrominance informa-

    tion (i.e. the hue and saturation) of a light, the chromaticity coordinate is dened

    as:

    t

    k

    =

    T

    k

    T

    1

    + T

    2

    + T

    3

    ; k = 1; 2; 3: (1.1.5)

    Since t

    1

    + t

    2

    + t

    3

    = 1, two chromaticity values are sucient to specify the chromi-

    nance of a color.

    Obviously, the color value of an imaged point depends on the primary colors

    used. To standardize color description and specication, several standard primary

    color systems have been specied. For example, the CIE,

    2

    an international body

    of color scientists, dened a CIE RGB primary system, which consists of colors at

    700 (R

    0

    ), 546.1 (G

    0

    ), and 435.8 (B

    0

    ) nm.

    Color Coordinate Conversion One can convert the color values based on one set

    of primaries to the color values for another set of primaries. Conversion of (R,G,B)

    coordinate to the (C,M,Y) coordinate is, for example, often required for printing

    color images stored in the (R,G,B) coordinate. Given the tristimulus representation

    of one primary set in terms of another primary, one can determine the conversion

    matrix between the two color coordinates. The principle of color conversion and

    the derivation of the conversion matrix between two sets of color primaries can be

    found in [9].

    1.1.5 Color Specication by Luminance and Chrominance At-

    tributes

    The RGB primary commonly used for color display mixes the luminance and chromi-

    nance attributes of a light. In many applications, it is desirable to describe a color

    in terms of its luminance and chrominance content separately, to enable more ef-

    cient processing and transmission of color signals. Towards this goal, various

    three-component color coordinates have been developed, in which one component

    reects the luminance and the other two collectively characterize hue and satura-

    tion. One such coordinate is the CIE XYZ primary, in which Y directly measures

    the luminance intensity. The (X;Y; Z) values in this coordinate are related to the

    (R;G;B) values in the CIE RGB coordinate by [9]:

    2

    4

    X

    Y

    Z

    3

    5

    =

    2

    4

    2:365 0:515 0:005

    0:897 1:426 0:014

    0:468 0:089 1:009

    3

    5

    2

    4

    R

    G

    B

    3

    5

    : (1.1.6)

    2

    CIE stands for Commission Internationale de L'Eclariage or, in English, International Com-

    mission on Illumination.

  • Section 1.2. Video Capture and Display 7

    In addition to separating the luminance and chrominance information, another

    advantage of the CIE XYZ system is that almost all visible colors can be specied

    with non-negative tristimulus values, which is a very desirable feature. The problem

    is that the X,Y,Z colors so dened are not realizable by actual color stimuli. As

    such, the XYZ primary is not directly used for color production, rather it is mainly

    introduced for dening other primaries and for numerical specication of color. As

    will be seen later, the color coordinates used for transmission of color TV signals,

    such as YIQ and YUV, are all derived from the XYZ coordinate.

    There are other color representations in which the hue and saturation of a color

    are explicitly specied, in addition to the luminance. One example is the HSI coor-

    dinate, where H stands for hue, S for saturation, and I for intensity (equivalent to

    luminance)

    3

    . Although this color coordinate clearly separates dierent attributes of

    a light, it is nonlinearly related to the tristimulus values and is dicult to compute.

    The book by Gonzalez has a comprehensive coverage of various color coordinates

    and their conversions [4].

    1.2 Video Capture and Display

    1.2.1 Principle of Color Video Imaging

    Having explained what is light and how it is perceived and characterized, we are

    now in a position to understand the meaning of a video signal. In short, a video

    records the emitted and/or reected light intensity, i.e. C(X; t; ) from the objects

    in the scene that is observed by a viewing system (a human eye or a camera). In

    general, this intensity changes both in time and space. Here, we assume that there

    are some illuminating light sources in the scene. Otherwise, there will be no injected

    nor reected light and the image will be totally dark. When observed by a camera,

    only those wavelengths to which the camera is sensitive are visible. Let the spectral

    absorption function of the camera be denoted by a

    c

    (), then the light intensity

    distribution in the 3D world that is \visible" to the camera is:

    (X; t) =

    Z

    1

    0

    C(X; t; )a

    c

    ()d: (1.2.1)

    The image function captured by the camera at any time t is the projection of

    the light distribution in the 3D scene onto a 2D image plane. Let P() represent

    the camera projection operator so that the projected 2D position of the 3D point

    X is given by x = P(X). Further more, let P

    1

    () denote the inverse projection

    operator, so that X = P

    1

    (x) species the 3D position associated with a 2D point

    x: Then the projected image is related to the 3D image by

    (P(X); t) =

    (X; t) or (x; t) =

    P

    1

    (x); t

    : (1.2.2)

    The function (x; t) is what is known as a video signal. We can see that it describes

    the radiant intensity at the 3D position X that is projected onto x in the image

    3

    The HSI coordinate is also known as HSV, where V stands for \value" of the intensity.

  • 8 Video Formation, Perception, and Representation Chapter 1

    plane at time t. In general the video signal has a nite spatial and temporal range.

    The spatial range depends on the camera viewing area, while the temporal range

    depends on the duration in which the video is captured. A point in the image plane

    is called a pixel (meaning picture element) or simply pel.

    4

    For most camera systems,

    the projection operator P() can be approximated by a perspective projection. This

    is discussed in more detail in Sec. 5.1.

    If the camera absorption function is the same as the relative luminous eciency

    function of the human being, i.e. a

    c

    () = a

    y

    (), then a luminance image is formed.

    If the absorption function is non-zero over a narrow band, then a monochrome

    (or monotone) image is formed. To perceive all visible colors, according to the

    trichromatic color vision theory (see Sec. 1.1.2), three sensors are needed, each with

    a frequency response similar to the color matching function for a selected primary

    color. As described before, most color cameras use the red, green, and blue sensors

    for color acquisition.

    If the camera has only one luminance sensor, (x; t) is a scalar function that

    represents the luminance of the projected light. In this book, we use the word

    gray-scale to refer to such a video. The term black-and-white will be used strictly

    to describe an image that has only two colors: black and white. On the other hand,

    if the camera has three separate sensors, each tuned to a chosen primary color, the

    signal is a vector function that contains three color values at every point. Instead

    of specifying these color values directly, one can use other color coordinates (each

    consists of three values) to characterize light, as explained in the previous section.

    Note that for special purposes, one may use sensors that work in a frequency

    range that is invisible to the human being. For example, in X-ray imaging, the

    sensor is sensitive to the spectral range of the X-ray. On the other hand, an infra-

    red camera is sensitive to the infra-red range, which can function at very low ambient

    light. These cameras can \see" things that cannot be perceived by the human eye.

    Yet another example is the range camera, in which the sensor emits a laser beam and

    measures the time it takes for the beam to reach an object and then be reected

    back to the sensor. Because the round trip time is proportional to the distance

    between the sensor and the object surface, the image intensity at any point in a

    range image describes the distance or range of its corresponding 3D point from the

    camera.

    1.2.2 Video Cameras

    All the analog cameras of today capture a video in a frame by frame manner with

    a certain time spacing between the frames. Some cameras (e.g. TV cameras and

    consumer video camcorders) acquire a frame by scanning consecutive lines with a

    certain line spacing. Similarly, all the display devices present a video as a consecu-

    tive set of frames, and with TV monitors, the scan lines are played back sequentially

    as separate lines. Such capture and display mechanisms are designed to take advan-

    4

    Strictly speaking the notion of pixel or pel is only dened in digital imagery, in which each

    image or a frame in a video is represented by a nite 2D array of pixels.

  • Section 1.2. Video Capture and Display 9

    tage of the fact that the HVS cannot perceive very high frequency changes in time

    and space. This property of the HVS will be discussed more extensively in Sec. 2.4.

    There are basically two types of video imagers: (1) tube-based imagers such

    as vidicons, plumbicons, or orthicons, and (2) solid-state sensors such as charge-

    coupled devices (CCD). The lens of a camera focuses the image of a scene onto a

    photosensitive surface of the imager of the camera, which converts optical signals

    into electrical signals. The photosensitive surface of the tube imager is typically

    scanned line by line (known as raster scan) with an electron beam or other electronic

    methods, and the scanned lines in each frame are then converted into an electrical

    signal representing variations of light intensity as variations in voltage. Dierent

    lines are therefore captured at slightly dierent times in a continuous manner. With

    progressive scan, the electronic beam scans every line continuously; while with

    interlaced scan, the beam scans every other line in one half of the frame time (a

    eld) and then scans the other half of the lines. We will discuss raster scan in more

    detail in Sec. 1.3. With a CCD camera, the photosensitive surface is comprised of a

    2D array of sensors, each corresponding to one pixel, and the optical signal reaching

    each sensor is converted to an electronic signal. The sensor values captured in each

    frame time are rst stored in a buer, which are then read-out sequentially one line

    at a time to form a raster signal. Unlike the tube based cameras, all the read-out

    values in the same frame are captured at the same time. With interlaced scan

    camera, alternate lines are read-out in each eld.

    To capture color, there are usually three types of photosensitive surfaces or CCD

    sensors, each with a frequency response that is determined by the color matching

    function of the chosen primary color, as described previously in Sec. 1.1.3. To reduce

    the cost, most consumer cameras use a single CCD chip for color imaging. This is

    accomplished by dividing the sensor area for each pixel into three or four sub-areas,

    each sensitive to a dierent primary color. The three captured color signals can be

    either converted to one luminance signal and two chrominance signal and sent out

    as a component color video, or multiplexed into a composite signal. This subject is

    explained further in Sec. 1.2.4.

    Many cameras of today are CCD-based because they can be made much smaller

    and lighter than the tube-based cameras, to acquire the same spatial resolution.

    Advancement in CCD technology has made it possible to capture in a very small

    chip size a very high resolution image array. For example, 1/3-in CCD's with 380 K

    pixels are commonly found in consumer-use camcorders, whereas a 2/3-in CCD with

    2 million pixels has been developed for HDTV. The tube-based cameras are more

    bulky and costly, and are only used in special applications, such as those requiring

    very high resolution or high sensitivity under low ambient light. In addition to the

    circuitry for color imaging, most cameras also implement color coordinate conversion

    (from RGB to luminance and chrominance) and compositing of luminance and

    chrominance signals. For digital output, analog-to-digital (A/D) conversion is also

    incorporated. Figure 1.2 shows the typical processings involved in a professional

    video camera. The camera provides outputs in both digital and analog form, and in

    the analog case, includes both component and composite formats. To improve the

  • 10 Video Formation, Perception, and Representation Chapter 1

    Figure 1.2. Schematic Block Diagram of a Professional Color Video Camera. From [6,

    Fig. 7(a)].

    image quality, digital processing is introduced within the camera. For an excellent

    exposition of the video camera and display technologies, see [6].

    1.2.3 Video Display

    To display a video, the most common device is the cathode ray tube (CRT). With

    a CRT monitor, an electron gun emits an electron beam across the screen line by

    line, exciting phosphors with intensities proportional to the intensity of the video

    signal at corresponding locations. To display a color image, three beams are emitted

    by three separate guns, exciting red, green, and blue phosphors with the desired

    intensity combination at each location. To be more precise, each color pixel consists

    of three elements arranged in a small triangle, known as a triad.

    The CRT can produce an image having a very large dynamic range so that the

    displayed image can be very bright, sucient for viewing during day light or from a

    distance. However, the thickness of a CRT needs to be about the same as the width

    of the screen, for the electrons to reach the side of the screen. A large screen monitor

    is thus too bulky, unsuitable for applications requiring thin and portable devices.

    To circumvent this problem, various at panel displays have been developed. One

    popular device is Liquid Crystal Display (LCD). The principle idea behind the LCD

    is to change the optical properties and consequently the brightness/color of the liq-

    uid crystal by an applied electric eld. The electric eld can be generated/adapted

    by either an array of transistors, such as in LCD's using active matrix thin-lm-

    transistors (TFT), or by using plasma. The plasma technology eliminates the need

    for TFT and makes large-screen LCD's possible. There are also new designs for

    at CRT's. A more comprehensive description of video display technologies can be

    found in [6].

    The above stated raster scan and display mechanisms only apply to TV cameras

    and displays. With movie cameras, the color pattern seen by the camera at any

  • Section 1.2. Video Capture and Display 11

    frame instant is completely recorded on the lm. For display, consecutive recorded

    frames are played back using an analog optical projection system.

    1.2.4 Composite vs. Component Video

    Ideally, a color video should be specied by three functions or signals, each de-

    scribing one color component, in either a tristimulus color representation, or a

    luminance-chrominance representation. A video in this format is known as com-

    ponent video. Mainly for historical reasons, various composite video formats also

    exist, wherein the three color signals are multiplexed into a single signal. These

    composite formats were invented when the color TV system was rst developed and

    there was a need to transmit the color TV signal in a way so that a black-and-

    white TV set can extract from it the luminance component. The construction of

    a composite signal relies on the property that the chrominance signals have a sig-

    nicantly smaller bandwidth than the luminance component. By modulating each

    chrominance component to a frequency that is at the high end of the luminance

    component, and adding the resulting modulated chrominance signals and the orig-

    inal luminance signal together, one creates a composite signal that contains both

    luminance and chrominance information. To display a composite video signal on a

    color monitor, a lter is used to separate the modulated chrominance signals and the

    luminance signal. The resulting luminance and chrominance components are then

    converted to red, green, and blue color components. With a gray-scale monitor, the

    luminance signal alone is extracted and displayed directly.

    All present analog TV systems transmit color TV signals in a composite format.

    The composite format is also used for video storage on some analog tapes (such

    as the VHS tape). In addition to being compatible with a gray-scale signal, the

    composite format eliminates the need for synchronizing dierent color components

    when processing a color video. A composite signal also has a bandwidth that is

    signicantly lower than the sum of the bandwidth of three component signals, and

    therefore can be transmitted or stored more eciently. These benets are however

    achieved at the expense of video quality: there often exist noticeable artifacts caused

    by cross-talks between color and luminance components.

    As a compromise between the data rate and video quality, S-video was invented,

    which consists of two components, the luminance component and a single chromi-

    nance component which is the multiplex of two original chrominance signals. Many

    advanced consumer level video cameras and displays enable recording/display of

    video in S-video format. Component format is used only in professional video

    equipment.

    1.2.5 Gamma Correction

    We have said that the video frames captured by a camera reect the color values of

    the imaged scene. In reality, the output signals from most cameras are not linearly

  • 12 Video Formation, Perception, and Representation Chapter 1

    related to the actual color values, rather in a non- linear form:

    5

    v

    c

    = B

    c

    c

    ; (1.2.3)

    where B

    c

    represents the actual light brightness, and v

    c

    the camera output voltage.

    The value of

    c

    range from 1.0 for most CCD cameras to 1.7 for a vidicon camera

    [7]. Similarly, most of the display devices also suer from such a non-linear relation

    between the input voltage value v

    d

    and the displayed color intensity B

    d

    , i.e.

    B

    d

    = v

    d

    d

    : (1.2.4)

    The CRT displays typically have a

    d

    of 2.2 to 2.5 [7]. In order to present true colors,

    one has to apply an inverse power function on the camera output. Similarly, before

    sending real image values for display, one needs to pre-compensate the gamma eect

    of the display device. These processes are known as gamma correction.

    In TV broadcasting, ideally, at the TV broadcaster side, the RGB values cap-

    tured by the TV cameras should rst be corrected based on the camera gamma and

    then converted to the color coordinates used for transmission (YIQ for NTSC, and

    YUV for PAL and SECAM). At the receiver side, the received YIQ or YUV values

    should rst be converted to the RGB values, and then compensated for the monitor

    gamma values. In reality, however, in order to reduce the processing to be done

    in the millions of receivers, the broadcast video signals are pre-gamma corrected in

    the RGB domain. Let v

    c

    represent the R, G, or B signal captured by the camera,

    the gamma corrected signal for display, v

    d

    , is obtained by

    v

    d

    = v

    c

    =

    d

    c

    : (1.2.5)

    In most of the TV systems, a ratio of

    c

    =

    d

    = 2:2 is used. This is based on the

    assumption that a CCD camera with

    c

    = 1 and a CRT display with

    d

    = 2:2 are

    used [7]. These gamma corrected values are converted to the YIQ or YUV values for

    transmission. The receiver simply applies a color coordinate conversion to obtain the

    RGB values for display. Notice that this process applies display gamma correction

    before the conversion to the YIQ/YUV domain, which is not strictly correct. But

    the distortion is insignicant and not noticeable by average viewers [7].

    1.3 Analog Video Raster

    As already described, the analog TV systems of today use raster scan for video

    capture and display. As this is the most popular analog video format, in this

    section, we describe the mechanism of raster scan in more detail, including both

    progressive and interlaced scan. As an example, we also explain the video formats

    used in various analog TV systems.

    5

    A more precise relation is B

    c

    = cv

    c

    c

    +B

    0

    ; where c is a gain factor, and B

    0

    is the cut-o level

    of light intensity. If we assume that the output voltage value is properly shifted and scaled, then

    the presented equation is valid.

  • Section 1.3. Analog Video Raster 13

    Field 1 Field 2

    Progressive Frame Interlaced Frame

    (a) (b)

    Figure 1.3. Progressive (a) and Interlaced (b) Raster Scan Formats.

    1.3.1 Progressive and Interlaced Scan

    Progressive Scan In raster scan, a camera captures a video sequence by sampling

    it in both temporal and vertical directions. The resulting signal is stored in a

    continuous one dimensional (1D) waveform. As shown in Fig. 1.3(a), the electronic

    or optic beam of an analog video camera continuously scans the imaged region from

    the top to bottom and then back to the top. The resulting signal consists of a

    series of frames separated by a regular frame interval,

    t

    , and each frame consists

    of a consecutive set of horizontal scan lines, separated by a regular vertical spacing.

    Each scan line is actually slightly tilted downwards. Also, the bottom line is scanned

    about one frame interval later than the top line of the same frame. However, for

    analysis purposes, we often assume that all the lines in a frame are sampled at

    the same time, and each line is perfectly horizontal. The intensity values captured

    along contiguous scan lines over consecutive frames form a 1D analog waveform,

    known as a raster scan. With a color camera, three 1D rasters are converted into a

    composite signal, which is a color raster.

    Interlaced Scan The raster scan format described above is more accurately known

    as progressive scan (also known as sequential or non-interlaced scan), in which

    the horizontal lines are scanned successively. In the interlaced scan, each frame is

    scanned in two elds and each eld contains half the number of lines in a frame.

    The time interval between two elds, i.e., the eld interval, is half of the frame

    interval, while the line spacing in a eld is twice of that desired for a frame. The

    scan lines in two successive elds are shifted by half of the line spacing in each

    eld. This is illustrated in Fig. 1.3(b). Following the terminology used in the

    MPEG standard, we call the eld containing the rst line and following alternating

    lines in a frame the top eld, and the eld containing the second line and following

  • 14 Video Formation, Perception, and Representation Chapter 1

    alternating lines the bottom eld.

    6

    In certain systems, the top eld is sampled

    rst, while in other systems, the bottom eld is sampled rst. It is important to

    remember that two adjacent lines in a frame are separated in time by the eld

    interval. This fact leads to the infamous zig-zag artifacts in an interlaced video

    that contains fast moving objects with vertical edges. The motivation for using

    the interlaced scan is to trade o the vertical resolution for an enhanced temporal

    resolution, given the total number of lines that can be recorded within a given time.

    A more thorough comparison of the progressive and interlaced scans in terms of

    their sampling eciency is given later in Sec. 3.3.2.

    The interlaced scan introduced above should be called 2:1 interlace more pre-

    cisely. In general, one can divide a frame into K 2 elds, each separated in time

    by

    t

    =K: This is known as K:1 interlace and K is called interlace order. In a digital

    video where each line is represented by discrete samples, the samples on the same

    line may also appear in dierent elds. For example, the samples in a frame may be

    divided into two elds using a checker-board pattern. The most general denition

    of the interlace order is the ratio of the number of samples in a frame to the number

    of samples in each eld.

    1.3.2 Characterization of a Video Raster

    A raster is described by two basic parameters: the frame rate (frames/second or fps

    or Hz), denoted by f

    s;t

    ; and the line number (lines/frame or lines/picture-height),

    denoted by f

    s;y

    . These two parameters dene the temporal and vertical sampling

    rates of a raster scan. From these parameters, one can derive another important

    parameter, the line rate (lines/second), denoted by f

    l

    = f

    s;t

    f

    s;y

    :

    7

    We can also

    derive the temporal sampling interval or frame interval,

    t

    = 1=f

    s;t

    , the vertical

    sampling interval or line spacing,

    y

    = picture-height=f

    s;y

    , and the line interval,

    T

    l

    = 1=f

    l

    =

    t

    =f

    s;y

    , which is the time used to scan one line. Note that the line

    interval T

    l

    includes the time for the sensor to move from the end of a line to the

    beginning of the next line, which is known as the horizontal retrace time or just

    horizontal retrace, to be denoted by T

    h

    . The actual scanning time for a line is

    T

    0

    l

    = T

    l

    T

    h

    : Similarly, the frame interval

    t

    includes the time for the sensor to

    move from the end of the bottom line in a frame to the beginning of the top line

    of the next frame, which is called vertical retrace time or just vertical retrace, to be

    denoted by T

    v

    : The number of lines that is actually scanned in a frame time, known

    as the active lines, is f

    0

    s;y

    = (

    t

    T

    v

    )=T

    l

    = f

    s;y

    T

    v

    =T

    l

    : Normally, T

    v

    is chosen to

    be an integer multiple of T

    l

    :

    A typical waveform of an interlaced raster signal is shown in Fig. 1.4(a). Notice

    that a portion of the signal during the horizontal and vertical retrace periods are

    held at a constant level below the level corresponding to black. These are called

    6

    A more conventional denition is to call the eld that contains all even lines the even-eld,

    and the eld containing all odd lines the odd-eld. This denition depends on whether the rst

    line is numbered 0 or 1, and is therefore ambiguous.

    7

    The frame rate and line rate are also known as the vertical sweep frequency and the horizontal

    sweep frequency, respectively.

  • Section 1.3. Analog Video Raster 15

    (a)

    Horizontal retrace

    for first field

    Vertical retrace

    from first to second field

    Vertical retrace

    from second to third field

    Blanking level

    Black level

    White level

    (b)

    Figure 1.4. A Typical Interlaced Raster Scan: (a) Waveform, (b) Spectrum.

    sync signals. The display devices start the retrace process upon detecting these

    sync signals.

    Figure 1.4(b) shows the spectrum of a typical raster signal. It can be seen that

    the spectrum contains peaks at the line rate f

    l

    and its harmonics. This is because

    adjacent scan lines are very similar so that the signal is nearly periodic with a

    period of T

    l

    : The width of each harmonic lobe is determined by the maximum

    vertical frequency in a frame. The overall bandwidth of the signal is determined by

    the maximum horizontal spatial frequency.

    The frame rate is one of the most important parameters that determine the

    quality of a video raster. For example, the TV industry uses an interlaced scan

    with a frame rate of 25{30 Hz, with an eective temporal refresh rate of 50-60 Hz,

    while the movie industry uses a frame rate of 24 Hz.

    8

    On the other hand, in the

    8

    To reduce the visibility of icker, a rotating blade is used to create an illusion of 72 frames/c.

  • 16 Video Formation, Perception, and Representation Chapter 1

    Luminance,

    Chrominance,

    Audio

    Multiplexing

    Modulation

    De-

    Modulation

    De-

    Multiplexing

    YC1C2

    --->

    RGB

    RGB

    --->

    YC1C2

    Figure 1.5. Analog Color TV systems: Video Production, Transmission, and Reception.

    computer industry, 72 Hz has become a de facto standard. The line number used in

    a raster scan is also a key factor aecting the video quality. In analog TVs, a line

    number of about 500-600 is used, while for computer display, a much higher line

    number is used (e.g., the SVGA display has 1024 lines). These frame rates and line

    numbers are determined based on the visual temporal and spatial thresholds under

    dierent viewing environments, as described later in Sec. 2.4. Higher frame rate

    and line rate are necessary in computer applications to accommodate a signicantly

    shorter viewing distance and higher frequency contents (line graphics and texts) in

    the displayed material.

    The width to height ratio of a video frame is known as the image aspect ratio

    (IAR). For example, an IAR of 4:3 is used in standard-denition TV (SDTV) and

    computer display, while a higher IAR is used in wide-screen movies (up to 2.2) and

    HDTVs (IAR=16:9) for a more dramatic visual sensation.

    1.4 Analog Color Television Systems

    In this section, we briey describe the analog color TV systems, which is a good

    example of many concepts we have talked about so far. One major constraint

    in designing the color TV system is that it must be compatible with the previous

    monochrome TV system. First, the overall bandwidth of a color TV signal has to t

    within that allocated for a monochrome TV signal (6 MHz per channel in the U.S.).

    Secondly, all the color signals must be multiplexed into a single composite signal in

    a way so that a monochrome TV receiver can extract from it the luminance signal.

    The successful design of color TV systems that satisfy the above constraints is one

    of the great technological innovations of the 20th century. Figure 1.5 illustrates the

    main processing steps involved in color TV signal production, transmission, and

    reception. We briey review each of the steps in the following.

    There are three dierent systems worldwide: the NTSC system used in North

  • Section 1.4. Analog Color Television Systems 17

    America as well as some other parts of Asia, including Japan and Taiwan; the

    PAL system used in most of Western Europe and Asia including China, and the

    Middle East countries; and the SECAM system used in the former Soviet Union,

    Eastern Europe, France, as well as Middle East. We will compare these systems

    in terms of their spatial and temporal resolution, the color coordinate, as well as

    the multiplexing mechanism. The materials presented here are mainly from [9, 10].

    More complete coverage on color TV systems can be found in [5, 1].

    1.4.1 Spatial and Temporal Resolutions

    All three color TV systems use the 2:1 interlaced scan mechanism described in

    Sec. 1.3 for capturing as well as displaying video. The NTSC system uses a eld

    rate of 59.94 Hz, and a line number of 525 lines/frame. The PAL and SECAM

    systems both use a eld rate of 50 Hz and a line number of 625 lines/frame. These

    frame rates are chosen to not to interfere with the standard electric power systems in

    the involved countries. They also turned out to be a good choice in that they match

    with the critical icker fusion frequency of the human visual system, as described

    later in Sec. 2.4. All systems have an IAR of 4:3. The parameters of the NTSC,

    PAL, and SECAM video signals are summarized in Table 1.1. For NTSC, the line

    interval is T

    l

    =1/(30*525)=63.5 s. But the horizontal retrace takes T

    h

    =10 s, so

    that the actual time for scanning each line is T

    0

    l

    =53.5 s. The vertical retrace

    between adjacent elds takes T

    v

    =1333 s, which is equivalent to the time for 21

    scan lines per eld. Therefore, the number of active lines is 525-42=483/frame. The

    actual vertical retrace only takes the time to scan 9 horizontal lines. The remaining

    time (12 scan lines) are for broadcasters wishing to transmit additional data in the

    TV signal (e.g., closed caption, teletext, etc.)

    9

    1.4.2 Color Coordinate

    The color coordinate systems used in the three systems are dierent. For video

    capture and display, all three systems use a RGB primary, but with slightly dierent

    denitions of the spectra of individual primary colors. For transmission of the video

    signal, in order to reduce the bandwidth requirement and to be compatible with

    black and white TV systems, a luminance/chrominance coordinate is employed. In

    the following, we describe the color coordinates used in these systems.

    The color coordinates used in the NTSC, PAL and SECAM systems are all de-

    rived from the YUV coordinate used in PAL, which in turn originates from the XYZ

    coordinate. Based on the relation between the RGB primary and XYZ primary, one

    can determine the Y value from the RGB value, which forms the luminance compo-

    nent. The two chrominance values, U and V, are proportional to color dierences,

    B-Y and R-Y, respectively, scaled to have desired range. Specically, the YUV

    9

    The number of active lines cited in dierent references vary from 480 to 495. This number is

    calculated from the vertical blanking interval cited in [5].

  • 18 Video Formation, Perception, and Representation Chapter 1

    Table 1.1. Parameters of Analog Color TV Systems

    Parameters NTSC PAL SECAM

    Field Rate 59.94 50 50

    Line Number/Frame 525 625 625

    Line Rate (Line/s) 15,750 15,625 15,625

    Image Aspect Ratio 4:3 4:3 4:3

    Color Coordinate YIQ YUV YDbDr

    Luminance Bandwidth (MHz) 4.2 5.0, 5.5 6.0

    Chrominance Bandwidth (MHz) 1.5(I),0.5(Q) 1.3(U,V) 1.0 (U,V)

    Color Subcarrier (MHz) 3.58 4.43 4.25 (Db),4.41 (Dr)

    Color modulation QAM QAM FM

    Audio Subcarrier (MHz) 4.5 5.5,6.0 6.5

    Composite Signal Bandwidth(MHz) 6.0 8.0,8.5 8.0

    coordinate is related to the PAL RGB primary values by [9]:

    2

    4

    Y

    U

    V

    3

    5

    =

    2

    4

    0:299 0:587 0:114

    0:147 0:289 0:436

    0:615 0:515 0:100

    3

    5

    2

    4

    ~

    R

    ~

    G

    ~

    B

    3

    5

    (1.4.1)

    and

    2

    4

    ~

    R

    ~

    G

    ~

    B

    3

    5

    =

    2

    4

    1:000 0:000 1:140

    1:000 0:395 0:581

    1:000 2:032 0:001

    3

    5

    2

    4

    Y

    U

    V

    3

    5

    ; (1.4.2)

    where

    ~

    R;

    ~

    G;

    ~

    B are normalized gamma-corrected values, so that (

    ~

    R;

    ~

    G;

    ~

    B) = (1; 1; 1)

    corresponds to the reference white color dened in the PAL/SECAM system.

    The NTSC system uses the YIQ coordinate, where the I and Q components are

    rotated versions (by 33

    o

    ) of the U and V components. This rotation serves to make I

    corresponding to colors in the orange-to-cyan range, whereas Q the green-to-purple

    range. Because the human eye is less sensitive to the changes in the green-to-purple

    range than that in the yellow-to-cyan range, the Q component can be transmitted

    with less bandwidth than the I component [10]. This point will be elaborated more

    in Sec. 1.4.3. The YIQ values are related to the NTSC RGB system by:

    2

    4

    Y

    I

    Q

    3

    5

    =

    2

    4

    0:299 0:587 0:114

    0:596 0:275 0:321

    0:212 0:523 0:311

    3

    5

    2

    4

    ~

    R

    ~

    G

    ~

    B

    3

    5

    (1.4.3)

  • Section 1.4. Analog Color Television Systems 19

    and

    2

    4

    ~

    R

    ~

    G

    ~

    B

    3

    5

    =

    2

    4

    1:0 0:956 0:620

    1:0 0:272 0:647

    1:0 1:108 1:70

    3

    5

    2

    4

    Y

    I

    Q

    3

    5

    (1.4.4)

    With the YIQ coordinate, tan

    1

    (Q=I) approximates the hue, and

    p

    I

    2

    +Q

    2

    =Y

    reects the saturation. In a NTSC composite video, the I and Q components are

    multiplexed into one signal, so that the phase of the modulated signal is tan

    1

    (Q=I),

    whereas the magnitude equals

    p

    I

    2

    +Q

    2

    =Y . Because transmission errors aect

    the magnitude more than the phase, the hue information is better retained than

    saturation in broadcast TV signal. This is desired, as the human eye is more

    sensitive to the color hue. The name I and Q come from the fact that the I signal is

    In-phase with the color modulation frequency, whereas the Q signal is in Quadrature

    (i.e. 1/4 of the way around the circle or 90 degrees out of phase) with the modulation

    frequency. The color multiplexing scheme is explained later in Sec. 1.4.4.

    Note that because the RGB primary set and the reference white color used in

    the NTSC system are dierent from those in the PAL/SECAM system, the same

    set of RGB values corresponds to slightly dierent colors in these two systems.

    The SECAM system uses the YDbDr coordinate, where the Db and Dr values

    are related to the U and V values by [7]

    D

    b

    = 3:059U; D

    r

    = 2:169V: (1.4.5)

    1.4.3 Signal Bandwidth

    The bandwidth of a video raster can be estimated from its line rate. First of all,

    the maximum vertical frequency results when the white and black lines alternate in

    a raster frame, which is equal to f

    0

    s;y

    =2 cycles/picture-height, where f

    0

    s;y

    represent

    the number of active lines. The maximum frequency that can be rendered properly

    by a system is usually lower than this theoretical limit. The attenuation factor

    is known as the Kell factor, denoted by K, which depends on the camera and

    display aperture functions. Typical TV cameras have a Kell factor of K = 0:7: The

    maximum vertical frequency that can be accommodated is related to the Kell factor

    by

    f

    v;max

    = Kf

    0

    s;y

    =2 (cycles/picture-height): (1.4.6)

    Assuming that the maximum horizontal frequency is identical to the vertical one

    for the same spatial distance, then, f

    h;max

    = f

    v;max

    IAR (cycles/picture-width).

    Because each line is scanned in T

    0

    l

    seconds, the maximum frequency in the 1D

    raster signal is

    f

    max

    = f

    h;max

    =T

    0

    l

    = IAR Kf

    0

    s;y

    =2T

    0

    l

    Hz: (1.4.7)

    For the NTSC video format, we have f

    0

    s;y

    = 483; T

    0

    l

    = 53:5 s. Consequently, the

    maximum frequency of the luminance component is 4.2 megacycles/second or 4.2

  • 20 Video Formation, Perception, and Representation Chapter 1

    MHz. Although the potential bandwidth of the chrominance signal could be just

    as high, usually it is signicantly lower than the luminance signal. Furthermore,

    the HVS has been found to have much lower threshold for observing changes in

    chrominance. Because of the Typically, the two chrominance signals are bandlimited

    to have much narrower bandwidth. As mentioned previously, the human eye is

    more sensitive to spatial variations in the orange-to-cyan color range, represented

    by the I component, than it is for the green-to-purple range, represented by the

    Q component. Therefore, the I component is bandlimited to 1.5 MHz, and the

    Q component to 0.5 MHz.

    10

    Table 1.1 lists the signal bandwidth of dierent TV

    systems.

    1.4.4 Multiplexing of Luminance, Chrominance, and Audio

    In order to make the color TV signal compatible with the monochrome TV system,

    all three analog TV systems use the composite video format, in which the three

    color components as well as the audio component are multiplexed into one signal.

    Here, we briey describe the mechanism used by NTSC. First, the two chrominance

    components I(t) and Q(t) are combined into a single signal C(t) using quadrature

    amplitude modulation (QAM). The color sub-carrier frequency f

    c

    is chosen to be

    an odd multiple of half of the line rate, f

    c

    = 455

    f

    l

    2

    = 3.58 MHz. This is chosen

    to satisfy the following criteria: i) It should be high enough where the luminance

    component has very low energy; ii) It should be midway between two line rate har-

    monics where the luminance component is strong; and iii) It should be suciently

    far away from the audio sub-carrier, which is set at 4.5 MHz (286 f

    l

    ), the same

    as in the monochrome TV. Figure 1.6(a) shows how the harmonic peaks of the

    luminance and chrominance signals interleave with each other. Finally, the audio

    signal is frequency modulated (FM) using an audio sub-carrier frequency of f

    a

    =4.5

    MHz and added to the composite video signal, to form the nal multiplexed signal.

    Because the I component has a bandwidth of 1.5 MHz, the modulated chrominance

    signal has a maximum frequency of up 5.08 MHz. In order to avoid the interference

    with the audio signal, the chrominance signal is bandlimited in the upper sideband

    to 0.6 MHz. Notice that the lower sideband of the I signal will run into the upper

    part of the Y signal. For this reason, sometimes the I signal is bandlimited to 0.6

    MHz on both sidebands. Finally, the entire composite signal, with a bandwidth of

    about 4.75 MHz, is modulated onto a picture carrier frequency, f

    p

    , using vestigial

    sideband modulation (VSB), so that the lower sideband only extends to 1.25 MHz

    below f

    p

    and that the overall signal occupies 6 MHz. This process is the same as in

    the monochrome TV system. The picture carrier f

    p

    depends on the broadcasting

    channel. Figure 1.6(b) illustrates the spectral composition of the NTSC compos-

    ite signal. The signal bandwidth and modulation methods in the three color TV

    systems are summarized in Table 1.1.

    At a television receiver, the composite signal rst has to be demodulated to

    the baseband, and then the audio and three components of the video signals must

    10

    In [9], the bandwidth of I and Q are cited as 1.3 and 0.6 MHz, respectively.

  • Section 1.4. Analog Color Television Systems 21

    (a)

    f Luminance Chrominance

    f

    (color sub-carrier)c

    (b)

    Luminance

    I

    I and Q

    Audio

    Picture

    CarrierColor

    Sub-carrier

    Audio

    Sub-carrier

    Figure 1.6. Multiplexing of luminance, chrominance, and audio signals in NTSC sys-

    tem. (a) The interleaving between luminance and chrominance harmonics; (b) The overall

    spectral composition of the NTSC composite signal.

  • 22 Video Formation, Perception, and Representation Chapter 1

    Table 1.2. Analog Video Tape Formats

    Video Format Tape Format Horizontal Luminance Applications

    Lines Bandwidth

    composite VHS, 8mm 240 3 MHz Consumer

    Umatic SP 330 4 MHz Professional

    S-video S-VHS, Hi8 400 5.0 MHz High quality consumer

    component Betacam SP 480 4.5 MHz Professional

    be demultiplexed. To separate the video and audio signals, a low-pass lter can be

    used. This process is the same in a monochrome TV as that in a color TV. To further

    separate the chrominance signal from the luminance signal, ideally, a comb lter

    should be used, to take advantage of the interleaving of the harmonic frequencies

    in these two signals. Most high-end TV sets implement a digital comb lter with

    null frequencies at the harmonics corresponding to the chrominance component to

    accomplish this. Low-end TV sets however use a simple RC circuit to perform low-

    pass ltering with a cut-o frequency at 3 MHz, which would retain the unwanted

    I component in the extracted luminance signal, and vice verse. This will lead to

    cross-color and cross-luminance artifacts. Cross-color refers to the spurious colors

    created by the high frequency luminance signal that is close to the color sub-carrier

    frequency. Cross-luminance refers to false high frequency edge patterns caused by

    the modulated chrominance information. For a good illustration of the eects of

    dierent lters, see [2]. After extracting the chrominance signal, a corresponding

    color-demodulation method is used to separate the two chrominance components.

    Finally, the three color components are converted to the RGB coordinate for display.

    1.4.5 Analog Video Recording

    Along with the development of analog TV systems, various video tape recording

    (VTR) technologies have been developed, to allow professional video production

    (record and editing) as well as for consumer level recording (home video) and play-

    back (VCR). Table 1.2 summarizes common video tape formats.

    1.5 Digital Video

    1.5.1 Notations

    A digital video can be obtained either by sampling a raster scan, or directly using a

    digital video camera. Presently all digital cameras use CCD sensors. As with analog

    cameras, a digital camera samples the imaged scene as discrete frames. Each frame

  • Section 1.5. Digital Video 23

    comprises of output values from a CCD array, which is by nature discrete both

    horizontal and vertically. A digital video is dened by the frame rate, f

    s;t

    , the

    line number, f

    s;y

    , as well as the number of samples per line, f

    s;x

    . From these, one

    can nd the temporal sampling interval or frame interval,

    t

    = 1=f

    s;t

    , the vertical

    sampling interval

    y

    = picture-height=f

    s;y

    , and the horizontal sampling interval

    x

    = picture-width=f

    s;x

    . In this book, we will use (m;n; k) to represent a digital

    video, where integer indices m and n are the column and row indices, and k the

    frame number. The actual spatial and temporal locations corresponding to the

    integer indices are x = m

    x

    ; y = n

    y

    , and t = k

    t

    : For convenience, we use the

    notation (x; y; t) to describe a video signal in a general context, which could be

    either analog or digital. We will use (m;n; k) only when specically addressing

    digital video.

    In addition to the above parameters, another important parameter of a digital

    video is the number of bits used to represent a pixel value (luminance only or three

    color values), to be denoted by N

    b

    . Conventionally, the luminance or each of the

    three color values is specied with 8 bits or 256 levels. Therefore, N

    b

    = 8 for

    a monochrome video, while N

    b

    = 24 for a color video. The data rate, R, of a

    digital video is determined by R = f

    s;t

    f

    s;x

    f

    s;y

    N

    b

    , with a unit of bits/second (bps).

    Usually it is measured in kilobits/second (Kbps) or megabits/second (Mbps). In

    general, the spatial and temporal sampling rates can be dierent for the luminance

    and chrominance components of a digital video. In this case, N

    b

    should reect the

    equivalent number of bits used for each pixel in the luminance sampling resolution.

    For example, if the horizontal and vertical sampling rates for each chrominance

    component are both half of that for the luminance, then there are two chrominance

    samples for every four Y samples. If each sample is represented with 8 bits, the

    equivalent number of bits per sample in the Y resolution is (4*8+2*8)/4=12 bits.

    When displaying the digital video on a monitor, each pixel is rendered as a

    rectangular region with a constant color that is specied for this pixel. The ratio of

    the width to the height of this rectangular area is known as the pixel aspect ratio

    (PAR). It is related to the IAR of the display area and the image dimension by

    PAR = IAR f

    s;y

    =f

    s;x

    : (1.5.1)

    For proper display of a digitized video, one must specify either PAR or IAR, along

    with f

    s;x

    and f

    s;y

    . The display device should conform to the PAR specied for

    this video (or derived from the specied IAR). Otherwise, the object shape will be

    distorted. For example, a person will become fatter and shorter, if the display PAR

    is larger than the PAR specied for this video. In computer industry, a PAR of

    1.0 is normally used. On the other hand, in the TV industry, non-square pixels are

    used because of some historical reasons. The rationale behind this is explained in

    Sec. 1.5.2.

  • 24 Video Formation, Perception, and Representation Chapter 1

    1.5.2 ITU-R BT.601 Digital Video

    Spatial Resolution of the BT.601 Signal In an attempt to standardize the digital

    formats used to represent dierent analog TV video signals with a quality equivalent

    to broadcast TV, the International Telecommunications Union - Radio Sector (ITU-

    R) developed the BT.601 recommendation [8]. The standard species digital video

    formats with both 4:3 and 16:9 aspect ratios. Here, we only discuss the version with

    aspect ratio 4:3.

    11

    To convert a raster scan to a digital video, one just needs to

    sample the 1D waveform. If a total number of f

    s;x

    samples are taken per line, the

    equivalent sampling rate is f

    s

    = f

    s;x

    f

    s;y

    f

    s;t

    = f

    s;x

    f

    l

    samples/second. In the BT.601

    standard, the sampling rate f

    s

    is chosen to satisfy two constraints: i) the horizontal

    sampling resolution should match with the vertical sampling resolution as closely

    as possible, i.e. let

    x

    y

    ; and ii) the same sampling rate should be used for the

    NTSC and PAL/SECAM systems, and it should be multiples of respective line rates

    in these systems. The rst criterion calls for f

    s;x

    IAR f

    s;y

    , or f

    s

    IAR f

    2

    s;y

    f

    s;t

    ,

    which leads to f

    s

    11 and 13 MHz for the NTSC and PAL/SECAM system. A

    number that is closest to both numbers and yet satises the second criterion is then

    chosen, which is

    f

    s

    = 858f

    l

    (NTSC) = 864f

    l

    (PAL=SECAM) = 13:5MHz: (1.5.2)

    The numbers of pixels per line are f

    s;x

    = 858 for NTSC and 864 for PAL/SECAM.

    These two formats are known as 525/60 and 625/50 signals, respectively, and are

    illustrated in Fig. 1.7. The numbers of active lines are respectively f

    0

    s;y

    = 480 and

    576 in the 525 and 625 line systems, but the number of active pixels/line are the

    same, both equal to f

    0

    s;x

    = 720 pixels. The rest are samples obtained during the

    horizontal and vertical retraces, which fall in the non-active area.

    With the BT.601 signal, the pixel width to height ratio is not one, i.e. the

    physical area associated with a pixel is not a square. Specically, PAR =

    x

    =

    y

    =

    IAR f

    0

    s;y

    =f

    0

    s;x

    = 8=9 for 525/60 and 16/15 for 625/50 signals. To display a BT.601

    signal, the display device must has a proper PAR, otherwise the image will be

    distorted. For example, when displayed on a computer screen which has a PAR

    of 1, a 525/60 signal will appear stretched horizontally, while a 625/50 signal will

    appear stretched vertically. Ideally, one should resample the original signal so that

    f

    0

    s;x

    = IAR f

    0

    s;y

    . For example, the 525/60 and 625/50 signals should be resampled

    to have 640 and 768 active pixels/line, respectively.

    Color Coordinate and Chrominance Subsampling Along with the image resolu-

    tion, the BT.601 recommendation also denes a digital color coordinate, known as

    YCbCr. The Y, Cb, and Cr components are scaled and shifted versions of the analog

    Y, U, and V components, where the scaling and shifting operations are introduced

    so that the resulting components take value in the range of (0,255). For a more

    detailed explanation on the design this color coordinate, the readers are referred

    11

    The ITU-R was formerly known as International Radio Consultative Committee (CCIR) and

    the 4:3 aspect ratio version of the BT.601 format was called the CCIR601 format.

  • Section 1.5. Digital Video 25

    48

    0lin

    es

    52

    5lin

    es

    122

    pel16

    pel

    858 pels

    720 pels

    Active

    Area

    525/60: 60 field/s

    57

    6lin

    es

    62

    5lin

    es

    864 pels

    132

    pel12

    pel

    720 pels

    Active

    Area

    625/50: 50 field/s

    Figure 1.7. BT.601 video formats.

    to [9]. Here we only present the transformation matrix for deriving this coordinate

    from digital RGB coordinate. Assuming that the RGB values are in the range of

    (0{255), the YCbCr values are related to RGB values by:

    2

    4

    Y

    C

    b

    C

    r

    3

    5

    =

    2

    4

    0:257 0:504 0:098

    0:148 0:291 0:439

    0:439 0:368 0:071

    3

    5

    2

    4

    R

    G

    B

    3

    5

    +

    2

    4

    16

    128

    128

    3

    5

    : (1.5.3)

    The inverse relation is:

    2

    4

    R

    G

    B

    3

    5

    =

    2

    4

    1:164 0:0 1:596

    1:164 0:392 0:813

    1:164 2:017 0:0

    3

    5

    2

    4

    Y 16

    C

    b

    128

    C

    r

    128

    3

    5

    : (1.5.4)

    In the above relations, R = 255

    ~

    R;G = 255

    ~

    G;B = 255

    ~

    B are the digital equiva-

    lent of the normalized RGB primaries,

    ~

    R;

    ~

    G; and

    ~

    B, as dened either in the NTSC

    or PAL/SECAM system. In the YCbCr coordinate, Y reects the luminance and

    is scaled to have a range of (16{235), C

    b

    and C

    r

    are scaled versions of color dif-

    ferences B Y and R Y , respectively. The scaling and shifting is designed so

    that they have a range of (16{240). The maximum value of C

    r

    corresponds to

    red (C

    r

    = 240 or R = 255; G = B = 0 ), whereas the minimum value yields

    cyan (C

    r

    = 16 or R = 0; G = B = 255). Similarly, the maximum and minimum

    values of C

    b

    correspond to blue (C

    b

    = 240 or R = G = 0; B = 255) and yellow

    (C

    b

    = 16 or R = G = 255; B = 0).

    The spatial sampling rate introduced previously refers to the luminance com-

    ponent, Y . For the chrominance components, C

    b

    and C

    r

    , usually only half of the

    sampling rate is used, i.e. f

    s;c

    = f

    s

    =2: This leads to half number of pixels in each

    line, but the same number of lines per frame. This is known as the 4:2:2 format,

    implying there are 2 Cb samples and 2 Cr samples for every 4 Y samples. To fur-

    ther reduce the required data rate, BT.601 also dened the 4:1:1 format, in which

  • 26 Video Formation, Perception, and Representation Chapter 1

    4:2:0For every 2x2 Y Pixels

    1 Cb & 1 Cr Pixel(Subsampling by 2:1 bothhorizontally and vertically)

    4:2:2For every 2x2 Y Pixels

    2 Cb & 2 Cr Pixel(Subsampling by 2:1

    horizontally only)

    4:4:4For every 2x2 Y Pixels

    4 Cb & 4 Cr Pixel(No subsampling)

    Y Pixel Cb and Cr Pixel

    4:1:1For every 4x1 Y Pixels

    1 Cb & 1 Cr Pixel(Subsampling by 4:1

    horizontally only)

    Figure 1.8. BT.601 chrominance subsampling formats. Note that the two adjacent lines

    in any one component belong to two dierent elds.

    the chrominance components are subsampled along each line by a factor of 4, i.e.,

    there are 1 Cb sample and 1 Cr sample for every 4 Y samples. This sampling

    method, however, yields very asymmetric resolutions in the horizontal and vertical

    directions. Another sampling format is therefore developed, which subsamples the

    Cb and Cr components by half in both the horizontal and vertical directions. In

    this format, there are also 1 Cb sample and 1 Cr sample for every 4 Y samples.

    But to avoid the confusion with the previously dened 4:1:1 format, this format

    is designated as 4:2:0. For applications requiring very high resolutions, the 4:4:4

    format is dened, which samples the chrominance components in exactly the same

    resolution as the luminance components. The relative positions of the luminance

    and chrominance samples for dierent formats are shown in Fig. 1.8.

    12

    In Chap. 4, we will discuss solutions for converting videos with dierent spa-

    tial/temporal resolutions. The conversion between dierent color subsampling for-

    mats is considered in one of the exercise problems.

    The raw data rates of a BT.601 signal depends on the color sub-sampling factor.

    With the most common 4:2:2 format, there are two chrominance samples per two Y

    samples, each represented with 8 bits. Therefore, the equivalent bit rate for each Y

    sample is N

    b

    = 16 bits, and the raw data rate is f

    s

    N

    b

    = 216 Mbps. The raw data

    rate corresponding to the active area is f

    s;t

    f

    0

    s;y

    f

    0

    s;x

    N

    b

    = 166 Mbps. With the 4:2:0

    format, there are two chrominance samples per four Y samples, and the equivalent

    bit rate for each Y sample is N

    b

    = 12 bits. Therefore the raw data rate is 162 Mbps,

    with 124 Mbps in the active area. For the 4:4:4 format, the equivalent bit rate for

    each Y sample is N

    b

    = 24 bits, and the raw data rate is 324 Mbps, with 249 Mbps

    in the active area. The resolutions and data rates of dierent BT.601 signals are

    12

    For the 4:2:0 format, the Cr and Cb samples may also be positioned in the center of the four

    corresponding Y samples, as shown in Fig. 13.14(a).

  • Section 1.5. Digital Video 27

    summarized in Table 1.3.

    The BT.601 formats are used in high-quality digital video applications, with the

    4:4:4 and 4:2:2 formats typically used for video production and editing, whereas 4:2:0

    for video distribution, e.g., movies on digital video disks (DVD), video-on-demand

    (VOD), etc. The MPEG2

    13

    video compression standard was primarily developed

    for compression of BT.601 4:2:0 signals, although it can also handle videos in lower

    or higher resolutions. A typical 4:2:0 signal with a raw active data rate of 124 Mbps

    can be compressed down to about 4-8 Mbps. We will introduce the MPEG2 video

    coding algorithm in Sec. 13.5.

    1.5.3 Other Digital Video Formats and Applications

    In addition to the BT.601 format, several other standard digital video formats have

    been dened. Table 1.3 summarizes these video formats, along with their main

    applications, compression methods, and compressed bit rates. The CIF (Com-

    mon Intermediate Format) is specied by International Telecommunications Union-

    Telecommunications Sector (ITU-T), which has about half the resolution of BT.601

    4:2:0 in both horizontal and vertical dimensions and is developed for video confer-

    encing applications, and the QCIF, which is a quarter of CIF, used for video phone

    type applications. Both are non-interlaced. The ITU-T H.261 coding standard was

    developed to compress videos in either format to p64 Kbps, with p = 1; 2; : : : ; 30,

    for transport over ISDN (integrated service digital network) lines, which only allow

    transmission rates in multiples of 64 Kbps. Typically, a CIF signal with a raw

    data rate of 37.3 Mbps can be compressed down to about 128 to 384 Kbps, with

    reasonable quality, while a QCIF signal with a raw data rate of 9.3 Mbps can be

    compressed to 64-128 Kbps. A later standard, H.263, can achieve better quality

    than H.261, at the same bit rate. For example, it is possible to compress a QCIF

    picture to about 20 Kbps, while maintaining a quality similar or better than H.261

    at 64 Kbps. This enables video phone over a 28.8 Kbps modem line.

    In parallel with the eort of ITU-T, the ISO-MPEG also dened a series of digital

    video formats. The SIF (Source Intermediate Format) is essentially a quarter size

    of the active area in the BT.601 signal, and is about the same as CIF. This format

    is targeted for video applications requiring medium quality, such as video games

    and CD movies. As with BT.601, there are two SIF formats: one with a frame rate

    of 30 Hz and a line number of 240, and another with a frame rate of 25 and line

    number of 288, both have 352 pixels/line. There is also a corresponding set of SIF-I

    format, which is 2:1 interlaced. The MPEG-1 algorithm can compress a typical SIF

    video with a raw data rate of 30 Mbps to about 1.1 Mbps with a quality similar to

    the resolutions seen on a VHS VCR, which is lower than broadcast television. The

    rate of 1.1 Mbps enables the playback of digital movies on CD-ROM's, which have

    an access rate of 1.5 Mbps. Distribution of MPEG1 movies on video CD's (VCD)

    marked the entrance of digital video into the consumer market in the early 1990's.

    13

    MPEG standards for Motion Picture Expert Group of the International Standard Organization

    or ISO.

  • 28 Video Formation, Perception, and Representation Chapter 1

    MPEG2-based DVD's, which started in mid 90's, opened the era of high quality

    digital video entertainment. MPEG2 technology is also the corner stone of the next

    generation TV system, which will be fully digital, employing digital compression and

    transmission technology. Table 1.3 lists the details of the video formats discussed

    above, along with their main applications, compression methods, and compressed

    bit rates. More on compression standards will be presented in Chap. 13.

    The BT.601 format is the standard picture format for digital TV (DTV). To

    further enhance the video quality, several HDTV formats have also been standard-

    ized by the Society of Motion Picture and Television Engineers (SMPTE), which

    are also listed in Table 1.3. A distinct feature of HDTV is its wider aspect ratio,

    16:9 as opposed to 4:3 in SDTV. The picture resolution is doubled to tripled in

    both horizontal and vertical dimensions. Furthermore, progressive scan is used to

    reduce the interlacing artifacts. A high prole has been developed in the MPEG2

    video compression standard for compressing HDTV video. Typically it can reduce

    the data rate to about 20 Mbps while retaining the very high quality required. This

    video bit rate is chosen so that the combined bit stream with audio, when trans-

    mitted using digital modulation techniques, can still t into a 6 MHz terrestrial

    channel, which is the assigned channel bandwidth for HDTV broadcast in the U.S.

    1.5.4 Digital Video Recording

    To store video in digital formats, various digital video tape recorder (DVTR) for-

    mats have been developed, which dier in the video format handled and technology

    for error-correction-coding and storage density. Table 1.4 lists some standard and

    proprietary tape formats. The D1-D5 formats store a video in its raw, uncom-

    pressed formats, while others pre-compress the video. Only a conservative amount

    of compression is employed so as not to degrade the video quality beyond that ac-

    ceptable for the intended application. A good review of digital VTRs can be found

    in [11]. A extensive coverage on the underlying physics of magnetic recording and

    operation of DVTRs can be found in the book by Watkinson [12].

    In addition to magnetic tape recorders, VCD and DVD are two video storage

    devices using optical disks. By incorporating MPEG1 and MPEG2 compression

    technologies, they can store SIF and BT.601 videos, respectively, with sucient

    quality. At present, VCD and DVD are read-only, so that they are mainly used

    for distribution of pre-recorded video, as opposed to as tools for recording video by

    consumers.

    Except video recording systems using magnetic tapes, hard-disk based systems,

    such as TiVo and ReplayTV, are also on the horizon. These systems enable con-

    sumers to record up to 30 hours of live TV programs onto hard-disks in MPEG-2

    compressed formats, which can be viewed later with usual VCR features such as

    fast forward, slow motion, etc. They also allow instant pause of a live program

    that is being watched, by storing the live video from the time of the pause onto the

    disk. As the price for hard-disks drops down continuously, hard-disk-based DVTR's

  • Section 1.5. Digital Video 29

    Table 1.3. Digital Video Formats for Dierent Applications

    Video Y Color Frame Raw Data

    Format Size Sampling Rate (Mbps)

    HDTV over air, cable, satellite, MPEG2 video, 20-45 Mbps

    SMPTE 296M 1280x720 4:2:0 24P/30P/60P 265/332/664

    SMPTE 295M 1920x1080 4:2:0 24P/30P/60I 597/746/746

    Video Production, MPEG2, 15-50 Mbps

    BT.601 720x480/576 4:4:4 60I/50I 249

    BT.601 720x480/576 4:2:2 60I/50I 166

    High quality video distribution (DVD,SDTV), MPEG2, 4-8 Mbps

    BT.601 720x480/576 4:2:0 60I/50I 124

    Intermediate quality video distribution (VCD, WWW), MPEG1, 1.5 Mbps

    SIF 352x240/288 4:2:0 30P/25P 30

    Video conferencing over ISDN/Internet, H.261/H.263, 128-384 Kbps

    CIF 352x288 4:2:0 30P 37

    Video telephony over wired/wireless modem, H.263, 20-64 Kbps

    QCIF 176x144 4:2:0 30P 9.1

    may eventually overtake tape-based systems, which are slower and have less storage

    capacity.

    1.5.5 Video Quality Measure

    To conduct video processing, it is necessary to dene an objective measure that

    can measure the dierence between an original video and the processed one. This

    is especially important, e.g., in video coding applications where one must measure

    the distortion caused by compression. Ideally such a measure should correlate well

    with the perceived dierence between two video sequences. Finding such a measure

    however turns out to be an extremely dicult task. Although various quality mea-

    sures have been proposed, those that correlate well with visual perception are quite

    complicated to compute. Most video processing systems of today are designed to

    minimize the mean square error (MSE) between two video sequences

    1

    and

    2

    ,

    which is dened as

    MSE =

    2

    e

    =

    1

    N

    X

    k

    X

    m;n

    (

    1

    (m;n; k)

    2

    (m;n; k))

    2

    ; (1.5.5)

  • 30 Video Formation, Perception, and Representation Chapter 1

    Table 1.4. Digital Video Tape Formats

    Tape Video Source Compressed Compression Intended

    Format Format Rate Rate Method Application

    Uncompressed formats

    SMPTE D1 BT.601 4:2:2 216 Mbps N/A N/A Professional

    SMPTE D2 BT.601 composite 114 Mbps N/A N/A Professional

    SMPTE D3 BT.601 composite 114 Mbps N/A N/A Professional/

    Consumer

    SMPTE D5 BT.601 4:2:2 270 Mbps N/A N/A Professional

    (10 bit)

    Compressed formats

    Digital Betacam BT.601 4:2:2 166 Mbps 80 Mbps Frame DCT Professional

    Betacam SX BT.601 4:2:2 166 Mbps 18 Mbps MPEG2 Consumer

    (I and B mode only)

    DVCPRO50 BT.601 4:2:2 166 Mbps 50 Mbps frame/eld DCT Professional

    DVCPRO25 (DV) BT.601 4:1:1 124 Mbps 25 Mbps frame/eld DCT Consumer

    where N is the total number of pixels in either sequence. For a color video, the

    MSE is computed separately for each color component.

    Instead of the MSE, the peak signal to noise ratio (PSNR) in decibel (dB) is

    more often used as a quality measure in video coding. The PSNR is dened as

    PSNR = 10 log

    10

    2

    max

    2

    e

    (1.5.6)

    where

    max

    is the peak (maximum) intensity value of the video signal. For the most

    common 8 bit/color video,

    max

    = 255: Note that for a xed peak value, PSNR

    is completely determined by the MSE. The PSNR is more commonly used than

    the MSE, because people tend to associate the quality of an image with a certain

    range of PSNR. As a rule of thumb, for the luminance component, a PSNR over

    40 dB typically indicates an excellent image (i.e., being very close to the original),

    between 30 to 40 dB usually means a good image (i.e., the distortion is visible but

    acceptable), between 20 and 30 dB is quite poor, and nally, a PSNR lower than

    20 dB is unacceptable.

  • Section 1.6. Summary 31

    It is worth noting that to compute the PSNR between two sequences, it is

    incorrect to calculate the PSNR between every two corresponding frames and then

    taking the average of the PSNR values obtained over individual frames. Rather

    one should compute the MSE between corresponding frames, average the resulting

    MSE values over all frames, and nally convert the MSE value to PSNR.

    A measure that is sometimes used in place of the MSE, mainly for reduced

    computation, is the mean absolute dierence (MAD). This is dened as

    MAD =

    1

    N

    X

    k

    X

    m;n

    j

    1

    (m;n; k)

    2

    (m;n; k)j : (1.5.7)

    For example, for motion estimation, the MAD is usually used to nd the best

    matching block in another frame for a given block in a current frame.

    It is well known that MSE or PSNR does not correlate very well with visual

    distortion between two imagery. But these measures have been used almost exclu-

    sively as objective distortion measures in image/video coding, motion compensated

    prediction, and image restoration, partly because of their mathematical tractability,

    and partly because of the lack of better alternatives. Designing objective distortion

    measures that are easy to compute and yet correlate well with visual distortion is

    still an open research issue. In this book, we will mostly use MSE or PSNR as the

    distortion measure.

    1.6 Summary

    Color Generation, Perception, and Specication (Sec. 1.1)

    The color of a light depends on its spectral content. Any color can be created

    by mixing three primary colors. The most common primary set includes red,

    green, and blue colors.

    The human eye perceives color by having receptors (cones) in the retina that

    are tuned to red, green, and blue wavelengths. The color sentation can be

    described by three attributes: luminance (i.e., brightness), hue (color tone),

    and saturation (color purity). The human eye is most sensitive to luminance,

    then to hue, and nally to saturation.

    A color can be specied by three numbers: either those corresponding to

    the contributions of the three primary colors (i.e., tristimulus values), or a