video communication over the internet

Thesis for the degree of Licentiate of Philosophy

Video communication over the Internet

Mathias Johanson

Department of Computer Engineering

Chalmers University of Technology

Gothenburg, Sweden, 2002


Mathias Johanson

Copyright Mathias Johanson, 2002

Technical Report No. 8L

Department of Computer Engineering Chalmers University of Technology S-412 96 Gothenburg, Sweden Phone: +46 (0) 31-772 10 00

Contact information: Mathias Johanson Framkom Research Corporation Sallarängsbacken 2 S-431 37 Mölndal, Sweden Phone: +46 (0) 31-67 55 43 Fax: +46 (0) 31-67 55 49 Email: [email protected]

Printed in Sweden Chalmers Reproservice Gothenburg, Sweden, 2002

iii


Mathias Johanson Department of Computer Engineering

Chalmers University of Technology

Abstract The Internet was originally designed as a data communication network, primarily supporting asynchronous applications, like file transfer and electronic mail. Already at an early stage, however, the potentials of using the Internet for synch-ronous interpersonal communication was realized. Since then numerous applications of real-time multimedia communication systems have been demon-strated.

One of the most compelling types of extraverbal communication to support over a network is that employing the visual modality. However, the difficulties in realizing large-scale video communication systems over the Internet must not be underestimated. The best-effort service model of the Internet, without guarantees on timely delivery of packets, poses significant challenges for the realization of robust, high-quality video communication systems. Nonetheless, the feasibility of packet video communication systems has been successfully demonstrated. A number of unsolved issues remain, however. Of primary concern is how to make video communication systems scalable to a large number of widely distributed users. Since video communication is potentially very broadband, sensitive to delay, jitter and packet loss, and often multipoint in nature, a scalable transmission architecture is certainly not easy to design. Another fundamental question is how to support video communication in highly heterogeneous and dynamic network and computing environments. Since the Internet is built upon network connections of widely different capacities and the computers connected to the Internet have vastly different characteristics, video applications must be adaptive to diverse and dynamic conditions.

This thesis contributes to the realization of a scalable and adaptive framework for video communication over the Internet by presenting novel algorithms and methods for multicast flow control and layered video coding. Moreover, the scope of Internet video is broadened by presenting a procedure for interconnecting multicast videoconferences with the World Wide Web. Finally, an enrichment of Internet video is proposed by introducing a system for stereoscopic video trans-mission.

Keywords: Teleconferencing, Internet video, layered multicast, flow control, congestion control, layered video coding, video gateways, stereoscopic video

v

List of included papers

This thesis consists of an introduction and the following papers:

Paper A Mathias Johanson, Delay-based flow control for layered multicast app-lications, Proceedings of the 12th International Packet Video Work-shop, Pittsburgh, PA, April 2002.

Paper B Mathias Johanson, A scalable video compression algorithm for real-time Internet applications, Pending publication.

Paper C Mathias Johanson, An RTP to HTTP video gateway, Proceedings of the Tenth International World Wide Web Conference, Hong Kong, China, May 2001.

Paper D Mathias Johanson, Stereoscopic video communication over the Internet, Proceedings of the Second IEEE Workshop on Internet Applications, San José, CA, July 2001.

vii

Table of contents

I Introduction 1 1 Background 3

2 Methodology and scope 4

3 Video on the Internet 4

3.1 Applications and requirements 5

3.1.1 Teleconferencing 5

3.1.2 Video telephony 6

3.1.3 Web cameras 6

3.1.4 Other applications 6

3.2 Protocols and standards 6

3.2.1 Transport protocols 7

3.2.2 Session management and control 10

3.3 Reflectors, transcoders and mixers 11

3.4 Multicast communication 12

3.4.1 Group management 12

3.4.2 Multicast routing 13

3.4.3 Multicast scope 14

3.4.4 The Mbone 14

3.5 Quality of service 14

3.5.1 Integrated services 15

3.5.2 Differentiated services 15

3.5.3 Traffic engineering and constraint based routing 16

3.6 Media encodings 16

3.6.1 Colorspace conversion and subsampling 17

3.6.2 Inter-frame coding 17

3.6.3 Transform coding 18

3.6.4 Quantization 18

3.6.5 Entropy coding 19

3.6.6 Video compression standards 19

3.7 Scalability and adaptivity 20

3.7.1 Layered multicast 20

viii

3.7.2 Flow control and congestion avoidance 21

3.7.3 Scalable media encodings 22

4 Summary of included papers and their contributions 23

4.1 Paper A 23

4.2 Paper B 23

4.3 Paper C 23

4.4 Paper D 24

5 Future directions 24

II Paper A 27

III Paper B 45

IV Paper C 71

V Paper D 85

ix

Acknowledgements I would like to thank my supervisor Sven Tafvelin and Lars-Åke Johansson without whose professional and personal support this work would not have been possible. Also, I would like to thank Framkom Research Corporation for funding this work.

1

Introduction

3

Introduction

1 Background

Interpersonal communication systems are becoming increasingly pervasive in everyday life. Doubtlessly, universal access to sophisticated multimodal com-munication systems has a tremendous potential for enriching social interactions between individuals. Furthermore, high quality communication using rich media can enable new ways of collaborative work between teams of co-workers, irrespective of geographical location. This not only reduces the need to travel but also facilitates new ways of cooperative work, wherein the flow of information is more direct between the people concerned. More efficient information exchange makes it possible to cut lead times and increase productivity in distributed teamwork, while improving the working conditions for the people involved.

An important aspect of human interaction is visual communication. This has impelled the development of digital video communication systems, initially based on dedicated circuit switched telecommunication channels (e.g. ISDN lines). Subsequently, the tremendous success of the Internet in providing a global communication infrastructure for a wide variety of applications inspired the invention of packet video systems. The Internet protocols were originally designed for asynchronous data transfer applications, like file transfer, electronic mail and remote access to time sharing systems. At an early stage, however, the potentials of using the Internet for synchronous interpersonal communication was explored, initially through text messaging systems and then using packet audio and video tools. Gradually, as link capacities and end system performance improved, the Internet evolved into a multiservice network infrastructure supporting many types of applications of voice, video and data communication. This convergence of telecommunication services into a unified IP based network infrastructure presents huge savings potentials for network operators, since it eliminates the need to maintain several communication networks in parallel. Thus, the incentives for video communication over the Internet can be seen to be related both to the desire for richer means of interpersonal communication services and to the cost-effective realization of those services.

However, the connectionless, best-effort nature of the current Internet architecture poses severe technological challenges for designing time-critical synchronous communication applications. Since there are no guarantees on resource availability or timely delivery of datagrams, packet video applications must be resilient to packet loss and adaptive to variations in bandwidth and latency. Furthermore, since bandwidth in many parts of the Internet is a scarce resource and since uncompressed digital video signals are prohibitively broadband, sophisticated video compression algorithms are needed to efficiently

4 Video communication over the Internet

utilize the network. Furthermore, the requirements of real-time multimedia communication applications have inspired researchers to propose enhancements to the prevalent best-effort model of packet delivery. These efforts, collectively labeled Internet Quality of Service (QoS) are aimed at providing different service classes for different types of Internet traffic. Although more sophisticated QoS support from the network will substantially facilitate the realization of large-scale real-time communication services, it is currently unclear in what shape this functionality will be provided. In any case there will for a long time yet be necessary for applications to rely on the current best-effort model.

This work contributes to the development of enabling technology for Internet based video communication systems and identifies requirements for novel applications thereof.

2 Methodology and scope

The research presented in this thesis is inherently multidisciplinary, touching the fields of computer networking, signal processing, algorithmics and application software architecture. The aim has been to study Internet video communication from a broad technological perspective, rather than focusing exclusively on some limited aspect thereof. The motivation for doing so is that the components of a video communication system are closely interrelated, each affecting the design of the system as a whole. (For instance, the design of a video coding algorithm is closely related to transmission architecture characteristics.)

The methodology employed in the work at hand is based on both experimental and theoretical approaches. The analysis of network protocols has primarily been carried out through simulations. Design and analysis of video compression algorithms has relied on both theoretical methods (e.g. complexity analysis) and experiments with prototype implementations. Throughout the work an emphasis on action research has been embraced, wherein prototype implementations of novel Internet video applications have been developed for the purpose of experimentation and analysis.

3 Video on the Internet

The earliest digital video transmission systems focused on circuit-switched transmission networks with fixed capacities. In circuit-switched networks calls are aggregated through time division multiplexing (TDM), allotting a constant bitrate share of the communication channel to each call. This requires that the video signal to be transmitted is encoded at a constant bitrate (CBR), conforming to the capacity of the TDM slot of the communication channel.

In contrast, packet-switched networks, like the Internet, aggregate traffic with variable bitrate (VBR) onto a single communication link using statistical multi-

Introduction 5

plexing. Thus, by not requiring the video coding to be CBR the aggregate utilization of the network can potentially be higher.

On the other hand, in circuit-switched networks the bandwidth of a connection is guaranteed for the duration of a session, whereas connectionless packet-switched networks are typically best-effort, requiring the applications to adapt to the amount of bandwidth available.

3.1 Applications and requirements

Internet video applications can be broadly categorized into two classes:

• live video applications and

• applications of stored video material.

The live video applications are concerned with synchronous, real-time trans-mission of live video signals in a person-to-person communication scenario, whereas the stored video applications are concerned with transmission and playback of pre-encoded material stored on disk or otherwise.

The simplest case of a stored video Internet application is transfer of a video file from a server for playback after the download finishes. To reduce start-up latencies, streaming video applications have been developed, that maintain a playback buffer so that playback can be initiated before the transfer is complete. The streaming applications are typically based on a client/server model.

In contrast, the synchronous applications are based on a peer-to-peer model, and the communication is in many situations (but not always) symmetric in that video flows both ways, e.g. in a videophone application. An example of an asymmetrical live video application is a lecture that is broadcast to a group of students in a teleteaching scenario. The students are in this type of setting typically able to use an audio or text chat backchannel for asking questions, etc.

Internet applications can also be classified as point-to-point or multipoint. In a point-to-point application, video is transmitted, bidirectionally or unidirectionally, between two endpoints. In a multipoint application video is broadcast from one or more senders to many receivers.

The work presented in this thesis is mostly oriented towards live Internet video applications, although much of the underlying technology can also be applied to streaming video applications. Some of the most important types of applications of live video on the Internet are described below.

3.1.1 Teleconferencing

Teleconferencing is a wide concept involving applications that enable groups of people to communicate in real time using a combination of audio, video and other modalities. The term videoconferencing is often assumed to include both audio and video, is most often symmetrical and can be multipoint or point-to-point.


Moreover, videoconferencing systems are often categorized as either desktop systems or room based systems. Desktop videoconferencing systems are software applications running on general workstations, whereas room based systems are self contained units typically installed in conference rooms.

Teleconferencing applications, being highly interactive, impose hard require-ments on end-to-end delays. Furthermore, the systems need to be scalable to large sessions with many participants. Audio and video quality requirements typically depend on the circumstances and are highly subjective.

3.1.2 Video telephony

Video telephony can be seen as a special case of videoconferencing, limited to a symmetrical point-to-point configuration with two participants. As with videoconferencing, audio is often assumed to be an integral part of a video telephony system. Video quality requirements can be assumed to be somewhat lower than in group conference settings, but again, this is highly subjective.

3.1.3 Web cameras

A web camera is a video camera attached to a computer that transmits a live video feed to a client web browser. A software component installed on a web server makes it possible to incorporate live video in WWW pages. This is commonly used for various types of remote awareness applications. Video quality is typically rather moderate and the delay requirements are relaxed.

3.1.4 Other applications

In addition to the above-mentioned applications, a large number of highly specialized applications of Internet video have been explored and implemented. For instance, telemedicine applications using video for remote consultations and diagnoses have been successfully demonstrated. Teleteaching is another application where videoconferencing in combination with other tools can be used for distance education in an asymmetrical, multipoint setting. Various forms of distributed computer supported cooperative work (CSCW) rely on video communication over the Internet in one form or another. Moreover, virtual reality systems augmented with video (sometimes referred to as augmented or mixed reality) have attracted a lot of attention from the research community.

3.2 Protocols and standards

The applications and technology covered by this work build heavily on the Internet Protocol (IP) architecture and service model. Indeed, for the remainder of this thesis an IP based communication infrastructure will be assumed.

Introduction 7

Standards for video communication over IP networks have primarily emerged from two sources: the International Telecom Union (ITU-T) and the Internet Engineering Taskforce (IETF). Although the standards developed by these respective authorities are partly overlapping, they represent two fundamentally different standardization approaches.

The ITU-T Recommendation H.323 [1] defines protocols and procedures for multimedia communication over packet networks and is a conglomerate of a number of standards, including H.245 for control, H.225.0 for packetization and connection establishment, H.261 and H.263 for video coding and a number of others for supplementary services. The H.323 series standards are based on adaptations of protocols developed for the traditional circuit-switched service model of telecommunication networks (e.g. H.320 for ISDN videoconferencing and Q.931 for signaling). A significant focus is kept on interoperability and compliance.

The IETF standards framework for Internet video is a more loosely coupled set of documents, each defining a specific protocol or procedure. Furthermore, the IETF standards are more lightweight with a pronounced aim of scalability. In contrast to the ITU-T standards, they do not define any algorithms for content coding, but include procedures and guidelines for packetization of media.

3.2.1 Transport protocols

Most Internet applications use the Transmission Control Protocol (TCP), that implements reliable, connection-oriented data delivery over the connectionless datagram service provided by IP. The TCP transport protocol achieves reliability by retransmission of lost packets using an acknowledgment scheme. TCP also provides a congestion avoidance algorithm that adapts the packet transmission pace based on the experienced loss rate. However, delay-sensitive applications, like packet audio and video tools, cannot use the TCP protocol due to its poor real-time properties. When dealing with real-time data a packet arriving too late is just as bad as a lost packet. The retransmission scheme of TCP is hence not appropriate for real-time applications.

The Real-time Transport Protocol (RTP) is an IETF proposed standard providing end-to-end delivery over packet networks for data with real-time characteristics [2]. For Internet applications it is typically used on top of the User Datagram Protocol (UDP) taking advantage of its multiplexing and checksum functionality [3]. RTP does not provide QoS mechanisms, but rather relies on lower level protocols to do so. Furthermore, a slightly abbreviated version of the RTP protocol is included in the ITU-T standard document H.225.0, specifying packetization rules for H.323 videoconferences.

In contrast to the traditional programming model of data communication, where the transport protocol is implemented in a protocol stack in the operating system kernel, the RTP protocol functionality is integrated in the application. This concept is known as application level framing (ALF) and is motivated by the fact that multimedia application design can be significantly simplified, and overall


performance enhanced, if application level semantics are reflected in the transport protocol [4].

RTP defines a packet header containing information that is of generic interest for many real-time applications, like timestamps and sequence numbers. In accordance with the ALF concept, the semantics of several of the RTP header fields are deferred from the RTP specification to application-specific RTP profile documents. Typically, each media encoding to be carried over RTP has an associated RTP profile document specifying packetization rules and defining the semantics of the application specific fields of the header. The RTP header is depicted in Figure 1.

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

V

timestamp

synchronization source (SSRC) identifier

contributing source (CSRC) identifiers

....

X MP CC sequence numberPT

Figure 1 RTP header

The first twelve octets of the header are present in every RTP packet. The variable length list of CSRC identifiers can be inserted by intermediate systems known as RTP mixers that aggregate RTP packets into a single RTP stream. The fields of the RTP header are interpreted as follows:

• version (V): 2 bits

The first two bits of the header identify the version of RTP. The current version is two (2).

• padding (P): 1 bit

The padding bit, if set, indicates that the packet contains one or more padding octets at the end of the packet, in which case the last padding octet determines the length of the padding.

• extension (X): 1 bit

The extension bit, if set, indicates that the fixed header is followed by an extension header.

• CSRC count (CC): 4 bits

The CSRC count specifies the number of contributing source identifiers that are present after the fixed header.

• marker (M): 1 bit

Introduction 9

The semantics of the marker bit is defined by the RTP profile of the media being carried in the packet. For instance, the marker bit can denote the start of a talkspurt for packet audio applications, or the start of a new video frame for video applications.

• payload type (PT): 7 bits

The payload type identifies the media type carried in the payload of the RTP packet. Static mappings of payload type values to media encodings are defined by RTP profile documents. Dynamic mappings established by application-specific signaling should be drawn from the range 96-127.

• sequence number: 16 bits

The initial value for the sequence number field should be a random number, and subsequently increased by one for each RTP packet transmitted. The applications must be designed to handle sequence number wrap-around. The sequence numbers are used to detect out-of-order delivery of packets.

• timestamp: 32 bits

The timestamp field reflects the sampling instant of the first media octet of the RTP payload. The timestamps are derived from a monotonically increasing, linear clock with a resolution defined by the RTP profile of the media encoding being used.

• synchronization source identifier (SSRC): 32 bits

The synchronization source field contains a number identifying a media source. The value is chosen randomly by the application, and should be unique for a session. Thus, mechanisms for SSRC identifier collision resolution (although unlikely) must be implemented by the applications.

• contributing source identifier (CSRC) list: 0 to 15 items, 32 bits each

The CSRC list contains the synchronization source identifiers of contributing sources for the payload contained in the RTP packet. The CSRC list is typically appended by RTP mixers when aggregating multiple media sources into one media stream.

The RTP specification includes a control protocol for RTP, called the RTP control protocol (RTCP). The primary function of RTCP is to provide quality feedback from the receivers of media streams. This is performed by periodic transmission of receiver report (RR) RTCP packets, containing status information like current interarrival jitter and cumulative number of packets lost. The receiver reports can be used by the originators of media streams to adapt the transmission rate based on the observed performance.

RTCP also provides a persistent transport-level source identifier referred to as the canonical name (CNAME) of the RTP stream. The canonical name is carried in an RTCP packet called a source identifier (SDES). The CNAME SDES packet is used to associate a synchronization source identifier with a CNAME, so that a


source can be uniquely identified even if the SSRC identifier changes (e.g. due to SSRC collision). The CNAME field is typically used to associate multiple media streams from the same source, for example when synchronizing audio and video streams.

Furthermore, RTCP can optionally be used to convey minimal session manage-ment information. For instance, the RTCP BYE packet can be transmitted by a session member before termination to indicate the end of the participation.

Since the RTCP protocol is based on periodic transmissions of session control information, the transmission interval must be scaled in proportion to the size of the session. Otherwise, an implosion of RTCP packets might overload the network for large sessions. The RTP specification provides guidelines for how this scaling should be implemented.

For a complete description of the RTP and RTCP protocols, see the RTP specification [2].

3.2.2 Session management and control

As mentioned above, the RTCP protocol provides elementary session management and control functions. However, this is limited to rudimentary support for identification of session participants and is not concerned with synchronous signaling of session initiation and control.

The H.323 standards suite provides session initiation and control through H.245 and H.225.0. The H.323 approach to session control conforms to the traditional circuit-switched model, based on the Q.931 protocol for ISDN call setup signaling.

In contrast, the IETF has designed an application-level signaling protocol called the Session Initiation Protocol (SIP) [5], reusing many of the header fields, encoding rules, error codes and authentication mechanisms of HTTP. The SIP protocol can be used to initiate, modify and terminate synchronous communication sessions with two or more participants. Furthermore, SIP invitation messages, used for session set-up, contain session descriptions, based on the Session Description Protocol (SDP) [6], specifying the media encodings to be used for the session.

The SIP/SDP protocols for session initiation and control provide a clearer separation of session signaling and multimedia data exchange, compared to the H.323 protocols. This makes it possible to implement dedicated session management tools, that can be used to launch any synchronous communication tool based on the SDP descriptions. For a comprehensive comparison of SIP and H.323, see [7].

For large multicast conferences where synchronous invitation of all prospective participants is not viable, a protocol called the Session Announcement Protocol (SAP) has been proposed [8]. With SAP session announcement packets containing SDP descriptions are periodically transmitted to a well-known multicast address

Introduction 11

and port. Specialized session directory tools listen to session announcements, informing the user about active and upcoming sessions.

3.3 Reflectors, transcoders and mixers

A reflector (also known as a Multipoint Conferencing Unit, MCU) is an application-level agent that serves as a relay point for multimedia traffic, facilitating multipoint communication. In a multipoint communication scenario employing reflectors, media packets are addressed to the reflector, which forwards the packets to all participants of the session. This results in a more scalable packet delivery mechanism compared to the situation where each sender transmits a unique copy of every packet to all participants. Several reflectors can be combined into a hierarchy for large sessions. The reflectors can be configured either statically, or using a dynamic signaling protocol, e.g. SIP. In Figure 2 a multipoint communication scenario using two reflectors is illustrated, wherein host 1 transmits a packet to all other participating hosts.

reflector 1

reflector 2

2

4

6

1

3

5

Figure 2 Multipoint communication using reflectors

A transcoder, or transcoding gateway, is a device that performs conversion between different media encodings in real-time. Transcoders are typically implemented in reflectors to enable a set of participants in a multipoint session to receive media in different encodings, based on link capacities and other considerations. By using transcoders, multipoint multimedia conference sessions can be realized in heterogeneos network and computing environments. However, transcoding introduces a high computational complexity and increases latency.

A mixer is a device that aggregates multiple incoming media streams into one outgoing media stream by performing some synthesis of the media. The typical example is a multipoint audio conference, where multiple incoming audio streams are mixed together to one outgoing stream. The process is illustrated in Figure 3. Mixers make more efficient use of network bandwidth and relieve the end systems


of the media mixing operation that might be required for presentation (for example when playing out audio). Albeit not as straightforward as audio mixing, video sources can also be mixed. For instance, four video sources can be combined into one quadruple view signal, or two signals could be combined in a picture-in-picture arrangement.

mixer

mixed stream

source 1

source 2

source 3

Figure 3 Mixer

3.4 Multicast communication

In multipoint communication situations an efficient mechanism is needed for delivery of data to many receivers. As discussed in section 3.3, reflectors can be used to achieve a more scalable multipoint transmission architecture. However, when using reflectors packets are still duplicated on shared network segments, resulting in sub-optimal resource utilization. Moreover, reflectors are hard to configure and maintain.

In IP multicast a range of class D IP addresses (224.0.0.0 to 239.255.255.255) is reserved for group communication [10]. A packet sent to a multicast group address is delivered to all members of the group. Group membership is maintained dynamically through explicit signaling. A dedicated multicast routing protocol is needed to forward traffic to multicast group members without transmitting redundant packets. Multicast traffic is propagated to the receivers of a group along a multicast delivery tree rooted at the sender.

3.4.1 Group management

Group dynamics in IP multicast is provided by the Internet Group Management Protocol (IGMP) [11]. Similar to the Internet Control Message Protocol (ICMP) for error control, IGMP is an integral part of IP. IGMP defines four types of operations for maintaining multicast groups, namely

• general membership query,

• group-specific membership query,

• membership report and

• leave group.

An IGMP message format is defined for carrying queries and reports.

Introduction 13

General membership queries are used to obtain information about active groups in a subnet, whereas group specific membership queries request information about whether a particular group contains members in the subnet. Membership queries are issued periodically by multicast routers to determine group membership status. Membership reports are sent by hosts when joining a new group and in response to membership queries. A leave group message is sent when a host's group membership is terminated.

3.4.2 Multicast routing

Multicast routers compute routing tables specifying forwarding paths for multicast traffic. For each multicast group having members on a subnet, the designated router maintains a routing table entry consisting of the multicast address, the network interface of the source and the interfaces where packets should be forwarded. The routers rely on soft state, so the routing table entries must be periodically refreshed by sending membership queries. If no local members remain for a group, the routing table entry is deleted.

A number of routing algorithms and protocols have been designed to compute the multicast routing tables. Basically, multicast routing algorithms can be classified as either data-driven or demand-driven. Data-driven algorithms, also known as broadcast-and-prune schemes, initially flood datagrams of a multicast group to all potential receivers using reverse-path-forwarding (RPF). In the RPF scheme a multicast router forwards an incoming packet on all interfaces except the ingress interface, if it arrived on the interface constituting the shortest path to the source. Forwarding paths are then pruned (i.e. removed) bottom-up by down-stream routers that have no group members on their connected subnets. If a host joins a multicast group in a subnet whose designated router has previously pruned the delivery tree for the group, the router grafts (i.e. reestablishes) the forwarding path. To be able to graft previously pruned paths, multicast routers must maintain state information for every pruned group. Periodically routers will anew flood datagrams of active groups, to update the soft state in the downstream routers.

The reliance on flooding in data-driven algorithms limits the scalability of multicast routing. In response, several algorithms have been proposed that avoid flooding. These algorithms are known as demand-driven, since they refrain from forwarding datagrams into networks until specifically demanded. When a host joins a multicast group in a demand-driven routing configuration, the designated router on the host's subnet must signal this join event to other routers before multicast datagrams will be forwarded to the host. The question is how to know which multicast router to inform, i.e. which router is on the next higher level of the delivery tree from the source. In demand-driven multicast routing, the Internet is divided into administrative regions (domains), each with a dedicated core router (also known as rendezvous points). Other routers in the region are either statically configured to know about the core router, or uses a dynamic discovery protocol at boot time to find it. Once a host joins a multicast group the


designated router unicasts a join request to the core router. To be able to support multicasting between domains, an interdomain multicast routing protocol is needed.

Data-driven multicast routing algorithms are appropriate in dense network configurations where many hosts are clustered closely and bandwidth is abundant. Demand-driven algorithms are more suitable in sparse network configurations where bandwidth is scarce. Thus, data-driven multicast routing can be envisioned to be used in enterprise and campus networks, whereas demand driven routing is more appropriate in wide-area networks. Moreover, data-driven routing is sender-oriented, suitable for applications where it is of importance that the first datagrams of a session are delivered to all participants of a group. Demand-driven algorithms, in contrast, are receiver-oriented, suitable for dynamic situations where the receivers join the group at different times. In multipoint video-conferencing sessions, participants typically join the conference at slightly different points in time and it is not crucial that the very first datagrams transmitted to the session are delivered to all hosts. Hence, the receiver-oriented paradigm might be preferable. Moreover, since video communication is inherently broadband, data-driven routing can waste a lot of bandwidth due to flooding.

3.4.3 Multicast scope

Two techniques exist for limiting the scope of IP multicast transmission. The first technique uses the datagram's time-to-live (TTL) field to limit the number of hops the datagram will be forwarded. Each router decreases the TTL field when forwarding a datagram. When the TTL value reaches zero the datagram is dropped.

The second technique, called administrative scoping, is based on reserving certain ranges of multicast addresses for limited propagation. The extent of administratively scoped multicast groups is explicitly configured by the organization operating the network.

3.4.4 The Mbone

The multicast backbone (Mbone) is an experimental virtual network implemented on top of the Internet, providing global multicast connectivity [12]. The Mbone consists of islands of native multicast routing domains interconnected over non-multicast routing domains through tunneling. IP-in-IP tunneling enables multicast datagrams to be encapsulated in unicast datagrams for unicast trans-mission to a destination network, where they are decapsulated and re-multicast.

3.5 Quality of service

The current Internet architecture provides only a single class of best effort service. All packets are treated in the same way with no guarantees on delivery or bounds

Introduction 15

on delay and jitter. For a wide deployment of performance-critical applications, like real-time multimedia conferencing, it has been argued that a more predictable service needs to be delivered from the network. This has resulted in the proposition of new service models for the Internet, most notably the integrated services model and the differentiated services model [13].

3.5.1 Integrated services

The integrated services model (Intserv) is based on end-to-end resource reservations. In this model a signaling protocol is used to set up a path between the communicating endpoints prior to data exchange. Along the path, resources are reserved at intermediate systems to be able to guarantee the quality of service requested by the application. The signaling protocol defined by the IETF for this purpose is called the resource reservation protocol (RSVP) [14]. If ample resources are available for the QoS requested, the RSVP reservation will succeed and the application can proceed to communicate using the guaranteed service. In case insufficient resources are available, the reservation will fail and no service will be given to the application.

Intserv requires admission control to be performed, to decide whether a reservation request should be granted. Furthermore, when a packet arrives at a router it must be classified and put in a packet scheduling queue corresponding to the QoS requested. To be able to perform these functions, each router along the reserved path must maintain state information about every communication session (also known as a microflow). Since the amount of state information increases proportionally to the number of microflows, the Intserv model places a huge storage and processing burden on core routers. This has raised doubts on whether an Intserv model is scalable enough for the global Internet. Moreover, incremental deployment is troublesome, making the migration to an Intserv Internet architecture hard to realize. These concerns have lead to the emergence of another Internet QoS model, known as differentiated services.

3.5.2 Differentiated services

The differentiated services model (Diffserv) is designed to avoid per-flow state in core routers. Instead packet classification and admission control is performed at the edge of the network, where the traffic volumes are typically lower. Upon ingress to a Diffserv network, packets are classified and marked with an identifier using a dedicated field of the IP header termed the DS field. Internal routers of the Diffserv domain then treat the packets based on the content of the DS field, according to a well-defined per-hop behavior (PHB). By defining a number of service classes and their PHBs corresponding to different application requirements, different types of traffic can receive differentiated QoS. Diffserv can thus be seen as essentially a relative priority scheme.


When traffic enters a Diffserv network it is classified, policed and shaped according to a contract between the customer and the service provider called a service level agreement (SLA). Basically, the SLA specifies the service classes that are provided and the amount of traffic the customer is allowed in each class. SLAs can be either static or dynamic. Dynamic service contracts need a signaling protocol to request services on demand. For instance, RSVP can be used for dynamic SLA signaling.

3.5.3 Traffic engineering and constraint based routing

Intserv and Diffserv provide different ways of dividing the bandwidth of a congested network between different applications. Congestion can be caused either by network resource shortage or by uneven load distribution. In the latter case congestion might be avoided by optimizing the routing of traffic. Since the current Internet routing protocols make routing decisions based only on the shortest path to the destination, traffic will typically be aggregated towards the core of a network, even if alternative routes of higher capacity exist. Traffic engineering is the process of distributing the load on a network in order to achieve optimal utilization of the available bandwidth. An important mechanism for automating traffic engineering is constraint based routing (CBR).

CBR extends shortest path routing algorithms to take resource availability and flow requirements into consideration when computing routes. Thus, a CBR algorithm might select an alternative path to the destination if it provides more bandwidth than the shortest path. This leads to a more effective utilization of network resources. However, constrain based routing increases the computational complexity of routers, increases routing table size and can potentially result in routing instability.

3.6 Media encodings

Digital media signals, in particular video, need to be compressed when transported over a network, to make efficient use of the bandwidth. For this purpose a plethora of compression algorithms have been designed targeted at different applications and requirements.

Compression algorithms can be characterized as lossless or lossy. A lossless compression algorithm allows perfect reconstruction of the original digital signal, whereas a lossy algorithm introduces controlled loss of information so that a sufficiently accurate approximation of the original signal can be reconstructed. Lossless compression algorithms are typically used for data compression where perfect reconstruction is critical. For video compression, lossless algorithms typically result in moderate compression efficiency, but are nevertheless used for certain applications where loss of information is unacceptable (e.g. medical imaging).

Introduction 17

Most video compression algorithms are lossy, exploiting the properties of the human visual system to discard information that is of insignificant perceptual importance. As with lossless algorithms, redundancy in the original signal is also exploited to represent the information more efficiently. In essence, video compression algorithms are based on the following techniques:

• colorspace conversion and component subsampling,

• inter-frame coding,

• transform coding,

• quantization,

• entropy coding.

3.6.1 Colorspace conversion and subsampling

The first step of essentially all video compression algorithms is to convert the images from the RGB colorspace into a luminance/chrominance representation (YCrCb). By exploiting the fact that the human visual system is less sensitive to variations in chrominance, the chrominance components are subsampled (i.e. represented with fever samples) to reduce the data rate. Typically the chrominance components are represented with one sample for every four luminance samples (resulting in so called 4:1:1 component subsampling).

3.6.2 Inter-frame coding

Inter-frame coding exploits temporal correlations in a video signal to reduce redundancy. Coding a frame differentially from a previous frame as an error signal improves subsequent run-length and entropy coding techniques. This predictive coding (P-coding) is usually performed on smaller subblocks of the image, typically 16-by-16 pixels large. To improve the efficiency of predictive coding a technique called motion compensation is often utilized. In a motion compensation scheme a block is coded predictively from a spatially translated block in a previous image. The differentially coded block together with a displacement vector, called a motion vector, are used by the decoder to recreate the block.

Optionally a scheme called conditional replenishment can be utilized together with block-based predictive coding. The idea is that only blocks whose error signal, when coded differentially from a previous frame, is larger than some threshold value will be transmitted. This implies that only the spatial regions of a video scene that changes temporally will be transmitted, resulting in efficient bandwidth utilization for video sequences with fairly static content.

Temporal prediction can be performed either from previous frames or from subsequent frames, providing that the temporally posterior frames have been sampled in advance. Temporal prediction in both directions is known as bi-


directional prediction (B-coding). B-coding improves compression efficiency, but is of limited applicability for interactive applications with hard delay requirements.

Predictive coding introduces interframe dependencies that make the video coding sensitive to packet loss. This is of great concern for Internet video applications, since they are typically based on unreliable transport protocols. To reduce the adverse implications of packet loss for video decoding, intra-coded frames are interleaved at regular intervals, providing re-synchronization points for the decoder.

3.6.3 Transform coding

In transform coding an image is transformed from the spatial domain to the frequency domain and represented as a linear combination of some set of basis functions. Some of the most commonly used basis functions are the trigonometric functions, used by the Fourier transform and the cosine transform. The reason for transforming an image to the frequency domain is to obtain a more compact representation of the data. Since the human visual system is more sensitive to low-frequency content in an image, high-frequency information can be excluded or represented with less precision.

The discrete cosine transform (DCT) is the most widely used transform for image and video compression. For instance the JPEG, MPEG and H.261 compression algorithms are based on the DCT. Since the cosine function has infinite support and since the spatial correlation of image pixels is localized, the transform is applied to small blocks of the image (typically 8-by-8 pixels).

Another, more recently discovered transform popular in image coding is the discrete wavelet transform (DWT). The wavelet transform is based on basis functions obtained by translation and dilation of a single wavelet mother function. The wavelet basis functions are localized in space and can consequently be applied to the whole image, contrary to the block-based approach of the DCT. This is beneficial at high compression ratios where block-based algorithms typically result in quantization defects known as blocking artifacts. Moreover, the DWT provides a native multiresolution representation that can be progressively decoded. This is highly beneficial when designing scalable encodings.

Transform coding is primarily used for intra-coding of video images. However, three-dimensional transform coding algorithms for video have been proposed that extend the two-dimensional spatial transform to the temporal dimension. Indeed, video compression algorithms based on the 3D DWT have been shown to obtain very high compression ratios, but the computational complexity is prohibitively high.

3.6.4 Quantization

Quantization is a lossy procedure wherein the precision of data samples is limited to a set of discrete values. The quantization function maps several of its input

Introduction 19

values to a single output value in an irreversible process. The quantization can be either uniform or non-uniform.

Uniform quantization limits the precision of samples uniformly over the input range. This can easily be implemented by dividing each input sample value by a quantization factor and then rounding off the result.

In non-uniform quantization the input samples are represented with different precision. Non-uniform quantization is typically implemented with a look-up table known as a quantization table.

By reducing the precision of sample values, quantization limits the number of different symbols that need to be encoded in the entropy coding step following the quantization.

3.6.5 Entropy coding

Entropy coding is the process of assigning the shortest codewords to the most frequent symbols based on the probability distribution of the input data. Examples of entropy coding schemes are Huffman coding and arithmetic coding.

Entropy coding is most often preceded by a run-length coding that encodes a consecutive series of the same symbol value as a run-length count and a symbol codeword.

3.6.6 Video compression standards

Standardization of video compression algorithms has been performed primarily by the Moving Pictures Expert Group (MPEG) of the International Standardization Organization (ISO) and by the Telecommunication standardization sector of the International Telecommunication Union (ITU-T). MPEG has developed a number of video compression standards targeted at different multimedia applications, while the ITU-T has mainly developed standards for teleconferencing applications.

MPEG-1 (ISO standard 11172) defines a video compression algorithm based on the DCT and motion compensation, targeted at multimedia applications with data rates up to about 1.5 Mbit/s.

MPEG-2 (ISO standard 13818) extends MPEG-1 with support for greater input format flexibility, higher data rates and better error resilience. The basic principles of MPEG-2 are the same as MPEG-1 (DCT and motion compensation) and MPEG-2 is backwards compatible with MPEG-1. MPEG-2 is also part of the ITU nomenclature as ITU-T Recommendation H.262.

MPEG-4 (ISO standard 14496) takes an object-oriented approach to video encoding. Visual scenes can be represented as a collection of objects, each with a specific encoding and compression format. Visual objects can be either synthetic or natural. Natural video objects are compressed using the DCT and motion compensation in basically the same manner as in MPEG-2.


ITU-T recommendations H.261 and H.263 are video compression standards targeted at teleconferencing applications at data rates up to 2 Mbit/s. Both are based on the DCT and motion compensation.

An excellent introduction to image and video compression standards is provided by Bhaskaran and Konstantinides [15].

3.7 Scalability and adaptivity

A salient feature of virtually all Internet protocols and standards is a strong focus on scalability. This is not surprising since the success of the Internet is dependent on its ability to support a large number of simultaneous users. Therefore, when designing a video communication system based on Internet technology, a fundamental concern must be the effects of scaling the system to many simultaneous users and large network topologies. This focus if further stressed by the fact that video communication is a very demanding application in terms of bandwidth and processing requirements.

The best-effort model of the current Internet, where all state information pertaining to an end-to-end communication session is kept at the endpoints, imposes a requirement on the applications, or the transport protocols used by the applications, to be adaptive to changing conditions. Moreover, heterogeneity in terms of connection capacity and end system capabilities calls for adaptive applications and protocols. An overview of adaptation strategies for Internet multimedia applications is provided by Wang and Schulzrinne [16].

3.7.1 Layered multicast

As discussed in section 3.4 above, IP multicast is a crucial component in making multipoint communication scalable to large networks and many users. Although many scalability problems with IP multicast persist, it is still a more scalable alternative than using reflectors. Furthermore, layered multicast is an approach to multipoint synchronous communication in heterogeneous network and computing environments, proposed as a more scalable alternative to using transcoding gate-ways.

The idea behind layered multicast is to utilize a layered encoding that transforms the media to be disseminated into a hierarchy of cumulative layers each of which is transmitted to a unique IP multicast address. Each participant of the synchronous communication session can then independently decide on a suitable number of multicast groups to join, depending on the amount of bandwidth and CPU resources available. A flow control algorithm is needed at each receiver, determining the optimal number of groups to subscribe to, based on feedback from the network. Figure 4 shows an example of a layered multicast com-munication session.

Introduction 21

multicastrouter

multicastrouter

multicastrouter

High quality receiver Low quality receiverSender

Medium quality receiver

Figure 4 Layered multicast scenario

3.7.2 Flow control and congestion avoidance

Flow control is the process of deciding the optimal transmission rate for a communication session. Congestion control is an exertion of flow control where the objective of the transmission rate adjustment is to avoid or minimize congestion. Although flow control is a slightly more general term, the two are often used synonymously. Flow control is typically provided by transport protocols (e.g. TCP's congestion avoidance). For real-time multimedia applications, however, the transport protocol functionality, including flow control, is the responsibility of the application.

The crucial properties of Internet flow control algorithms are

• adaptability to dynamics in bandwidth availability,

• high utilization of network bandwidth,

• intra-protocol fairness, assuring fair bandwidth allocation among sessions using the same flow control algorithm,

• inter-protocol fairness, assuring fair bandwidth allocation among sessions using different flow control algorithms,

• fast convergence to an optimal operating point, and

• lightweight implementation characteristics.

Congestion control algorithms can be classified as reactive or proactive. A reactive algorithm detects congestion from the packet loss it causes and responds by adjusting the transmission rate. Proactive algorithms detect impending con-gestion and try to respond before packet loss is experienced. Thus, proactive mechanisms are beneficial in that the overall packet loss on a network can be


reduced, but on the other hand reactive algorithms typically exhibit higher bandwidth utilization. Moreover, when reactive and proactive congestion control algorithms are co-existing on the same network, reactive algorithms are usually favored since they are more aggressive in terms of bandwidth allocation. For these reasons, congestion control on the Internet has hitherto been dominated by reactive algorithms.

Congestion control algorithms can also be categorized as either feedback-based or feedback-free. Feedback-based algorithms rely on feedback of status information from the receiver (or the group of receivers in the multicast situation). The rate is continually adjusted by the sender based on the status reports. Feedback-free schemes are typically used in multicast flow control, where the receivers are subject to heterogeneous bandwidth constraints. In a feedback-free multicast congestion control algorithm, each receiver individually controls the amount of data being received, without involving the sender. This requires a receiver-driven bandwidth allocation mechanism, e.g. layered multicast. Feedback-free schemes are more scalable to large multicast groups, since they circumvent the potential implosion of feedback information that feedback-based schemes must deal with. A comparison of feedback-based and feedback-free multicast congestion control algorithms is provided by Gorinsky and Vin [17]. Paper A in this thesis presents a novel proactive, feedback-free congestion control algorithm for layered multicast applications.

3.7.3 Scalable media encodings

In order to design scalable and adaptive multimedia communication systems, the media encodings need to be scalable in terms of bandwidth requirements and computational complexity. Most video compression algorithms in use today provide some means to trade off between quality and bandwidth, for instance by varying the frame rate, spatial resolution or the amount of quantization applied.

For multipoint conferencing with heterogeneous terminal capabilities, layered video encodings can be used in combination with a layered transmission architecture (e.g. layered multicast). Layered media encoding is also beneficial in point-to-point streaming of pre-encoded stored media, since the transmission rate can be adapted to the available bandwidth without re-encoding the media. Paper B in this thesis presents a highly scalable wavelet-based video compression algorithm.

Introduction 23

4 Summary of included papers and their contributions

4.1 Paper A: Delay-based flow control for layered multicast applications

The first of the included papers presents a novel approach to flow control for layered multicast applications. Traditionally, packet loss has been used as a congestion signal for participants of a layered multicast session, indicating that the bandwidth must be lowered by leaving multicast groups. In contrast, the algorithm presented in this paper detects impending congestion from packet delay measurements performed by the receivers. An increasing delay, corresponding to increased queuing in router buffers, is responded to by leaving multicast groups. By predicting impending congestion before packet loss is experienced the overall packet loss rate is reduced compared to the traditional loss-based algorithms. This is of vital importance for loss-sensitive applications like real-time multimedia communication. Moreover, since the algorithm is feedback-free it is free from the scalability problems of feedback-based schemes. The performance of the algorithm in terms of resource utilization, intra- and inter-protocol fairness, overall loss rate and convergence time is explored through simulations.

4.2 Paper B: A scalable video compression algorithm for real-time Internet applications

In this paper a video compression algorithm targeted at real-time Internet applications is presented. The design of the algorithm is focused on achieving scalability in terms of computational complexity, bandwidth and quality while keeping the coding latency at a minimum. Wavelet transform coding in combination with a zerotree quantization scheme with temporal prediction and arithmetic coding are the building blocks of the algorithm. The performance of the algorithm in terms of compression efficiency is analyzed using a prototype implementation. The computational cost is estimated through complexity analysis. The compression performance of the algorithm is shown to be competitive with a popular non-layered video compression algorithm (MPEG-1). The scalability in terms of bandwidth is shown to be excellent, ranging from about 10 kbps to several Mbps. Trade-offs between quality and resource consumption is demonstrated to be possible in three different ways depending on receiver capabilities and preferences.

4.3 Paper C: An RTP to HTTP video gateway

The importance of the World Wide Web for the proliferation and penetration of the Internet is unquestionable. To take advantage of the prevalence of the WWW for


video communication applications an interconnection of the transport protocols for WWW and video traffic is proposed in this paper. HTTP, being the application level protocol used on top of TCP for WWW traffic, is poorly suited for video, but can nevertheless be used if the real-time requirements are relaxed. The motivation for doing so is that it facilitates the inclusion of live video in HTML pages for user-friendly display in a WWW-browser, in much the same way as web-cameras work. Moreover, it enables users located behind firewalls to easily participate in video communication sessions without requiring any re-configuration. In Paper C, the design and implementation of an application level gateway interconnecting the WWW with RTP-based multicast video applications is described. The paper also proposes a multicast flow control mechanism implemented by the transport protocol gateway. The gateway monitors the bandwidth of the TCP connections of its connected web-browser clients, and adjusts the multicast bandwidth accordingly. The transport protocol overhead is estimated for RTP and HTTP/TCP respectively, and is found to be approximately the same.

4.4 Paper D: Stereoscopic video transmission over the Internet

Stereopsis - the ability of the human visual system to perceive three-dimensional depth by means of binocular disparity, is a powerful sensory capability. Still, practically all visual communication systems to date do not support stereopsis. This paper explores the possibilities of stereoscopic video communication over the Internet by presenting the development of a novel stereoscopic video communication system. The paper contributes implementation and usage experiences to the Internet applications research community and analyzes the requirements for stereoscopic video communication systems. Furthermore, a transport protocol extension for identification and association of stereo video streams is presented along with guidelines for implementation. Finally, application domains expected to benefit from stereoscopic video communication systems are identified and discussed.

5 Future directions

The evolution of the Internet service model from a best-effort network for data exchange into a true multiservice network supporting voice, video and data will have significant implications for the design of high-quality multimedia communication systems. However, as numerous experiments within the research community have demonstrated, this development is not necessarily a sine qua non for the successful realization of large-scale Internet video communication systems. More likely, the prevalent situation with adaptive applications that dynamically adjust to variations in network conditions will be continually valid. Gradual introduction of QoS support in certain regions of the Internet will make it possible for network operators to improve the service for customers of real-time

Introduction 25

communication applications, while maintaining the traditional end-to-end pers-pective on flow control and connection state maintenance.

A clear trend in computer and network architecture is towards mobile, wireless computing. Handheld computers become increasingly powerful and wireless networks become more widespread, enabling new types of communicative applications. The tremendous impact of cellular phones for ubiquitous interpersonal communication suggests a huge potential for more sophisticated mobile communication services. A convergence of technology between cellular phones and handheld computers and between telephony and data networks is thus clearly foreseeable. Wireless networking and mobile computing present significant challenges for video communication in terms of bandwidth limitations, limited processing power, constrained visualization and man-machine interface issues. To overcome these obstacles a continuing effort in the development of scalable media encodings and adaptive transmission architectures will be necessary and many promising research issues will be encountered.

References [1] ITU-T Recommendation H.323, "Packet based multimedia communication

systems," International Telecommunication Union, Telecommunication Standardization Sector, Geneva, Switzerland, February 1998.

[2] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP: A transport protocol for real-time applications," IETF RFC 1889, January 1996.

[3] J. Postel, "User datagram protocol," IETF RFC 768, August 1980.

[4] D. Clark, D. Tennenhouse, "Architectural considerations for a new generation of protocols," Proceedings of SIGCOMM '90, pp. 200-208, Philadelphia, Pennsylvania, September 1990.

[5] M. Handley, H. Schulzrinne, E. Schooler, J. Rosenberg, "SIP: Session initiation protocol," IETF RFC 2543, March 1999.

[6] M. Handley, V. Jacobson, "SDP: Session description protocol," IETF RFC 2327, April 1998.

[7] H. Schulzrinne and J. Rosenberg, "A comparison of SIP and H.323 for Internet telephony," Proceedings of NOSSDAV '98, July 1998.

[8] M. Handley, C. Perkins, E. Whelan, "Session announcement protocol," IETF RFC 2974, October 2000.

[9] R. Wittmann, M. Zitterbart, "Multicast communication protocols and appli-cations," Morgan Kaufmann Publishers, Academic Press, 2001.


[10] S. Deering, "Multicast routing in a datagram internetwork," PhD thesis, Stanford University, December 1991.

[11] W. Fenner, "Internet group management protocol, version 2," IETF RFC 2236, November 1997.

[12] H. Eriksson, "Mbone: The multicast backbone," Communications of the ACM 37(8), pp. 54-60, August 1994.

[13] X. Xiao, L. Ni, "Internet QoS: the big picture," IEEE Network Magazine, March 1999.

[14] R. Braden, L. Zhang, S. Berson, S. Herzog, S. Jamin, "Resource reservation protocol (RSVP)," IETF RFC 2205, September 1997.

[15] V. Bhaskaran, K. Konstantinides, "Image and video compression standards, Algorithms and architectures," second edition, Kluwer Academic Publishers, 1997.

[16] X. Wang, H. Schulzrinne, "Comparison of adaptive Internet multimedia applications," IEICE Transactions on Communication, Special issue on distributed processing for controlling telecommunications systems, vol. E82-B, no. 6, June 1999.

[17] S. Gorinsky, H. Vin, "The utility of feedback in layered multicast congestion control," Proceedings of the 11th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 2001), June 2001.

27

Paper A

Delay-based flow control for layered multicast applications

Proceedings of the 12th International Packet Video Workshop, Pittsburgh, PA, April 2002

29

Delay-based flow control for layered multicast applications

Mathias Johanson Framkom Research Corporation for

Media and Communication Technology Sallarängsbacken 2, S-431 37 Mölndal, Sweden

[email protected]

Abstract This paper presents an approach to flow control for real-time, loss-sensitive, layered multicast applications. The fundamentals of flow control for multicast applications are related and a novel delay-based flow control algorithm is introduced. The basic idea of the algorithm is to react to incipient congestion before packet loss occurs by monitoring variations in the one-way packet delay from sender to receivers. By using a hierarchical representation of the real-time data in combination with a layered multicast transmission model the flow control algorithm can be implemented entirely in the receivers. Furthermore, by constraining the bandwidth of the layers to a well-defined rate, the congestion control can be accomplished almost entirely without packet loss. This is particularly suitable for real-time multimedia conferencing applications that are inherently multipoint and loss-sensitive. The performance of the flow control algorithm in terms of link utilization, inter- and intra-protocol fairness, session scalability and loss probability is evaluated through extensive simulation.

1 Introduction

One of the reasons why the Internet has been so successful in supporting large numbers of simultaneous users is the ability of the network protocols to adapt to changing conditions. Specifically, the transport protocol used for most Internet traffic, TCP, includes a flow control algorithm that adapts the packet transmission pace of the sender so as not to congest the network [1]. The algorithm tries to experimentally find the optimal transmission rate by gradually increasing the rate until packet loss is experienced. However, delay sensitive applications like audio and video conferencing tools do not use TCP because of its poor real-time properties. Rather, these applications use the UDP and RTP protocols, leaving the flow control entirely to the application. In point-to-point configurations, flow control can be implemented by utilizing a rate adaptive coding algorithm, wherein feedback from the receiver is used to periodically adjust the media encoding parameters to match the available bandwidth [2]. For multipoint configurations where the receivers typically are subject to disparate bandwidth limitations a more sophisticated arrangement is needed. One approach is to use audio/video gateways that transcode the media to match the available bandwidth of each receiver. This has the drawback of requiring specialized network configurations and is inherently not very scalable. Another approach is to use a layered multicast


transmission scheme wherein a hierarchical representation of the data is transmitted to a set of multicast group addresses that can be subscribed to individually by the receivers. The number of groups subscribed to determines the bandwidth utilization for each receiver and consequently the quality of the decoded media. In order for multipoint real-time multimedia applications to be realized on a large scale a flow control algorithm is needed that can adapt the bandwidth of the multicast flows to the network and host resources available for each independent receiver. Since real-time multimedia streams are sensitive to packet loss it is desirable with a flow control algorithm that can detect congestion before packet loss occurs. For ease of deployment in existing network environ-ments the flow control should ideally not be dependent on changes to network routers or switches.

2 Flow control algorithms for layered multicast

Flow control for layered multicast applications is implemented solely in the receivers. By joining and leaving multicast groups as the network load changes the receivers can dynamically adapt to the available bandwidth. The decision of when to join groups, leave groups or remain at the same level is the task of the flow control algorithm. Several approaches have been suggested:

A technique generally referred to as receiver-driven layered multicast (RLM) was proposed by McCanne, Jacobson and Vetterli [3]. In this scheme the receivers periodically perform what is known as a join experiment, wherein a receiver tentatively joins an additional multicast group and monitors packet loss to determine whether the additional bandwidth causes congestion. If so, the layer is dropped and the application concludes that the optimal subscription level is reached. If no packet loss is experienced, the application proceeds to subscribe to additional layers until the optimal number of layers is reached. To avoid the implosion of join experiments that would result if all in a potentially large group of receivers performed their join attempts independently, the experiments are co-ordinated. This is done by having the member that is about to perform a join experiment multicast a message to all the other receivers declaring its intention to perform an experiment for a certain layer. In this way all receivers can determine for themselves whether the experiment caused congestion or not and may not need to perform an experiment of their own. This procedure is known as shared learning.

Vicisano, Rizzo and Crowcroft elaborated on this scheme by introducing the concept of synchronization points [4]. In this model receivers are only allowed to perform join experiments immediately after receiving a synchronization packet from the sender. Synchronization packets are sent periodically as flagged packets in the encoded media stream. This proves to be more scalable than the shared learning algorithm of RLM.

The problem with these algorithms is that they use packet loss as a congestion detection signal. Since there is no corresponding signal when the network gets

Paper A: Delay-based flow control for layered multicast applications 31

unloaded the applications must repeatedly perform join experiments to probe for available bandwidth. Packet loss caused by the failed join experiments will negatively impact the quality of the received data, not only for the member performing the experiment, but for each member located behind the same bandwidth bottleneck. The problem is further aggravated by the fact that the pruning of the reverse data path to the sender after a multicast leave operation can take a substantial amount of time (up to a few seconds), which means that the congestion caused can be relatively long-lasting. To avoid the negative effects of failed join attempts, the experiments must not be performed too frequently. But, on the other hand, too infrequent experiments has serious implications on the rate of convergence to the optimal operating point and makes the application less responsive to bandwidth fluctuations.

What is needed is a way of telling that the network is becoming congested before packet loss is experienced. At the onset of congestion queues start to build up at network routers leading to an increased end-to-end delay. Several congestion avoidance algorithms for TCP (most notably TCP Vegas [5]) have been proposed based on reacting to changes in the round-trip time (RTT) from a segment of data is sent until it is acknowledged by the receiver [5, 6, 7]. Wu et al. proposed a layered multicast transmission architecture called ThinStreams that, in the spirit of TCP Vegas, uses the difference between the expected throughput and the actual throughput as a means to detect congestion [8]. To calculate the expected throughput the ThinStreams algorithm requires a constant bitrate for each multicast layer. This paper suggests an approach to layered multicast congestion avoidance based on direct measurements of packet delay variations. Unlike the ThinStreams approach it does not require a constant bitrate for the layers and hence imposes less restrictions on the layered media encoding.

3 Delay-based layered multicast flow control

In order for the flow control algorithm to be able to respond to congestion before packet loss occurs, the variations in packet transmission delay can be used to detect congestion. An increasing delay indicates that router buffers are filling up and must be responded to by lowering the effective bandwidth. Similarly, a delay that has decreased below some threshold indicates that it might be possible to increase the bandwidth. The rate control is performed in the receivers by joining and leaving multicast groups as appropriate. To avoid packet loss the increase in bandwidth resulting from joining an additional group must be small enough for the network to buffer the excessive packets for the time it takes the receivers to detect the congestion and respond to it by leaving the group. This time is prolonged by the fact that the packet forwarding will proceed at multicast routers until the prune message of the leave operation is propagated back through the reverse multicast path. By carefully assigning an upper limit to the bandwidth of each layer (corresponding to a multicast group), packet loss as the result of joining an additional group can be avoided. To compute this bandwidth limit assume that


Q is the minimum queue size in use on the network and that L is the leave latency. Then the bandwidth limit B is

LQB≤ .

If we conservatively assume Q to be 5 Kbytes and L to be 2 seconds we get a bandwidth limit of 20 kilobits per second (kbps). The organization of data into layers at the transmitter should thus be made with a granularity of approximately 20 kbps. For real-time multimedia data this granularity is probably sufficiently small since the improvement in perceived quality by a refinement signal in the magnitude of 20 kbps is likely to be rather moderate for both audio and video.

Given the above data organization and the layered multicast transmission architecture, what we now need is a way to monitor variations in packet delay. Recall that in TCP Vegas the round trip time is used to measure the variations in throughput. For multicast transmission, however, a round trip delay cannot be computed since the network path from sender to a receiver is not generally the same as the path from the receiver to the sender and thus cannot give a reliable measure of the buffering in the multicast data path. Nevertheless, the variations in transmission delay can be measured by a scheme involving timestamping the packets at the transmitter and clocking the arrival times of packets at the receivers.

3.1 Variable transmission delay estimation

The one-way transmission delay from a source to a receiver can be seen as consisting of two parts; the fixed propagation delay and the variable delay due to buffering. The variable delay that interests us can be determined in the following way.

Let the source put a timestamp in every packet that reflects the sending time of that packet. Then the variable delay for packet i, δi, is

( ) ( )0i0ii ttrr −−−=δ ,

where ri and ti are the arrival and sending times of packet i respectively. Note that the delay calculations are performed only by the receiver and that the values of ti are determined from the timestamp in packet i. For the algorithm to give a reliable estimation of the variable delay the first (reference) packet must be transmitted when the network is uncongested, that is δ0 = 0. A reference packet with a non-zero variable delay will result in negative variable delays once the


network gets uncongested. This is an indication that the values of t0 and r0 must be reassigned (i.e. a new reference packet is chosen).

The RTP protocol that is used to fragment audio and video into UDP packets defines a packet header that includes a timestamp field, primarily intended to be used for things like playout scheduling and cross-media synchronization. The recommended clock frequency of the RTP timestamps is 90 kHz for video content and 8 kHz for audio [9]. The variations in transmission delay are typically in the order of 10 to 100 ms, so both clock frequencies are sufficiently high resolution for the delay estimation. (For example, a 10 Kbytes router buffer and a wire speed of 1 Mbps gives a maximum delay of 80 ms.)

To prevent measurement noise from impacting the join/leave decision algorithm, the packet delay estimation should be calculated as a running average over a number of measurements. That is, the delay estimation for the i:th packet,

i∂̂ , is given by

∑−

=−∂=∂

1N

0kkii N

1ˆ ,

where N is the number of delay measurements used to compute the average. In the simulations and the implementation presented in this paper, a value of N=20 was used.

Note that the algorithm relies heavily on the fact that the sender's and receiver's system clocks are isochronous (that is, that they tick at the same speed). This could potentially be a serious deployment problem, since workstation clocks are frequently badly tuned. Note also that the algorithm does not require the clocks to have the same conception of absolute time. The issue of clock synchronization is beyond the scope of this paper, but techniques exist to synch-ronize clocks (both in terms of absolute time and clock frequency) down to microsecond precision [10].

3.2 Fairness

In order for layered multicast applications to be successfully realized in existing network environments it is important that the flow control algorithm adjusts the rate of the traffic so that the application competes in a fair way for bandwidth with other applications. To this end one can distinguish three different fairness issues that can be considered crucial:

1. Fairness among members of the same layered multicast session

2. Fairness among different sessions of the same layered multicast application

3. Fairness to TCP


Fairness among members of the same session and among members of different layered multicast sessions can be realized by adjusting the threshold delay values used in the algorithm to decide whether to join multicast groups, leave multicast groups or remain at the same subscription level. By decreasing the leave threshold and the join threshold with increasing layer numbers, receivers at lower subscription levels will be more inclined to joining new layers and less inclined to dropping layers compared to receivers at higher subscription levels. This means that on a heavily loaded network connection with many competing sessions the receivers subscribed to more layers will be more responsive to increased packet delays and hence will make leave decision sooner than receivers at lower subscription levels. Similarly, at decreasing network load receivers at lower subscription levels will join groups before receivers at higher levels. This will lead to a fair sharing of the available bandwidth between the members of a session and between the sessions, provided that all sessions use the same flow control algorithm.

Fairness to TCP's flow control is important since the bulk of network applications in use on the Internet are based on TCP. The throughput of a TCP session can be shown to be inversely proportional to the product of the round trip time (RTT) and the square root of the packet loss rate [14]. Since the throughput of the layered multicast flow control presented in this paper is independent of the packet loss rate and the round trip time the concept of fairness to TCP is not well-defined. Furthermore, since the types of applications targeted by the multicast flow control is very different from the "typical" TCP application, the relative performance exhibited by competing TCP sessions is not immediately appropriate. For instance, two TCP sessions with different RTTs will allocate the bandwidth of a shared bottleneck unevenly. While this "unfairness" can be motivated in the TCP case it does not make much sense for two participants of a multicast videoconference, sharing a bandwidth bottleneck, to receive the video at different rates depending on the distance to the sender. The important point to be stressed is that real-time multimedia data need to be rate-controlled in some way in order to coexist with TCP on congested links. This behavior is sometimes referred to as TCP-friendliness.

3.3 The join/leave decision algorithm

The flow control algorithm implemented by each receiver of a layered multicast

session uses the measured queuing delay, i∂̂ , as indication to whether the layer subscription level should be increased or decreased. By considering not only the magnitude of the delay but also the rate of change, the algorithm can respond earlier to impending congestion. Since the algorithm responds to congestion by leaving a multicast group, the effect of lowered bandwidth is not manifested until the multicast delivery-tree is pruned back to the sender. Thus, in order to be able to respond in time, the algorithm needs to predict the congestion level at some


time ahead determined by the leave latency. If y(t) is the queuing delay at time t and L is the leave latency then the queuing delay at time t + L can be predicted by

y(t + L) = y(t) + Ly'(t).

Now, in order to prevent loss,

y + Ly' < M,

where M is the maximum queuing delay in the network. The value of M can be experimentally learned by initializing it to a conservatively small value and adjusting it whenever a larger delay is experienced. The leave latency can also be found experimentally by using the algorithm described in [8]. Alternatively, a preconfigured upper limit can be used. The algorithm continually computes y(t+L) and whenever the value is above a certain limit (the leave threshold) a layer is dropped. To decide when to join an additional layer the algorithm uses the value of y(t) directly, instead of the predicted y(t+L). This asymmetry is due to the fact that the join decisions should not be made in a way that keeps the network in a constantly congested state. Whenever the value of y(t) is below the join threshold an additional multicast group is subscribed to.

In order to ensure inter- and intra-session fairness, as discussed in section 3.2, the join and leave thresholds should depend on the layer subscription level. The threshold functions are designed in a way that assures that all members of the same session sharing a bandwidth bottleneck eventually converge to roughly the same number of layers. Following the discussion in section 3.2 it is clear that both the join and leave thresholds should decrease with increasing layer subscription level. In the current implementation the join and leave threshold values are calculated using functions that decrease quadratically with the number of layers joined. The range of the join threshold function is from zero to 75 percent of the maximum delay whereas the range of the leave threshold is from 65 to 100 percent. The appropriateness of using these functions and parameter values were determined experimentally from simulation results.

3.4 Scheduling the join/leave operations

Since the flow control algorithm is designed to detect and respond to congestion before packet loss occurs there is no need to synchronize the join operations from different receivers of the same session. The situation is different for algorithms that detect congestion from packet loss, since uncoordinated join attempts in this case will lead to constant congestion and packet loss. However, members of different sessions sharing the same bottleneck link can cause packet loss if they join new groups simultaneously. This is because the aggregate bandwidth change


can be larger than what the network can buffer if many receivers join layers at the same instant in time. To prevent this from happening the join operations performed by receivers from different sessions need to be decorrelated. Since a strict decorrelation is hard to realize without negatively affecting scalability and convergence time, a reasonable approximation can be achieved by scheduling the join operations using a pseudo-randomized timer. By having the timer interval increase as more layers are joined, the applications are allowed to converge relatively fast to a reasonable quality level.

Since the leave operation does not have the desired effect (of lowered congestion) until all members subscribed to the same layer leave, the leave operations for members of the same session at the same subscription level should ideally be synchronized. However, all members sharing a bandwidth bottleneck will experience the same variations in packet delay and therefore the leave operations will be reasonably synchronized automatically if only the leave decisions are scheduled frequently enough. Given that the effect of the leave is not manifested in a lowered packet delay until the multicast tree is pruned back to the sender, the receivers must defer their next leave decision for a time equal to the leave latency to avoid dropping more than one layer in response to the same congestion signal. This is easily implemented with a hold-down timer after a leave.

3.5 The delay-based layered multicast flow control algorithm

The algorithm lined out above can be described by the following pseudo-code segment.

y = current queuing delay y'= rate of change of y M = maximum delay n = number of layers joined

N = maximum number of layers t = current time L = leave latency join_threshold := 0.75*M*(1 - sqrt(n/N)) leave_threshold := M*(0.65 + 0.35*(1 - sqrt(n/N))) if ( y + L*y' > leave_threshold and t > leave_timeout ) drop_layer(n) n := n-1 leave_timeout := t + L if ( y < join_threshold and t > join_timeout) add_layer(n+1) n := n+1 join_timeout := t + (n/N + random(0, 0.5))*L


The procedures add_layer(n) and drop_layer(n) are assumed to implement the joining and leaving of multicast groups corresponding to layer n. The random(x, y) function is assumed to return a random value between x and y.

4 Simulation results

The behavior of the flow control algorithm described in chapter 3 has been simulated using the network simulator ns [11]. The topologies used for the simulations are depicted in Figure 1. Each simulation used a packet queue length of 20 packets and a dense multicast routing protocol. The transmission delays on the links were 10 ms unless otherwise noted.

RSTopology 1:One sender and one receiver connectedby a 128 Kbit/s link.

S

R1

128 Kbit/s

B Kbit/s

Rn

R2

. . .

B

B/2

B/2n

Topology 2:One sender and n receivers connected by links with different bandwidths.

512 Kbit/s

512 Kbit/s

256 Kbit/s

256 Kbit/s

S1

S2

R1

R2

Topology 4:n senders with one receiver each, sharing a B Kbit/sbottleneck link.

Topology 3:One sender and 3n receivers connected at threedifferent bandwidth levels.S

1.5 Mbit/s

R2R1 Rn

R2

R1

R2R1

Rn

1.5 Mbit/s

128 Kbit/sRn

Sn Rn

B

Figure 1 Topologies used in simulations

4.1 Link utilization and intra-session fairness

The first simulation was performed using the simplest possible topology; one layered multicast sender and one receiver connected with a point-to-point link (topology 1 in Figure 1). The aim of the simulation was to test the link utilization on a network connection with no intervening traffic. The sender transmits ten layers of approximately 20 kbps each, resulting in a total bandwidth requirement of 200 kbps. The link bandwidth is 128 kbps, so theoretically the receiver should be able to receive six layers (6*20 = 120 kbps) without congesting the network.


Figure 5 shows how the algorithm quickly joins seven layers before the network becomes congested. Then two layers are dropped in response to increased packet delay and throughout the simulation the receiver oscillates between the sixth and seventh layer. This shows that the algorithm indeed utilizes the available bandwidth as expected. The simulation was conducted without packet loss.

Figure 5 Number of multicast groups joined by the application

A slightly more complex situation is given by topology 2. Here n receivers are connected at different link speeds to a sender with the same characteristics as in the previous simulation. This topology was used to test the algorithm's ability to converge to different bandwidths in a heterogeneous network environment. Figure 6 shows the bandwidth allocation resulting from a simulation with three receivers (n=3) and a 256 kbps capacity of the shared link (B=256). The network path to receivers R1, R2 and R3 were 256, 128 and 64 kbps respectively. The expected result is that R1 should be able to receive all ten layers of the transmission, whereas R2 and R3 should converge to six and three layers respectively. The results of this simulation indicate that different receivers of the same session can converge to different bandwidths. No packet loss was experienced on any of the links.

Figure 6 Bandwidth consumed by three members of the same session

A configuration with one sender and three sets of n receivers located behind bottleneck links is given by topology 3. The resultant bandwidth utilization when the sender transmits 20 layers of 20 kbps each and a value of n=5 is depicted in Figure 7. The receivers can be seen to converge to three distinct bandwidth levels: The five receivers of the uppermost cluster in topology 3 receive the full 400 kbps


(all 20 layers), whereas the rightmost five receivers are limited by the shared 256 kbps bottleneck and the lowermost five receivers are confined to 128 kbps. Again, the simulation was concluded without any packet loss.

Figure 7 Bandwidth consumed by the 15 receivers of topology 3

4.2 Inter-session fairness

To investigate the performance of multiple independent sessions sharing the same bandwidth bottleneck a large number of simulations were conducted using topology 4, with different values for the number of sessions, n, and the bottleneck bandwidth, B. Figure 8 shows the result of a configuration with two senders, S1 and S2, and one receiver for each session, R1 and R2, with a bottleneck bandwidth of 256 kbps. Both senders transmit ten 20 kbps layers resulting in an aggregate bandwidth requirement of 400 kbps for the shared link. Receiver R1 is started first and initially joins all ten layers resulting in an allocation of 200 kbps out of the available 256 kbps. Then, after approximately ten seconds, receiver R2 is started and the two receivers can be seen in Figure 8 to converge to approximately 128 kbps each. Thus, a fair sharing of the bottleneck bandwidth is achieved.

Similar results were obtained when simulating using topology 3 for many different values of n and B. The algorithm approximately allocated the bandwidth B/n to each session. Hence the algorithm can be seen to share network resources in a fair way among independent sessions.

Figure 8 Bandwidth allocated by two members of different sessions


4.3 Scalability

The primary motivation for delay-based flow control is that the packet loss rate can be reduced compared to loss-based algorithms. The simulations involving only one sender and many receivers can be performed entirely without packet loss resulting from congestion. This is not surprising since the flow control algorithm was designed to predict and react in time to pending congestion, providing that the layers of the encoded media are sufficiently narrowband. If more than one layered multicast session is active simultaneously, however, the increase in bandwidth resulting from two or more receivers of different sessions joining simultaneously can be higher than what the router queues can withstand. In order to investigate the scalability of the algorithm when the number of sessions grow, a number of simulations were conducted using topology 4 with increasing values of n. The bottleneck bandwidth, B, was scaled in proportion to the number of sessions for each simulation. Figure 9 illustrates the average and worst-case loss rate performance. The loss rates were computed in non-overlapping windows one second wide. As can be seen, the average loss rate is about 1%, whereas the worst-case loss rate is about 2% of the total bandwidth. In comparison, McCanne et al. report a short-term worst-case loss rate of about 10% for RLM and a long-term loss-rate of about 1% [3]. Vicisano et al. report loss rates of about 7-8% for a simulation with 32 senders using their TCP-like congestion control scheme[4].

0.01.02.03.04.05.06.07.08.09.0

10.0

0 10 20 30 40

number of sessions

Loss

Rat

e (%

)

worst case

average

Figure 9 Loss rate when superpositioning independent sessions

4.4 TCP friendliness

Figure 10 shows the bandwidth of a layered multicast session simulated using topology 1 with the addition of competing TCP traffic. The TCP traffic consisted of one FTP session and ten Telnet sessions. When the simulation is started the layered multicast session can be seen to allocate all the available bandwidth of the link. Then, after 10 seconds the FTP file transfer and the Telnet sessions are started that lasts for approximately two minutes. It is clear that the multicast


session yields bandwidth in favor of the TCP sessions. After the file transfer has ended the multicast session regains the full bandwidth.

Figure 10 Bandwidth consumed by a layered multicast receiver in presence of TCP-traffic

When simulating this scenario with a 10 ms delay on the shared connection, as in Figure 10, the bandwidth is shared evenly between the multicast traffic and the TCP traffic. However, since the performance of TCP is dependent on the RTT, a less fair sharing will be obtained if the delay is increased. To compare the bandwidth allocation of TCP with that of our multicast flow control we perform the same simulation as above but with different values of the link delay. Then we calculate the fairness index defined as the ratio of the bandwidth allocated by the multicast application and the bandwidth allocated by TCP. The result is plotted in Figure 11 for both droptail and RED routers.

0.00

1.00

2.00

3.00

4.00

5.00

0 50 100 150 200link delay (ms)

fairn

ess i

ndex DropTail

RED

Figure 11 TCP friendliness index

The multicast flow control obviously gets more aggressive in terms of bandwidth allocation compared to TCP when the link delay increases. For a 200 ms link delay


the multicast session allocates almost three times as much bandwidth as the TCP sessions when using droptail routers. For RED routers the multicast session is favored even more in terms of bandwidth allocation. This is an expected finding since RED routers will drop packets before router buffers are filled leading to an earlier response from TCP's congestion avoidance, whereas the multicast flow control is unaffected.

5 Summary and conclusions

Large-scale deployment of multipoint real-time conferencing applications in heterogeneous network environments requires a sophisticated flow control protocol. The protocol must be scalable to a large number of users, efficient in terms of resource utilization, fair to other data streams, adaptive to changing network conditions, and relatively light-weight for ease of implementation. In this paper, an approach to flow control for layered multicast applications was presented that relies on packet delay measurements to detect and avoid congestion. The algorithm was shown by simulation to interoperate in a fair way, in terms of resource allocation, among members of the same session as well as between instances of different sessions. Furthermore, the overall packet loss rate was seen to be very moderate when superpositioning independent sessions. The behavior of the algorithm in the presence of TCP traffic was seen to be TCP-friendly for low delay links and increasingly favorable for the multicast traffic at higher link delays. Further work will be needed to study the behavior of the algorithm in more complex network topologies and with larger sessions.

References

[1] V. Jacobson, "Congestion avoidance and control," Proceedings of SIGCOMM '88, August 1988.

[2] J. C. Bolot, T. Turletti, "A rate control mechanism for packet video in the Internet," Proceedings of IEEE Infocom '94, June 1994.

[3] S. McCanne, V. Jacobson, M. Vetterli, "Receiver-driven layered multicast," Proceedings of ACM SIGCOMM´96, August 1996.

[4] L. Vicisano, L. Rizzo, J. Crowcroft, "TCP-like congestion control for layered multicast data transfer," Infocom '98, San Francisco, March 1998

[5] L. Brakmo, S. O'Malley, L. Peterson, "TCP Vegas: New techniques for congestion detection and avoidance," Proceedings of ACM SIGCOMM '94, pp 24-35, May 1994.

[6] R. Jain, "A delay-based approach for congestion avoidance in interconnected heterogeneous computer networks," ACM Computer Communication Review, October 1989.


[7] Z. Wang, J. Crowcroft, "Eliminating periodic packet losses in 4.3 Tahoe BSD TCP Congestion Control Algorithm," ACM Computer Communication Review, April 1992.

[8] L. Wu, R. Sharma, B. Smith, "ThinStreams: An architecture for multicasting layered video," Proceedings of NOSSDAV'97, May 1997.

[9] H. Schulzrinne, "RTP profile for audio and video conferences with minimal control," RFC1890, January 1996.

[10] D. L. Mills, "Network time protocol (version 3) specification, implementation and analysis," RFC1305, March 1992.

[11] S. McCanne, S. Floyd, "The LBNL Network Simulator," Software on-line, http://www-nrg.ee.lbl.gov/ns

[12] M. Johanson, "Scalable video conferencing using subband transform coding and layered multicast transmission," Proceedings of ICSPAT'99, November 1999.

[13] xntpd, "The network time protocol daemon," Software on-line, http://www.ntp.org

[14] M. Mathis, J. Semke, J. Mahdavi, T. Ott, "The macroscopic behaviour of the TCP congestion avoidance algorithm," Computer Communications Review, vol. 27 no. 3, July 1997.

[15] T. Turletti, S.F. Parisis, and J. Bolot, "Experiments with a layered transmission scheme over the internet," IEEE INFOCOM'98, February 1998.

[16] T. Turletti and J. C. Bolot, "Issues with multicast distribution in heterogenous packet networks," 6th international Workshop on Packet Video, September 1994.

[17] J. C. Bolot, T. Turletti, I. Wakeman, "Scalable feedback control for multicast video distribution in the internet," ACM SIGCOMM 1994, August 1994.

45

Paper B

A scalable video compression algorithm for real-time Internet applications

Pending publication

47

A scalable video compression algorithm for real-time Internet applications

Mathias Johanson Framkom Research Corporation

for Media and Communication Technology Sallarängsbacken 2, S-431 37 Mölndal, Sweden

[email protected]

Abstract Ubiquitous use of real-time video communication on the Internet requires adaptive applications that can provide different levels of quality depending on the amount of resources available. For video coding this means that the algorithms must be designed to be scalable in terms of bandwidth, processing requirements and quality of the reconstructed signal. This paper presents a novel video compression and coding algorithm targeted at delay-sensitive applications in heterogeneous network and computing environments. The algorithm, based on the embedded zerotree wavelet algorithm for still image compression, generates a highly scalable layered bitstream that can be decoded at different qualities in terms of spatial resolution, frame rate and compression distortion. Furthermore, the algorithm is designed to require only a minimal coding delay, making it suitable for highly interactive communication applications like videoconferencing. The performance of the proposed algorithm is evaluated by comparison with a non-scalable codec and the penalty in compression efficiency that the scalability requirement imposes is analyzed. The codec is shown to produce a scalable bitstream ranging from about 10 kbps to 10 Mbps, while the computational complexity is kept at a level that makes software implementation on CPU-constrained equipment feasible.

1 Introduction

The evolution of the Internet has enabled a new class of synchronous multimedia communication applications with high demands on delay and bandwidth. Not only does this affect network and transport protocols, but also has a profound impact on the design of media encoding and compression algorithms. For teleconferencing applications the coding delay must be kept at a minimum while a high compression performance is maintained to efficiently utilize the available bandwidth. For videoconferencing this is of particular importance due to the high bandwidth and complexity imposed by video transmission and processing. Furthermore, since the Internet is a highly heterogeneous environment, both in terms of link capacity and terminal equipment, video codecs need to be able to generate bitstreams that are highly scalable in terms of bandwidth and processing requirements. As the current Internet provides only a single class of service, without guarantees on bandwidth or lossrate, the applications need to be adaptive to variations in throughput and loss probability. The dissimilar requirements imposed by different applications and the heterogeneity problems have given birth


to a multitude of video compression algorithms with different target bitrates, complexities and qualities. Multipoint videoconferences, where the participants in the general case are subject to different bandwidth constraints, can be realized using transcoding gateways that re-codes the video to different bandwidths. This is problematic since it introduces delay and complexity and limits scalability. Furthermore, in a network environment without strict quality of service guarantees where the instantaneous load is not predictable it is hard to identify where the transcoding gateways should be placed. Another approach is to encode the media using a hierarchical representation that can be progressively decoded and assign the layers of the encoded signal to a set of distinct multicast addresses [22, 23, 24, 25]. In this layered multicast transmission architecture each receiver individually chooses a quality suitable for the network and computing resources available, by joining an appropriate number of IP multicast groups. This is the target application for the video compression algorithm presented in this paper. While scalable encoding schemes based on the standard video compression algorithms have been designed (MPEG-2 scalable profile [12], H.263+ [11]), the scalability requirement has clearly been added as an afterthought, resulting in high complexity and suboptimal performance. The goal of the work presented here is to design a compression algorithm with the scalability property as one of the fundamental requirements.

2 Layered video compression algorithms

A layered video encoding is a representation that splits a digital video signal into a number of cumulative layers such that a progressively higher quality signal can be reconstructed the more layers are used in the decoding process. The layering can be performed in three ways, viz. spatial layering, temporal layering and layered quantization (also known as signal-to-noise-ratio scalability). In spatial layering the video can be reconstructed at successively higher spatial resolutions, while temporal layering implies that the frame rate of the video sequence can be progressively increased. In signal-to-noise-ratio (SNR) layering the quantization of the video images is refined with each added layer. While all three techniques result in a layered bitstream the nature of the layering techniques are very different and address different aspects of the heterogeneity problem. With spatial layering the resolution of the decoded video images can be chosen depending on the resolution of the display. Temporal layering provides different levels of frame update dynamics in the video whereas SNR scalability varies the compression distortion of each individual frame. The type of layering that is most suitable depends on the application, on user preference and on the level of overall scalability desired. A good scalable video codec should ideally provide all three types of layering simultaneously so that each decoder can individually trade-off between spatial resolution, temporal resolution and fidelity, given a certain resource limit. Thus the three layering techniques should be orthogonal so that they can be applied independently.

Paper B: A scalable video compression algorithm for real-time Internet applications 49

The key challenge when designing a layered video compression algorithm is to keep the compression efficiency high while providing a high level of scalability. Intuitively, a non-scalable codec should perform more efficiently compared to a scalable at a given bandwidth or distortion level. This assumption was verified by Equitz and Cover who proved that a progressive encoding can only be optimal if the source possesses certain Markovian properties [10]. Nevertheless, a number of layered video codecs have been proposed.

The scalable mode of H.263+ defines a layered codec that provides all three modes of layering discussed above. In H.263, as well as in the MPEG video coding standards, spatial redundancies within individual images are reduced by the discrete cosine transform applied to eight-by-eight pixel blocks. Predictive coding with motion compensation is performed in the pixel domain with reference to a past frame, or bi-directionally with reference to both past and future frames. Spatial scalability is achieved by subsampling each frame until the desired resolution of the base layer is reached. The low-resolution image thus achieved is compressed using predictive coding and DCT whereupon the frame is de-compressed and upsampled so that an error-signal constituting the enhancement layer can be computed. The process is repeated for each enhancement layer. SNR scalability is achieved in basically the same manner, except that instead of resampling the frames the binsize of the quantizer is refined at each level. Temporal scalability is achieved by assigning bi-directionally predicted pictures to the refinement layers to increase the frame rate of the decoded video. The scalable modes of MPEG and H.263+ are working in basically the same way. The basic problem is that the prohibitively high complexity introduced limits the total number of layers that are feasible. Also, the efficiency of the coding can be expected to be far worse compared to the baseline algorithm, although very little experimental results have been published.

Another class of scalable video codecs are based on the discrete Wavelet transform (DWT). Since the DWT, when used to reduce spatial redundancies in image compression, is applied to the whole image as opposed to the block-based DCT, the algorithm provides a multiresolution image representation without the need for an explicit subsampling operation. The Wavelet coefficients can be progressively quantized in the same way as is performed in the block-based algorithm. Alternatively, Shapiro's Embedded Zerotree Wavelet (EZW) coding [4] or Said's and Pearlman's related algorithm based on set partitioning in hierarchical trees (SPIHT) [5] could be used to successively refine the coefficients. The key issue in Wavelet-based video codec design is how to exploit the temporal correlation between adjacent images to increase compression performance. One approach, pioneered by Taubman and Zakhor, is to extend the 2D DWT to three dimensions and apply the transform in the temporal dimension as well [1, 2, 3, 17, 19]. Apart from a dramatic increase in computational complexity this approach also has the drawback of requiring images to be buffered prior to transmission, for at least as many frames as there are iterations of the wavelet transform. This generates an unacceptable coding delay for delay-sensitive applications. Another approach is to perform predictive coding and motion compensation in the pixel


domain and then to compress the residual images using the DWT. This scheme is inherently incompatible with the scalable layering mode however, since a full resolution frame needs to be decoded before motion compensation can be performed. Another problem is that block-based motion compensation often results in blocking artifacts at high quantization ratios. This can be avoided by using overlapping block motion compensation [13, 18]. Yet another approach is to perform predictive coding and motion compensation in the transformed domain. The main obstacle with this type of coding lies in the fact that the Wavelet transform is translationally variant causing motion compensation to perform poorly [15, 16]. A remedy for this is to apply an antialiasing filter to the wavelet coefficients prior to motion estimation [7]. Needless to say, this increases the already high complexity associated with motion compensation.

The algorithm presented in this paper is targeted at multipoint video-conferencing in heterogeneous environments. With this application in mind the following assumptions have guided the design:

1. Coding delay is of paramount importance. The coder is therefore not allowed to buffer frames in order to process two or more frames as a unit prior to transmission. Consequently, only temporal prediction with respect to previous frames is permissible, not bi-directional prediction. Three-dimensional sub-band transform coding is not viable either.

2. The video content is assumed to be reasonably static, without camera pans and limited scene motion. With this assumption the high complexity of motion compensation cannot be motivated and is therefore omitted.

3. The encoding should support a hybrid of spatial, temporal and SNR scalability enabling each receiver to trade-off between resolution, frame rate and distortion.

4. The algorithm should be reasonably lightweight so that software-only imple-mentation on general-purpose processors is feasible.

3 A new wavelet-based video coding with low delay

In an attempt to leverage off the excellent scalability and compression performance for still images provided by Shapiro's embedded zerotree wavelet coding (EZW), an extension to the EZW algorithm to also exploit temporal correlations between wavelet coefficients of previously processed frames has been developed. We call this novel algorithm EZWTP, for embedded zerotree wavelet coding with temporal prediction. In order to explain the algorithm let us first recapitulate Shapiro's classical EZW algorithm for still image compression.


3.1 Embedded zerotree wavelet coding for still images

The first step of EZW coding is to decompose the input image into an octave-band pyramid, using a 2D DWT. A two-level decomposition of an image is shown in Figure 1. The EZW algorithm produces an embedded bitstream (in the sense that it can be truncated at any point to achieve any desired bitrate) by ordering the coefficients of the subbands so that the most significant coefficients are coded first. Then the correlation between corresponding coefficients in subbands of the same orientation is exploited by introducing the concept of the zerotree data structure. The zerotree data structure, illustrated in Figure 2, is an association between coefficients in subbands of the same spatial orientation in a tree-like structure. Each coefficient, with the exception of the coefficients of the lowest frequency subband and the three highest-frequency subbands, is considered to be the parent of four coefficients (the children) at the next finer scale with the same spatial orientation. For a given parent, the set of child coefficients, the children's children and so on are called zerotree descendants. The highest frequency coefficients have no children and are thus never parents. The coefficients of the lowest frequency subband each have three child coefficients at the corresponding spatial positions in the horizontal, vertical and diagonal refinement subband of the same level, as indicated in Figure 2. The algorithm progresses iteratively, alternating between two passes: the dominant pass and the subordinate pass. In the dominant pass the coefficient values are identified as significant or insignificant with respect to a threshold value Ti that is decreased (typically halved) for each iteration. The coefficients cx,y that are found to be significant at quantization level i, that is |cx,y| > Ti are encoded with one of two symbols (POS or NEG) depending on the sign of the coefficient. A coefficient that has been found to be significant is set to zero to prevent it from being encoded as POS or NEG in subsequent dominant passes. Then the magnitude of the coefficient is inserted into a list called the subordinate list, used in the subordinate pass. The insignificant coefficients are considered to be zero at the current quantization level and are coded as either zerotree roots (ZTR) or isolated zeros (IZ). A coefficient is coded as a ZTR if all its descendants in the zerotree rooted at the coefficient are insignificant with respect to the current threshold. Otherwise, if any of its descendants are significant, the symbol is coded as an IZ. A coefficient that is the descendant of a previously coded zerotree root is set to zero and not encoded in this pass. The dominant pass processes all coefficients in a well-defined scanning order, subband-by-subband, from low-frequency to high-frequency subbands, encoding the coefficients that are not zerotree descendants with a codeword from a four-symbol alphabet. Since the highest frequency subbands do not have any zerotree roots, a ternary alphabet is used to encode those coefficients. After the dominant pass is completed the subordinate pass processes each entry of the subordinate list refining the coefficient value to an additional bit of precision. This is done by using a binary alphabet to indicate whether the magnitude of a coefficient is higher or lower than half the current threshold value. In effect, this corresponds to a quantizer binsize being halved for each subordinate pass. The algorithm alternates between the


dominant and the subordinate pass, halving the threshold value for each iteration, until the desired precision is achieved or a bandwidth limit is met. The symbols that are output from the dominant pass (POS, NEG, ZTR and IZ) are entropy-coded using arithmetic coding. An adaptive arithmetic coder can be used to dynamically update the probabilistic model throughout the encoding process. In practice this is done by maintaining a histogram of symbol frequencies as described in [6]. To further improve the entropy coding, a number of histograms can be used and the selection of probabilistic model for each encoded symbol is conditioned on whether the coefficient's parent and left neighbor coefficient are significant or not. This results in four histograms for the dominant pass. In the subordinate pass a single histogram is used.

Figure 1 Two-level dyadic Wavelet decomposition of an image

LL2 HL2

LH2 HH2

LH1 HH1

HL1

Figure 2 Parent-child relationship of subbands. A zerotree rooted at the LL2 subband is also shown.

3.2 The EZWTP algorithm

To extend the EZW algorithm to video coding without introducing substantial coding delays and prohibitively high complexity, a temporal prediction scheme


without motion compensation is devised. The temporal prediction uses only the previously coded frame as reference. For robustness to packet loss, intra-coding is employed at regular intervals so that the decoder can be resynchronized. Thus, two types of encoded images are present in the output video stream: intra-coded frames and predictive frames (I-frames and P-frames). The I-frames are coded using the traditional EZW algorithm. For the P-frames, two new symbols are introduced in the dominant pass: zerotree root with temporal prediction (ZTRTP) and isolated zero with temporal prediction (IZTP). A coefficient is coded as ZTRTP in the dominant pass if it cannot be coded as a ZTR, but the difference between the coefficient and the coefficient at the same spatial location in the previous frame is insignificant with respect to the current threshold and so is the difference between each descendant of the coefficient and the corresponding descendant in the previous frame. Thus, a temporally predicted zerotree is an extension of the zerotree data structure to include coefficients of the same subbands in the previous frame. This relationship is illustrated in Figure 3. A coefficient that is insignificant, but is not a ZTR or ZTRTP (or a descendant), is coded as an isolated zero. A significant coefficient that is not a ZTRTP is coded as an IZTP if the difference between the coefficient and the corresponding coefficient in the previous frame at the current quantization level is insignificant. Note that when computing the difference between a coefficient's value and the value of the coefficient at the same spatial location in the previous frame, we must use the approximation of the coefficient value corresponding to the precision of the current pass of the algorithm. This is because in order to decode a coefficient's value at a precision corresponding to the i:th refinement level, the decoder should only be required to decode the previous frame's coefficients at refinement levels 1,2,..i. Otherwise the SNR scalability criterion would be violated. Consequently, the coder and decoder must keep the coefficient values of each refinement level of a frame for reference when coding or decoding the next predictive frame. Although this results in a substantial memory requirement, it does not introduce any buffering delay, since a frame is still transmitted once it is coded. Coefficients that are not coded as ZTRTP, ZTR, IZTP or IZ are coded as POS or NEG depending on the sign as in the original EZW algorithm. Note however that when a coefficient is found to be significant and pushed onto the subordinate list, after previously having been coded with temporal prediction (ZTRTP, temporally predicted zerotree descendant or IZTP), it is the magnitude of the difference between the coefficient and the coefficient used for the temporal prediction that should be recorded. It must also be remembered that this magnitude value represents a differentially coded coefficient. In this way the same threshold value can be used for refinement of both coefficient magnitudes and coefficient difference magnitudes. Since the state of the algorithm implicitly encodes this information no extra signaling is needed between the coder and decoder. Note also that the successive approximations of coefficient values that the decoder will reconstruct can be generated as intermediate results of the encoding process without extra cost.

The subordinate pass works in the same way as in the original EZW algorithm apart from the fact that some of the magnitude values on the subordinate list now


represent the prediction error term of a coefficient relative to the corresponding coefficient in the previously decoded frame. The state of the decoder when the coefficient is added to the subordinate list determines whether it is a prediction error term or a coefficient magnitude value and this information is kept in the subordinate list.

LL2 HL2

LH2 HH2

LH1 HH1

HL1

LL2 HL2

LH2 HH2

LH1 HH1

HL1

frame i-1 frame i

Figure 3 Spatial and temporal relationships of the coefficients belonging to a temporally predicted zerotree rooted at subband LL2 of frame i

The arithmetic coding of the symbols is performed using codewords from five different alphabets. For I-frames the three alphabets of the original EZW algorithm are used, viz. a four-symbol alphabet for all subbands except the highest frequency subbands of the dominant pass, a ternary alphabet for the highest frequency subbands and a binary alphabet for the subordinate pass. For P-frames a six-symbol alphabet is used for all subbands except the highest frequency ones (ZTR, IZ, ZTRTP, IZTP, POS, NEG), a four-symbol alphabet is used for the highest frequency subbands, where ZTRTP and ZTR cannot occur, and a binary alphabet for the subordinate pass. The conditioning of the statistical model used by the arithmetic coder in the dominant pass is performed with respect to whether the parent and left neighbor of a coefficient is significant, as in the original EZW algorithm, but also with respect to whether the corresponding coefficient in the previous frame is significant. This increases the performance of the arithmetic coder. Another difference compared to the original EZW algorithm is that with the addition of temporal information, there is now a way to condition the statistical model to be used for arithmetic coding of the symbols resulting from the subordinate pass. Since the coefficients at the same spatial location in adjacent frames exhibit a strong correlation, the probability is higher that the coefficient will be refined in the same direction as the corresponding coefficient in the previous frame. Thus the arithmetic coding of the symbols from the subordinate pass can be enhanced by temporal conditioning. The arithmetic coding can be based on either static, predefined, probability models or adaptive models based on histograms of symbol frequencies. However, since the decoder should be able to partially decode the encoded bitstream a fully adaptive arithmetic coding, where symbol probabilities are updated for every coded symbol, cannot be used. Thus, in order not to violate the scalability criteria and to be resilient to packet loss, the


intended application of multipoint video communication systems must use either a static model or introduce synchronization points in the media (e.g. for every I-frame), where the probability models are propagated from the coder to the decoder. The advantage of using an adaptive arithmetic coder is not so significant that it motivates the added complexity of maintaining symbol frequency histo-grams, so a static model is generally preferred.

outputDWT

EZWTP

I- or P-frame?

I-frame

P-frame

EZW

framememory

arith-metic

coding

inputimage

Figure 4 Schematic diagram of the EZWTP encoding process

3.3 EZWTP codec design

The EZWTP encoder consists of the following four components:

1. colorspace conversion and component subsampling,

2. transform coding,

3. zerotree coding with built-in temporal prediction,

4. arithmetic coding.

The colorspace conversion transforms the input color video signal into a luminance signal (Y) and two color-difference chrominance signals (Cr and Cb). Since the human visual system is more sensitive to variations in luminosity than in hue, the chrominance signals are subsampled by two horizontally and vertically. The colorspace conversion and subsampling operations decorrelates the components and reduces the bandwidth to half of the original signal. The encoding is then performed separately on each of the three components. The encoding process is illustrated schematically in Figure 4.

3.3.1 Transform coding and spatial scalability

The wavelet transform decomposes the input images into subbands representing the frequency content of the image at different scales and orientations. The transform is implemented by applying a pair of band-splitting filters to the image. The filtering process is repeated on the lowest-frequency subband a finite number of steps, resulting in a pyramid of wavelet coefficients like the one depicted in Figure 1. For the implementation of the EZWTP codec presented in this paper the


filters designed by Antonini et al. [8] were chosen, since they have been found to give good performance for image coding [9]. The transform is iterated on the low-pass subband until the size is considered small enough, e.g. for CIF-size images (352x288), five iterations are done for the luminance component and four for the chrominance. Thus, for CIF video five spatial resolution levels are obtained each of which (except the LL-band) contains three refinement signals for horizontal, vertical and diagonal detail. The spatial layering can be performed on subband level resulting in 3x5+1 = 16 spatial layers, for CIF images. Such a fine granularity for spatial scalability is probably unnecessary for most applications, suggesting that the subbands should be coalesced into fewer layers.

I-frame

P-frame

IP-frame

Figure 5 Inter-frame dependencies for intra-coded frame (I-frames), predicted frames (P-frames) and intra-predicted frames (IP-frames).

3.3.2 Temporal scalability

The temporal scalability requirement restricts the inter-frame dependencies that the predictive coding is allowed to establish. Since a P-coded frame cannot be decoded unless the I- or P-frame it is predicted from has been decoded, such inter-frame dependencies must be confined to the same layer or to temporally antecedent layers. P-frames are generally predicted from the immediately preceding frame, since the temporal correlation usually diminishes rather quickly. One approach is to employ a two-layer model wherein all I-frames are assigned to the base layer and all P-frames to a single refinement layer. To increase the number of temporal layers possible some (or all) P-frames can be predicted from the previous I-frame instead of from the immediately preceding frame. Figure 5 illustrates a temporal layering arrangement with three temporal layers, where the P-frames temporally equidistant from two I-frames are coded with reference to the previous I-frame and the intermediate P-frames are coded relative to the immediately preceding P- or I-frame.


4 Performance

In this section a number of performance measurements are presented that evaluates the efficiency of the codec in terms of scalability, compression rate and reconstructed image quality. The compression efficiency for a given bandwidth limit is compared to that of a non-scalable codec in order to quantify the sacrifice in compression rate that the layering requirement imposes.

4.1 Inter-frame compression performance

In order to investigate how much compression efficiency is gained by the predictive coding introduced in the EZWTP algorithm a number of measurements were performed comparing the compression rate obtained for different ratios between I-frames and P-frames. In Figure 6 the compressed image size in bits-per-pixel is plotted for each of the 100 first images of the CIF akiyo video sequence for eight P-frames per I-frame. Figure 7 shows the same plot for I-frames only. The former I-frame/P-frame layout thus supports two temporal layers, whereas the latter supports any number of temporal layers (since there are no inter-frame dependencies). Each line in Figure 6 and Figure 7 represents a quantization level, resulting from the SNR scalable EZWTP coding. The compression performance for P-frames can be seen to be about twice the performance for I-frames, for each quantization level. Since the akiyo test sequence contains a typical "head and shoulders" scene it can be assumed to be fairly representative for the kind of video content the algorithm is targeted for. The low-motion nature of the video makes inter-frame coding without motion compensation perform reasonably well.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

image

imag

e si

ze (b

pp)

Figure 6 Size of each compressed image (in bits-per-pixel) for the CIF akiyo test sequence at 8 P-frames per I-frame


0.0

0.5

1.0

1.5

2.0

2.5

3.0

image

imag

e si

ze (b

pp)

Figure 7 Size of each compressed image (in bits-per-pixel) for the CIF akiyo test sequence, with I-frames only

In Figure 8 the mean size of a compressed image of the akiyo sequence is plotted against the proportion of P-frames per I-frame. Again, each curve represents a quantization level. It can be seen that the inclusion of inter-frame coding is highly beneficial to the overall compression efficiency, and that a coding-strategy with one I-frame every fourth to sixth frame can be adopted while maintaining a good compression performance.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 2 4 6 8 10 12P-frames per I-frame

aver

age

imag

e si

ze (b

pp)

Figure 8 Mean compressed image size (in bits per pixel) depending on the number of P-frames for each I-frame


4.2 Overall compression efficiency

To analyze the penalty on compression efficiency for a given bandwidth that the scalability requirements impose and to compare the performance of the EZWTP algorithm to a popular, widely used codec, EZWTP was compared with a non-scalable MPEG-1 codec [21]. In order for the comparison to be as fair as possible, and to reflect the target application for the EZWTP codec, the MPEG codec was configured to use I- and P-frames only (no B-frames) and to have the same I-frame/P-frame ratio. The overall compression performance was quantified by computing the peak-signal-to-noise-ratio (PSNR) for a number of bandwidth levels (measured in bits-per-pixel). The PSNR was computed on the luminance component only, whereas the bandwidth refers to all three components. Due to the SNR scalability of the EZWTP codec all distortion levels could be decoded from a single encoded source, whereas for the non-scalable MPEG codec the encoding was done multiple times with different target bandwidths. In Figure 9 the compression performance of the first 100 frames of the CIF akiyo sequence is shown. It is clear that the MPEG codec outperforms the EZWTP codec with as much as 3 dB.

101520253035404550

0.0 0.5 1.0 1.5 2.0

bpp

PSN

R

EZWTPMPEG-1

Figure 9 Compression efficiency of EZWTP compared to MPEG-1 for the CIF akiyo test sequence

Figure 10 shows the compression efficiency of EZWTP and MPEG-1 for the 100 first frames of the 4CIF susie test sequence. For this video source the EZWTP algorithm performs almost as well as the MPEG codec at low bitrates and even outperforms the MPEG codec at high bitrates.


101520253035404550

0.0 0.5 1.0 1.5 2.0

bpp

PSN

R

EZWTPMPEG-1

Figure 10 Compression efficiency of EZWTP compared to MPEG-1 for the 4CIF susie test sequence

The reason why the EZWTP algorithm performs better relative to MPEG for susie compared to akiyo is probably related to the fact that the higher spatial resolution in the former case makes the superiority of the Wavelet transform over the DCT for spatial decorrelation be of higher significance. A contributing factor might be that although the inter-frame compression of EZWTP performs very well on the low-motion akiyo sequence (cf. Figure 6), the motion-compensated inter-frame coding of MPEG performs even better. The susie sequence contains slightly more motion and since the performance of MPEG to a higher degree depends on the efficacy of the inter-frame compression compared to EZWTP, the reduced P-coding efficiency has a larger impact.

4.3 Scalability

The scalability of the encoding in terms of bandwidth and quality of the reconstructed signal is illustrated in Figure 11 and Figure 12, calculated over the first 100 frames of the CIF akiyo sequence. Only the effects of SNR- and spatial layering are considered. A five-level wavelet transform was used for the encoding, resulting in a total of 16 subbands (three refinement subbands per level plus the base layer). Each subband was assigned a unique spatial layer for the purpose of these measurements. The quantization of the wavelet coefficients was divided into 12 refinement layers. Thus, a hierarchical structure of 12*16=192 layers was created. For most applications such a fine granularity is probably not needed, indicating that some layers should be merged. Figure 11 shows the cumulative bandwidth in kilobits per second (kbps) as a function of the number of spatial layers and quantization layers. The bandwidth was computed at a frame rate of 25 frames per second. As can be seen, increasing the sample precision (adding quantization layers) has a bigger effect on bandwidth consumption compared to an increase in spatial resolution.


0

1000

2000

3000

4000

5000

6000

bandwidth(kbps)

1

12

116 quantization layers

spatial layers

Figure 11 Bandwidth scalability of the encoding of the CIF akiyo sequence

101520253035404550

PSNR

1

12

116 quantization layers

spatial layers

Figure 12 Scalability of the encoding of the CIF akiyo sequence in terms of PSNR of decoded images.

The corresponding reconstructed image quality for each quantization and resolution level is shown in Figure 12. Here, image quality is quantified using the peak-signal-to-noise ratio of the original and reconstructed images. When computing the PSNR for frames decoded at a lower spatial resolution than the original, the reconstructed image was upsampled to the original dimensions prior to computing the mean square error to the original image. It is important to note that this is a statistical measure of correlation between signals that does not take psychovisual effects into considerations and hence is a poor estimator for perceptual quality. For instance there seems to be some anomaly resulting in a lowered quality when 14 subbands are used in the decoder, compared to using only 13. Upon visual inspection of the images, however, the higher resolution versions are subjectively preferable, although mathematically more distorted. The PSNR can be seen to depend approximately linearly on both quantization and resolution. The conclusions that can be drawn from these measurements are that finer quantization has a more profound effect on bandwidth consumption compared to increased spatial resolution, and that the reconstructed image quality (determined


by the PSNR metric) depends, in some sense, equally on both parameters. Thus, when trading off between resolution and quantization distortion, the former should possibly be prioritized. However, in real applications, other factors like the video content, the type of application and user preference are likely to be of significant importance for this decision.

a

b c

d e

Figure 13 A frame of the akiyo test sequence decoded at three different resolutions and five different distortion levels


Figure 13 displays one frame from the akiyo test sequence decoded at three different resolutions and five different distortion levels. For image a (CIF resolution) 16 spatial layers and 10 SNR layers were used in the decoding process. For b and c (QCIF resolution) 12 spatial layers were used with 7 and 6 SNR layers respectively. For d and e (1/16 CIF resolution) 8 spatial layers were used while the number of SNR layers were 7 and 5 respectively.

In Table 1 the PSNR of each image in Figure 13 is listed together with the frame rate that can be supported, given the bandwidth limit of a particular network access technology. The spatial resolutions are also included. These measurements indicate what performance can be expected from the video coding in some relevant situations.

image resolution PSNR fps target access technologya CIF 42.1 25 T1 (1.5 Mbps)b QCIF 32.2 24 4xISDN (256 kbps)c QCIF 28.6 22 2xISDN (128 kbps)d 1/16 CIF 28.5 16 ISDN (64 kbps)e 1/16 CIF 24.5 10 modem (33 kbps)

Table 1: Examples of image quality, spatial resolution and frame rate at

bandwidths corresponding to different network access technologies

5 Processing requirements

One of the primary design goals of the EZWTP codec is that the computational complexity should be low enough for the algorithm to be possible to implement in software on general-purpose processors. Furthermore, the processing require-ments should be scalable so that the coding and/or decoding complexity can be adjustable to the amount of CPU resources available for different types of terminal equipment. To analyze the complexity of the EZWTP algorithm we first note that the two major contributions to the overall complexity are the transform and quantization for the encoder and the inverse transform and the dequantization for the decoder. It is easy to see that the coding and decoding requirements are symmetric, since the inverse transform and the dequantization are simply the reverse processes of the forward transform and the quantization. Therefore we present the complexity analysis for the decoder only, since the scalability property of the algorithm is most highlighted in the situation where one encoded stream is decoded at many different quality and complexity levels for a collection of heterogeneous decoders. The computational complexity is estimated depending on the number of levels of the inverse transform and the number of iterations of the zerotree decoding that is performed. In this way we can analyze the scalability of


the processing requirement in relation to the spatial resolution and compression distortion of the reconstructed video.

One iteration of the wavelet transform is implemented by applying a low-pass and a high-pass filter to the pixel values of each image. For the next iteration the transform is applied to the low-frequency subband which has a resolution of a quarter of the original. Thus, for L levels of the transform the processing requirement is proportional to the number of multiplications performed, which is

∑−

=−−

1L

0i1iM4

fn2, (1)

where f is the filter tap length, n is the number of pixels of the full image, M is the total number of transform levels executed by the encoder, and L≤M is the number of levels of the inverse transform effected by the decoder.

The zerotree decoding with temporal prediction is performed in two passes: the dominant pass and the subordinate pass. In the dominant pass the wavelet coefficients that were found to be significant in the corresponding dominant pass of the encoder are input and decoded. The ZTR, ZTRTP, IZ and IZTP symbols are also read and the coefficients affected are set to zero or to the value predicted from the previous image. The significant coefficients are added to the subordinate list for further processing. In the dominant pass each coefficient is updated once, so the processing power for each pass is proportional to the number of coefficients, resulting in a total number of

LM4Pn

− (2)

coefficients to be processed, where P is the number of iterations of the EZWTP decoding algorithm (i.e. the number of quantization levels decoded).

In the subordinate pass each coefficient in the subordinate list is processed and refined to an extra bit of precision as determined by the symbols read from the input stream. The processing power for each iteration of the subordinate pass is hence proportional to the number of significant coefficients for that level. If no temporal prediction is performed (i.e. I-coding) the number of significant coefficients can be found empirically to be approximately doubled for each pass. With temporal prediction the number of significant coefficients is reduced, but we nevertheless assume a doubling quantity also for P-frames, appreciating that the complexity estimation will be somewhat pessimistic. This gives us a complexity for the subordinate pass that is proportional to


∑−

=−−− ⋅

1P

0j1jQLM 24

n, (3)

where Q is the total number of quantization levels computed by the encoder, and thus, P≤Q.

A linear combination of (1), (2) and (3) gives the total complexity, CEZWTP(L, P). That is, for some positive proportionality constants C1, C2, C3, where the filter length in (1) has been included in the C1 constant,

=)P,L(CEZWTP =⋅

++ ∑∑−

=−−−

−

=−−−

1P

0j1jQLM3

1L

0iLM21iM1 24

nC4PnC

4nC

( ))12(C2PC4)14(C4n 1P

3QL2

2L1L

1M −++− +−+ . (4)

As can be seen in (4), the complexity of the EZWTP decoding grows exponentially with respect to the number of transform levels (L). This is not surprising since the number of pixels to process increases by a factor four when the width and height of the images are doubled. With respect to the number of quantization levels (P), the EZWTP complexity increases by a linear term and an exponential term, accounting for the dominant and subordinate passes respectively.

To verify the theoretically deduced complexity estimation in (4) the execution time of the EZWTP implementation when decoding the akiyo test video sequence was measured for different values of L and P. The proportionality constants were empirically determined from decoding time measurements. In Figure 14 decoding time is plotted against the number of quantization levels while keeping the number of transform levels constant. As can be seen, the decoding time corresponds very well with the theoretically estimated curve. In Figure 15 the decoding time is plotted against the number of transform levels, while keeping the number of quantization levels constant. Again, a very good correspondence is found indicating that the complexity estimation in (4) is sound.


0

50

100

150

200

250

1 3 5 7 9 11

SNR levels

deco

ding

tim

e (m

s)

Figure 14 Decoding time as a function of the number of quantization levels

0

200

400

600

800

1000

0 1 2 3 4 5 6

IDWT levels

deco

ding

tim

e (m

s)

Figure 15 Decoding time as a function of the number of inverse transform levels

From looking at the graphs in Figure 14 and Figure 15 it appears as if the number of IDWT levels chosen has a larger impact on decoding time than the number of quantization levels. Thus, when trading off between resolution and quantization distortion in the decoder, from a complexity standpoint a refinement of the quantization precision might be preferable compared to an increased resolution. The shape of the graph in Figure 14 suggests that the linear term of P in (4) is dominant over the exponential, for SNR levels below 10, resulting in an approximately constant increase in complexity, compared to the apparent exponential increase imposed by a higher resolution level. To verify this observation we differentiate the complexity function CEZWTP(L, P) with respect to L and P and form the quotient of the derivatives. The ratio thus obtained represents the relative impact on computational complexity of refining the spatial resolution versus the quantization precision.


( )≥

+

−++

=

∂∂∂∂

+−

+

1QP32

1PQ3

21

2C2lnC

122C

PCC44ln

PCLC

12C

2lnC

2CPC2C81QP

32

1QP321 ≥

+

++

+−

+−

(5)

iff 2ln

CPC2C8 221 ≥+ (6)

Since P ≥ 1 > 1 / 2ln2 – 4 C1/C2, the relation in (6) is trivially true and thus the ratio in (5) is always greater than one, implying that the increase in computational complexity is always affected more by an increase in resolution compared to an increase in quantization precision, irrespective of L, P and the proportionality constants. This suggests that for computationally constrained devices, refined quantization might be preferred over increased resolution. Note that in this analysis we have calculated the change in computational cost associated with a change in resolution corresponding to three additional spatial subbands being used in the decoding process. That is, we do not consider the effect on complexity of adding the spatial subbands of a transform level independently. Since the improvement in reconstructed image quality is most profound when adding a spatial subband of the next resolution level (cf. Figure 12), the conclusions are still consistent.


Real-time multipoint Internet videoconferencing applications require highly scalable video encoding and compression algorithms with minimal coding delays. This paper has presented a video compression algorithm that produces a layered bitstream that can be decoded at different quality levels depending on the amount of resources available to the decoder in terms of network bandwidth, computational capacity and visualization capabilities. The algorithm, called EZWTP, has been designed with the scalability and real-time properties as primary requirements, while trying to maintain high compression efficiency and low computational complexity. Computational complexity is kept low by excluding motion compensation. The motivation for doing so is that the target application (Internet videoconferencing) implies that a reasonably low-motion video content can be assumed. The inter-frame compression of EZWTP was shown to give a substantial compression performance for low-motion video scenes. In comparison to a popular non-layered codec (MPEG-1), the EZWTP codec was shown to exhibit competitive compression performance for high-resolution video, due to the superior spatial decorrelation properties of the wavelet transform compared to the discrete cosine transform. For lower resolution video, non-scalable codecs with motion compensation typically outperform the EZWTP algorithm.


The decoder can trade off between frame rate, spatial resolution and compression distortion based on local constraints and user preference. Complexity and performance analyses showed that for computationally constrained devices, enhanced quantization might be favored over an increased spatial resolution, while the opposite discrimination could be advocated for bandwidth constrained instances of the decoder. The temporal layering has a linear impact on both decoding time and bandwidth consumption.

Although the computational power of processors and the capacity of network infrastructure will continue to increase, heterogeneity will persist. It can thus be argued that scalability in terms of performance and resource consumption should be considered a more important feature of a video-coding algorithm, than sheer compression efficiency, when targeting applications like Internet video-conferencing. This sentiment has inspired the work presented in this paper.

References [1] D. Taubman and A. Zakhor, "Multirate 3-D subband coding of video," IEEE

Trans. Image Processing, vol. 3, no. 5, pp. 572-588, Sep. 1994.

[2] C. I. Podilchuck, N. S. Jayant and N. Farvardin, "Three-dimensional subband coding of video," IEEE Trans. Image Processing, vol. 2, no. 2, pp. 125-139, Feb. 1995.

[3] J. R. Ohm, "Three-dimensional subband coding with motion compensation," IEEE Trans. Image Processing, vol. 3, no. 5, pp. 559-571, Sep. 1991.

[4] J. M. Shapiro, "Embedded image coding using zerotrees of wavelet coefficients," IEEE Trans. Image Processing, vol. 41, no. 12, pp. 3445-3462, Dec. 1993.

[5] A. Said and W. Pearlman, "A new, fast and efficient image codec based on set partitioning in hierarchical trees," IEEE Trans. Circuits and Syst. for Video Technol., vol. 6, no. 3, pp. 243-250, June 1996.

[6] I. H. Witten, R. Neal and J. G. Cleary, "Arithmetic coding for data compression," Communications of the ACM, vol. 30, pp. 520-540, June 1987.

[7] X. Yang, K. Ramchandran, "Scalable wavelet video coding using alias-reduced hierarchical motion compensation," IEEE Trans. Image Processing, vol. 9, no. 5, May 2000.

[8] M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies, "Image coding using wavelet transform," IEEE Trans. Image Processing, vol. 1, no. 2, April 1992.

[9] D. Villasenor et al., "Wavelet filter evaluation for image compression," IEEE Trans. Image Processing, Aug. 1995.

[10] W. Equitz and T. Cover, "Successive refinement of information," IEEE Trans. Information Theory, vol. 37, pp. 269-275, Mar. 1991.


[11] G. Cote, B. Erol and F. Kossentini, "H.263+: Video coding at low bit rates," IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 7, pp. 849-866, Nov. 1998.

[12] MPEG-2, ISO/IEC 13818. "Generic coding of moving pictures and associated audio information," Nov. 1994.

[13] M. T. Orchard, G. J. Sullivan, "Overlapped block motion compensation: an estimation-theoretic approach," IEEE Trans. Image Processing, vol. 3, no. 5, pp. 693-699, Sept. 1994.

[14] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding. Englewood Cliffs, NJ: Prentice-Hall, 1995.

[15] K. Tsunashima, J. B. Stampleman, and V. M. Bove, "A scalable motion-compensated subband image coder," IEEE Trans. Commun., vol. 42, pp.1894–1901, Apr. 1994.

[16] A. Nosratinia and M. Orchard, "Multi-resolution backward video coding," in IEEE Int. Conf. Image Processing, vol. 2, Washington DC, Oct. 1995, pp. 563–566.

[17] K. Metin Uz and M. Vetterli, "Interpolative multiresolution coding of advanced television with compatible subchannels," IEEE Trans. Circuits Syst. Video Technol., vol. 1, no. 1, pp. 86–99, Mar. 1991.

[18] M. Ohta and S. Nogaki, "Hybrid picture coding with wavelet transform and overlapped motion-compensated interframe prediction coding,'' IEEE transactions on Signal Processing, vol. 41, no. 12, pp. 3416-3424, Dec. 1993.

[19] Y. Chen and W. Pearlman, "Three-dimensional subband coding of video using the zero-tree method," in Proceedings of SPIE - Visual Communications and Image Processing, Orlando, Mar. 1996, pp. 1302 - 1312.

[20] S. McCanne, M. Vetterli, and V. Jacobson, "Low-complexity video coding for receiver-driven layered multicast," IEEE Journal on Selected Areas in Communications, vol. 16, no. 6, pp. 983-1001, Aug. 1997.

[21] A. C. Hung, "PVRG-MPEG CODEC 1.1," Portable Video Research Group (PRVG), Stanford University, June 14, 1993.

[22] N. Shacham, "Multicast routing of hierachical data," in Proceedings of the International Conference on Computer Communications, Chicago, June 1992, pp. 1217-1221.

[23] T. Turletti and J. C. Bolot, "Issues with multicast video distribution in heterogeneous packet networks," in Proceedings of the Sixth International Workshop on Packet Video, Portland, Sept. 1994.

[24] S. McCanne, V. Jacobson, and M. Vetterli, "Receiver-driven layered multicast," in Proceedings of SIGCOMM ’96, Stanford, Aug. 1996.

[25] S. McCanne, "Scalable video coding and transmission for Internet multicast video," Ph.D. thesis, University of California, Berkeley, Dec. 1996.

71

Paper C

An RTP to HTTP video gateway

Proceedings of the Tenth World Wide Web Conference, Hong Kong, May 2001.

73

An RTP to HTTP video gateway


Sallarängsbacken 2, S-431 37 Mölndal, Sweden [email protected]

Abstract Multicast audio and video conferences are today commonplace in certain parts of the Internet. The vast majority of Internet users however are not able to participate in these events because they either lack multicast network connectivity, are located behind firewalls, have insufficient network resources available or don't have access to the proper software tools. In many cases all of the above restrictions apply. This paper presents an effort to extend the scope of multicast video conferencing by the development of an Internet video gateway that interconnects multicast networks with the World Wide Web. The overall design of the gateway software is outlined and a novel algorithm for rate control of the multicast video flows is described. Some performance tests that show the efficacy of the system in terms of resource utilisation and scalability are presented.

1 Introduction

The explosive growth of the Internet has so far mostly been related to its success in supporting asynchronous applications like WWW-browsing and file transfers. Within the research community however, the Internet has for many years been successfully utilised in supporting synchronous multimedia conference sessions, most notably within the Mbone initiative [1]. The Mbone is a virtual network implemented on top of the Internet that enables multicast packet delivery; a technology crucial for implementing scalable multipoint communication systems. Nevertheless, the vast majority of Internet hosts are not connected to multicast-enabled networks, so inter-operation with Mbone-type services need some sort of gateway function or tunnelling mechanism that can forward IP multicast datagrams in a controlled manner over unicast network connections. Several software tools have been designed for this purpose, including mrouted [6] and mTunnel [3], but there are still other difficulties that need to be overcome to make audio and video conferencing ubiquitous on the Internet. One difficulty is that the bandwidth available on many dialup links is too low to sustain the potentially broadband traffic of audio and video sessions. A solution to this problem is to employ media transcoding gateways that convert the transmitted media to a lower bandwidth format suitable for transmission over low-bandwidth links. One such approach is presented in [2]. Yet another obstacle is the fact that many Internet hosts are located behind firewalls. In the general case firewalls do not allow UDP-based real-time traffic to pass through and in many cases they also employ techniques like network address translation that complicate end-to-end real-time


communication. Moreover, the rather sophisticated applications required for real-time audiovisual communication might not be available on every computing platform and troublesome installation and configuration procedures will in any case restrain the applicability of the services in question.

This paper presents a novel software tool that has been developed to partially circumvent the aforementioned impediments to extend the range of synchronous multimedia communication.

2 Background and motivation

Synchronous collaboration tools like audio and video conferencing applications are becoming increasingly more popular on the Internet. Simple synchronous communication tools like ICQ [4] and IRC [5] have rapidly reached a large number of users due to their applicability virtually anywhere on the Internet. This is due to the fact that they rely only on the core protocols of the Internet (TCP/IP) and require very little network resources to be useful. Sophisticated multimedia collaboration software on the other hand require substantially more bandwidth and build largely on protocols that are not supported everywhere on the Internet (IP multicast [7], RTP/RTCP [11], UDP [13]). Although these technologies are expected to reach an increasingly more widespread deployment there will always be heterogeneity in terms of network resources and services. In an effort to extend the scope of multicast video conferences we have developed an RTP to HTTP gateway software that makes it possible for an Internet user to take part of multicast video streams, albeit at potentially high latency and low frame-rate, with the only prerequisite being access to the WWW through a standard browser. Figure 1 shows an example configuration of a network that connects WWW-users to a multicast network.

Multicast-enabledreceivers

GWGW

Multicast backbone

Video transmitter

WWW receivers

Figure 1 Typical network configuration using gateways

Paper C: An RTP to HTTP video gateway 75

Note that the video gateway presented in this paper only enables users to receive video streams of multicast conference sessions. It does not provide any support for transmitting video to conference sessions.

2.1 Multicast conferencing tools

A suite of tools generally referred to as "the Mbone tools" have been used for some time on the global experimental multicast network known as the Mbone. The Mbone tools include real-time audio and video conferencing applications, shared whiteboards, text chat tools and more. These tools communicate using IP multicast group addresses and encapsulate real-time data in IP datagrams as specified by the Real-time Transport Protocol (RTP) [11], and the associated RTP-profiles for various media encodings. Basic session management and control as well as miscellaneous status report functions are handled by the Real Time Control Protocol, RTCP [11]. In addition, the Session Announcement Protocol (SAP) [14] and the Session Description Protocol (SDP) [15] are used to announce the lifetime of multicast sessions and describe what media format will be used for each session.

2.2 Video on the WWW

Except for experimental systems within the research community, the first large-scale use of live video on the WWW was so-called web-cameras. A web-camera is a device that is attached to a web-server that transmits live video images to a WWW-browser using HTTP. Although HTTP was originally designed for strictly asynchronous applications, extensions have been developed to enable web-servers to send continuous media streams to the client browser. This is known as "push"-technologies or HTTP streaming. Another class of applications that have emerged on the WWW are media on demand servers that transmit pre-recorded media clips to the client browser using HTTP-streaming or some other streaming protocol.

2.3 Packet video gateways

The concept of active media processing within multicast networks as a solution to the network heterogeneity problem was pioneered by Turletti and Bolot in [16] and by Pasquale et al. in [17]. Amir et al. elaborate on these ideas in [2] with the presentation of an application level video gateway that performs transcoding between JPEG and H.261 RTP streams. A classification of active networking applications is given in [18], wherein a distinction is made between transport gateways that bridge networks with different characteristics and applications services that perform active processing of the transmitted data, such as transcoding of video streams between different encodings. In [19] Ooi et al. present an architecture for a programmable media gateway that can be remotely configured to perform user-defined processing of media streams.


3 WebSmile: Overall architecture

WebSmile is a software component that is installed on an ordinary web-server that is connected to a multicast capable network. The software gives users access to multicast RTP video streams through the web-server using HTTP streaming.

3.1 Client side

Two different techniques are used to enable the client browser to display the video that is streamed over HTTP; an experimental MIME-extension [20] for displaying moving images and a Java applet. The MIME extension, known as multipart/x-mixed-replace, makes it possible to display sequences of JPEG or GIF images in an HTML page. Since it is not supported in all browsers this technique is complemented with a Java video player applet that is downloaded from the WebSmile server.

WWWserver

WWW-browser

HTTP-requests

Web-Smile

HTTP-responsesHTML-pagesHTTP video

CommonGatewayInterface

Multicast network

RTP videoRTCP

RTCP receiver reportsRTCP source descriptions

Figure 2 Conceptual model of the WebSmile server architecture

3.2 Server side

The WebSmile gateway is implemented as a server program executed on a web server through the common gateway interface (CGI) [21]. The program performs three separate functions depending on the parameters with which it is invoked:


• Monitor a multicast session and report back information about the video sources that are identified.

• Join a session and return an HTML-page with video displays.

• Start forwarding video over HTTP.

The first function is performed by joining the multicast address and port specified and listening to RTCP source description (SDES) advertisements. The members of the session are identified by a canonical name in the format [email protected] and optionally by more verbose information like a real name, address, phone number etc. This information is reported back to the browser that originated the CGI-request as an HTML-form with a checkbutton associated with each identified session member. The user then indicates which video sources are to be monitored by checking the appropriate checkbuttons and posting the form back to the server. This invokes WebSmile in the second mode as described above to join the session and return the video display HTML page. This page contains a Java applet to display the video, in case the browser has been identified (through CGI environment variables) as non-capable of displaying multipart/x-mixed-replace content. The third mode of WebSmile is invoked when the references in the video HTML-page to the HTTP-streamed video are resolved. This is either an image hyperlink looking something like

<IMG SRC="http://server:port/cgi-bin/websmile?-s+1234+-a+224.2.2.2+-p+5566">

(where 1234 is the source id of the video to be monitored, 224.2.2.2 is the multicast address and 5566 is the UDP port number) or an applet connecting explicitly to the web server with the same CGI parameters. In both cases the video streamed over HTTP conforms to the multipart MIME specification with a content type of image/jpeg for each multipart entity.

3.3 Transcoding

In case the multicast video is not JPEG over RTP as specified by RFC2435 [12] the gateway needs to transcode the video into JPEG. Currently no transcoding support is implemented in WebSmile so only JPEG-compressed video will be forwarded. However, specialised transcoding gateways are available, including [2], that can be used in combination with WebSmile to support other formats.

4 Rate control

Since the bandwidth available for users connected through HTTP is in most situations expected to be less than the bandwidth used for the multicast sessions,


rate control must be applied to the video traffic forwarded by WebSmile. This is performed by adapting the frame rate of the outbound video to the available bandwidth of each HTTP connection. The WebSmile gateway accomplishes this by writing video image data on the TCP socket of each HTTP connection until the socket buffer is filled. Images arriving on the multicast network while a socket is blocked (due to a full buffer) will not be sent to the corresponding client. When the socket is unblocked, forwarding of images is resumed. This modus operandi is simple to implement and will result in each client receiving video at a frame rate determined by TCP's flow control.

Considering the fact that the frame rate sustainable over the HTTP connections might be substantially less than the frame rate of the video being multicast, it would be desirable if the gateway could control the multicast video flow being received so that it conforms to the target bandwidth of the rate-controlled video. Otherwise network resources will be wasted on the multicast data path between the sender and the gateway, since many of the video frames simply will be dropped by the gateway.

4.1 Layered multicast

An elegant solution to multicast flow control is to subdivide the data stream into a hierarchy of cumulative layers each of which is transmitted to a unique multicast address. Thus, each individual receiver can control the bandwidth of the data stream being received by subscribing to an appropriate number of multicast groups. The quality of the reconstructed data depends on how many layers are available in the decoding process. The flow control problem is thereby reduced to finding a way for the receivers to determine the optimal number of layers to subscribe to. Unfortunately, this is not so easy to do in the general case. Several approaches have been suggested [8, 9, 10]. In the present case however, given our assumption that the bandwidth bottlenecks are the HTTP connections rather than the multicast backbone, we can use information about the bandwidth constraints of the HTTP connections as input to the multicast flow control algorithm. Since HTTP is transported over TCP we can actually let the flow control algorithm of TCP drive the decision algorithm for subscribing to multicast layers. What we need is a way to measure the bandwidth that TCP allocates for the HTTP connections. We also need a layered representation of the video signals to be transmitted. The easiest way to achieve a layered video encoding is to distribute the individual video frames temporally over the group of layers. Thus, subscribing to an increased number of layers will result in a higher frame rate of the decoded video. The temporal layering is simplified if only intra-frame compression is used, as is the case with the JPEG encoding used in WebSmile.


4.2 The TCP-driven multicast flow control algorithm

Since one WebSmile gateway can support many HTTP-connected clients with video from the same session, the client with the fastest connection determines how many multicast layers must be subscribed to, in order to support the desired frame rate for each client. That is, if a gateway is serving n clients with TCP connections of bandwidth Bi, i=1..n, respectively, with video distributed uniformly across L distinct layers with an aggregate bandwidth of Btot then the number of multicast layers the gateway should subscribe to, LGW is given by

( ).L

BBmax

Ltot

n1ii

GW

⋅= = (1)

Note that the value we get must be rounded up since only integral layers can be received. To determine the effective bandwidth of the HTTP connections WebSmile measures the time each socket write operation consumes and calculates the mean sending time for each transmitted image. Since a blocking socket interface is used the sending time for an individual image will sometimes be very short (in case of an empty output socket buffer) and sometimes disproportionately long, but on the average a good estimation of the actual throughput is achieved.

If the expression in (1) was to be used directly by WebSmile in the multicast flow control algorithm, the total bandwidth of the video stream (Btot) must be known. However, this parameter may change during the session, so it would be better if an equivalent expression not including Btot could be derived. Furthermore, since the parameter being measured is the average socket send time for an image it would be easier if that parameter could be used directly instead of calculating the bandwidth.

Now, if we let t denote the average time to send an image on the HTTP socket connected to receiver k, where Bk=max(Bi), then the average frame size of the video, J, will be given by

.tBJ k=

Observing that the average frame size can also be written as

fB

J tot= ,


where f is the frame rate of the video, we note that the fraction in (1) can be written as

( )tf1

JtfJ

BB

BBmax

tot

k

tot

n1ii ==== . (2)

Substituting (2) in (1) gives the simple formula

=

tfLLGW , (3)

where L and f are constants. Thus the optimal number of layers to subscribe to can be determined by measuring only the transmission time for the video frames, providing we have an a priori knowledge of the number of layers used and the frame rate of the video. (Strictly speaking the frame rate could be experimentally learned by receiving one layer and multiplying the observed frame rate with the total number of layers, L.)

The algorithm is continually monitoring the average image transmission time to compute the optimal subscription level and thus dynamically adapts to bandwidth fluctuations on the HTTP connections in response to TCP's flow control.

Note that the parameter t in (3) was defined to be the avarage transmission time for an image on the TCP socket with the fastest connection. This implies that the gateway must keep track of which TCP connection has the lowest average sending time (highest throughput) at any time and use that value as input to the flow control algorithm. However, in the actual implementation of WebSmile, each HTTP-connected user is served by a separate process. Running the flow control algorithm independently in all processes using (3), with t being the average image sending time for the process' own TCP connection, will in effect lead to an allocation of multicast addresses where the set of addresses allocated by the process with the fastest TCP connection will be a superset of the sets of addresses allocated by the other processes. The total allocation of multicast addresses on the gateway is hence determined by the process with the fastest TCP connection. Thus, the desired behaviour is achieved without the processes having to synchronize their operation (or even be aware of the other processes' existence.)

Finally, note that the bandwidths Bi and Btot used in (1) in the deduction of (3) represent the actual throughput of image data, excluding transport protocol overhead. Thus, the difference in protocol overhead between HTTP/TCP and RTP/UDP does not impact the flow control algorithm, although it affects the overall bandwidth consumption. The transport protocol overheads are estimated in section 5.1.


5 Performance

To measure how well the flow control algorithm allocates bandwidth on the multicast network in relation to the throughput on the HTTP/TCP-connection a test environment was set up with the configuration shown in Figure 3.

GW

multicast networkvideo

transmitter video receiver

dialupconnection

Figure 3 Network configuration used for performance tests

The line speed of the dialup connection was configurable so that different network access technologies could be emulated (in terms of bandwidth). The connection was configured at a number of different speeds ranging from 30 kbps to 2 Mbps and the resultant bandwidths allocated by WebSmile on the HTTP/TCP-connection and on the multicast connection were measured. The video was transmitted at 25 frames per second in 10 distinct temporal layers. The image resolution was 192 by 144 pixels, which after JPEG compression resulted in a total bitrate of about 650 kbps, or about 65 kbps per layer. In Figure 4 the multicast bandwidth is plotted against the HTTP/TCP bandwidth.

0

200

400

600

0 200 400 600

TCP bandwidth (kbps)

Mul

ticas

t ban

dwid

th (k

bps)

Figure 4 Multicast bandwidth allocation in relation to HTTP/TCP bandwidth


It is clear that the bandwidth allocated on the multicast network depends linearly on the bandwidth available to the TCP connection, as expected. It can also be noted that on the average a slightly higher bandwidth is allocated on the multicast network compared to the TCP bandwidth. (The dotted line in Figure 4 delineates an identical allocation of bandwidth.) This is due to the fact that bandwidth is allocated at a much courser scale on the multicast network, the granularity being the bandwidth of one layer, compared to TCP's congestion window adjustments. On average the over-allocation of bandwidth on the multicast network is one half of the layer bitrate, which in the present case is about 30 kbps.

5.1 Transport protocol overhead

The bandwidth measurements presented in Figure 4 include the overhead imposed by the transport protocols. To investigate what influence the difference in protocol overhead between HTTP/TCP and RTP/UDP transport has on the bandwidth allocation we roughly estimate the overheads.

For the HTTP/TCP transport the overhead for each packet is 20 bytes for the IP-header and 20 bytes for the TCP header. Furthermore, each image is encapsulated by an application-specific MIME multipart boundary identifier. Also a content-type and content-length MIME-field is added for each image. The WebSmile implementation adds 65 bytes of MIME-information for each image. The total overhead depends on the data segment size chosen by the TCP implementation, the fragmentation occurring on the end-to-end network connection and the average size of the images transmitted. Assuming a packet size of 576 octets including IP and TCP header (the default packet size in TCP), no additional fragmentation and an average image size of 3.5 Kb, a total overhead of 8.76% is obtained. Note that this estimation requires that the TCP sender always has enough data in the output socket buffer to transmit a full-sized packet.

The RTP/UDP overhead consists of the 20 byte IP-header, 8 bytes for the UDP header, 12 bytes for the RTP header and 8 bytes for the JPEG/RTP profile header, giving a total of 48 bytes per packet. The same packet size and fragmentation situation as in the TCP case gives an overhead of 8.33%. However, on average, the last datagram of an image will be only half of the maximum datagram size. With a 3.5 Kb average image size this increases the overhead to 9.31%.

Note that in the estimations above the overhead of retransmissions in the TCP protocol is not included and neither is the overhead due to periodic RTCP packet transmissions in the RTP case. Nevertheless, this rough estimation indicates that the overhead is approximately the same for both transports and accounts for about 9 percent of the total bandwidth, both for HTTP/TCP and the RTP/UDP.


6 Future work

The applet used for displaying live video in the client WWW-browser will be extended with functionality to playback audio as well, so that the system can be used as both an audio and video gateway. Furthermore, the integration of media transcoding support into the WebSmile system will be studied in more detail.

7 Summary

In this paper the development of a novel Internet video gateway has been presented. The system, known as WebSmile, enables Internet users that normally would be unable to participate in multicast video conferences to partake using only a standard web browser. The need for a system like this is motivated by the fact that many Internet users will continually be unable to utilise many of the advanced technologies needed for multicast conferencing due to resource unavailability, security concerns and other shortcomings. The design and implementation of WebSmile as an application level gateway co-located with a WWW server was discussed in chapter 3.

In chapter 4 a novel TCP-driven flow control algorithm for layered multicast video was introduced. The algorithm implemented in the video gateway works by adapting the rate of the multicast video flows to the bandwidth allocated by the HTTP/TCP connections to the receiving clients. A layered video encoding transmitted to a set of multicast addresses was suggested to enable the receiver-oriented multicast flow control. The performance of the flow control algorithm was measured in a test network configuration, and the results show that the multicast bandwidth allocated by the gateway closely match the TCP connection bandwidth. The transport protocol overheads for JPEG-video over HTTP and RTP respectively were estimated and found to be approximately the same.

References [1] H. Eriksson, "Mbone: The multicast backbone," Communications of the ACM

37, 1994.

[2] E. Amir, S. McCanne, H. Zhang, "An application level video gateway," ACM Multimedia '95, November 1995.

[3] P. Parnes, K. Synnes, D. Schefström, "Lightweight application level multicast tunneling using mTunnel," Journal of Computer Communication, 1998.

[4] http://www.icq.com - "The ICQ Internet chat service."

[5] J. Oikarinen, D. Reed, "Internet Relay Chat (IRC) protocol," RFC1459, May 1993.


[6] B. Fenner. "The multicast router daemon - mrouted," software on-line, ftp://ftp.parc.xerox.com/pub/net-research/ipmulti.

[7] S. E. Deering, "Multicast routing in a datagram internetwork," PhD thesis, Stanford University, December 1991.

[8] S. McCanne, V. Jacobson, M. Vetterli, "Receiver-driven layered multicast," Proceedings of ACM SIGCOMM '96, October 1996.

[9] L. Vicisano, L. Rizzo, J. Crowcroft, "TCP-like congestion control for layered multicast data transfer," Proceedings of IEEE Infocom '98, San Francisco, CA, March 1998.

[10] L. Wu, R. Sharma, and B. Smith. "ThinStreams: An architecture for multicasting layered video," Proceedings of NOSSDAV'97, May 1997.

[11] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP: A transport protocol for real-time applications," RFC1889, January 1996.

[12] L. Berc, W. Fenner, R. Frederick, S. McCanne, P. Stewart, "RTP payload format for JPEG-compressed video," RFC2435, October 1998.

[13] J. Postel, "User Datagram Protocol (UDP)," RFC 768, August 1980.

[14] M. Handley, "SAP: Session Announcement Protocol," Internet draft, IETF Multiparty Multimedia Session Control Working Group, 1997.

[15] M. Handley, V. Jacobsen, "SDP: Session Description Protocol," RFC2327, April 1998.

[16] T. Turletti, J. Bolot, "Issues with multicast video distribution in heterogeneous packet networks," Proceedings of Packet Video Workshop, Portland Oregon, September 1994.

[17] J. Pasquale, G. Polyzos, E. Anderson, V. Kompella, "Filter propagation in dissemination trees: Trading off bandwidth and processing in continuous media networks," Proceedings of NOSSDAV'98, pp. 269-278, October 1993.

[18] D. Tennenhouse, J. Smith, W. Sincoskie, D. Wetherall, G. Minden, "A survey of active network research," IEEE Communications Magazine, pp. 80-86, January 1997.

[19] W. Ooi, R. van Renesse, B. Smith, "Design and implementation of programmable media gateways," Proceedings of NOSSDAV 2000, June 2000.

[20] N. Borenstein, N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part one: Mechanisms for specifying and describing the format of Internet message bodies," RFC1521, September 1993.

[21] K. Coar, D. Robinson, "The WWW Common Gateway Interface version 1.1," Internet draft, June 1999.

85

Paper D

Stereoscopic video transmission over the Internet

Proceedings of the Second IEEE Workshop on Internet Applications, San José, CA, July 2001.

87

Stereoscopic video transmission over the Internet


for Media and Communication Technology Sallarängsbacken 2, S-431 37 Mölndal, Sweden

[email protected]

Abstract One of the most remarkable features of the human visual system is the ability to perceive three-dimensional depth. This phenomenon is primarily related to the fact that the binocular disparity causes two slightly different images to be projected on the retinas. The images are fused by the human brain into one three-dimensional view. Various stereoscopic display systems have been devised to present computer generated or otherwise properly produced images separately to the eyes resulting in the sensation of stereopsis. A stereoscopic visual communication system can be conceived by arranging two identical video cameras with an appropriate inter-ocular separation, encoding the video signals and transporting the resultant data over a network to one or more receivers where it is decoded and properly displayed. The requirements for realizing such a system based on Internet technology are discussed in this paper and in particular a transport protocol extension is proposed. The design and implementation of a prototype system is discussed and some experiences from using it are reported.

1 Introduction

Real-time multimedia communication over packet networks has recently received a lot of attention. Applications like videoconferencing and video-on-demand have impelled the development and standardization of network protocols and technologies for transport of real-time video over IP networks. Until recently the transmission of video over the Internet has been seriously constrained by technological limitations such as low bandwidth, high latency and high computational complexity. Consequently, most Internet video systems to date have been restricted to low-quality video, limiting its use for demanding applications. However, recent advances in networking technology and signal processing are rapidly eliminating these shortcomings. High quality video communication can therefore be envisioned to be commonplace on the Internet in the near future. To some extent it is already a reality. In some applications the quality and realism of the video content is of paramount importance. Such applications include remotely guided surgery, telerobotics and others. Striving for realism in video-mediated communication implies conveying the visual cues perceptible by the human visual system as closely as possible. An important characteristic of the human visual system is the ability to perceive depth resulting from the spatial disparity of the left and right eyes' viewpoints. Since most packet


video transmission systems developed so far are limited to monoscopic imagery the perception of depth resulting from stereopsis is lost. Although there are other 3D depth cues such as obscuration, kinetic depth, relative size and lighting that can be conveyed by a 2D projection, true stereoscopic vision is only possible with stereopsis [3]. In order to enable stereoscopic visual communication over packet networks two video sources (one for each eye's viewpoint) need to be transmitted and properly presented at the remote end using a 3D-visualisation system. This paper identifies the basic requirements for stereoscopic video transmission over the Internet and proposes a transport protocol extension for stereoscopic video. Furthermore, the development of a software tool to transmit stereoscopic video over the Internet is presented and some initial usage experiences are related.

2 Stereoscopic video fundamentals

The basis for stereoscopic perception is the binocular disparity of the human visual system that causes two slightly different images to be projected on the retinas of the eyes. The two different perspective images are fused in the visual cortex of the brain to compose a single three-dimensional view. This process can be simulated by having two cameras (still image or video) arranged with the same inter-ocular distance as the human eyes. On average the separation of the human eyes is about 65 mm, so placing the two cameras this distance apart with coplanar image sensors will model the human visual system in respect to the difference in perspective between the two viewpoints. When each camera's image is presented only to the corresponding eye of the viewer the two images will be fused to one, providing the cameras are identical. A number of display techniques have been developed to filter out and present the appropriate images to each eye.

2.1 Background

The observation that stereoscopic perception is related to the binocular disparity of the human visual system is not new. As early as about 300 B.C. Euclid had an understanding of the fact that each eye sees a slightly different image, but it wasn't until 1832, when Charles Wheatstone explained that the perception of depth is produced when the mind fuses the two images into one solid three-dimensional view, that the principles of stereoscopic vision were fully uncovered [2]. Wheatstone constructed a simple stereoscope from mirrors and drawn images that demonstrated that there is a unique depth sense, stereopsis, produced by retinal disparity.

2.2 Stereoscopic display systems

A stereoscopic display system is a device that arranges for the left and right viewpoint images to be displayed separately for the corresponding eye. Numerous techniques have been suggested to this end. One approach familiar to many people

Paper D: Stereoscopic video transmission over the Internet 89

is known as the anaglyphic method, wherein the left and right images are drawn in different colors, usually red and green (or blue). The spectator wears a pair of glasses with a red filter covering one eye and a green (or blue) filter covering the other eye. In this way the proper images are displayed for each eye. The main problem with the anaglyphic method is that it does not work very well with color images.

Another approach is to use a time-division-multiplexing scheme for displaying the images and synchronized liquid crystal shutter glasses, to restrict each eye to the proper view. In such a system the images of the left and right channel are displayed sequentially on a CRT monitor at a rate that is synchronized with the shutter glasses so that the left eye is uncovered when the left viewpoint image is rendered and vice versa. The synchronization signal commonly uses an infrared link so that many glasses can be in use simultaneously. A system of this kind is illustrated in Figure 1. A variation on this scheme is to use an active polarizing filter plate in front of the screen in conjunction with passive polarizing filter glasses [4]. The filter plate switches between the two distinct polarizations of the eyewear in sync with the rendering of the left and right viewpoint images. An advantage of this type of system is that the eyewear is cheaper and more convenient than the active shutter glasses.

Figure 1 Time-multiplexed stereoscopic display using active shutter glasses and an infrared emitter

A head-mounted display (HMD) is a piece of equipment worn on the viewer's head with a small liquid crystal display positioned in front of each eye [5]. The left and right viewpoint images are rendered separately on each screen. The main drawbacks are that HMDs tend to be uncomfortable to wear and that they only can be used by one person at a time.

2.3 Stereo image aquisition

To acquire the images to be viewed in a stereoscopic display system, two horizontally displaced cameras are used. There are also numerous applications


where the images are computer-generated from virtual 3D models, but we will limit our discussion to photographic (video) stereoscopy. The principles are the same whether still image or video cameras are being used, but henceforth we will assume video stereography. There are two types of camera set-up that can be used:

• parallel axes cameras,

• toed-in cameras.

In the parallel axes configuration the cameras are aligned so that the axes through the lenses of the cameras are parallel. The convergence of the images is achieved by shifting the image sensors of the cameras slightly, or by a horizontal image translation and clipping of the resulting images. In the toed-in arrangement the cameras are slightly rotated towards each other so that the lens axes intersect at the point of convergence. The two set-ups are depicted in Figure 2.

In both camera configurations the cameras should be vertically aligned and the interaxial separation should be about 65 mm in order to give realistic stereopsis depth cues. However, for specialized applications, like stereo microscopy or aerial mapping, dramatically different interaxial separations might be appropriate.

a b

Figure 2 Camera configuration: a) parallel axes cameras b) toed-in cameras

An undesired effect of the toed-in configuration is that it causes vertical misalignment of corresponding left and right image points. This misalignment, or vertical parallax, is known to be a source of discomfort for the viewer [7]. The reason for this effect, sometimes referred to as keystone distortion, is that the image sensors of the cameras are located in different planes. Therefore the left and right viewpoints get slightly different perspective views of the scene. The problem is illustrated in Figure 3.


a b c

Figure 3 Keystone Distortion: a) original image b) left eye view c) right eye view

Image processing algorithms for elimination of keystone distortion have been proposed [13], but it is most easily avoided by using the parallel axes camera set-up. The parallel axes configuration gives no vertical parallax (providing the cameras are correctly vertically aligned), but requires a horizontal translation of the resulting images. Because of the translation the images are not perfectly superimposed. This requires clipping of the images so that only the common field of view is displayed. Depending on how much the images are translated the convergence plane can be positioned at different perceived depths. A configurable convergence plane can be useful in eliminating the convergence/accommodation problem discussed in the next section.

2.4 Problems with stereoscopic display

There are a number of well-known problems with stereoscopic imaging, some depending on technological shortcomings and some relating to the characteristics of the human visual system. These problems are often manifested as eye strain and discomfort for the viewer.

2.4.1 Accommodation/convergence breakdown

When viewing an object in the real world the eyes are focused (accommodated) on the object by changing the shape of the lenses to give a sharp image. The eyes are also converged on the object by rotation so that the two images seen by the eyes can be fused by the brain into one object. In the real-world viewing situation the convergence plane and the focal plane always coincide. Conversely, when looking at a stereoscopic display, the eyes accommodate on the plane of the screen but converge based on the parallax between the left and right viewpoint images. This breakdown of the habitually accustomed accommodation/ convergence relationship is a well-documented cause for eye strain [8]. The level of discomfort is highly individual and can be reduced with practice. Nevertheless, to minimize the


negative effects of the accommodation/ convergence problem the convergence plane should be positioned so that it appears to be in the plane of the screen. This can be done by an appropriate horizontal image translation in case the parallel axes camera set-up is being used.

2.4.2 Interposition and parallax depth cue conflicts

If an object in a 3D view that have negative parallax, i.e. is perceived to be located in front of the screen, is obscured by the bounding box of the screen or the 3D-window, the sensation of stereoscopic depth is seriously impaired. This is because of the conflict between the 3D depth cue resulting from the negative parallax and the cue of interposition between the object and the screen or window surround. The easiest way to avoid this problem is to arrange the convergence plane so that the foreground objects have zero parallax. Thus, no object in the scene will appear to be in front of the screen and the problem vanishes. For some applications however, it is highly desirable to have objects poking out of the screen in which case care must be taken so that they are not clipped by the bounding region.

2.4.3 Crosstalk

Crosstalk is an undesirable effect occurring when imperfections in the stereoscopic display system results in compromised view separation [13]. For example, using a liquid crystal shutter glasses system, the unwanted view can be leaked to the improper eye by CRT phosphor afterglow, or shortcomings of the optical shutter in the eyewear.

3 Requirements for stereoscopic video transmission

There are a number of issues that need to be considered in order to develop a flexible framework for transmitting stereoscopic video over packet networks. Some basic requirements and some desirable features are identified and discussed below.

3.1 Interoperability with monoscopic packet video systems

The fragmentation of the stereoscopic video content into IP packets should follow the specification of the Real-time Transport Protocol (RTP) and the appropriate RTP profile document for the media encoding in question [19]. This assures that prevalent packet video systems will be able to de-multiplex the packet streams generated by stereoscopic systems. Furthermore, in case monoscopic systems are used in combination with stereoscopic systems, it is highly desirable that the former can decode at least one of the two streams of a stereoscopic packet video stream. To facilitate the reception of only one of the two video streams different UDP port numbers should be used for the two channels. This arrangement is


consistent with the port-based multiplexing scheme for independent media streams devised by the RTP specification.

3.2 Independence of video encoding

The transmission architecture should be general enough to be used with any video encoding and compression scheme. Since most popular video codecs in use are restricted to encoding of monoscopic video, it must be possible to associate two independently encoded (monoscopic) video streams and to identify which stream is the left-eye view and which is the right-eye view. In case a stereoscopic encoding that defines a specific channel multiplexing scheme is being used, the encapsulation should be defined by a RTP profile document. This will compromise interoperability with monoscopic systems as discussed above.

3.3 Compression

In order to utilize network bandwidth as efficiently as possible, compression must be applied to the digital video signals. Numerous compression schemes have been devised for monoscopic video, and techniques and standards customized for stereoscopic video are also emerging. It is widely recognized that substantial compression performance can be gained by exploiting the strong correlation between the left and right video channels of a stereoscopic video pair [9]. Furthermore, the resemblance between motion disparity and perspective disparity makes it possible for stereoscopic video compression to benefit from predictive coding techniques developed for temporal inter-frame compression in algorithms like MPEG. The conventional approach for stereoscopic video compression is to encode one of the channels using a standard monoscopic video compression algorithm and to encode the second channel differentially from the first, exploiting inter-channel redundancies. In the MPEG-2 multiview profile the left channel is encoded as a normal MPEG-2 stream and the right channel is encoded with disparity compensation from the left channel and also with motion compensation within the right channel [12]. Thus, with a suitable multiplexing scheme the left channel is decodable by any system capable of MPEG-2 decoding even if the multiview profile is not supported.

3.4 Switching between stereo and mono

Many of the contemplated applications of stereoscopic video transmission can be envisioned to benefit from stereopsis only for limited periods of time, while the rest of the session is better served with monoscopic video. An example of such a case would be a distributed design session using a videoconferencing system to communicate, enabling stereoscopic views whenever an object needs to be inspected in 3D. To facilitate this type of usage it must be easy to switch between


stereoscopic and monoscopic rendering at the receiver and between stereoscopic and monoscopic transmission at the sender.

3.5 Source host

It should be possible for the two channels of a stereoscopic video stream to originate from different hosts. This is useful for at least two reasons: Many video boards used in workstations only allow one input video signal. In order to transmit two video signals one could use two co-located workstations to transmit one stream each. This raises demands on the synchronization of the streams as will be discussed later. Another motivation for the requirement is that some applications of stereoscopic imaging require a very large inter-ocular distance to be useful. Examples of such applications include aerial mapping and space telescopy. Clearly, these systems will require different source addresses for the left and right video channel's data streams.

3.6 Destination address

In case of multicast operation, it might be useful to assign a different multicast address to each of the video channels. In that way stereo-capable receivers subscribe to both multicast groups whereas monoscopic receivers subscribe to only one. Thus, network bandwidth is not wasted in transmitting both video streams to a receiver that can only decode one. For this reason the left and right viewpoint's video streams should be allowed to have different (multicast) destination addresses.

3.7 Synchronization

The receiver of stereoscopic video must be able to synchronize the left and right video streams for playback. This is eminently important since moving objects in a scene can be perceived as having false parallax values as the result of being spatially displaced in the two views due to bad synchronization. The RTP timestamps of the two video streams represent the sampling instants of the video images and can thus be used for synchronization, providing the RTP timestamps of the two streams are derived from the same clock. This is straightforward if the two streams originate from the same host. Otherwise the receiver must relate the RTP timestamps of the video streams to the NTP timestamps in the corresponding RTCP sender reports. Consequently, the NTP timestamps of the transmitting sources must be synchronized. This is the purpose of the NTP protocol [18].


3.8 Session announcement and initiation

SDP, the Session Description Protocol, is a protocol for describing the multimedia content of real-time conferencing sessions [21]. SDP is used by the Session Announcement Protocol (SAP) to announce multicast conferences and by the Session Initiation Protocol (SIP) to synchronously initiate sessions [22]. In an SDP message a media description entry defines a particular media to be used in the session (such as audio or video), its encoding, the transport port and the transport protocol. For stereoscopic video two media description entries should be included in the SDP message; one for each channel. The media descriptions should specify different UDP port numbers for the two channels. An SDP media description can contain one or more attribute lines. To signal that a media description contains a stereoscopic video channel an attribute of the type "a=X-stereovideo:<channel>" should be present, where <channel> is "left" for the left video channel and "right" for the right channel. An example SDP message for a stereoscopic video session is given in Figure 4.

v=0 o=mathias 2890844526 2890842807 IN IP4 192.36.136.15

s=Stereoscopic Video Test

c=IN IP4 224.2.2.2/127

t=2873397496 2873404696

m=video 25566 RTP/AVP 26

a=X-stereovideo:left

m=video 25568 RTP/AVP 26

a=X-stereovideo:right

Figure 4 Example of an SDP session description for a stereoscopic video session

4 Transport protocol extension for stereoscopic video

In accordance with the requirements discussed in chapter 3 stereoscopic video streams should be transported using the Real-time Transport Protocol (RTP) [19]. Furthermore, the left and right video channels should be carried as two distinct RTP streams so that individual demultiplexing and decoding is possible for systems that cannot display stereoscopic images. A transport protocol extension is needed to associate the two channels of a stereoscopic stream to each other and to identify which channel is the left and right viewpoint respectively. Although this information can be defined at session initiation time using SDP as described in section 3.8, another mechanism is necessary for sessions that are not announced by SAP or SIP. Furthermore, the transport protocol extension permits the bindings of RTP streams to viewpoints to change throughout a session.


RTP provides end-to-end network transport functionality suitable for real-time, delay sensitive data transmission. The protocol defines packetization and multiplexing rules for real-time data. It also defines a packet header containing, among other things, fields for sequence numbers and timestamps, usable for things like packet loss detection, playout scheduling and cross-media synchronization. The details of packetization for specific media encodings are defined separately in RTP profile documents. A closely related protocol, the RTP Control Protocol (RTCP) is used for monitoring the quality of service of RTP sessions and to convey information identifying the properties of RTP flows.

The RTCP protocol defines source description (SDES) packets for carrying information about associated RTP streams. SDES packets consist of a packet header followed by a number of source identification/description pairs. The source identification is a 32-bit synchronization source identifier (SSRC) that uniquely identifies an RTP stream. Each RTP stream carries the SSRC identifier in its RTP header. The source description is a list of SDES items. An SDES item is a variable length entity consisting of an 8-bit item type identifier, an 8-bit length field and a variable length source identification string. Currently defined SDES item types include CNAME, NAME, PHONE, EMAIL, LOC, TOOL, NOTE, APP and PRIV. Each SDES item describes an RTP stream by some attribute like a real name or a phone number. The private extension (PRIV) SDES item is intended for experimental or application-specific SDES extensions. In addition to the 16-bit SDES item header the PRIV item also includes an 8-bit prefix length field and a variable length prefix string containing an ASCII identification of the PRIV item subtype. Since PRIV items of unrecognized subtypes are required to be silently ignored, new source description items can be introduced without requiring packet type value registration. If wider use is justified after testing it is recommended that the PRIV item is redefined as a unique SDES item, without the prefix identification, and given an item type that is registered by the Internet Assigned Numbers Authority (IANA) [24]. Thus, SDES PRIV items are ideal as containers for information associating the channels of a stereo pair.

The format of the stereo SDES PRIV item we have used in the experimental system presented in this paper is shown in Figure 5.

SDES ID=8 length prefix length

“3D-video”

SSRC of ...

channel id

prefix string

0 8 16 24 32

... other channel’s RTP stream

Figure 5 RTCP SDES PRIV item: SDES ID (8 bits), length (8 bits), prefix length (8 bits),

prefix string (64 bits), SSRC (32 bits), channel id (8 bits)


The prefix string field is an 8 octets wide ASCII string that identifies the PRIV packet as a stereoscopic video source description item. The string value "3D-video" is used for stereo PRIV items. The prefix length should consequently be set to 8. The SSRC field contains the 32-bit numeric synchronization source identifier of the other channel's RTP stream. (That is, for a stereo PRIV item identifying a left-eyed RTP stream this field contains the SSRC of the corresponding right-eyed RTP stream and vice versa.) The channel id identifies the RTP video stream as being the left (channel id 1) or right (channel id 2) viewpoint of a stereo video pair.

The stereo SDES PRIV item should be included in the SDES item list of the RTCP packets periodically transmitted to the destination address of the associated RTP video stream. This assures that late joining members (in case of a multicast session) can identify the source as a stereo video channel.

Note that it is sufficient with only one of the RTP video streams of a stereo pair being identified as stereoscopic with SDES PRIV items, since it gives a complete association of the two streams. However, since RTCP packets are implicitly associated with an RTP stream by UDP port number (the port number of the RTCP stream being one higher than that of the RTP stream) it might be desirable to mutually identify the stereo pair by both RTCP streams. This is useful if the reception of a stereo stream is subdivided in separate processes for each channel, or indeed is distributed on two hosts. The operation if the stereo SDES PRIV packet streams are inconsistent is undefined.

5 Implementation issues

In this chapter the design and implementation of a stereoscopic videoconferencing application is presented. Some experiences from initial usage of the system are also reported.

5.1 A stereoscopic videoconferencing tool

In order to study the requirements of stereoscopic video transmission over the Internet in practice an experimental teleconferencing tool called Smile! was modified to support stereoscopic video transmission and display. The Silicon Graphics O2 workstation running the IRIX operating system was chosen as the target platform because of its broad multimedia capabilities and native support for stereoscopic rendering. JPEG compression was chosen for the video streams due to the availability of dedicated hardware for compression/decompression.

The transmitting side of the system was realized using two workstations with video grabber and compression hardware and one camera connected to each workstation. The cameras were arranged in the parallel-axes configuration, as described in section 2.3, with an inter-axial separation of about 65 mm. To enable stereoscopic transmission the user selects whether the video is the left or right


viewpoint by checking the corresponding checkbox in a pulldown menu. A snapshot of the graphical user interface is shown in Figure 6.

Figure 6 User interface for viewpoint selection

Once a viewpoint is selected from the graphical user interface, the application starts transmitting RTCP SDES PRIV packets for the stereoscopic extension as specified in section 4. In order to do this, the application must know the synchronization source identifier (SSRC) of the other viewpoint's RTP stream. In this prototype implementation the SSRC identifiers of the RTP streams were user-configurable by command line parameters. Allocating SSRC identifiers in this way is not recommended, since it compromises identifier uniqueness, but was nevertheless chosen for simplicity. A better approach would be to generate the SSRC identifiers randomly, as exemplified in Appendix A6 of the RTP specification [19], and to use some application-specific setup protocol to exchange the identifiers between the sending peers.

On the receiving side stereoscopic video streams are identified when the RTCP source description packets including the stereoscopic video extension items arrive. Smile! maintains a list of contributing members of the conference session that have been identified by source description RTCP packets. The list is graphically presented to the user with an icon identifying the media type and a name describing the originator. When two video streams have been identified as left and right viewpoint of a stereoscopic video pair respectively, they are represented by only one item in the graphically displayed list with an icon indicating the stereoscopic nature of the video. This is depicted in Figure 7.

Time-multiplexed stereoscopic video is displayed in a window or full-screen using Open GL's quad-buffer rendering and is viewed using CrystalEyes shutter glasses from StereoGraphics [20]. A checkbutton for switching between stereoscopic and monoscopic rendering is available from a pull-down menu. The horizontal image translation needed to converge the views is controlled by the left and right arrow-keys on the keyboard. Thus, the convergence plane can be interactively adjusted to different perceived depths, depending on viewing conditions and user preference.


Figure 7 The session member list: one stereoscopic video transmitter is identified by the "3D" label on the camera icon. Two other sources are

active: one transmitting audio and video, the other only audio.

5.2 Usage experiences

Some incipient trials with stereoscopic video transmissions over the Internet have been performed using the prototype system described above. Although much more work is needed to evaluate the technology in question, some initial observations concerning usability have been made.

5.2.1 Notes concerning video quality

What level of quality to choose for a stereoscopic video transmission depends on the type of application in question and the bandwidth available. Also, different video encodings are appropriate in different situations. However, some observations of a more general nature can be made:

The frame rate of the left and right video streams should be the same. How high frame rate is needed is highly application-dependent, but substantially different frame rates in the two channels must be avoided. Simultaneously displaying two temporally displaced frames of a dynamic scene will result in inconsistent views for the left and right eyes, creating ghosting effects, false parallax effects and confusion.

The effect of choosing different spatial quality for the two channels is more complex. Initial experiments suggest that a high spatial quality in one of the channels and a substantially lower quality in the other results in a perceived quality that is some sort of average of the two [23]. How this effect relates to other factors, like eye-dominance, requires further psychovisual research.


5.2.2 Positioning the convergence plane

Due to problems with conflicting depth cues, as discussed in section 2.4.2, a positioning of the convergence plane that causes negative parallax values must be treated with care. This was found to be particularly true if a small screen was used. In the trials with the prototype system presented here the best results were achieved when positioning the convergence plane in the plane of the screen.

5.2.3 Target applications

In computer aided engineering (CAE) and design (CAD) stereoscopic visualization of 3D models is commonly used. In distributed CAE and CAD sessions sharing of virtual 3D models between geographically dispersed teams of engineers can be complemented with shared stereoscopic views of real product prototypes. The addition of stereopsis in this context means that early prototypes can be evaluated remotely with a high degree of realism.

Another field where stereoscopic visualization has prevailed is medical simulators. With stereoscopic video transmission and visualization remote medical consultations and remote surgery can be performed with increased quality.


Advances in network technology and signal processing have made high-quality video transmissions over the Internet feasible. Since Internet videoconferencing hitherto has been limited to monoscopic video, the sensation of stereopsis resulting from binocular disparity is lost. This paper has identified the basic requirements for stereoscopic video transmission over the Internet. Also, a transport protocol extension has been proposed that enables two video signals of a stereoscopic pair to be associated and identified as left and right viewpoint respectively. The packetization and multiplexing rules for stereoscopic video was defined in accordance with the RTP specification. An implementation of a stereoscopic videoconferencing system was presented along with some initial usage experiences.

To conclude, stereoscopic video transmission systems can be successfully realized over the Internet using the transport protocol extension and general guidelines presented in this paper. Experiences from implementing and using the prototype system substantiate this assertion. It is this author's belief that many future communication systems will support stereoscopic video transmission and benefit from the powerful visual cues of stereopsis.


References

[1] H. Wallach, D. H. O'Connell, "The kinetic depth effect," Journal of Experimental Psychology, 45, pp. 205-217, 1953.

[2] C. Wheatstone, "On some remarkable and hitherto unobserved phenomena of binocular vision," Philosophical Transactions of the Royal Society of London, 1838.

[3] D. L. MacAdam, "Stereoscopic perceptions of size, shape, distance and direction," SMPTE Journal, 1954.

[4] P. Bos, T. Haven, "Field-sequential stereoscopic viewing systems using passive glasses," Proceedings of the SID, vol. 30, No. 1, pp. 39-43, 1989.

[5] I. E. Sutherland, "The ultimate display," Proceedings of the IFIPS Conference, pp. 506-508, 1965.

[6] I. E. Sutherland, "Head-mounted three-dimensional display," Proceedings of the fall joint computer conference, pp. 757-764, 1968.

[7] J. Konrad, "Enhancement of viewer comfort in stereoscopic viewing: parallax adjustment," Proceedings of SPIE/IST symposium on electronic imaging, stereoscopic displays and virtual reality systems, pp. 179-190, 1999.

[8] J. S. McVeigh, M. W. Siegel, A. G. Jordan, "Algorithm for automated eye strain reduction in real stereoscopic images and sequences," Proceedings of the SPIE/IST conference, February 1996.

[9] M. G. Perkins, "Data compression of stereopairs," IEEE Transactions on communications, vol 40, pp. 684-696, April 1992.

[10] M. W. Siegel, P. Gunatilake, S. Sethuraman, A. G. Jordan, "Compression of stereo image pairs and streams," Stereoscopic displays and applications V, pp. 258-268, February 1994.

[11] A. Puri, R. V. Kollarits, B. G. Haskell, "Stereoscopic video compression using temporal scalability," Proceedings of SPIE visual communications and image processing, May 1995.

[12] A. Luthra, X. Chen, "MPEG-2 multiview profile for MPEG-2," Proceedings of SPIE/IS&T Multimedia hardware architectures, February 1997.

[13] B. Lacotte, "Elimination of keystone and crosstalk effects in stereoscopic video," Technical report 95-31, INRS Telecommunications, Quebec, December 1995.

[14] L. Lipton, "Binocular symmetries as criteria for the successful transmission of images," Processing and display of three-dimansional data II, SPIE vol. 507, 1984.

[15] T. Miyashita, T. Uchida, "Cause of fatigue and its improvements in stereoscopic displays," Proceedings of the SID, vol. 31 no. 3, pp. 249-154, 1990.


[16] T. Yamazaki, K. Kamijo, S. Fukuzumi, "Quantitative evaluation of visual fatigue encountered in viewing stereoscopic 3D displays: Near-point distance and visual evoked potential study," Proceedings of the SID, vol. 31 no. 3, pp. 245-247, 1990.

[17] S. Pastoor, "3D-television: A survey on recent resarch results on subjective requirements," Signal processing: Image communication, vol. 4 no 1, pp. 21-32, 1991.

[18] D. L. Mills, "Network time protocol (version 3) specification, implementation and analysis," RFC1305, March 1992.

[19] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP: A transport protocol for real-time applications," RFC1889, January 1996.

[20] L. Lipton, "CrystalEyes handbook," StereoGraphics Corporation, 1991.

[21] M. Handley, V. Jacobsen, "SDP: Session description protocol," RFC2327, April 1998.

[22] M. Handley, H, Schulzrinne, E. Schooler, J. Rosenberg, "SIP: Session initiation protocol," RFC2543, March 1999.

[23] L. B. Stelmach, W. J. Tam, "Stereoscopic image coding: Effect of disparate image quality in left- and right-eye views," Signal Processing: Image Communication 14, pp. 111-117, 1998.

[24] Internet Assigned Numbers Authority, http://www.iana.org/

video communication over the internet

Documents