adaptive nonlinear video editing: retargeting, …yuhuang/papers/videoeditingoverview09.pdf- 3 -...

- 1 -

Adaptive Nonlinear Video Editing: Retargeting,

Replaying, Repainting and Reusing (R4)

Yu Huang

Multimedia Content Networking Lab Core-Network Research Department

Huawei Technologies (USA) Bridgewater, NJ 08807

TR-2009-1, April, 2009

Abstract

With increasing access to and sophisticated use of digital video contents, there is

ever-growing interest in video editing tools allowing consumers to manipulate the

captured video. As the marketplace for consumer video editing evolves, we believe

that a new generation of tools will be developed to provide video manipulation

capabilities ranging from simple touchups to advanced postproduction special

effects like those seen in big-budget Hollywood films. A required video editing

interface in the media area should demand that users think in terms of not only

frames and timelines, but also the higher level components of a video such as

motion, action, character, and story. Recently more research work on video editing

tools is being carried on. Basically those efforts focus on processing videos in an

adaptive, nonlinear and content-driven way, which may be classified into four

groups: 1) retargeting in spatial domain, 2) replaying in the temporal domain, 3)

repainting in the luminance/texture domain and 4) reusing in the hybrid domain.

In this report, we present an overview of major technologies employed in each

group, respectively.

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in

whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all

such whole or partial copies include the following: a notice that such copying is by permission of Huawei Technologies

Co., Ltd.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the

copyright notice. All rights reserved.

- 2 -

A Trial Version

- 3 -

Adaptive Nonlinear Video Editing: Retargeting,

Replaying, Repainting and Reusing (R4)

Yu Huang Multimedia Content Networking Lab, Huawei Technologies USA 400 Somerset Corporate Blvd. Suite 602, Bridgewater, NJ 08807

E-mail: [email protected]

1. Introduction

Prior to digital photography there was relatively little editing of film prints per se, beyond the crops afforded by scissors and the occasional zoom provided by photo lab enlargements. Whereas today, even the most naïve user of digital photography commonly crops, resizes, adjusts the contrast of, and varies the brightness of their images. Moreover, photo editing tools, such as Adobe Photoshop [1], are increasingly applying sophisticated image processing operations, such as layer segmentation, clone brushing, and un-sharp masking. With increasing access to and sophisticated use of digital video camcorders, there is ever-growing interest in video editing tools allowing consumers to manipulate the captured video. As with digital photography, the first generation video editing tools facilitate the most simple and common of editing tasks–specifically, the cutting and pasting of video segments interspersed with transitions and titles, as well as a rich set of intra-frame processing capabilities to improve image contrast and color-balance. As the marketplace for consumer video editing evolves we believe that a new generation of tools will be developed to provide video manipulation capabilities ranging from simple touchups to advanced postproduction special effects like those seen in big-budget Hollywood films. There are commercially available digital video editing tools such as Adobe Premier [2], Final Cut by Apple [3] and Avid Media Composer [6] allowing so-called non-linear editing, the pipeline of cutting, pasting, and trimming sequences of frames. Apple’s QuickTime [4] framework and Microsoft’s DirectShow [7] framework implement multimedia filter graphs for video and audio. Specifically, DirectShow treats video data as a stream that flows in buffers of entire frames of pixels from the graph’s input to the graph’s output. Present video interfaces demand that users think in terms of frames and timelines. Although frames and timelines are useful notions for many editing operations, they are less well suited for other types of interactions with video. In many cases, users are likely to be more interested in the higher level components of a video such as motion, action, character, and story. Although the concept of object motion is not as high-level

- 4 -

as character and story, it is a mid-level aspect of video that we believe is a crucial building block toward those higher-level concepts. Existing video editing tools are still not satisfying the users/customers’ requirements in the media area. Recently more research work on adaptive video editing tools are being carried on, in which machine learning, pattern recognition, image processing and computer vision techniques are applied to provide users a convenient and flexible interface to edit, manipulate and interact with the video contents. These functionalities should be adaptive and content-driven, classified basically into four groups: 1) retargeting in spatial domain, 2) replaying in the temporal domain, 3) repainting in the luminance/texture domain and 4) reusing in the hybrid domain.

2. Retargeting

Standard image scaling is not sufficient since it is oblivious to the image content and typically can be applied only uniformly. Cropping is limited since it can only remove pixels from the image periphery. More effective resizing can only be achieved by considering the image content and not only geometric constraints. Recently, there is a growing interest in image retargeting that seeks to change the size of the image while maintaining the important features intact, where these features can be either detected top-down or bottom-up. Video retargeting is the process of transforming an existing video to fit the dimensions of an arbitrary display. A compelling retargeting aims at preserving the viewers' experience by maintaining the information content of important regions in the frame, whilst keeping their aspect ratio. In retargeting, top down methods use tools such as face detectors [77] to detect important regions in the image, whereas bottom-up methods rely on visual saliency methods [37] to construct a visual saliency map of the image. Once the saliency map is constructed, cropping can be used to display the most important region of the image. Attention models, based on human spatiotemporal perception have been used to detect Regions of Interest (ROIs) in image and video. The ROIs are then used to define “display paths” [79] to be used on devices in which the display size is smaller than the video (or image) size. The least important content of the video is cropped, leaving the important features in larger scale, essentially creating a zoom-in-like effect [29]. Virtual camera motions or pseudo zoom-in/out effects are used to present the content in a visually pleasing manner. A similar system was proposed by [52], where both cropping and scaling are used together with virtual camera motion to mimic the process of adapting wide screen feature films and DVDs to standard TV resolution. Their system minimizes

- 5 -

information loss based on image saliency, objects saliency and detected objects (e.g. faces). Cropping, however, discards considerable amounts of information and might be problematic, for instance, if important features are located at distant parts of the image or frame, which is common in wide or over-the-shoulder shots in videos. Figure 1 shows an example of retargeting a film frame from a widescreen DVD to 200x150, 160x120 and 120x120 pixels respectively.

Figure 1. Video retargeting: automating pan and scan

An alternative approach is to segment the image into background and foreground layers, scale each one of them independently and then recombine them to produce the retargeted image. This was first proposed by [67] for non-photorealistic retargeting of images and later extended to video by [73]. While this is an appealing approach, it relies crucially on the quality of segmentation - a difficult and complicated task in itself. Figure 2 shows the source image and its retargeted images to a PDA and a cell phone display respectively. Figure 3 is the overview of the proposed method in [73].

Figure 2. Automatic image retargeting

Figure 3. Active Window Oriented Dynamic Video Retargeting

Recently, [88] presented a system to retarget video that uses non-uniform global warping. They concentrate on defining an effective saliency map for videos that

- 6 -

comprises of spatial edges, face detection and motion detection. Results are shown mainly for reducing video size. Figure 4 is the example which shows the comparison between a retargeted frame and traditional resized frame with the half width for MPEG video “Akiyo”. Its system overview is given in Figure 4.

Figure 4. Non-homogeneous Content-driven Video Retargeting

In [26], surveillance videos are cropped based on the trajectory of the most important parts of the video for display. Deselaers et al. presented a method to fully automatically fit videos in 16:9 format on 4:3 screens and vice versa [21], shown in Figure 5. Different low-level features and a loglinear model are employed to learn how to find the right area. The cropping sequence is optimized over time to create smooth transitions and thus leads to a good viewing experience.

Figure 5. Pan, Zoom, Scan - Automatic Video Cropping

Seam carving is an image retargeting algorithm [13]. This algorithm alters the dimensions of an image not by scaling or cropping, but rather by intelligently removing pixels from (or adding pixels to) the image that carry little importance.

- 7 -

Figure 6. Seam Carving for Content-Aware Image Resizing

The importance of a pixel is generally measured by its contrast when compared with its neighbor pixels, but other measures may be used. Additionally, it's possible to define (or auto detect) areas of high importance (faces, buildings, etc.) in which pixels may not be deleted, and conversely, areas of zero importance which should be removed first. A seam is an optimal 8-connected path of pixels on a single image from top to bottom, or left to right, where optimality is defined by an image energy function, shown as red curves in Figure 6. By repeatedly carving out or inserting seams in one direction we can change the aspect ratio of an image. By applying these operators in both directions we can retarget the image to a new size. The selection and order of seams protect the content of the image, as defined by the energy function. In Figure 6, the energy function is illustrated in the middle along with the vertical and horizontal path maps, the resizing result of extending in one dimension and reducing in the other is shown on the top right, compared to standard scaling on the bottom right. In a similar manner, video should support retargeting capabilities as it is displayed on TVs, computers, cellular phones and numerous other devices. A naive extension of seam carving to video is to treat each video frame as an image and resize it independently. This creates jittery artifacts due to the lack of temporal coherency, and a global approach is required. Instead, Rubinstein et al. improved it by treating video as a 3D cube and extending seam carving from 1-D paths on 2-D images, to 2-D manifolds in a 3-D volume [65]. In spatial-temporal volume, a seam must be monotonic, including one and only one pixel in each row (or column), and connected. This extension defines 2D surfaces to be removed from the 3D video cube, shown in Figure 7 (comparison with standard scaling).

- 8 -

Figure 7. Improved Seam Carving for Video Retargeting

3. Replaying

The replay or playback of videos at faster than their original speeds has been investigated by the multimedia summarization community. Video is divided into segments, and more important and interesting segments are selected for a shorter form — a video abstraction. With granularity from small to large, the segmentation results can be frame, shot, scene, and event. Shot is a sequence of frames recorded in a single-camera operation, and scene is a collection of consecutive shots that have semantic similarity in object, person, space, and time. Video abstraction methods will use these notions of video structure. There are two types of video abstraction, video summary and video skimming [47]. Video summary, also called a still abstract, is a set of salient images (key frames) selected or reconstructed from an original video sequence. Video skimming, also called a moving abstract, is a collection of image sequences along with the corresponding audios from an original video sequence. Video skimming is also called a preview of an original video, and can be classified into two sub-types: highlight and summary sequence. A highlight contains the most interesting and attractive parts of a video, while a summary sequence renders the impression of the content of an entire video. Among all types of video abstractions, summary sequence conveys the highest semantic meaning of the content of an original video. A good survey for video abstraction is referred to [47]. However, some important work is worth to emphasize in this session. The “Informedia” project at CMU [70] looks for short, representative video segments that best tell the story of the video. Segments are chosen based on characteristics including scene changes, camera motion, and audio. The documentary films they target have distinct scenes changes, unlike time-lapse sources. Figure 8 illustrates the concept pf extracting the most representative information to create the skim. The most significant frames from a select scene are chosen for browsing. A single frame is selected from the skim for iconic representative.

- 9 -

Figure 8. Video Skimming through Image and Language Understanding

Sundaranm et al. applied audio segmentation, visual complexity/grammar and utility model to generate audio-visual skims from computable scenes [72]. The complexity is estimated by key frame for each shot. The shot is regarded as words and the syntax provides the meaning of the shot sequence. The utility function, which models the comprehensibility of a shot as a continuous function of its duration and its complexity, is to calculate the penalty function for skim generation via constrained optimization. Li developed a content-based movie analysis, indexing and skimming system, which includes event detection module, speaker identification module, and movie skimming module for the content browsing purpose, reported in his thesis [48]. In [22] (work from MERL), short video clips are identified as containing significant motion and are played at real-time speeds and assembled into the final video. Doing so sets a lower bound on the video’s duration, which system flowchart is illustrated in Figure 9. Hua et al. [35] search for video segments that contain motion between shot boundaries and combine them to match an audio source (AVE), which system overview is shown in Figure 10.

Figure 9. Video summarization using MPEG-7 descriptors

- 10 -

Figure 10. Automated home video editing

Gargi et al. at HPL [30] applies content analysis as well as user input being an option to determine higher semantic importance segments of video and varies the playback rate accordingly. Based on perceptual and computational attention modeling studies, Evangelopoulos et al. presented an audiovisual saliency-based movie summarization [28] and the system overview is shown in Figure 11. Audio saliency is captured by signal modulations and related multifrequency band features. Visual saliency is measured by means of a spatiotemporal attention model driven by various feature cues (intensity, color, motion). Audio and video curves are integrated in a single attention curve, where events may be enhanced, suppressed or vanished. The presence of salient events is signified on this audiovisual curve by geometrical features such as local extrema, sharp transition points and level sets.

Figure 11. Movie Summarization based on Audiovisual Saliency Detection

- 11 -

Video Collage is a video browsing system in MSR Asia which automatically constructs a compact and visually appealing synthesized collage from a video sequence [54]. It selects the salient ROIs and resizes them according to their saliencies, and seamlessly arranges these ROIs on a given canvas while preserving the temporal structure of video content. Video Collage provides a user interface to browse video content by 1D/2D static collage, 1D/2D dynamic collage or key frames. Figure 12 illustrates its user interface for an example, where area A for 2D collage, B for video sequence, C for 1D collage and D for selected key frames.

Figure 12. Video Collage

Time-lapse is an effective tool for visualizing motions and processes that evolve too slowly to be perceived in real-time. Time-lapse videos are regularly used to capture natural phenomena and to provide artistic cinematic effects. Time-lapse related techniques are also frequently applied to other applications including summarization of films and time-compression of surveillance videos. Traditional time-lapse methods use uniform temporal sampling rates and short, fixed-length exposures to capture each frame. As is the case with any periodic sampling method, the sampling rate must be sufficiently high to capture the highest-frequency changes, otherwise aliasing will occur. Bennett and McMillan [15] simplifies time-lapse capture by removing the need to specify a sampling rate and exposure time a priori, which is accomplished through non-uniform sampling of a video-rate input to select salient output frames and non-linear integration of multiple frames to simulate normalized long exposures. An example is given in Figure 13.

- 12 -

Figure 13. Computational Time-Lapse

Typical approaches to browsing video utilize a linear scan metaphor, such as a slider, timeline, or fast-forward and rewind controls. However, using pre-computed object motion, the trajectories of objects in the scene can be employed as constraints for direct-manipulation navigation. Shneiderman defines a direct manipulation interface as one with “visible objects and actions of interest, with rapid, reversible, incremental actions and feedback” [68]. Scrubbing, a method of controlling the video frame time by mouse motion along a time line or slider, is often used for this fine level control, allowing a user to carefully position the video at a point where objects or people in the video are in certain positions of interest or moving in a particular way. Kimber et al. [42] introduced the notion of navigating a video by directly manipulating objects within a video frame. They also demonstrate tracking and navigation across multiple cameras. However, their method uses static surveillance cameras and relies on whole object tracking, precluding navigation on multiple points of a deforming object. Figure 14 shows a trail in video and floor plan view.

- 13 -

Figure 14. Trailblazing: Video playback control by direct object manipulation

Contemporaneous works by Karrer et al. [41] and Dragicevic et al. [25] use flow-based preprocessing to enable real-time interaction. Their research demonstrates that direct manipulation video browsing permits significant performance improvements for frame selection tasks. Figure 15 is an example for user interaction in [41]. Some different from [42], they used optical flow so that even individual parts of objects (like the hand of a soccer player) can be dragged around for navigation. In [25], background stabilization is performed to make direct manipulation easier, shown in Figure 16. When the user drags an object, background stabilization shifts the video frames and leaves a trail of previously displayed frames.

Figure 15. DRAGON: manipulation interface for in-scene video navigation

- 14 -

Figure 16. Video browsing by direct manipulation

Recently, a strong interaction tool with video is demonstrated by Adobe research lab [32] which explores the use of tracked 2D object motion for direct manipulation of objects. This system also enables a re-animation mode of video sequences using the mouse as input for various applications, for example, directing the motion of an object such as a face by dragging it around the screen. This interface can be applied to “puppeteer” existing video, or to retime the playback of a video. The system first analyzes the video in a fully automatic preprocessing step that tracks the motion of image points across the video and segments those tracks into coherently moving groups. Figure 17 shows two examples, one for surveillance videos and the other for human face videos.

Figure 17. Video Object Annotation, Navigation, and Composition

4. Repainting

Pixel-based processing techniques to repaint or reproduce the video have been discussed for decades, such as high dynamic range (HDR), color transfer and video stylization etc. Traditionally, the texture mapping technique can be used to reproduce the surface of a given well-modeled object/scene. However, such object models are not readily obtained from a priori unseen and un-calibrated footage. Recently more video/image-based approaches are proposed to accomplish this task.

- 15 -

4.1 High Dynamic Range (HDR) Video

The real world has a lot more brightness variation than can be captured by the sensors available inmost cameras today. The radiance of a single scene may contain four orders of magnitude from shadows to fully lit regions. Typical CCD or CMOS sensors only capture about 256-1024 levels (The non-linear allocation of levels in a gamma curve can improve this slightly). The technique to solve for the limited dynamic range problem for images and videos works from either capture or rendering. In video editing, HDR technique is designed to reproduce new videos that depict the full visual dynamics of real world scenes through tone reproduction (tone mapping) as possible. Tone mapping provides a method of scaling (mapping) luminance values in the real world to a displayable range, either spatially varying (locally) or spatially uniform (globally). HDR videos require more elaboration process than HDR images because of the temporal consistence in tone mapping. Kang et al. shows a system [40] to generate a HDR video from an image sequence of a dynamic scene captured while rapidly varying the exposure. It utilized statistics from temporally neighboring frames to produce tone-mapped images that smoothly vary in time. To realize this, it requires accurately registering neighboring frames and selecting the most trustworthy pixels for radiance map computation before tone mapping. Figure 18 is an example of HDR where the top row is the input video with alternating short and long exposures and the bottom row is the tone-mapped result.

Figure 18. HDR Video

Mantiuk et al. proposed a tone-mapping operator that produces the least distorted image, in terms of visible contrast distortions, given the characteristic of a particular display device [53]. The distortions are weighted using the human visual system (HVS) model, which accounts for all major effects, including luminance masking, spatial contrast sensitivity and contrast masking. Such tone mapping operator is naturally formulated as an optimization problem, where the error function is weighted by the HVS model and constraints are dictated by the display limitations. It runs very efficiently if the error function is based on higher order image statistics and the non-linear optimization problem is reduced to the medium-size quadratic programming task. A straightforward extension ensures temporal coherence and

- 16 -

makes it suitable also for video sequences. Figure 19 is the illustration of adaptive tone mapping for low ambient light and high ambient light.

Figure 19. Display Adaptive Tone Mapping

4.2 Video Color Transfer (Colorization)

Colorization is generally used for increasing visual appeal of grayscale images and perceptually enhancing various single band medical/scientific images (pseudo color). For this, traditional approach is to segment an image into some regions and manually or semi-automatically color it region by region. Obtaining high quality colorization using traditional techniques is extremely time-consuming. An attempt to minimize human intervention in colorization process to speed it up was presented by Welsh et al. [85] where they used luminance statistics to colorize grayscale images (target) using some reference color images (source). Rather than choosing RGB colors from a palette to color individual components, they transfer the entire color “mood” of the source to the target image by matching luminance and texture information between the images. There is then an underlying assumption that different colored regions give rise to distinct luminance, and their technique works properly only when this is not violated, otherwise requiring significant user intervention. Figure 20 shows an example for this proposal.

Figure 20. Transferring color to grayscale images Levin et al. in [45] assume that in addition to the monochrome data, the user scribes some colors (strokes) in the image as constraints. The color is added following the simple premise that neighboring pixels having similar intensities in the monochrome data should have similar colors in the chroma channels. Then, this premise is formulated using a quadratic cost function and obtain an optimization problem. Thus, the indicated colors are automatically propagated in both space and time. Figure 21 shows a colorized frame (middle) compared with the original color frame (right).

- 17 -

Figure 21. Colorization using Optimization

Irony et al. proposed a new method for colorizing grayscale images by transferring color from a segmented example image [36], shown in Figure 22. Rather than relying on a series of independent pixel-level decisions [85], the colorizations exhibit a much higher degree of spatial consistency. This method requires considerably less manual effort than previous user-assisted colorization methods [45].

Figure 22. Colorization by Example

Yatziv and Sapiro presented a fast colorization approach in [89] permitting the user to interactively get the desired results promptly after providing a reduced set of chrominance scribbles. Their method obtains the high resolution colorization results based on the concepts of luminance-weighted chrominance blending and fast intrinsic distance computations. Figure 23 is an example of re-colorization.

Figure 23. Fast Colorization using Chrominance Blending

- 18 -

4.3 Video Stylization

Stylized rendering of video is an active area of research in non-photorealistic rendering (NPR). Cartoon animations are typically composed of large regions which are semantically meaningful and highly abstracted by artists. A region may simply be constantly colored as in most cel animation systems, or it may be rendered in some other consistent style. Hertzmann et al. [33] modified each successive frame of the video by first warping the previous frame to account for optical flow changes and then painting over areas of the new frame that differ significantly from its predecessor. An example is shown in Figure 24. This work was extended in [34] by guiding paint strokes with a general energy term consisting of both pixel color differences and optical flow.

Figure 24. Paint by relaxation

DeCarlo and Santella propose an stylization and abstraction approach for NPR in [20]. In their system, images were transformed into a style combining of line-drawing and filling large regions with constant color. For abstraction, the system used eye-tracking data to determine where to remove extraneous details and to highlighting important objects, illustrated in Figure 25.

Figure 25. Stylization and abstraction of photographs

Wang et al. demonstrate a system for transforming the video into the abstracted cartoon animation [80]. In the system, the user simply outlines objects on key frames. A mean shift algorithm is then employed to create 3-d semantic regions by interpolation between the key frames, while maintaining smooth trajectories along the time dimension. Two examples are given in Figure 26.

- 19 -

Figure 26. Video Tooning

Winnemeoller et al. present an automatic, real-time video and image abstraction framework in [87] that abstracts imagery by modifying the contrast of visually important features, namely luminance and color opponency. The abstraction step is extensible and allows for artistic or data-driven control. Abstracted images can optionally be stylized using soft color quantization to create cartoon-like effects with good temporal coherence, illustrated in Figure 27.

Figure 27. Real time video abstraction

4.4 Video-based modeling and rendering

In computer vision, several systems have been developed to automatically recover a cloud of 3D scene points from a video sequence [57, 74]. However these are vulnerable to ambiguities in the image data, degeneracies in camera motion, and a lack of discernible features on the model surface. These difficulties can be overcome by manual intervention in the modeling process. An interactive tool called “Video Trace” is given in [76], which aid the recovery of polyhedral surface models from video, but it is restricted to rigid scenes. An example for the car is given in Figure 28.

Figure 28. VideoTrace: Rapid interactive scene modeling from video

Pavic et al. presented a semi-interactive system for advanced video processing and editing in [56]. The basic idea is to partially recover planar regions in object space and

- 20 -

to exploit this minimal pseudo-3D information in order to make perspectively correct modifications. One critical operation is to add or remove objects by copying them from other video streams and distorting them perspectively according to some planar reference geometry. The necessary user interaction is entirely in 2D and easy to perform even for untrained users. The technique is based on feature tracking and homography matching. In complicated and ambiguous scenes, user interaction as simple as 2D brush strokes can be used to support the registration. An example is shown in Figure 29.

Figure 29. 2D Video Editing for 3D Effects

Triangulation of the sparse points from non-rigid structure may be expected to be at least as difficult as from rigid structure, which has proved surprisingly troublesome. Microsoft Research Lab introduces a technique which overcomes these difficulties to a large extent, generating a representation, called “Unwrap Mosaic” [62], which is in some ways equivalent to a deforming 3D surface model, but can be extracted directly from video. The primary goal is to recover the object’s texture map, rather than its 3D shape. Accompanying the recovered texture map will be a 2D-to-2D mapping describing the texture map’s projection to the images, and a sequence of binary masks modeling occlusion. A video will typically be represented by an assembly of several unwrap mosaics: one per object, and one for the background. Video editing can be performed on the mosaic itself and re-rendered without ever converting to a 3D representation. A demo of face editing is shown in Figure 30.

Figure 30. Unwrap Mosaic

- 21 -

5. Video Reusing

Video content can be processed in a complicated and impressive way. Users can rearrange the components in the video frame, such as regions, objects or motion for repurposing, so that the video has been changed in hybrid spatio-temporal-luminance domain.

5.1 Video Texture/Synopsis

“Video Textures” [66] looks for transitions within a video that are least noticeable to indefinitely extend playing time. A video texture provides a continuous infinitely varying stream of images. Video textures can be used in place of digital photos to infuse a static image with dynamic qualities and explicit action. Figure 31 shows a fish tank populated with artificial fish sprites. Alternatively Kwatra et al. [43] generated video textures by transforming patch regions from the video and copying to the output and then stitching together along optimal seams to generate a new (and typically larger) output with graph cut.

Figure 31. Video Texture

“Panoramic Video Textures (PVT)” [11] extended the idea of video textures. PVT is a video that has been stitched into a single, wide field of view and that appears to play continuously. One frame of PVT for video “waterfall” is given in Figure 32. The key technique in creating a PVT is that although only a portion of the scene has been imaged at any given time, the output must simultaneously portray motion throughout the scene.

Figure 32. Panoramic Video Textures

- 22 -

An alternative approach [60], called dynamosaicing, is to sweep the space-time video volume with a time front surface and generate time slices in a new video sequence. This approach, termed Evolving Time Fronts, gives users the ability to manipulate time in dynamic video scenes. A beautiful view is given by Figure 33.

Figure 33. Dynamosaicing

Rav-Acha et al. [61] specifically address time-lapse videos, allowing events to occur simultaneously and/or out of chronological order, i.e. video synopsis, resulting in disjoint image regions combined using 3D Markov fields and arranged using simulated annealing. Another paper [58] proposes an “object-based” approach to webcam synopsis, where they segment the input video into objects and activities, rather than frames. Then they compose a short video synopsis, in response to user query. Two examples are given in Figure 34.

Figure 34. Dynamic video synopsis

- 23 -

5.2 Video Matting & Compositing

Extracting foreground objects from still images or video sequences plays an important role in many image and video editing applications, thus it has been extensively studied for more than twenty years. Accurately separating a foreground object from the background involves determining both full and partial pixel coverage, also known as pulling a matte, or digital matting. Most matting approaches rely on user guidance and prior assumptions on image statistics to constrain the problem to obtain good estimates of the unknown variables. In parallel to image matting research, video matting, the process of pulling a matte from a video sequence of a dynamic foreground element against a natural background, has also received considerable attention. Video matting is a critical operation in commercial television and film production, giving the power to insert new elements seamlessly into a scene, or to transport an actor into a completely new environment in order to create novel visual artifacts. Extracting foreground objects from still images is a hard problem. Unfortunately, extracting dynamic objects from video sequences is an even more challenging task due to some aspects, such as large data size, temporal coherence, and fast motion vs. low temporal resolution. A number of techniques have been proposed in existing video matting approaches to alleviate the difficulties and leverage the advantages. To deal with large data size, most approaches adopt a two-step framework. In the first step, only binary segmentation is solved to generate a trimap for each frame. Given the trimaps, matting algorithms are then applied in the second step to refine the foreground boundary. Since only binary segmentation is considered in the first step, these approaches can give users rapid response through various user interfaces. Once accurate trimaps are generated, image matting algorithms can then be applied offline on each frame to generate the final fine mattes. To address the importance of temporal coherence, instead of creating trimaps on video frames independently, most approaches create trimaps in a temporally coherent way, by performing spatio-temporal optimizations. This also allows trimaps to be propagated from a limited number of user defined key frames to the entire sequence, resulting in significantly reduced user input. When facing fast motions, most approaches rely on the user to provide dense input as guidance for properly tracking objects across frames. Including users in the loop ensures the systems are able to handle sequences at a variety of difficulty levels. A good survey on video matting is referred to [84]. Besides, compositing video is an operation required in the production of most modern motion pictures. Compositing is the process of seamlessly combining multiple image or video regions. Chuang et al. [18] proposed Bayesian matting, which formulates the problem in a

- 24 -

well-defined Bayesian framework and solves it using MAP estimation. Later Chuang extended it to the video [19] with optic flow estimation to temporally propagate the trimaps, leading to impressive video-sequence mattes which require a minimum of manual input. Figure 35 is illustrating the matting process. Apostoloff and Fitzgibbon extended the Bayesian approach to take proper account of spatiotemporal information in [12]. This confers many of the advantages of the optic-flow based technique, but with a system that has less dependence on accurate trimaps. An example is given in Figure 36.

Figure 35. Video matting with trimap

Figure 36. Bayesian matting with priors Recently the LazySnapping [49] and GrabCut [64] systems have employed graph-cut optimization to achieve more coherent and higher quality foreground segmentation. In both systems users coarsely indicate foreground and background regions with a few strokes of the mouse and the system determines the ideal boundary for segmenting the image. Figure 37 is two examples of [64]. However, none of these approaches deal very well with large amounts of partial foreground coverage.

Figure 37. GrabCut

- 25 -

The Poisson matting algorithm [71] assumes the foreground and background colors are smooth. Thus, the gradient of the matte matches with the gradient of the image and can be estimated by solving Poisson equations. The iterative matting system [81] solves for a matte directly from a few user specified scribbles instead of a carefully specified trimap. In [50], a system works for cutting a moving object out from a video clip and paste it onto another video or a background image. To achieve this, a new 3D graph cut based segmentation approach on the spatial temporal video volume is performed. Then the system provides brush tools for the user to control the object boundary precisely and apply coherent matting to extract the alpha mattes and foreground colors of the object. An illustration is given in Figure 38. Another similar system with a painting-based user interface is shown in [82]. It extends previous min-cut based image segmentation techniques to the domain of video where 2D alpha matting methods designed for images is extended to work with 3D video volumes. A matting and composting example is provided in Figure 39.

Figure 38. Object Cut-and-Paste

Figure 39. Interactive cutout The closed-form matting [46] approach assumes foreground and background colors can be fit with local linear models, which leads to a quadratic cost function in that can be minimized globally. Recently an interactive tool for extracting alpha mattes of foreground objects in real time is reported in [83], called Soft Scissors. In an online interactive setting, this real time system estimates foreground color thereby allowing both the matte and the final composite to be revealed instantly as the user roughly paints along the edge of the foreground object. In addition, it can dynamically adjust the width and boundary conditions of the scissoring paint brush to approximately capture the boundary of the foreground object that lies ahead on the scissor’s path. Figure 40 shows the processing result for an image.

- 26 -

Figure 40. Soft scissors

Interactive digital photomontage combines different regions of a set of roughly aligned photos with similar content into a single composite [9]. Graph cuts are used to minimize the binary seams between the combined regions, and followed by Poisson image editing to reduce any remaining artifacts. Figure 41 illustrates how the photomontage is done.

Figure 41. Interactive digital photomontage

The process of tracking the boundary curve of an object through a video sequence is called rotoscoping. User guided rotoscoping techniques allow users to trace curves every few frames and automatically interpolate between them. Agarwala et al. [10] have combined an optimization technique with user guidance to significantly reduce user effort. However, this technique requires that the video contain strong edges between foreground and background and has difficulty with fast moving foreground objects. A demo of few frames is given in Figure 42.

- 27 -

Figure 42. Rotoscoping and Animation Wang et al. proposed a video editing tool (video shop) working in the gradient domain by modifying and/or mixing the spatio-temporal gradient fields of target videos to generate a new gradient field that is integrated by solving a 3D Poisson equation [78]. Two application scenarios are described: replacing the face of a person in a target image using the face of another person in a source video, and compositing of two videos, shown in Figure 43.

Figure 43. Video shop

- 28 -

In [39], a user-friendly system for seamless image composition is designed and implemented, called “drag-and-drop pasting”. To make Poisson image editing more practical and easy to use, another objective function to compute an optimized boundary condition is applied. A shortest closed-path algorithm is designed to search for the location of the boundary. Figure 44 is a nice demo for this process.

Figure 44. Drag-and-drop paste

Lalonde et al. accomplished a system for inserting new objects into existing photographs by querying a vast image-based object library [44]. The user is only asked to do two simple things: 1) pick a 3D location in the scene to place a new object; 2) select an object to insert using a hierarchical menu. The problem of object insertion is posed as a data-driven, 3D-based, context-sensitive object retrieval task. Figure 45 shows the object insertion tool and one of results.

(a) original

(b) object insertion

- 29 -

© GUI Figure 45. Photo Clip Art

5.3 Video Completion (Repairing)

Video completion refers to the process of filling in missing pixels or replacing undesirable pixels in a video. It is of great importance to many applications such as video repairing, video editing and movie post-production. There are two main approaches to image completion. Image inpainting [16] methods use PDEs to repair minor damage to images, given in Figure 46. For small, non-textured regions, such methods achieve visually satisfactory results. However, the lack of generated texture in larger more complex reconstructed areas is clearly visible.

Figure 46. Image Inpainting Texture synthesis methods comprise the other approach. After selecting a target pixel whose neighborhood is partially inside the hole, a source fragment, with texture

- 30 -

matching the target’s known neighborhood, is sought elsewhere in the image. This source fragment is then merged into the neighborhood of the target pixel. Such methods are suited to filling large holes in images. The method in [24] uses these ideas together with hierarchical image approximation and adaptive neighborhood sizes, leading to impressive results, but at high computational cost. Several examples are given in Figure 47.

Figure 47. Fragment-based image completion

Video completion is more challenging for two reasons. Firstly, the amount of data in video sequences is much larger, so texture synthesis methods cannot be directly applied to video completion: searching for a source fragment in the whole video dataset would be much too slow. Secondly, temporal consistency is a necessity; it is more important than spatial aliasing in images, due to the eye’s sensitivity to motion. Simply completing video sequences frame by frame using image completion methods leads to flickering, and is inappropriate. Bertalm´ıo et al. [17] considered extending image inpainting techniques to video sequences using ideas from fluid dynamics. As before, such video inpainting is useful for filling small non-textured holes in video sequences, but is unsuitable for completing large space-time holes caused by removal of macroscopic objects. Figure 48 provides two frames of video “Foreman” after recovery.

Figure 48. Navier-stokes, image and video inpainting

- 31 -

Wexler et al. [86] treat video completion as a global optimization problem, to enforce global spatio-temporal consistency during video completion. They solve the problem iteratively: missing video portions are filled pixel by pixel. Multiple target fragments are considered at different locations for the unknown pixel; for each, it seeks the most similar space time source fragment elsewhere in the video. The fragments are merged according to similarity criteria to complete the unknown pixel. For speed, this method is performed from at several scales using spatio-temporal pyramids and nearest neighbor search algorithms are used. Overall, however, this approach is slow, and the results appear blurred due to the fragment merging and smoothing operations. Figure 49 is a demo clip for umbrella removal.

Figure 49. Space-time video completion

Jia et al. implemented a system capable of synthesizing a large number of pixels that are missing in the video [38]. These missing pixels may correspond to the static background or cyclic motions of the captured scene. This system employs user-assisted video layer segmentation. Missing colors and illumination of the background are synthesized by applying image repairing. Finally, the occluded motions are inferred by spatio-temporal alignment of collected samples at multiple scales. Figure 50 is a demo clip for statue removal.

Figure 50. Video repairing: Inference of foreground and background

Zhang et al. [90] segment video sequences into different non-overlapping motion

- 32 -

layers, each of which is completed separately. After removal of unwanted video objects in each layer, the method selects a reference frame in each layer and completes that frame. The solution is then propagated to other frames using the known motion parameters. This yields good results, but is limited to rigid bodies for which the transformation between frames can readily be determined: for example, their appearance may not vary with time by rotating in three-dimensions. Figure 51 gives several frames of an example after object removal.

Figure 51. Motion layer based object removal in videos Shiratori et al. proposed a new approach for video completion using motion field transfer [69]. Unlike prior methods, they fill in missing video parts by sampling spatio-temporal patches of local motion instead of directly sampling color. Once the local motion field has been computed within the missing parts of the video, color can then be propagated to produce a seamless hole-free video. Figure 52 shows a video clip for removing a moving object.

Figure 52. Motion field transfer

5.4 Video Annotation

Video annotation is the task of associating graphical objects with moving objects on the screen. The telestrator [8], popularly known as a “John Madden-style whiteboard”, was

- 33 -

invented for drawing annotations on a TV screen using a light pen. This approach has been also adopted for individual sports instruction using systems like ASTAR [5] that aid coaches in reviewing videos for athletic performance analysis. Interactive TV is a popular application area of hyperlinked video with the convergence between broadcast and network communications. For example, the European GMF4iTV (Generic Media Framework for Interactive Television) project has developed such a system where active video objects are associated to metadata information [75], embedded in the program stream at production time and can be selected by the user at run time to trigger the presentation of their associated metadata. An example of object annotation from PDA is shown in Figure 53.

Figure 53. GMF4iTV

Another European PorTiVity (Portable Interactivity) project is developing and experimenting with a complete end-to-end platform providing Rich Media Interactive TV services for portable and mobile devices [55], realizing direct interactivity with moving objects on handheld receivers connected to DVB-H (broadcast channel) and UMTS (unicast channel). Figure 54 just provides an interactivity example with metadata display.

Figure 54. PorTiVity

GMF4iTV

Terminal

- 34 -

Goldman et al. [31] constructed a system of visualizing short video clips in a single static image, using the visual language of storyboards. These schematic storyboards are composed from multiple input frames and annotated using outlines, arrows, and text describing the motion in the scene. Their system rendered a schematic storyboard layout based on a small amount of user interaction, for example, to scrub through time using direct manipulation of video objects. Figure 55 shows the example with arrows for camera motion.

Figure 55. Schematic Storyboard

In [32] an interactive technique for visually annotating independently moving objects in a video stream is also proposed. Features in the video are automatically tracked and grouped in an off-line preprocess that enables later interactive manipulation and annotation. Examples of such annotations include speech and thought balloons, video graffiti, hyperlinks, and path arrows. Figure 56 is an example of annotation with highlighted hyperlinks to web pages and word balloons.

Figure 56. Video Object Annotation, Navigation, and Composition

- 35 -

6. Conclusion

The state-of-art vide editing technologies are content-driven, nonlinear and adaptive. Some cutting edge research achievements come from combination of video processing, computer vision, machine learning, human-computer-interaction and multimedia technologies, such as object detection and tracking, motion analysis, image and video segmentation, image-based modeling & rendering, 3-d reconstruction, image blending in gradient domain (Poisson equation) and so on. In future, temporal coherence and spatial consistence will be further explored to alleviate the artifacts in video editing methods. Natural interaction with the GUI may be developed, such as integrating hand gestures, facial expression or body postures. In content analysis, multi-modal patterns are required, such as audio energy, speech recognition and natural language understanding are employed to process the video.

7. Reference

1. http://www.adobe.com/PhotoshopCS4: Adobe Photoshop 7.0.

2. http://www.Adobe.com/PremiereProCS4: Adobe Premiere Pro.

3. http://www.apple.com/finalcutpro/: Apple Final Cut Pro 4.0.

4. http://www.apple.com/quicktime/, Apple QuickTime 6. Apple Computer, Inc.

5. http://www.astarls.com, ASTAR Learning Systems. 2006. [Available 21-January-2009].

6. http://www.avid.com, Avid Media Composer. Avid Technology, Inc.

7. http://www.microsoft.com/directx, Microsoft DirectShow (DirectX 9.0). Microsoft Co.

8. http://en.wikipedia.org/w/index.php?title=Telestrator&oldid=180785495, Telestrator.

2006. [Available on 21-January-2009].

9. A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen, “Interactive digital photomontage,” ACM SIGGRAPH, pp. 294–302, 2004.

10. A. Agarwala, A. Hertzmann, D. H. Salesin, and S. M. Seitz. “Keyframe-based tracking for rotoscoping and animation”. SIGGRAPH’04, 23(3):584–591, 2004.

11. A. Agarwala, C. Zheng, C. Pal, M. Agrawala, M. Cohen, B. Curless, D. Salesin, R. Szeliski. “Panoramic Video Textures”. ACM SIGGRAPH 2005.

12. N. Apostoloff and A. Fitzgibbon, “Bayesian Video Matting Using Learnt Image Priors”, IEEE CVPR, 2004.

13. S. Avidan, A. Shamir, “Seam Carving for Content-Aware Image Resizing“, ACM Siggraph’07.

14. E. P. Bennett and L. McMillan, “Proscenium: A framework for spatiotemporal video editing,” ACM MM’03, pp. 2–8, November 2003.

15. E. P. Bennett and L. McMillan, “Computational time-lapse video”, ACM Siggraph 2007.

16. M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting”,

- 36 -

ACM SIGGRAPH’00, pp417–424, 2000. 17. M. Bertalmio, A. L. Bertozzi, “Navier-stokes, fluid dynamics, and image and

video inpainting”. IEEE CVPR’01, December 2001. 18. Y. Chuang, B. Curless, D. Salesin, and R. Szeliski, “A Bayesian Approach to

Digital Matting”, IEEE CVPR, 2001. 19. Y.-Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin, and R. Szeliski,

“Video matting of complex scenes”, SIGGRAPH’02, pp. 243–248, 2002. 20. Decarlo, D., Santella A, “Stylization and abstraction of photographs”. ACM

SIGGRAPH’02, pp769–776. 2002. 21. T. Deselaers, P. Dreuw, H. Ney, “Pan, Zoom, Scan – Time-coherent, Trained

Automatic Video Cropping”, IEEE CVPR’08, 2008. 22. A. Divakaran, K. A. Peker, R. Radhakrishnan, Z. Xiong, R. Cabasson.

“Video summarization using MPEG-7 motion activity and audio descriptors”. Tech. Rep. TR-2003-34, MERL, 2003.

23. G. Doretto and S. Soatto, “Editable dynamic textures”, IEEE CVPR ’03, vol. 2, pp. 137–142, June 2003.

24. Drori, I., Cohen-Or, D., Yeshurun, H. “Fragment-based image completion”. ACM Trans. Graph. 22(3), 303–312 (2003)

25. P. Dragicevic, G. Ramos, J. Bibliowicz, D. Nowrouzezahrai,R. Balakrishnan, and K. Singh. “Video browsing by direct manipulation”. ACM CHI, pp237–246, 2008.

26. H. El-Alfy, D Jacobs, L. Davis, “Multi-scale video cropping”, ACM MM’07, Sept. 2007.

27. J. H. Elder and R. M. Goldberg, “Image editing in contour domain,” IEEE Trans. PAMI, vol. 23, no. 3, pp. 291–296, March 2001.

28. G. Evangelopoulos, K. Rapantzikos, A. Potamianos, P. Maragos, A. Zlatintsi, Y. Avrithis, “Movie Summarization based on Audiovisual Saliency Detection”, IEEE ICIP’08, pp2528-2531, Oct. 2008.

29. Fan, X., Xie, X., Zhou, H.-Q., Ma, W.-Y. “Looking into video frames on small displays”. ACM MULTIMEDIA’03, pp247–250. 2003.

30. U. Gargi, S. Banerjee, “Exploratory Video Search by Augmented Playback”, Tech. Report, HP Lab, HPL-2006-155, Oct., 2006.

31. D. B Goldman, B. Curless, S. M. Seitz, and D. Salesin. “Schematic storyboarding for video visualization and editing”. ACM SIGGRAPH, 25(3):862–871, 2006

32. D.Goldman, C. Gonterman, B. Curless, D. Salesin, S. M. Seitz, “Video Object Annotation, Navigation, and Composition”, ACM symposium on User interface software and technology (UIST), pp3-12, Oct. 2008.

33. Herzmann, A., Perlin, K. “Painterly rendering for video and interaction”. NPAR’00, pp7–12. 2000.

34. Herzmann, A. “Paint by relaxation”. Computer Graphics International 2001, pp47–54. 2001.

35. Hua, X.-S., Lu, L., Zhang, H.-J. “AVE: Automated home video editing”. ACM Multimedia, pp490–497. 2003.

- 37 -

36. R. Irony, D. Cohen-Or, and D. Lischinski, “Colorization by example”, Eurographics Symposium on Rendering, 2005.

37. L. Itti, C. Koch, E. Neibur, “A model of saliency based visual attention for rapid scene analysis”. IEEE PAMI 20(11), pp1254–1259. 1999.

38. J. Jia, T. P. Wu, Y. W. Tai, and C. K. Tang, “Video repairing: Inference of foreground and background under sever occlusion”. IEEE CVPR 2004.

39. J. Jia, J. Sun, C.-K. Tang, and H.-Y. Shum, “Drag-and-drop pasting,” ACM SIGGRAPH, 2006.

40. S. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High dynamic range video,” SIGGRAPH’03, vol. 61, pp. 1–11, 2003.

41. T. Karrer, M. Weiss, E. Lee, and J. Borchers. “DRAGON: A direct manipulation interface for frame-accurate in-scene video navigation”. In CHI, pp47–250, 2008.

42. D. Kimber, T. Dunnigan, A. Girgensohn, F. Shipman, T. Turner, and T. Yang. “Trailblazing: Video playback control by direct object manipulation “. IEEE ICME, pp1015–1018, 2007.

43. V. Kwatra, A. Sch¨odl, I. Essa, G. Turk, and A. Bobick, “Graphcut textures: Image and video synthesis using graph cuts”, SIGGRAPH’03, pp. 277–286, 2003.

44. J-F Lalonde, D. Hoeim, A. Efros, C. Rother, J. Winn, A. Criminisi. “Photo Clip Art”, ACM SIGGRAPH’07, 26(3). August 2007

45. A. Levin D. Lischinski and Y. Weiss, “Colorization using Optimization”, ACM Transactions on Graphics, Aug 2004.

46. A. Levin, D. Lischinski, and Y. Weiss, “A closed form solution to natural image matting,” IEEE CVPR 2006.

47. Y. Li, T. Zhang, D. Tretter, “An Overview of Video Abstraction Techniques”, HP Laboratories Palo Alto, Tech. Report No. HPL-2001-191, July, 2001.

48. Y. Li and C.-C.J. Kuo, Video Content Analysis Using Multimodal in Formation: For Movie Content Extraction, Indexing and Representation. Norwell, MA: Kluwer, 2003.

49. Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum, “Lazy snapping,” ACM SIGGRAPH, pp. 303–308, 2004.

50. Y. Li, J. Sun, and H. Shum, “Video object cut and paste,” ACM SIGGRAPH, pp. 595–600, 2005.

51. Y. Li, S.-H. Lee, C.-H. Yeh, and C.-C.J. Kuo, “Techniques for movie content analysis and skimming,” IEEE Signal Processing Magazine, vol. 23, pp. 79–89, Mar 2006.

52. F. Liu, M. Gleicher, “Video retargeting: automating pan and scan”. ACM Multimedia’06, pp241-250, 2006.

53. R. Mantiuk, S. Daly, L. Kerofsky, “Display adaptive tone mapping”, ACM Siggraph2008.

54. T. Mei, B. Yang, S. Yang, X-S Hua, “Video collage: presenting a video sequence using a single image”, The Visual Computer, 25(1), pp39-51, 2009.

55. H. Neuschmied, R. Trichet, B. Merialdo, "Fast Annotation of Video Objects

- 38 -

for Interactive TV", ACM MM 2007. 56. D. Pavic, V. Schoenefeld, L, Krecklau, M. Habbecke, L. Kobbelt, “2D video

editing for 3D effects”, VMV (Vision, Modeling and Visualization) 2008. 57. M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, R.

Koch, “Visual modeling with a hand-held camera”. IJCV, 59(3), pp207–232. 2004.

58. Y. Pritch, A. Rav-Acha, A. Gutman, and S. Peleg. “Webcam synopsis: Peeking around the world”. IEEE ICCV, pp1–8, 2007.

59. G. Ramos and R. Balakrishnan. “Fluid interaction techniques for the control and annotation of digital video”. UIST’03, pp105–114, 2003.

60. A. Rav-Acha, Y. Pritch, D. Lischinski, and S. Peleg. “Dynamosaicing: Video mosaics with non-chronological time”. IEEE CVPR, pages 58–65, 2005.

61. A. Rav-Acha, Y. Pritch, and S. Peleg. “Making a long video short: Dynamic video synopsis”. CVPR, pp435–441, 2006.

62. A. Rav-Acha, P Kohli, C Rother, A. Fitzgibbon. “Unwarp mosaics: a new representation for video editing”. ACM Sigggraph, 2008.

63. C. Rhemann, C. Rother, A. Rav-Acha, and T. Sharp, “High resolution matting via interactive trimap segmentation,” IEEE CVPR, 2008.

64. C. Rother, V. Kolmogorov, and A. Blake, “Grabcut - interactive foreground extraction using iterated graph cut,” ACM SIGGRAPH, pp. 309–314, 2004.

65. M Rubinstein, A. Shamir, S. Avidan, “Improved Seam Carving for Video Retargeting”, ACM Siggraph 2008.

66. A. Scheodl, R. Szeliski, D. H. Salesin, and I. Essa. “Video textures”, ACM SIGGRAPH’00, pp489–498, 2000

67. Setlur, V., Takagi, S., Raskar, R., Gleicher, M., Gooch, B. “Automatic image retargeting”. Mobile and Ubiquitous Multimedia (MUM), ACM Press. 2005.

68. B. Shneiderman, “Direct manipulation: A step beyond programming languages”. IEEE Computer, 16(8), pp57–69. 1983

69. T. Shiratori, Y. Matsushita, S. Kang X. Tang, “Video completion by motion field transfer”, IEEE CVPR 2006.

70. Smith, M., Kanade, T., “Video Skimming and Characterization through the Combination of Image and Language Understanding Techniques”, IEEE CVPR'97, pp. 775-781, June, 1997.

71. J. Sun, J. Jia, C-K Tang, and H. Shum, “Poisson matting,” ACM SIGGRAPH, pp. 315-321, 2004.

72. H. Sundaram, L. Xie, S-F Chang. “A Utility Framework for the Automatic Generation of Audio-Visual Skims”. ACM Multimedia’02, December 2002.

73. C. Tao, J. Jia, H. Sun, “Active Window Oriented Dynamic Video Retargeting”, Int. Workshop on Dynamical Vision, ICCV 2007.

74. Thormeahlen, T., Broszio, H., “Voodoo Camera Tracker: A tool for the integration of virtual and real scenes”. Its web link as follows: http://www.digilab.uni-hannover.de/docs/manual.html, 2008.

75. R. Trichet, B. Merialdo, "Fast Video Object Selection For Interactive TV", IEEE ICME 2006.

- 39 -

76. A. Van Den Hengel, A.Dick, T. Thormmeahlen, B. Ward, P. H. S. Torr, “VideoTrace: Rapid interactive scene modeling from video”. ACM SIGGRAPH 2007.

77. P. Viola, M. Jones, “Robust Real-time Object Detection “, IEEE ICCV 2001. 78. H. Wang, N. Xu, R. Raskar, N. Ahuja, “Videoshop: a new framework for

spatio-temporal video editing in gradient domain“, Graphical models, 69(1), pp57-70, January 2007.

79. J. Wang, M. Reinders, R. Lagendijk, J. Linderberg, M. Kankanhalli, “Video content presentation on tiny devices”. IEEE ICME, vol. 3, 1711–1714. 2004.

80. J. Wang, Y. Xu, H Shum, M. F. Cohen, “Video Tooning”, ACM Siggraph 2004.

81. J. Wang, P. Bhat, A. Colburn, M. Agrawala, and M. Cohen, “Interactive video cutout,” ACM SIGGRAPH, 2005.

82. J. Wang M. Cohen, “An iterative optimization approach for unified image segmentation and matting,” IEEE ICCV’05, pp. 936–943, 2005.

83. J. Wang, M. Agrawala, M. Cohen, “Soft Scissors: An Interactive Tool for Real time High Quality Matting”, ACM SIGGRAPH 2007.

84. J. Wang, M. Cohen. “Image and Video Matting: A Survey”. Foundations and Trends in Computer Graphics and Vision, Vol. 3, No.2, 2007.

85. T. Welsh, M. Ashikhmin and K. Mueller, “Transferring color to grayscale images”, ACM SIGGRAPH 2002.

86. Y. Wexler, E. Shechtman, and M. Irani, “Space-time video completion”, IEEE CVPR 2004.

87. H. Winnemeoller, C. Olsen, B. Gooch, “Real time video abstraction”, ACM Siggraph 2006.

88. L. Wolf, M. Guttmann, D. Cohen-Or, “Non-homogeneous Content-driven Video Retargeting”, IEEE ICCV 2007.

89. L. Yatziv and G. Sapiro. “Fast Image and Video Colorization using Chrominance Blending”, IEEE T-IP, vol.15, no.5, pp. 1120- 1129, May 2006.

90. Zhang, Y., Xiao, J., Shah, M. “Motion layer based object removal in videos”. IEEE Workshop on Applications of Computer Vision, 2005.

- 40 -

A Trial Version

adaptive nonlinear video editing: retargeting, …yuhuang/papers/videoeditingoverview09.pdf- 3 -...

Documents