deep rnns for video denoising -...

10
Deep RNNs for Video Denoising Xinyuan Chen a , Li Song a , and Xiaokang Yang a a Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Cooperative Medianet Innovation Center, Shanghai, China ABSTRACT Video denoising can be described as the problem of mapping from a specific length of noisy frames to clean one. We propose a deep architecture based on Recurrent Neural Network (RNN) for video denoising. The model learns a patch-based end-to-end mapping between the clean and noisy video sequences. It takes the corrupted video sequences as the input and outputs the clean one. Our deep network, which we refer to as deep Recurrent Neural Networks (deep RNNs or DRNNs), stacks RNN layers where each layer receives the hidden state of the previous layer as input. Experiment shows (i) the recurrent architecture through temporal domain extracts motion information and does favor to video denoising, and (ii) deep architecture have large enough capacity for expressing mapping relation between corrupted videos as input and clean videos as output, furthermore, (iii) the model has generality to learned different mappings from videos corrupted by different types of noise (e.g., Poisson-Gaussian noise). By training on large video databases, we are able to compete with some existing video denoising methods. Keywords: Deep RNNs, Video Denoising, patch-based method, spatial-temporal processing 1. INTRODUCTION Video denoising is a classical problem in computer vision and video processing, as well as an assessment for low-level video modeling algorithm. It has been widely studies in the literature, yet remains an active topic. A video denoising procedure takes a corrupted video Y = X + N as input, where X is the clean video of Y and N is the additive noise. It outputs the video where the noise has been reduced. By accommodating different types of noise distribution, the same model could extend to suit for different kinds of low-level video processing problems, like deblur, super-resolution, inpainting, etc. Nowadays, deep learning has made great progress in computer vision and pattern recognition application (e.g., image classification using deep convolutional networks 1 ), thanks to its enormous expression ability for representation and fast running on Graphics Processing Units (GPUs). How to apply deep learning on exploring time sequence data and mining temporal information is a popular topic. RNNs are a superset of feedforward neural networks, with the ability to pass information across time steps. They are first widely used in language processing domain like speech recognition, 2 image description. 3 In computer vision domain, Nitish Srivastava et al. 4 verifies that RNNs have ability to learn both motion information and profile feature from videos and successfully exploits such representation to perform pattern recognition, which motivates us to propose a novel patch-based deep RNNs model for video sequence denoising. To the best of our knowledge, our method is the first one proposing an end-to-end deep RNNs for video denoising processing. The procedure is to take a few continuous noisy frames as input and outputs images where the noise has been reduced. Results show that removing on additive white Gaussian noise are competitive with the current state of the art. The approach is equally valid for other type of noise. In summary, following reasons inspire us to proposed deep RNNs for video denoising: Further author information: (Send correspondence to Xinyuan Chen, Li Song) Xinyuan Chen: E-mail: [email protected], Telephone: +86 15800767939 Li Song: E-mail: song [email protected], Telephone: +86 13816840561 Xiaokang Yang: E-mail: [email protected], Telephone: +86 13761192352

Upload: others

Post on 16-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

Deep RNNs for Video Denoising

Xinyuan Chena, Li Songa, and Xiaokang Yanga

aInstitute of Image Communication and Network Engineering, Shanghai Jiao Tong University,Cooperative Medianet Innovation Center, Shanghai, China

ABSTRACT

Video denoising can be described as the problem of mapping from a specific length of noisy frames to clean one.We propose a deep architecture based on Recurrent Neural Network (RNN) for video denoising. The modellearns a patch-based end-to-end mapping between the clean and noisy video sequences. It takes the corruptedvideo sequences as the input and outputs the clean one. Our deep network, which we refer to as deep RecurrentNeural Networks (deep RNNs or DRNNs), stacks RNN layers where each layer receives the hidden state of theprevious layer as input. Experiment shows (i) the recurrent architecture through temporal domain extractsmotion information and does favor to video denoising, and (ii) deep architecture have large enough capacity forexpressing mapping relation between corrupted videos as input and clean videos as output, furthermore, (iii)the model has generality to learned different mappings from videos corrupted by different types of noise (e.g.,Poisson-Gaussian noise). By training on large video databases, we are able to compete with some existing videodenoising methods.

Keywords: Deep RNNs, Video Denoising, patch-based method, spatial-temporal processing

1. INTRODUCTION

Video denoising is a classical problem in computer vision and video processing, as well as an assessment forlow-level video modeling algorithm. It has been widely studies in the literature, yet remains an active topic. Avideo denoising procedure takes a corrupted video Y = X + N as input, where X is the clean video of Y andN is the additive noise. It outputs the video where the noise has been reduced. By accommodating differenttypes of noise distribution, the same model could extend to suit for different kinds of low-level video processingproblems, like deblur, super-resolution, inpainting, etc.

Nowadays, deep learning has made great progress in computer vision and pattern recognition application(e.g., image classification using deep convolutional networks1), thanks to its enormous expression ability forrepresentation and fast running on Graphics Processing Units (GPUs). How to apply deep learning on exploringtime sequence data and mining temporal information is a popular topic. RNNs are a superset of feedforwardneural networks, with the ability to pass information across time steps. They are first widely used in languageprocessing domain like speech recognition,2 image description.3 In computer vision domain, Nitish Srivastavaet al.4 verifies that RNNs have ability to learn both motion information and profile feature from videos andsuccessfully exploits such representation to perform pattern recognition, which motivates us to propose a novelpatch-based deep RNNs model for video sequence denoising. To the best of our knowledge, our method is thefirst one proposing an end-to-end deep RNNs for video denoising processing. The procedure is to take a fewcontinuous noisy frames as input and outputs images where the noise has been reduced. Results show thatremoving on additive white Gaussian noise are competitive with the current state of the art. The approach isequally valid for other type of noise.

In summary, following reasons inspire us to proposed deep RNNs for video denoising:

Further author information: (Send correspondence to Xinyuan Chen, Li Song)Xinyuan Chen: E-mail: [email protected], Telephone: +86 15800767939Li Song: E-mail: song [email protected], Telephone: +86 13816840561Xiaokang Yang: E-mail: [email protected], Telephone: +86 13761192352

Page 2: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

1. RNNs have ability to extract motion information and profile feature by temporal sequences,4 which helpsperform video denoising.

2. Deep architecture have large capacity for expressing mapping relation between corrupted videos as input andclean videos as output.

3. Generality for different types of noise can be simulated.

The contribution of this paper goes that we present a patch-based denoising algorithm that is learned on alarge dataset with a deep recurrent neural network. Results on additive white Gaussian noise (AWGN) approachto the current state of the art. The model is equally valid for other types of noise and other mixed noise.

2. RELATED WORK

In the last decades, great progress has been made in image denoising attempted by various methods, e.g. sparsecoding method,5 conditional random fields,6 variation techniques,7,8 patch based methods,9 etc. Among these,BM3D10 is known as a state of the art approach. It exploits similar images patches based on `2 norm distance,and then noise is removed after wavelet shrinkage or Wiener filter in 3D transform domain.

Video denoising is different as video sequences have motion information and high temporal redundancy to beexploited during noise removing procedure. A common method of patch-based image denoising can be appliedon video denoising by searching for similar patched on different frames over time.11,12 Liu and Freeman,13 alsouse motion vectors and group patches across adjacent frames but in a different manner. Instead of comparingpatches to the reference patch, these are compared in each frame with the compensated patch of the referenceone. NL-means14 is applied to this group of collected patches. The proposed algorithm adaptively computesthe noise model of the sequence, which is an important issue for real applications. Gijiesh and Zhou15 tookthe whole adjacent frames into account by using motion estimation for global motion compensation. Thenwavelet transform is applied to the piled frames including current frame and past/future frames. Mairal etal.16 learnt multi-scale sparse representations for video restoration. VBM3D12 evolves from BM3D methodby matching similar patches both within images and over different timestep frames by predictive search block-matching. VBM4D,17 the state-of-the-art for white noise removal in video, evolved from VBM3D by exploitingsimilarity between 3D spatia-temporal volumes instead of 2D patches and grouped similar volumes together bystacking them along an additional fourth dimension, thus producing a 4D structure. Collaborative filtering isrealized by transforming each group through a decorrelating 4D separable transform and then by shrinkage andinverse transformation. Low rank matrix completion18 is another practicable method combining patch-basedmethodology. It grouped similar patches in temporal-spatial field and minimize the nuclear norm (`1 norm ofall singular values) of the matrix with linear constraints, leads to state-of-the-art performance in mixed noisesdenoising. Another way to utilize spatiotemporal information is to combine with motion estimation.15The modelsimultaneously captured the local correlations between the wavelet coefficients of natural video sequences acrossboth space and time, strengthened with a motion compensation process.

As recurrent neural network(RNN) can model long-term contextual information for video sequence, a greatamount of successes have been made for video processing based on RNN range from high-level recognition,19,20

middle level to low level.21 Recently, Yan Huang21 has made breakthrough on multi-frames super-resolutionby using a bidirectional recurrent convolutional network for efficient multi-frame SR, it’s a remarkable work forusing RNNs model in low-level video processing. To our knowledge, there’s no video denoising by RNNs atcurrent, so our work is the first succeeding in proposing RNNs based model for video denoising.

3. MODEL DESCRIPTION

3.1 Deep Recurrent Neural Networks

Recurrent neural networks are a powerful family of connectionist models that capture time dynamics via cycles inthe graph. Information can cycle inside the networks during timesteps. Details about recurrent neural networkscan be found in a review of RNN.22 Briefly speaking, A single RNN unit is shown in Figure 1(a). At time t,

Page 3: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

(a) (b)Figure 1. Structure of Recurrent Neural Networks. (a) A simple recurrent network , the delay nodes represents recurrentin time. (b) Deep Recurrent Neural Networks of 2 layers.

hidden layer units h(t) receive activation from input x(t) at current timestep and hidden states h(t−1) at lasttimestep. The output y(t) is computed from hidden units h(t) at time t:

h(t) = σ(Whxx(t) +Whhh(t−1) + bh

)(1)

y(t) = σ(Wyhh(t) + by

)(2)

The weight matrices Whx,Whh,Wyh and vector-valued biases bh,by parameterize the RNN, the activation func-tion σ (·) (e.g tanh or sigmoid function) operates component-wise. In our model, All of activation function ishyperbolic tangent function except the output layer uses linear function. The greatest different between RNNand normal neural networks is that the recurrent hidden units’ states are influenced by not only at current inputsbut also previous hidden state. As a result, hidden units can been seen as container carry information of previoustime sequences.

Deep recurrent neural networks(Deep RNNs) is an extended deeper RNN by stacking one input layer, multiplehidden layers and one output layer.23–25 In fact, deep RNNs model stacked different layers in depth with thesame way to multi layer perceptions(MLP). If we get rid of the cycle procedure through timesteps, Deep RNNsbecomes to multi layer perceptions. A two-hidden-layers deep RNNs is shown in Figure 1(b). Generally speaking,

Deep RNNs have more hidden layers compared to simple RNN. Hidden layer h(t)l (l > 2)) receives layer below

h(t)(l−1) and previous hidden states h

(t−1)l at current layer as input:

h(t)l = σ

(Whlhl−1

h(t)l−1 +Whlhl

h(t−1)l + bhl

)(3)

3.2 Deep RNNs for Video Denoising

In this section, we propose an end-to-end model that uses Deep RNNs for video denoising. The model consistsof two recurrent layers as shown in Figure 2. The input to the model is a sequence of vectors (corrupted videocubes), and the target output is groundtruth vector (clean image patches), which means to denoise a givenframe’s noise we take several future/past adjacent frames into account.

Let χ = {xt}Tt=1 be the image sequence with T frames. Each image xt can be treat as a sum of its underlyingclean image yt and the noise nt:

xt = yt + nt (4)

The goal of video denoising is to construct a mapping into Y = {yt}Tt=1 by removing nt from xt.

We adopt patch-based method in our model. In order to incorporate temporal redundancy of video into deepRNNs, we take in video cubes as input. Here, cube refers multi-dimensional array of data with specific patchsize and timestep picked from video. More precisely, it stacks from several timing continuous patches (see blue

Page 4: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

RNN

RNN Delay

Delay

Input

Output

Hidden layer

Hidden layer

(a)

( )21

th − ( )11

th − ( )1

th ( )11

th + ( )21

th +

( )22

th − ( )12

th − ( )2

th ( )12

th + ( )22

th +

(b)Figure 2. Deep RNNs for video denoising model. (a) Noisy video is extracted as 3D small cubes feeded in to 2 layerrecurrent neural network and finally maps into one noise removal. (b) Model of deep RNNs for video denoising byunfolded over time (example of 5 timesteps).

cube in Figure 2 (a)). The Deep RNNs aim to map such cubes from noisy video into clean image patches ofthe midmost timestep: y(T+1)/2 = F (χ; Θ), where Θ denotes to network parameters. They are updated by theBPTT algorithm26 minimizing the quadratic error between the mapped noisy patch and clean patch:

L =‖ F (χ; Θ)− Y ‖ (5)

The intuitive processing of our Deep RNNs denoising model can be described as follow. The first recurrenthidden layer reads input data in sequence and delivers the representation of the input video to second layer.Then the second recurrent hidden layer tries to extract higher level representation and learning more powerfulexpression from the representation. At last, the output layer is being asked to construct the clean image fromthis representation. In order to do so, the representation must retain information about the appearance of theobjects and the background as well as the motion contained in the video.

3.3 Applying Deep RNNs for Video Denoising

To denoise video, we decompose a given noisy video into overlapping patch frame-by-frame. Then we feed itscorresponding cubes (a list of patches in the same position each frame) into trained model. In other words, inputscan be obtained by sliding through video volume with a 3D window of specific spatial size and temporal step.The denoised video is obtained by placing the denoised patches at the location their noisy counterparts averagingthe overlapping regions. When we pick overlapping patches, the stride set to be 3 as denosing performance isalmost equally good with smaller stride.

4. EXPERIMENTAL SETUP

We design experiment to accomplish the following objectives:

• Compare DRNNs with different model variants to illustrate DRNNs exploit latent temporal-spatial infor-mation for video denoising.

• Compare with state-of-the-art video noising benchmark on removing addictive white Gaussian noise.

• Measure the generality of DRNNs for other type of noise (e.g., Poisson-Gaussian noise).

Page 5: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

Figure 3. Central frame of the original sequences used in our tests. From left to right: Salesm., Miss Am., Coastg.,Foreman, Bus.

4.1 Dataset

Training set: The UCF-101 dataset27 is a standard benchmark dataset for human action recognition. It has13320 videos of variabel length belonging to 101 human action categories, and each frame has size 160×320pixels. We use it to train deep RNNs for video denoising because it’s large enough and consist of natural videoscontaining motion and the spatial information well.

We performed all our experiments on gray-scale video. We randomly pick 17×17 patches from 7 continuousframes to generate 17×17×7 cubes as raw sequences samples. Then we manually add noise on these samples togenerate training samples. The groundtruth of each training samples is corresponding to clean patches from themiddle of timestep.

Testing set: We use standard test sequence the same with test sequence in paper of VBM3D12 and VB-M4D,17 which can be downloaded by the author’s website. Figure 3 displays the central frame of the usedsequences.

4.2 Implementation

Python library Theano28 is used to construct deep RNNs (DRNNs) model. All models were trained on a singleNVIDIA Titan X GPU. A two layer deep RNNs took about 10 days to converge. In order to make trainingprocedure more efficient, we apply some common tricks :

• Data normalization: The pixel values are transformed to have approximately mean zero and variance one.More precisely, we subtract 0.5 from each pixel and divide by 0.2, assuming pixel values between 0 and 1.

• Weight initialization: The neural net’s weights are initialized from a uniform distribution U[− 1√

n, 1√

n

]where n is the size of the previous layer.29

• Learning Strategy: The learning rate is initialized as 0.03 and stopped as 0.0001, while momentum isinitialized as 0.9 and stopped as 0.999. Both of them are decreased at the same step in every epoch. Toavoid waiting for max epochs when validation error stopped improving, we use early stopping strategy.We stop training when validation error stopped improve in the latest 200 epoches, and save model at theepoch where validation error reached the lowest.

5. RESULTS

5.1 Comparison of Different Model Variants

In this set of experiment, we compare different model variants to verify if recurrent unit and deep architectureplay important roles in denoising. All the models are trained on video cubes corrupted with Gaussian noise withσ = 25: nt ∼ N

(0, σ2I

). The training set contains about 10 million cubes of 500 videos randomly picked from

UCF-101, twenty percent of which is chosen as validation set. Table 1 shows the denoising testing results usingpeak signal to noise ratio (PSNR) as metric.

Page 6: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

RNN Delay

Input

Output

Hidden layer

(a)

Hidden layer

Hidden layer

Input

Output

(b)

Figure 4. Sketch of model variants. (a) Single layer RNN for video denosing. (b) MLP-7 for video denoising withoutrecurrent layer.

Table 1. Denoising results of PSNR (dB) by variants of DRNNs on testing videos corrupted addictive Gaussian noise withstandard deviation σ = 25

Video name: Salesm. Miss Am. Coastg. Foreman BusAverageFrame size 288×352 288×352 144×176 288×352 288×352

Frame number 50 150 300 300 150MLPs30 28.68 32.87 27.71 29.63 27.34 29.25MLPs-7 29.98 34.73 28.43 29.97 26.22 29.87RNN-5 31.07 36.02 29.23 30.81 27.01 30.83DRNNs-5 31.43 36.64 29.72 31.31 27.54 31.33DRNNs-7 32.27 37.01 30.06 31.66 27.88 31.78

To compare performance of single layer RNN and deep RNNs, we trained an RNN model of 17×17−1024−17×17layers (shown in Figure 4(a)) and deep RNNs model of 17×17−1024−1024−17×17 layers with 5 timesteps, no-tating as RNN-5 and DRNNs-5. As we can see, DRNNs-5 outperforms RNN-5 on all the test sequence, whichmeans deep architecture have better mapping capacity from noisy video to clean one.

To find if recurrent unit helps for temporal information extracting and motion analysis, we trained a modelof 2 layers MLP of 7 timesteps pooling in the last layer (notating as MLP-7 shown in Figure 4(b)) and a 2layers deep RNNs of 7 timesteps (DRNNs-7). Also we use MLPs30 methods as comparison. The denoising MLPsmodel is obtained from author’s website and applied on corrupted video frame by frame to remove noise. MLPshas 4 layers with 2047 hidden units in each layer. The stride is 3 on both DRNNs and MLPs when clippingoverlapping patches. Result shows DRNNs-7’s denoising PSNR is 1.91dB higher than MLPs-7 and 2.53 higherthan MLPs on average.

Noticing that two layers MLPs-X shared the same order of magnitudes numbers of parameters with singlelayer RNN-X (X notating as number of timesteps), because recurrent layer occupies the same space complexityof one fully connected layer. In table 1 two layer MLPs-7’s performance is even poorer than RNN-5’s with narrowtemporal windows, from which we draw the conclusion that recurrent layer extracts temporal information anddoes motion analyse for video denoising which is superior to simply stacking a sequence of frames and feeding intonetwork without timing connection. DRNNs outperforms MLPs on all of test videos, which illustrates temporalinformation improves denoising performance. The performance is inferior to VBM3D at a small distance.

Page 7: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

Figure 5. From left to right and from top to bottom, visual comparison of the denoising performance of R-NL, VBM3D,VBM4D and DRNNs on the sequence of salesm., Miss Am., and foreman corrupted by white Gaussian noise with standarddeviation σ=35.

By comparing DRNNs-5’s and DRNNs-7’s results, we find that by enlarging the timestep windows, noisecan be removed more thoroughly as the model has more reference frames. The performance might be improvedfurther if temporal windows become larger. But considering exponentially increasing time cost, we haven’t doneexperiment with wider temporal windows.

5.2 Comparison with State-of-the-art Performance

We compared 7 timesteps 17×17−1024−1024−17×17 deep RNNs with state-of-the-art algorithm. The resultsof algorithm R-NL14 are also displayed for comparison. The Matlab implementation of VBM3D ,VBM4D andR-NL was obtained from the authors web site. The default parameters were used in the tests if the author haven’tprovided results in their paper. All the test video sequence are provided by VBM3D’s author on his web site.We compared the performance of R-NL, VBM3D, VBM4D and the proposed algorithm (DRNNs) both visuallyand numerically in several image sequences. Figure 3 displays the central frame of the used sequences. Table 2shows the videos’ sizes and frame number. We trained two networks processing with noisy video corrupted byAWGN with σ = 25 and σ = 35, both of whose training strategy and tricks have been mentioned in Section 4.

Table 2 reports denoising performance in terms of PSNR using different methods. DRNNs reaches an ap-proximate result with VBM4D methods in less than 1 dB difference on everage and outperforms R-NL methods.Figure 5 displays the visual results of the compared algorithms. The figures also display the difference of thedenoised image with the noisy one, and the difference of the denoised image to the original one. The differencewith the noisy image displays the noise removed by each algorithm. The results of R-NL smooth the image over-much while remove too much details and textures. DRNNs seemly performs similar with VBM3D and VBM4Dby a first glimpse while its weakness is some slight noise spot still on the images, but details are protected betterespecially comparing with R-NL.

5.3 DRNNs for Mix Noises

Videos are not always corrupted by addictive white Gaussian noise. In some situations, the video processingmight be corrupted by mix noise, like Gaussian and Poisson noise. Most existing denoising removers are quitesensible to the noise model with certain type of noise. DRNNs allows us to effectively learn a denoising modelfor a given noise of any types, if noise can be simulated. In this set of experiment, we trained a DRNNs modelfor removing mix noise in Poisson-Gaussian noise:

nt = ntg + ntp (6)

Page 8: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

Table 2. Denoising performance of R-NL, VBM3D, VBM4D and DRNNs. The PSNR (in dB) are reported. The testvideos are corrupted by AWGN with corresponding standard deviation σ showed on left row.

σVideo name Salesm. Miss Am. Coastg. Foreman Bus

AverageFrame size 288×352 288×352 144×176 288×352 288×352Frame number 50 150 300 300 150

25

R-NL14 31.81 36.91 30.32 31.53 27.48 31.61VBM3D12 32.79 37.10 30.62 32.19 28.48 32.24VBM4D17 32.66 37.24 30.81 32.61 29.10 32.48DRNNs 32.27 37.01 30.06 31.66 27.88 31.78

35

R-NL14 29.88 35.60 28.51 29.80 25.55 29.87VBM3D12 30.72 35.87 28.92 30.56 26.91 30.60VBM4D17 30.99 35.98 29.17 31.11 27.39 31.04DRNNs 30.51 35.88 28.43 30.09 26.02 30.19

where ntg denotes Gaussian noise with zero mean and pixel independent variance σ = 25, ntp denotes Poissonnoise (shot noise) with zero mean and variance λ = kyt. We use the formulation I = kP (yt/k),31 and in ourexperiment, we set k = 15. Figure 6(a) shows the noisy frame from video ’salesman’ in the middle of the sequence(the 25th frame).

As there’s no specific methods to remove video’s Poisson-Gaussian noise, we compare results to three existingvideo denoising methods, two of those are VBM3D method and R-NL mentioned above. Another is low rankmatrix completion18 method for video denoising. Figure 6 shows the visual results. As VBM3D is designedfor AWGN with known variance, we first perform VBM3D given σ = 25, and it turns out to be still noisy(PSNR=21.50dB). Then we feed the processed video as input to R-NL method to remove part of Poissonnoise(by setting the corresponding Poisson noise λ = 15yt. The result in Figure6(d) shows to be smooth toomuch (PSNR=26.97dB). We also set different value of variance in VBM3D to various from 25 to 50 and find thebest denoising results when the variance to be σ = 45 (PSNR=28.44dB). R-NL method and low rank method areshown in Figure6(f) and (g), it seems R-NL method (PSNR=27.79dB) remove too much texture and features ofthe image while low rank method (PSNR=24.37dB) still preserve some noisy black spot in the image. Overall,deep RNNs methods (PSNR=30.09dB) outperform other methods in both PSNR and visual result.

6. CONCLUSION

Deep recurrent neural networks can exploit temporal-spatial information of video to removing video noise, aswell as deep architecture have large capacity for expressing mapping relation between corrupted videos as inputand clean videos as output. Results of our method approach to state-of-the-art video denoising performance.Our proposed method does not assume any specific statistical properties of noise and is robust to map relationbetween corrupted video as input and clean video as output.

ACKNOWLEDGMENTS

This work was supported by NSFC (61527804,61521062,61420106008), the 111 Project (B07022 and Sheitc No.150633) and the Shanghai Key Laboratory of Digital Media Processing and Transmissions.

REFERENCES

[1] Krizhevsky, A., Sutskever, I., and Hinton, G. E., “Imagenet classification with deep convolutional neuralnetworks,” in [Advances in neural information processing systems ], 1097–1105 (2012).

[2] Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., and Khudanpur, S., “Recurrent neural network basedlanguage model.,” in [INTERSPEECH 2010, 11th Annual Conference of the International Speech Commu-nication Association, Makuhari, Chiba, Japan, September 26-30, 2010 ], 1045–1048 (2010).

Page 9: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

Figure 6. Visual comparison of the denoising performance by different methods on the sequence of salesman corruptedby Poisson-Gaussian noise with σ = 25 and λ = 15yk. The PSNR from (b) to (h) is 16.33dB, 21.50dB, 26.97dB, 28.44dB,27.79dB, 24.37dB, 30.09dB.

[3] Karpathy, A. and Li, F., “Deep visual-semantic alignments for generating image descriptions,” in [IEEEConference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12,2015 ], 3128–3137, IEEE (2015).

[4] Srivastava, N., Mansimov, E., and Salakhutdinov, R., “Unsupervised learning of video representations usinglstms,” in [Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille,France, 6-11 July 2015 ], Bach, F. R. and Blei, D. M., eds., JMLR Proceedings 37, 843–852, JMLR.org(2015).

[5] Elad, M. and Aharon, M., “Image denoising via learned dictionaries and sparse representation,” in [IEEEComputer Society Conference on Computer Vision and Pattern Recognition ], 895–900 (2006).

[6] Tappen, M. F., Liu, C., Adelson, E. H., and Freeman, W. T., “Learning gaussian conditional random fieldsfor low-level vision,” in [2007 IEEE Conference on Computer Vision and Pattern Recognition ], 1–8, IEEE(2007).

[7] Rudin, L. I., Osher, S., and Fatemi, E., “Nonlinear total variation based noise removal algorithms ,” PhysicaD-nonlinear Phenomena 60(1-4), 259–268 (1992).

[8] Osher, S., Burger, M., Goldfarb, D., Xu, J., and Yin, W., “An iterative regularization method for totalvariation-based image restoration,” in [Simul ], 460–489 (2005).

[9] Buades, A., Coll, B., and Morel, J. M., “A non-local algorithm for image denoising,” in [IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition ], 60–65 (2005).

[10] Dabov, K., Foi, A., Katkovnik, V., and Egiazarian, K., “Image denoising by sparse 3-d transform-domaincollaborative filtering,” Image Processing, IEEE Transactions on 16(8), 2080–2095 (2007).

[11] Mahmoudi, M. and Sapiro, G., “Fast image and video denoising via nonlocal means of similar neighbor-hoods,” IEEE Signal Processing Letters 12(12), 839–842 (2010).

[12] Dabov, K., Foi, A., and Egiazarian, K., “Video denoising by sparse 3d transform-domain collaborativefiltering,” in [Signal Processing Conference, 2007 European ], 2080 – 2095 (2007).

[13] Liu, C. and Freeman, W. T., “A high-quality video denoising algorithm based on reliable motion estimation,”in [Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete,Greece, September 5-11, 2010, Proceedings, Part III ], 706–719 (2010).

Page 10: Deep RNNs for Video Denoising - SJTUmedialab.sjtu.edu.cn/publications/2016/spie2016_ChenSongYang.pdf · The model is equally valid for other types of noise and other mixed noise

[14] Sutour, C., Deledalle, C.-A., and Aujol, J.-F., “Adaptive regularization of the nl-means: Application toimage and video denoising,” IEEE Transactions on image processing 23(8), 3506–3521 (2014).

[15] Varghese, G. and Wang, Z., “Video denoising based on a spatiotemporal gaussian scale mixture model.,”IEEE Transactions on Circuits and Systems for Video Technology 20(20), 1032–1040 (2010).

[16] Boulanger, J., Kervrann, C., and Bouthemy, P., “Space-time adaptation for patch-based image sequencerestoration,” Pattern Analysis and Machine Intelligence IEEE Transactions on 29(6), 1096–102 (2007).

[17] Maggioni, M., Boracchi, G., Foi, A., and Egiazarian, K. O., “Video denoising using separable 4d nonlocalspatiotemporal transforms,” in [Image Processing: Algorithms and Systems IX, San Francisco, California,USA, January 24-25, 2011 ], 787003 (2011).

[18] Ji, H., Liu, C., Shen, Z., and Xu, Y., “Robust video denoising using low rank matrix completion,” in[Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on ], 1791–1798, IEEE (2010).

[19] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., and Saenko, K., “Sequence tosequence-video to text,” in [Proceedings of the IEEE International Conference on Computer Vision ], 4534–4542 (2015).

[20] Torabi, A., Pal, C., Larochelle, H., and Courville, A., “Using descriptive video services to create a largedata source for video annotation research,” Computer Science (2015).

[21] Huang, Y., Wang, W., and Wang, L., “Bidirectional recurrent convolutional networks for multi-framesuper-resolution,” in [Advances in Neural Information Processing Systems 28: Annual Conference on NeuralInformation Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada ], 235–243 (2015).

[22] Lipton, Z. C., “A critical review of recurrent neural networks for sequence learning,” CoRR ab-s/1506.00019 (2015).

[23] Schmidhuber, J., “Learning complex, extended sequences using the principle of history compression,” NeuralComputation 4(2), 234–242 (1992).

[24] Hihi, S. E., Hc-J, M. Q., and Bengio, Y., “Hierarchical recurrent neural networks for long-term dependen-cies,” Advances in Neural Information Processing Systems 8, 493–499 (1995).

[25] Hermans, M. and Schrauwen, B., “Training and analyzing deep recurrent neural networks,” Advances inNeural Information Processing Systems (2013).

[26] Werbos, P. J., “Backpropagation through time: what it does and how to do it,” Proceedings of theIEEE 78(10), 1550–1560 (1990).

[27] Soomro, K., Zamir, A. R., and Shah, M., “Ucf101: A dataset of 101 human actions classes from videos inthe wild,” arXiv preprint arXiv:1212.0402 (2012).

[28] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., andBengio, Y., “Theano: new features and speed improvements.” Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop (2012).

[29] Glorot, X. and Bengio, Y., “Understanding the difficulty of training deep feedforward neural networks,” in[Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010 ], 249–256 (2010).

[30] Burger, H. C., Schuler, C. J., and Harmeling, S., “Image denoising: Can plain neural networks compete withbm3d?,” in [Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on ], 2392–2399,IEEE (2012).

[31] Zhang, L., Vaddadi, S., Jin, H., and Nayar, S. K., “Multiple view image denoising,” in [Computer Visionand Pattern Recognition, 2009. CVPR 2009. IEEE Conference on ], 1542–1549, IEEE (2009).