judging a movie by its poster using deep learning

5
Judging a Movie by its Poster using Deep Learning Brett Kuprel [email protected] Abstract It is often the case that a human can de- termine the genre of a movie by looking at its movie poster. This task is not trivial for computers. A recent advance in ma- chine learning called deep learning allows al- gorithms to learn important features from large datasets. Rather than analyzing an im- age pixel by pixel, for example, higher level features can be used for classification. In this project I attempted to train a neural network of stacked autoencoders to predict a movie’s genre given an image of its movie poster. My hypothesis is that a good algorithm can cor- rectly guess the genre based on the movie poster at least half the time. 1. Introduction A “neuron” is a computational unit that takes as input x R n and outputs an activation h(x)= f (w T x + b), where f is a sigmoidal function. A neural network is a network of these neurons. See the example in figure 1 from the UFLDL tutorial (Ng et al., 2010). 1.1. Forward Propagation A neural network performs a computation on an input x by forward propagation. Let a (l) R n l be the set of activations (i.e. outputs) of the n l neurons and W (l) the matrix of weight vectors w for layer l. We have the following recursion relationship: a (l+1) = f (W (l) a (l) + b (l) ) (1) To determine the final hypotheses h given an input x, iteratively apply this recursion, starting with a (0) = x 1.2. Autoencoders An autoencoder is a neural network that takes as input x [0, 1] n , maps it to a latent representation y [0, 1] n 0 , and finally outputs z [0, 1] n , a reconstructed version of x. If the input is interpreted as bit vectors, Figure 1. Top: a single neuron. Bottom: a neural network (specifically a feedforward network) the reconstruction error can be measured by the cross entropy J (x, z)= - X (x log z + (1 - x) log(1 - z)) (2) When n 0 <n, the latent layer y can be thought of as a lossy compression of x. It does not generalize for arbitrary x, but this is usually ok since many datasets lie on lower dimensional manifolds. Natural images for instance are a very small subset of all possible images. 2. Methods 2.1. Model Let each movie poster be a vector x (i) R n where n is the number of pixels in the image. Each movie belongs to at most 3 genres. I express this as a boolean vector y (i) R |G| where G is the set of genres, and y (i) j =1 if movie i belongs to genre j , 0 otherwise.

Upload: nguyendung

Post on 14-Feb-2017

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Judging a Movie by its Poster using Deep Learning

Judging a Movie by its Poster using Deep Learning

Brett Kuprel [email protected]

Abstract

It is often the case that a human can de-termine the genre of a movie by looking atits movie poster. This task is not trivialfor computers. A recent advance in ma-chine learning called deep learning allows al-gorithms to learn important features fromlarge datasets. Rather than analyzing an im-age pixel by pixel, for example, higher levelfeatures can be used for classification. In thisproject I attempted to train a neural networkof stacked autoencoders to predict a movie’sgenre given an image of its movie poster. Myhypothesis is that a good algorithm can cor-rectly guess the genre based on the movieposter at least half the time.

1. Introduction

A “neuron” is a computational unit that takes as inputx ∈ Rn and outputs an activation h(x) = f(wTx+ b),where f is a sigmoidal function. A neural network is anetwork of these neurons. See the example in figure 1from the UFLDL tutorial (Ng et al., 2010).

1.1. Forward Propagation

A neural network performs a computation on an inputx by forward propagation. Let a(l) ∈ Rnl be the set ofactivations (i.e. outputs) of the nl neurons and W (l)

the matrix of weight vectors w for layer l. We havethe following recursion relationship:

a(l+1) = f(W (l)a(l) + b(l)) (1)

To determine the final hypotheses h given an input x,iteratively apply this recursion, starting with a(0) = x

1.2. Autoencoders

An autoencoder is a neural network that takes as inputx ∈ [0, 1]n, maps it to a latent representation y ∈[0, 1]n

′, and finally outputs z ∈ [0, 1]n, a reconstructed

version of x. If the input is interpreted as bit vectors,

Figure 1. Top: a single neuron. Bottom: a neural network(specifically a feedforward network)

the reconstruction error can be measured by the crossentropy

J(x, z) = −∑

(x log z + (1− x) log(1− z)) (2)

When n′ < n, the latent layer y can be thought of asa lossy compression of x. It does not generalize forarbitrary x, but this is usually ok since many datasetslie on lower dimensional manifolds. Natural images forinstance are a very small subset of all possible images.

2. Methods

2.1. Model

Let each movie poster be a vector x(i) ∈ Rn where n isthe number of pixels in the image. Each movie belongsto at most 3 genres. I express this as a boolean vector

y(i) ∈ R|G| where G is the set of genres, and y(i)j = 1

if movie i belongs to genre j, 0 otherwise.

Page 2: Judging a Movie by its Poster using Deep Learning

Judging a Movie by its Poster using Deep Learning

Figure 2. an autoencoder

The algorithm will produce a single genre predictiony ∈ G. Define the prediction as the argmax of theconditional probability distribution:

y(i) = arg maxjP (Y = j|x(i)) (3)

and the CPT as the softmax of the final hypothesislayer of the network

P (Y = j|x(i)) =exp(hj)∑

exp(h)(4)

Where h ∈ R|G| is found by forward propagation ofx(i) through the network. The goal is to minimizeprediction error rate. Define an error to occur whenthe predicted genre for some movie i is not in the setof genres that movie i belongs to:

% Error =1

|D|∑i∈D

1− y(i)[y(i)] (5)

This error rate function is not differentiable in themodel parameters W and b because of the argmax ex-pression to find y. To train the network, I instead tryto minimize the negative log likelihood:

J(W, b) = −∑

logP (Y = j|X)[Y ] (6)

Where the ith rows of X and Y are x(i) and y(i), andthe sum is over all elements. Notice the CPT matrix

is indexed by the boolean matrix Y . This is still adifferentiable function in the parameters W and b.

2.2. Learning Parameters

Backpropagation is a greedy method used to train theweights in a neural network. It involves using gradientdescent to update the parameters,

W(l)ij − = α

∂J

∂W(l)ij

, b(l)i − = α

∂J

∂b(l)i

(7)

Many times gradients are messy to derive, and do notprovide much insight into the problem at hand. Thereis a package in python that I used called Theano thatcalculates these gradients and applies updates to themodel parameters W and b behind the scenes.

2.3. Stacking Autoencoders

Before applying backpropagation, it would be nice ifW and b were initialized to something reasonable. Aknown problem with training neural networks is ‘dif-fusion of gradients’. When back propagation is runfrom scratch, only the nodes close to the final layerwill be updated properly. A greedy method for ini-tializing W and b is by stacking auto encoders. Theidea is simple: train an autoencoder on a set of dataX. Use the learned feature representations as the in-put for another autoencoder. Repeat until you have asmany layers as you want. For this project I had 3 la-tent layers (aside from the initial data and hypothesislayers). This process results in a reasonable initializa-tion of the weights W and b. It also allows unlabeleddata to be used effectively for feature learning. Ofthe movie poster images I had, very few than 1/6 hadgenre labels.

2.4. Getting the Data

On IMDB there is a link that goes to a random popularmovie.

http : //www.imdb.com/random/title

Using this link, I can obtain the movie rating andposter for N movies as shown in algorithm 1. I usedthe BeautifulSoup package in python to scrape theHTML.

This algorithm seems to exhaust IMDB’s random pop-ular movie function around a little less than 1000movies. At that point, the algorithm will visit closeto 100 sites before seeing a movie that hasn’t beenscraped yet. There is another website, called MoviePoster DB that also has a random movie link, andclaims to host over 100 thousand movie posters. While

Page 3: Judging a Movie by its Poster using Deep Learning

Judging a Movie by its Poster using Deep Learning

Algorithm 1 Scrape IMDB

Input: desired number of movies NOutput: dictionary M where a movie key m pointsto genre g, rating r, and movie poster p

M ← {}U = http : //www.imdb.com/random/titlewhile |M | < N dom = getMovieTitle(U)if m not in M thenp← getMoviePoster(U)g ← getMovieGenre(U)r ← getMovieRating(U)M [m]← {‘genre’: g, ‘rating’ : r, ‘poster’ : p}

end ifend while

these posters do not have ratings or genre labels, theycan still be used for feature learning. I wrote a similarscript for this website and was able to scrape 5,000posters in just a few hours.

Figure 3. Movie posters from IMDB, standardized to100x100 pixels and converted to grayscale

One frustration I ran into into while scraping posterswas that there was no standard image shape. In or-der to apply most machine learning algorithms, eachdata point should have the same set of features (i.e.same image size). In the PIL package in python, thereis a function PIL.Image.resize(new size) which con-verts any size image to any other size image. After

Table 1. Genre counts for movies in IMDB dataset. Onemovie can belong to multiple genres

Genre Count

Drama 365Comedy 247Action 234Adventure 178Crime 170Thriller 135Sci-Fi 102Fantasy 90Romance 89Mystery 79Horror 54Animation 53Family 49Biography 30History 23Documentary 18War 16Sport 11Western 4Musical 3

playing around with different image sizes, I decided on100 × 100. The change in aspect ratio doesn’t reallyaffect the poster as much as I would have expected.Also, I had to decide what to do with the color. Itseems that color does not add enough information towarrant tripling the features (or reducing the numberof pixels per image to 1/3). I used a luminosity func-tion, gray = 0.299 red + 0.587 green + 0.114 blue toconvert each image to grayscale. See figure 3 for thepreprocessed IMDB dataset.

2.5. Preparing Data

Let M be the dictionary returned by algorithm 1. Idecided to split the data into a training set Dtrain, avalidation set Dvalid, and a test set Dtest with sizes80%, 10%, and 10%.

2.6. Implementing a Neural Net

I used a package for python called Theano (Bergstraet al., 2010) designed for deep learning. It simplifiesrunning algorithms on the GPU. The same code writ-ten for the CPU will work on the GPU (as long as floatsare used). Among other things, it uses a lazy evalua-tion technique, and performs symbolic differentiation.I found an example of using stacked autoencoders toclassify the MNIST handwritten digits dataset. I usedthis as starter code to build by movie poster classifier.

Page 4: Judging a Movie by its Poster using Deep Learning

Judging a Movie by its Poster using Deep Learning

Figure 4. Training speedup using GPU

3. Results

I scraped a total of 5,800 images, each 100 by 100pixels grayscale. 800 of the images are shown in fig-ure 3. 5,000 of the images have no genre labels. Theremaining 800 have genre labels distributed as shownin table 1. Each movie has anywhere between 1 and 3genre labels associated with it.

Training a 3 layer (layers of 1000, 500, 300 nodes) ar-chitecture topped with a layer of multiclass logisticregression results in a validation set error rate of 47%and a test set error percentage of 49.5%. This meansthat given a movies poster, the algorithm can correctlypredict on of its genre out of 20 possible genres about50% of the time. Note that if drama is guessed everytime, the algorithm will predict the genre correctly45.6% of the time. A plot of the negative log likeli-hood over number of iterations through the training

Figure 5. Negative log likelihood

set is shown in figure 5

The images that most highly activate the neurons inthe 3rd layer are shown in figure 6. Notice, a fewof them look like faces. The first and second layerfeatures were less exciting so I did not include them.

Also, my code is split across many files and I decidedto omit it from the report. Please email me if you wantany part or all of it.

Figure 6. Learned features in the 3rd hidden layer

Page 5: Judging a Movie by its Poster using Deep Learning

Judging a Movie by its Poster using Deep Learning

4. Conclusion

It was difficult to implement a deep neural network forthe first time in 1 week. That said, I think my neuralnetwork suffered from the “curse of dimensionality”.My images were 100 by 100 pixels for a total of 10,000variables per training example, and I only had 5000training examples. A smarter method might be tocut each poster into patches and then classify using avoting scheme amongst the patches in the poster. An-other method could be to simply use lower resolutionmovie posters.

References

Bergstra, James, Breuleux, Olivier, Bastien, Frederic,Lamblin, Pascal, Pascanu, Razvan, Desjardins,Guillaume, Turian, Joseph, Warde-Farley, David,and Bengio, Yoshua. Theano: a CPU andGPU math expression compiler. In Proceedingsof the Python for Scientific Computing Conference(SciPy), June 2010. Oral Presentation.

Ng, Andrew, Ngiam, Jiquan, Foo, Chuan Y.,Mai, Yifan, and Suen, Caroline. Ufldl tutorial,2010. URL http://ufldl.stanford.edu/wiki/

index.php/UFLDL_Tutorial.