image-to-markup generation with coarse-to-fine attention · 2020-01-26 · evaluation is done...
TRANSCRIPT
Image-to-Markup Generation with Coarse-to-FineAttention
Presenter: Ceyer Wakilpoor
Yuntian Deng1 Anssi Kanervisto2 Alexander M. Rush
1Harvard University
3University of Eastern Finland
ICML, 2017
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 1 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 2 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 3 / 29
Introduction
Optical Character Recognition for mathematical expressions
Challenge: creating markup from compiled image
Have to pick up markup that translates to how characters arepresented, not just what characters
Goal is to make a model that does not require domain knowledge, usedata-driven approach
Work is based on previous attention-based encoder-decoder modelused in machine translation and in image captioning
Added multi-row recurrent model before attention layer, which provedto increase performance
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 4 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 5 / 29
Problem Statement
Converting rendered source image to markup that can render theimage
The source, x ∈ X is a grayscale image of height, H, and width, W(RHxW )
The target, y ∈ Y consists of a sequence of tokens y1, y2, ..., yCC is the length of the output, and each y is a token from the markuplanguage with vocabulary
∑Effectively trying to learn how to invert the compile function of themarkup using supervised examples
Goal is for compile(y) ≈ x
Generate hypothesis y , and x is the predicted compiled image
Evaluation is done between x and x , as in evaluating to render animage similar to the original input
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 6 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 7 / 29
Model
Convolutional Neural Network (CNN) extracts image features
Each row is encoded using a Recurrent Neural Network (RNN)
When paper mentions RNN, it means a Long-Short Term MemoryNetwork (LSTM)
These encoded features are used by an RNN decoder with a visualattention layer, which implements a conditional language model over∑
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 8 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 9 / 29
Convolutional Network
Visual features of the image are extracted with a multi-layerconvolutional neural network, with interleaved max-pooling layers
Based on model used for OCR by Shi et al.
Unlike some other OCR models, there is no fully-connected layer atthe end of the convolutional layers
Want to preserve spatial relationship of extracted features
CNN takes in input RHxW , and produces feature grid, V of sizeCxH ′xW ′ where c denotes the number of channels, and H ′ and W ′
are the reduced dimensions from pooling
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 10 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 11 / 29
Row Encoder
Unlike with image captioning, OCR there is significant sequentialinformation (i.e. reading left-to-right)
Encode each row separately with an RNN
Most markup languages default left-to-right, which an RNN willnaturally pick upEncoding each row will allow the RNN to use surrounding horizontalinformation to improve the hidden representation
Generic RNN: ht = RNN(ht−1, vt ; θ)
RNN takes in V and outputs V:
Run RNN over all rows h ∈ {1, ....,H ′} and columns w ∈ {1, ....,W ′}V = RNN(Vh,w−1,Vh,w )
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 12 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 13 / 29
Decoder
Decoder is trained as conditional language model
Modeling probability of output token conditional on previous ones:p(yt+1|y1, ...., yt , V = softmax(Woutot)
Wout is a learned linear transformation and ot = tanh(Wc [ht ; ct ])
ht = RNN(ht−1, [yt−1; ot−1])
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 14 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 15 / 29
Attention
General form of context vector used to assist decoder at time-step t:
ct = φ({Vh,w}, αt)
General form of e and the weight vector, α:
et = a(ht , {Vh,w})
αt = softmax(et)
From empirical success choose a: eit = βT tanh(Whhi−1 + Wv vt) andci =
∑i αitvt
ct and ht are simply concatenated and used to predict the token, yt
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 16 / 29
Model
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 17 / 29
Attention
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 18 / 29
Encoder Decoder with Attention
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 19 / 29
Example
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 20 / 29
Model Architecture
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 21 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 22 / 29
Experiment Details
Beam search used for testing since decoder models conditionallanguage probability of the generated tokensPrimary experiment was with IM2LATEX-100k, which is a dataset ofmathematical expressions written in latexLatex vocabulary was tokenized to relatively specific tokens, modifiercharacters such as ˆ or symbols such as \sigma
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 23 / 29
Experiment Details
Created duplicate model without encoder as control to test againstimage captioning models, the control was called CNNEnc
Evaluation by comparing input image and rendered image of outputlatex
Initial learning rate of .1 and halve it once validation perplexitydoesn’t decrease
Low validation perplexity means good generalization when comparingtraining to validation set
12 epochs and beam search using beam size of 5
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 24 / 29
Outline
1 IntroductionProblem Statement
2 ModelConvolutional NetworkRow EncoderDecoderAttention
3 Experiment Details
4 Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 25 / 29
Results
97.5% exact match accuracy of decoding HTML images
Reimplement Image-to Caption work on Latex and achieved accuracyof over 75% for exact matches
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 26 / 29
Results
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 27 / 29
Implementation
Mostly written in Torch, Python for preprocessing, and utilized lualibraries
Bucketed inputs into similar size images
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 28 / 29
Citations
https://theneuralperspective.com/2016/11/20/
recurrent-neural-network-rnn-part-4-attentional-interfaces/
Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 29 / 29