image-to-markup generation with coarse-to-fine attention · 2020-01-26 · evaluation is done...

29
Image-to-Markup Generation with Coarse-to-Fine Attention Presenter: Ceyer Wakilpoor Yuntian Deng 1 Anssi Kanervisto 2 Alexander M. Rush 1 Harvard University 3 University of Eastern Finland ICML, 2017 Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland) Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 1 / 29

Upload: others

Post on 19-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Image-to-Markup Generation with Coarse-to-FineAttention

Presenter: Ceyer Wakilpoor

Yuntian Deng1 Anssi Kanervisto2 Alexander M. Rush

1Harvard University

3University of Eastern Finland

ICML, 2017

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 1 / 29

Page 2: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 2 / 29

Page 3: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 3 / 29

Page 4: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Introduction

Optical Character Recognition for mathematical expressions

Challenge: creating markup from compiled image

Have to pick up markup that translates to how characters arepresented, not just what characters

Goal is to make a model that does not require domain knowledge, usedata-driven approach

Work is based on previous attention-based encoder-decoder modelused in machine translation and in image captioning

Added multi-row recurrent model before attention layer, which provedto increase performance

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 4 / 29

Page 5: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 5 / 29

Page 6: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Problem Statement

Converting rendered source image to markup that can render theimage

The source, x ∈ X is a grayscale image of height, H, and width, W(RHxW )

The target, y ∈ Y consists of a sequence of tokens y1, y2, ..., yCC is the length of the output, and each y is a token from the markuplanguage with vocabulary

∑Effectively trying to learn how to invert the compile function of themarkup using supervised examples

Goal is for compile(y) ≈ x

Generate hypothesis y , and x is the predicted compiled image

Evaluation is done between x and x , as in evaluating to render animage similar to the original input

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 6 / 29

Page 7: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 7 / 29

Page 8: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Model

Convolutional Neural Network (CNN) extracts image features

Each row is encoded using a Recurrent Neural Network (RNN)

When paper mentions RNN, it means a Long-Short Term MemoryNetwork (LSTM)

These encoded features are used by an RNN decoder with a visualattention layer, which implements a conditional language model over∑

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 8 / 29

Page 9: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 9 / 29

Page 10: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Convolutional Network

Visual features of the image are extracted with a multi-layerconvolutional neural network, with interleaved max-pooling layers

Based on model used for OCR by Shi et al.

Unlike some other OCR models, there is no fully-connected layer atthe end of the convolutional layers

Want to preserve spatial relationship of extracted features

CNN takes in input RHxW , and produces feature grid, V of sizeCxH ′xW ′ where c denotes the number of channels, and H ′ and W ′

are the reduced dimensions from pooling

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 10 / 29

Page 11: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 11 / 29

Page 12: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Row Encoder

Unlike with image captioning, OCR there is significant sequentialinformation (i.e. reading left-to-right)

Encode each row separately with an RNN

Most markup languages default left-to-right, which an RNN willnaturally pick upEncoding each row will allow the RNN to use surrounding horizontalinformation to improve the hidden representation

Generic RNN: ht = RNN(ht−1, vt ; θ)

RNN takes in V and outputs V:

Run RNN over all rows h ∈ {1, ....,H ′} and columns w ∈ {1, ....,W ′}V = RNN(Vh,w−1,Vh,w )

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 12 / 29

Page 13: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 13 / 29

Page 14: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Decoder

Decoder is trained as conditional language model

Modeling probability of output token conditional on previous ones:p(yt+1|y1, ...., yt , V = softmax(Woutot)

Wout is a learned linear transformation and ot = tanh(Wc [ht ; ct ])

ht = RNN(ht−1, [yt−1; ot−1])

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 14 / 29

Page 15: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 15 / 29

Page 16: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Attention

General form of context vector used to assist decoder at time-step t:

ct = φ({Vh,w}, αt)

General form of e and the weight vector, α:

et = a(ht , {Vh,w})

αt = softmax(et)

From empirical success choose a: eit = βT tanh(Whhi−1 + Wv vt) andci =

∑i αitvt

ct and ht are simply concatenated and used to predict the token, yt

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 16 / 29

Page 17: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Model

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 17 / 29

Page 18: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Attention

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 18 / 29

Page 19: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Encoder Decoder with Attention

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 19 / 29

Page 20: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Example

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 20 / 29

Page 21: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Model Architecture

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 21 / 29

Page 22: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 22 / 29

Page 23: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Experiment Details

Beam search used for testing since decoder models conditionallanguage probability of the generated tokensPrimary experiment was with IM2LATEX-100k, which is a dataset ofmathematical expressions written in latexLatex vocabulary was tokenized to relatively specific tokens, modifiercharacters such as ˆ or symbols such as \sigma

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 23 / 29

Page 24: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Experiment Details

Created duplicate model without encoder as control to test againstimage captioning models, the control was called CNNEnc

Evaluation by comparing input image and rendered image of outputlatex

Initial learning rate of .1 and halve it once validation perplexitydoesn’t decrease

Low validation perplexity means good generalization when comparingtraining to validation set

12 epochs and beam search using beam size of 5

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 24 / 29

Page 25: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Outline

1 IntroductionProblem Statement

2 ModelConvolutional NetworkRow EncoderDecoderAttention

3 Experiment Details

4 Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 25 / 29

Page 26: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Results

97.5% exact match accuracy of decoding HTML images

Reimplement Image-to Caption work on Latex and achieved accuracyof over 75% for exact matches

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 26 / 29

Page 27: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Results

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 27 / 29

Page 28: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Implementation

Mostly written in Torch, Python for preprocessing, and utilized lualibraries

Bucketed inputs into similar size images

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 28 / 29

Page 29: Image-to-Markup Generation with Coarse-to-Fine Attention · 2020-01-26 · Evaluation is done between ^x and x, as in evaluating to render an image similar to the original input Yuntian

Citations

https://theneuralperspective.com/2016/11/20/

recurrent-neural-network-rnn-part-4-attentional-interfaces/

Yuntian Deng, Anssi Kanervisto, Alexander M. Rush ( Harvard University, University of Eastern Finland)Image-to-Markup Generation with Coarse-to-Fine Attention ICML, 2017 29 / 29