an empirical evaluation of generic convolutional and recurrent …lcarin/rachel12.7.2018.pdf ·...

17
Introduction Model Experiments An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling Shaojie Bai, J. Zico Kolter, and Vladlen Koltun Presented by Rachel Draelos December 7, 2018 Bai et al.

Upload: others

Post on 20-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • IntroductionModel

    Experiments

    An Empirical Evaluation of GenericConvolutional and Recurrent Networks for

    Sequence Modeling

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun

    Presented by Rachel Draelos

    December 7, 2018

    Bai et al.

  • IntroductionModel

    Experiments

    Introduction

    Question: are CNNs or RNNs better for sequencemodeling?Claims:

    Generic temporal convolutional network (TCN) architectureoutperforms canonical RNN models across a variety ofsequence modeling tasksTCNs have longer memory than RNNs

    Bai et al.

  • IntroductionModel

    Experiments

    Background

    Inspired by prior work showing that certain convolutionalarchitectures achieve good performance on sequence tasks:

    audio synthesis: WaveNet (van den Oord et al 2016)word-level language modeling: gated conv nets (Dauphinet al 2017)machine translation: ByteNet (Kalchbrenner et al 2016),ConvS2S (Gehring 2017)

    Bai et al.

  • IntroductionModel

    Experiments

    Generic Sequence Modeling Task

    Given an input sequence x0, ..., xT the goal is to predictoutputs y0, ..., yT at each time.Sequence modeling network f : X T+1 → YT+1 producesthe mapping ŷ0, ..., ŷT = f(x0, ..., xT )f must satisfy the constraint that yt depends only onx0, ..., xt and not on any future inputs xt+1, ..., xTSequence modeling goal: find a network f that minimizesexpected loss between actual outputs and predictions,L(y0, ..., yT , f(x0, ..., xT ))

    Bai et al.

  • IntroductionModel

    Experiments

    Temporal Convolutional Network (TCN)

    “Not a truly new architecture” - rather, a description of a familyof architectures.

    Characteristics: 1-D fully convolutional network with:

    (1) Causal convolutions(2) Input length = output length(3) Dilated convolutions(4) Residual connections

    Bai et al.

  • IntroductionModel

    Experiments

    (1) Causal convolutions & (2) Input length=outputlength

    Causal convolutions: No information leakage from future topast. An output at time t is the result of convolution thatused only elements from time t and earlier in the previouslayer.Input length = output length: each hidden layer is the samelength as the input layer (via zero padding of length (kernelsize - 1))

    Bai et al.

  • IntroductionModel

    Experiments

    (3) Dilated convolutions

    Filter is expanded with gaps. Enables much larger receptivefield.

    For a 1-D sequence input x ∈ Rn and a filterf : {0, ..., k − 1} → R, the dilated convolution operation F onelement s of the sequence is:

    F (s) = (x ∗d f)(s) =k−1∑i=0

    f(i) · xs−d·i (1)

    where d is the dilation factor, k is the filter size, and s− d · iaccounts for the direction of the past.

    Bai et al.

  • IntroductionModel

    Experiments

    (3) Dilated convolutions

    Two ways to expand the receptive field: larger filter sizes kand larger dilation factor d.Effective history of a layer is (k − 1)dTCN model: increase d exponentially with the depth of thenetwork: d = O(2i) at level i

    Figure: (1a). Dilated causal convolution with dilation factors d = 1,2,4and filter size k = 3.

    Bai et al.

  • IntroductionModel

    Experiments

    (4) Residual connections

    Figure: (1b)

    A TCN residual block includes twolayers of dilated causal convolutions,ReLU nonlinearities, weightnormalization, and spatial dropout (ateach training step a whole channel iszeroed out)

    The TCN uses an additional 1× 1convolution to ensure thatelement-wise addition of (residualblock input ⊕ residual block output)receives tensors of the same shape.

    Bai et al.

  • IntroductionModel

    Experiments

    Comparison

    TCN model size ≈ RNN model sizeTCNs are relatively insensitive to hyperparameter changesas long as the effective history size (i.e. receptive field) islarge enoughFor RNNs, they used grid search to choosehyperparameters

    Bai et al.

  • IntroductionModel

    Experiments

    Tasks: Synthetic Data and Music

    Benchmark Input Objective

    The adding problem

    Sequence of length T and depth 2. Dim 1: random values in [0,1]. Dim 2: all zeros except for two elements marked by 1.

    Sum the two random values whose second dimensions are marked by 1.

    Sequential MNIST and P-MNIST

    Each 28x28 MNIST image is presented to the model as a 784x1 sequence. In P-MNIST, the order of the sequence is permuted using a fixed random order.

    Digit classification

    Copy memory

    Sequence of length T+20. The first 10 values are chosen randomly among the digits 1-8; the middle values are 0; the last 11 values are ‘9’ (the first 9 is a delimiter).

    Generate an output of the same length that is zero everywhere except the last 10 values, where the model must repeat the 10 values it encountered at the start of the input.

    Polyphonic music: JSB Chorales & Nottingham

    Sequence where each element is an 88-bit binary code (for 88 piano keys), with a 1 indicating a key that is pressed at a given time.

    Predict the next note

    Bai et al.

  • IntroductionModel

    Experiments

    Tasks: Language

    Benchmark Data Objective

    PennTreebank

    Words: 888K train, 70K valid, 79K test. Vocabulary 10K Chars: 5,059K train, 396K valid, 446K test. Alphabet size 50

    Predict the next word Predict the next character

    Wikitext-103 Words: 103M train, 218K valid, 246K test. Vocabulary 268K

    Predict the next word

    LAMBADA

    Train: full text of 2,662 novels Test: 10K passages for which humans are good at predicting the last word only when given context Input: ~4.6 context sentences plus 1 target sentence with the last word missing

    Predict the last word of the target sentence

    text8 Chars: 90M train, 5M valid, 5M test. 27 unique alphabets.

    Predict the next character

    Bai et al.

  • IntroductionModel

    Experiments

    Results

    Higher is better

    Lower is better

    Bai et al.

  • IntroductionModel

    Experiments

    Results

    Bai et al.

  • IntroductionModel

    Experiments

    Results: Copy Memory Task

    Bai et al.

  • IntroductionModel

    Experiments

    Results: TCN vs. SoTA

    Best results are highlighted in yellow (higher is better) or blue (lower is better).Note that the SoTA model may be larger than the TCN model (“Size”columns.)

    Bai et al.

  • IntroductionModel

    Experiments

    TCN Summary: Advantages/Disadvantages

    + Parallelism: long input sequence can be processed as a whole rather thansequentially+ Control over receptive field size/memory size: e.g. via more layers,larger dilation factors, larger filter sizes+ Stable gradients: avoids exploding/vanishing gradients+ Lower memory requirement for training: TCNs use up to a multiplicativefactor less memory than gated RNNs− Higher memory requirement for eval/testing: TCNs need raw sequenceup to the effective history length, not just current input xt− Parameter changes for transfer of domain: a TCN model may performpoorly if transferred from a domain where only little memory is needed (smallk and d) to a domain where longer memory is needed (large k and d)

    Bai et al.

    IntroductionModelExperiments