multilabel deep learning - indico · supervised learning • general problem, desire a labeling...

56
Multilabel Classification and Deep Learning Zachary Chase Lipton Critical Review of RNNs: http://arxiv.org/abs/1506.00019 Learning to Diagnose: http://arxiv.org/abs/1511.03677 Conditional Generative RNNS: http://arxiv.org/abs/1511.03683

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Multilabel Classification and Deep Learning

Zachary Chase Lipton

Critical Review of RNNs: http://arxiv.org/abs/1506.00019

Learning to Diagnose:http://arxiv.org/abs/1511.03677

Conditional Generative RNNS:http://arxiv.org/abs/1511.03683

Page 2: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Outline • Introduction to Multilabel Learning

• Evaluation

• Efficient Learning & Sparse Models

• Deep Learning for Multilabel Classification

• Classifying Multilabel Time Series with RNNs

Page 3: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Supervised Learning• General problem, desire a labeling function

• ERM principle - choose the model in hypothesis class that minimizes loss on the training sample

• Most research assumes simplest case

• Real world much messier

f : X ! Y

f̂ HS 2 {X ⇥ Y}n

X = Rd,Y = {0, 1}

Page 4: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Binary Classification

y 2 {0, 1}

Page 5: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Multiclass Classification

y 2 {c1, c2, ..., cL}

Page 6: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Multilabel Classification

y ✓ {c1, c2, ..., cL}

Page 7: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Why Multilabel?• Superset of both BC and MC:

BC when = 1, MC when

• Natural for many real problems:Clinical diagnosis Predicting purchasesAuto-tagging news articles Activity recognition Object detection

• Easy to formulate:Take L tasks and slap them together

y 2 L|L|

Page 8: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Naive Baseline• Binary relevance:

Separately train

• Pros:Simple to execute, easy to understandstrong baseline

• Cons:Computational cost: Leaves some information on the table (correlation betw. labels)

|L| classifiers fl : X ! {0, 1}

|L|⇥

Page 9: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Challenges• Efficiency

Develop classifiers that do not scale in time or space complexity with the number of labels

• Performance Make use of the extra labels to achieve better accuracy, generalization

• EvaluationHow do we evaluate a multilabel classifier’s performance across 10s, 100s, 1000s, or even 1M labels?

Page 10: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Outline • Introduction to Multilabel Learning

• Evaluation

• Efficient Learning & Sparse Models

• Deep Learning for Multilabel Classification

• Classifying Multilabel Time Series with RNNs

Page 11: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Why not accuracy?

• Often extreme class imbalanceWhen blind classifier gets 99.99%, can be optimal to be uninformative

• Varying base rates across labels E.g.: MeSH dataset: Human applies to 71% of articles, platypus in <.0001%

Page 12: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

F1 Score• Easy to calculate from confusion matrix

• Harmonic mean of precision and recall F1 = 2·tp

2·tp+fp+fn

tp

tp+ fp

tp

tp+ fn

Page 13: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

F1 given fixed base rate

Page 14: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Compared to Accuracy

Page 15: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Expected F1 for Uninformative Classifier

Page 16: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Multilabel Variations

Example 1 TP FP FN TN

Example 2 FP FP FN TP

Example 3 FN TP FN FP

… TN TP TP TN

Micro F1 calculated over all entries

Page 17: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Macro F1

• Macro: F1 calculated separately for each label and averaged

Label 1 Label 2 Label 3 Label 4

Example 1 TP FP FN TN

Example 2 FP FP FN TP

Example 3 FN TP FN FP

… TN TP TP TN

Page 18: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Characterizing the Optimal Threshold

• Threshold can be expressed in terms of the conditional probabilities of scores given labels

• When scores are calibrated probabilities, optimal threshold is precisely half the F1 it achieves.

Page 19: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Problems with F1

• Sensitive to thresholding strategy

• Hard to tell who has the best algorithms and who is smart about thresholding

• Micro-F1 biased towards common labels

• Macro-F1 biased against them

Page 20: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Some alternatives• Any threshold indicates a cost sensitivity:

When you know the cost, specify it and use weighted accuracy

• AUC exhibits same dynamic range for every label (blind classifier gets 0, perfect is 1)

• Macro-averaged AUC scores may give a better sense of performance across all labels **high AUC for rare labels can be misleading.can achieve AUC of .99 produce useless results for IR

Page 21: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Outline • Introduction to Multilabel Learning

• Evaluation

• Efficient Learning & Sparse Models

• Deep Learning for Multilabel Classification

• Classifying Multilabel Time Series with RNNs

Page 22: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

The problem

• With many labels, binary relevance models can be huge and slow

• 10k labels + 1M features = 80GB of parameters

• We want compact models Fast to train and evaluate, cheap to store

Page 23: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Linear Regression• The bulk of computation is label agnostic (compute

inverse

• Can do this especially fast when we reduce dimensionality of X via SVD.

• Problem: Unsupervised dim reduction -> lose signal of rare features -> mess up rare labels

✓ = (XTX)�1XT b

(XTX)�1

✓ = (XTX)�1XTB

Page 24: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Sparsity• For auto-tagging tasks, features are often high-dimensional sparse

bag-of-words or n-grams

• Datasets for web-scale information retrieval tasks are large in the number of examples, thus SGD is the default optimization procedure

• Absent regularization, the gradient is sparse and training is fast

• Regularization destroys the sparsity of the gradient

• Number of features and labels are large, dense stochastic updates are computationally infeasible

Page 25: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Regularization• Goals: achieve model sparsity, prevent overfitting • regularization is induces sparse models • regularization is thought to achieve more accurate

models in practice • Elastic net, balances the two

`22

`1

Page 26: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Balancing Regularization with Efficiency

• To regularize while maintaining efficiency, can use a lazy updating scheme, first described by Carpenter (2008)

• For each feature, remember the last time it was nonzero

• When a feature is nonzero at some step t+k, perform a closed form update

• We derive lazy updates for elastic net regularization on both standard SGD and FoBoS (Duchi & Singer)

Page 27: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Lazy Updates for Elastic NetTheorem 1 To bring the weight wj current from time j to time k using SGD,

the constant time update is

(1 )

w(k)j = sgn(w

( j)j )

|w( j)

j | P (k � 1)

P ( j � 1)

� P (k � 1) · (B(k � 1)�B( j � 1))

+

where P (t) = (1 � ⌘(t)�2) · P (t � 1) with base case P (�1) = 1 and B(t) =Pt⌧=0 ⌘

(⌧)/P (⌧ � 1) with base case B(�1) = 0.

Theorem 2 A constant-time lazy update for FoBoS with elastic net regulariza-

tion and decreasing learning rate to bring a weight current at time k from time

j is

(2 )

w(k)j = sgn(w

( j)j )

|w( j)

j | �(k � 1)

�( j � 1)�

�(k � 1) · �1 (�(k � 1)� �( j � 1))

+

where �(t) = �(t� 1) · 11+⌘t�2

with base case �(�1) = 1 and �(t) = �(t� 1) +⌘(t)

�(t�1) with base case �(�1) = 0.

Page 28: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Empirical Validation• On two largest datasets in Mulan repository of

multilabel datasets, we can train to convergence on a laptop in just minutes

• rcv1: 490x speedup, bookmarks: 20x speedup

bookmarksrcv1

Page 29: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Outline • Introduction to Multilabel Learning

• Evaluation

• Efficient Learning & Sparse Models

• Deep Learning for Multilabel Classification

• Classifying Multilabel Time Series with RNNs

Page 30: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Performance• Efficiency is nice, but we’d also like performance

• Neural networks can learn shared representations across labels.

• Both regularizes each label’s model and exploits correlations between labels

• In extreme multilabel, may use significantly less parameters than logistic regression

Page 31: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Neural Network

Page 32: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Training w Backpropagation

• Goal: calculate the derivative of loss function with respect to each parameter (weight) in the model

• Update the weights by gradient following:

Page 33: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Forward Pass

Page 34: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Backward Pass

Page 35: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Multilabel MLP

Page 36: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Outline • Introduction to Multilabel Learning

• Evaluation

• Efficient Learning & Sparse Models

• Deep Learning for Multilabel Classification

• Classifying Multilabel Time Series with RNNs

Page 37: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

To Model Sequential Data: Recurrent Neural Networks

Page 38: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Recurrent Net (Unfolded)

Page 39: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

LSTM Memory Cell (Hochreiter & Schmidhuber, 1997)

Page 40: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

LSTM Forward Pass

Page 41: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

LSTM (full network)

Page 42: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Unstructured Input

Page 43: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Modeling Problems

• Examples: 10,401 episodes

• Features: 13 time series (sensor data, lab tests)

• Complications: Irregular sampling, missing values, varying-length sequences

Page 44: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

How to models sequences?

• Markov models

• Conditional Random Fields

• Problem: Cannot model long range dependencies

Page 45: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Simple Formulation

Page 46: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Target Replication

Page 47: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Auxiliary Targets

Page 48: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Results

Page 49: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Outline • Introduction to Multilabel Learning

• Evaluation

• Efficient Learning & Sparse Models

• Deep Learning for Multilabel Classification

• Jointly Learning to Generate and Classify Beer Reviews

Page 50: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

RNN Language Model

Page 51: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Past Supervised Approaches relied upon Encoder-Decoder Model

Page 52: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Bridging Long Time Intervals with Concatenated Inputs

Page 53: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Example

A.5 FRUIT/VEGETABLE BEER <STR>On tap at the brewpub. A nice dark red color with a nice head that left a lot of lace on the glass. Aroma is of raspberries and chocolate. Not much depth to speak of despite consisting of raspberries. The bourbon is pretty subtle as well. I really don’t know that I find a flavor this beer tastes like. I would prefer a little more carbonization to come through. It’s pretty drinkable, but I wouldn’t mind if this beer was available. <EOS>

Page 54: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Character-based Classification

Page 55: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

“Love the Strong Hoppy Flavor”

Page 56: multilabel deep learning - Indico · Supervised Learning • General problem, desire a labeling function • ERM principle - choose the model in hypothesis class that minimizes loss

Thanks!

Critical Review of RNNs: http://arxiv.org/abs/1506.00019 Learning to Diagnose:http://arxiv.org/abs/1511.03677 Conditional Generative RNNS:http://arxiv.org/abs/1511.03683

Contact: [email protected] zacklipton.com