multi-label classification - github pages · multi-label classification k = 2 k >2 l = 1 binary...

133
Multi-label Classification Jesse Read https://users.ics.aalto.fi/jesse/ Department of Information and Computer Science Helsinki, Finland Summer School on Data Sciences for Big Data September 5, 2015

Upload: others

Post on 14-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Classification

Jesse Readhttps://users.ics.aalto.fi/jesse/

Department of Information and Computer ScienceHelsinki, Finland

Summer School on Data Sciences for Big DataSeptember 5, 2015

Page 2: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Classification

Binary classification: Is this a picture of the sea?

∈ {yes, no}

Page 3: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Classification

Multi-class classification: What is this a picture of?

∈ {sea, sunset, trees, people, mountain, urban}

Page 4: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Classification

Multi-label classification: Which labels are relevant to thispicture?

⊆ {sea, sunset, trees, people, mountain, urban}

i.e., multiple labels per instance instead of a single label!

Page 5: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Classification

K = 2 K > 2L = 1 binary multi-classL > 1 multi-label multi-output†

† also known as multi-target, multi-dimensional.

Figure: For L target variables (labels), each of K values.

multi-output can be cast to multi-label, just as multi-classcan be cast to binary.

tagging / keyword assignment: set of labels (L) is notpredefined

Page 6: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Increasing Interest

year in text in title1996-2000 23 12001-2005 188 182006-2010 1470 1642011-2015 4550 485

Table: Academic articles containing the phrase ‘multi-labelclassification’ (Google Scholar)

Page 7: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Single-label vs. Multi-label

Table: Single-label Y ∈ {0, 1}X1 X2 X3 X4 X5 Y1 0.1 3 1 0 00 0.9 1 0 1 10 0.0 1 1 0 01 0.8 2 0 1 11 0.0 2 0 1 0

0 0.0 3 1 1 ?

Table: Multi-label Y ⊆ {λ1, . . . , λL}X1 X2 X3 X4 X5 Y1 0.1 3 1 0 {λ2, λ3}0 0.9 1 0 1 {λ1}0 0.0 1 1 0 {λ2}1 0.8 2 0 1 {λ1, λ4}1 0.0 2 0 1 {λ4}0 0.0 3 1 1 ?

Page 8: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Single-label vs. Multi-label

Table: Single-label Y ∈ {0, 1}X1 X2 X3 X4 X5 Y1 0.1 3 1 0 00 0.9 1 0 1 10 0.0 1 1 0 01 0.8 2 0 1 11 0.0 2 0 1 0

0 0.0 3 1 1 ?

Table: Multi-label [Y1, . . . ,YL] ∈ 2L

X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4

1 0.1 3 1 0 0 1 1 00 0.9 1 0 1 1 0 0 00 0.0 1 1 0 0 1 0 01 0.8 2 0 1 1 0 0 11 0.0 2 0 1 0 0 0 1

0 0.0 3 1 1 ? ? ? ?

Page 9: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources

Page 10: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Text Categorization

For example, the news . . .

Novo Banco: Portugal bank sell-off hits snag

Portugal’s central bank has missed its deadline to sellNovo Banco, a bank created after the collapse of thecountry’s second-biggest lender.

Reuters collection, newswire stories into 103 topic codes

Page 11: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Text Categorization

For example, the IMDb dataset: Textual movie plot summariesassociated with genres (labels).

Page 12: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Text Categorization

For example, the IMDb dataset: Textual movie plot summariesassociated with genres (labels).

aban

don

ed

acci

den

t

...

viol

ent

wed

din

g

horror

romance

. . . comedy

action

i X1 X2 . . . X1000 X1001 Y1 Y2 . . . Y27 Y28

1 1 0 . . . 0 1 0 1 . . . 0 02 0 1 . . . 1 0 1 0 . . . 0 03 0 0 . . . 0 1 0 1 . . . 0 04 1 1 . . . 0 1 1 0 . . . 0 15 1 1 . . . 0 1 0 1 . . . 0 1...

......

. . ....

......

.... . .

......

120919 1 1 . . . 0 0 0 0 . . . 0 1

Page 13: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Labelling E-mails

For example, the Enron e-mails multi-labelled to 53categories by the UC Berkeley Enron Email AnalysisProject

Company Business, Strategy, etc.Purely PersonalEmpty MessageForwarded email(s). . .company image – current. . .Jokes, humor (related to business). . .Emotional tone: worry / anxietyEmotional tone: sarcasm. . .Emotional tone: shameCompany Business, Strategy, etc.

Page 14: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Labelling Images

Images are labelled to indicate

multiple concepts

multiple objects

multiple people

e.g., Scene data with concept labels⊆ {beach, sunset, foliage, field, mountain, urban}

Page 15: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Applications: AudioLabelling music/tracks with genres / voices, concepts, etc.

e.g., Music dataset, audio tracks labelled with different moods,among: {

amazed-surprised,

happy-pleased,

relaxing-calm,

quiet-still,

sad-lonely,

angry-aggressive

}

Page 16: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Medical

Medical Diagnosis

medical history, symptoms→ diseases / ailments

e.g., Medical dataset,

clinical free text reports by radiologists

label assignment out of 45 ICD-9-CM codes

Page 17: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Bioinformatics

Genes are associated with biological functions.

E.g. the Yeast dataset: 2, 417 genes, described by 103attributes, labeled into 14 groups of the FunCAtfunctional catalogue.

Page 18: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Related Tasksmulti-output1 classification: outputs are nominal

yj ∈ {1, . . . ,K }, y ∈ NL

multi-output regression: outputs are real-valued

yj ∈ R, y ∈ RL

X1 X2 X3 X4 X5 pri

ce

age

per

cen

t

x1 x2 x3 x4 x5 37.00 25 0.88x1 x2 x3 x4 x5 22.88 22 0.22x1 x2 x3 x4 x5 88.23 11 0.77

label ranking, i.e., preference learning

λ3 � λ1 � λ4 � . . . � λ2

1aka multi-target, multi-dimensional

Page 19: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Related Areas

multi-task learning: multiple tasks, sharedrepresentation, data may come from different sourcese.g., learn to recognise speech for different speakers,classify text from different corpora

sequential learning: predict across time indices instead ofacross label indices

structured output prediction: assume particular structureamoung outputs, e.g., pixels

Page 20: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Advanced Applications

Figure: Image Segmentation: Foreground yj = 1

Page 21: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Advanced Applications

Figure: Localization: yj = 1 if j-th tile occupied.

Page 22: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Advanced Applications

Figure: Demand prediction: yj = 1 if high demand at j-th node.

Page 23: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources

Page 24: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Single-label Classification

y

x5x4x3x2x1

y = h(x) • classifier h

= argmaxy∈{0,1}

p(y|x) •MAP Estimate

Page 25: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Example: Naive Bayes

y

x5x4x3x2x1

y = argmaxy∈{0,1}

p(y)p(x|y) • Generative, p(y|x) ∝ p(x|y)p(y)

= argmaxy∈{0,1}

p(y)D∏

d=1

p(xd |y) •Naive Bayes

Page 26: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Example: Logistic Regression

y

x5x4x3x2x1

y = argmaxy∈{0,1}

p(y|x) •MAP Estimate

p(y = 1|x) = fw(x) =1

1 + exp(−w>x)• Logistic Regression

and find w to minimize E(w)

Page 27: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Focus on the Labels

y

x5x4x3x2x1

y1 y2 y3 y4

x5x4x3x2x1

y4y3y2y1

xd

D

y4y3y2y1

x

Page 28: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Classification

y4y3y2y1

x

yj = hj(x) = argmaxyj∈{0,1}

p(yj |x) • for index, j = 1, . . . ,L

and then,

y = h(x) = [y1, . . . , y4]

=[

argmaxy1∈{0,1}

p(y1|x), · · · , argmaxy4∈{0,1}

p(y4|x)]

=[

f1(x), · · · , f4(x)]= f (W>x)

This is the Binary Relevance method (BR).

Page 29: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources

Page 30: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

BR Transformation1 Transform dataset . . .

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1. . . into L separate binary problems (one for each label)

X Y1

x(1) 0x(2) 1x(3) 0x(4) 1x(5) 0

X Y2

x(1) 1x(2) 0x(3) 1x(4) 0x(5) 0

X Y3

x(1) 1x(2) 0x(3) 0x(4) 0x(5) 0

X Y4

x(1) 0x(2) 0x(3) 0x(4) 1x(5) 1

2 and train with any off-the-shelf binary base classifier.

Page 31: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Why Not Binary Relevance?

BR ignores label dependence, i.e.,

p(y|x) ∝ p(x)L∏

j=1

p(yj |x)

which may not always hold!

Example (Film Genre Classification)

p(yromance|x) 6= p(yromance|x, yhorror)

Page 32: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Why Not Binary Relevance?BR ignores label dependence, i.e.,

p(y|x) ∝ p(x)L∏

j=1

p(yj |x)

which may not always hold!

Table: Average predictive performance (5 fold CV, EXACT MATCH)

L BR MCC

Music 6 0.30 0.37Scene 6 0.54 0.68Yeast 14 0.14 0.23Genbase 27 0.94 0.96Medical 45 0.58 0.62Enron 53 0.07 0.09Reuters 101 0.29 0.37

Page 33: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Classifier Chains

Modelling label dependence,

y4y3y2y1

x

p(y|x) ∝ p(x)L∏

j=1

p(yj |x, y1, . . . , yj−1)

and,y = argmax

y∈{0,1}L

p(y|x)

Page 34: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

CC Transformation

Similar to BR: make L binary problems, but include previouspredictions as feature attributes,

X Y1

x(1) 0x(2) 1x(3) 0x(4) 1x(5) 0

X Y1 Y2

x(1) 0 1x(2) 1 0x(3) 0 1x(4) 1 0x(5) 0 0

X Y1 Y2 Y3

x(1) 0 1 1x(2) 1 0 0x(3) 0 1 0x(4) 1 0 0x(5) 0 0 0

X Y1 Y3 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1

and, again, apply any classifier (not necessarily a probabilisticone)!

Page 35: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Greedy CC

y4y3y2y1

x

L classifiers for L labels. For test instance x, classify [22],1 y1 = h1(x)2 y2 = h2(x, y1)

3 y3 = h3(x, y1, y2)

4 y4 = h4(x, y1, y2, y3)

and returny = [y1, . . . , yL]

Page 36: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

1

1

0

1

y = h(x) = [?, ?, ?]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 37: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

10.4

1

0

0

1

1

0

1

0.6

y = h(x) = [1, ?, ?]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 38: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

10.7

1

0

1

0.3

0.6

y = h(x) = [1, 0, ?]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 0

3 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 39: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

0.4

10.60.7

1

0

1

0.6

y = h(x) = [1, 0, 1]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 40: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

10.60.7

1

0

1

0.6

y = h(x) = [1, 0, 1]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 41: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Bayes Optimal CC

Bayes-optimal Probabilistic CC [4] (PCC)

y = argmaxy∈{0,1}L

p(y|x)

= argmaxy∈{0,1}L

{p(y1|x)

L∏j=2

p(yj |x, y1, . . . , yj−1)}• chain rule

y4y3y2y1

x

Test all possible paths (y = [y1, . . . , yL] ∈ 2L in total)

Page 42: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Bayes Optimal CCExample

0

0

0

0.5

10.50.2

1

0

0.9

10.1

0.8

0.4

1

0

0

0.4

10.60.7

1

0

0.5

10.5

0.3

0.6

1 p(y = [0, 0, 0]) = 0.0402 p(y = [0, 0, 1]) = 0.0403 p(y = [0, 1, 0]) = 0.2884 . . .6 p(y = [1, 0, 1]) = 0.2527 . . .8 p(y = [1, 1, 1]) = 0.090

return argmaxy p(y|x)

Better accuracy than greedy CC but computationallylimited to L . 15

Page 43: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Monte-Carlo search for CC1 For t = 1, . . . ,T iterations:

Sample yt ∼ p(y|x) the chain [20]

1 y1 ∼ p(y1|x) //y1 = 1 with probability p(y1|x)2 y2 ∼ p(y2|x, y1, y2)

3 . . .4 yL ∼ p(yL|x, y1, . . . , yL−1)

2 Predicty = argmax

yt∈{y1,...,yT }p(yt |x)

y4y3y2y1

x

Page 44: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Monte-Carlo search for CC

Example

0

0

0

1

1

0

0.9

1

0.8

0.4

1

0

0

10.60.7

1

0

1

0.6

Sample T times . . .

p([1, 0, 1]) = 0.6 · 0.7 · 0.6 =0.252

p([0, 1, 0]) = 0.4 · 0.8 · 0.9 =0.288

return argmaxytp(yt |x)

Page 45: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Monte-Carlo search for CCExample

0

0

0

1

1

0

0.9

1

0.8

0.4

1

0

0

10.60.7

1

0

1

0.6

Sample T times . . .

p([1, 0, 1]) = 0.6 · 0.7 · 0.6 =0.252

p([0, 1, 0]) = 0.4 · 0.8 · 0.9 =0.288

return argmaxytp(yt |x)

Tractable, with similar accuracy to (Bayes Optimal) PCC

Can use other search algorithms, e.g., beam search [13]

Page 46: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Does Label-order Matter?Are these models equivalent?

y4y3y2y1

x

vs

y1y3y2y4

x

p(x, y) = p(y1|x)p(y2|y1, x) = p(y2|x)p(y1|y2, x)

but we are estimating p from finite and noisy data (andpossibly doing a greedy search); thus

p(y1|x)p(y2|y1, x) 6= p(y2|x)p(y1|y2, x)

Page 47: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Does Label-order Matter?Are these models equivalent?

y4y3y2y1

x

vs

y1y3y2y4

x

p(x, y) = p(y1|x)p(y2|y1, x) = p(y2|x)p(y1|y2, x)

but we are estimating p from finite and noisy data (andpossibly doing a greedy search); thus

p(y1|x)p(y2|y1, x) 6= p(y2|x)p(y1|y2, x)

Page 48: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Searching the Chain SpaceCan search the space of possible chain orderings [20] with, e.g.,Monte Carlo walk

ys4ys3ys2ys1

x

For u = 1, . . . ,U :

1 propose su = [s1, . . . , sL] =permute([1, . . . ,L])

2 build model on sequence su

3 evaluate; accept if better(if J (su) > J (su−1))

Use hsU as the final model.

ExampleScene data

u su = [s1, . . . , sL] J (su)0 [4, 2, 0, 1, 3, 5] 0.6231 [4, 2, 0, 3, 1, 5] 0.6282 [4, 2, 0, 3, 5, 1] 0.6383 [4, 0, 2, 3, 5, 1] 0.6475 [4, 0, 5, 2, 3, 1] 0.65318 [5, 1, 4, 3, 2, 0] 0.65423 [5, 4, 0, 1, 2, 3] 0.664128 [3, 5, 1, 0, 2, 4] 0.668176 [5, 3, 1, 0, 4, 2] 0.669225 [5, 3, 1, 4, 0, 2] 0.670J (s) := EXACTMATCH(Y, hs(X)) (higher is better)

Page 49: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Searching the Chain Space

ys4ys3ys2ys1

x

The space is of combinational proportions, . . . but a littlesearch can go a long way.Many other options:

add temperature to freeze su from left to right over timeuse a population of chain sequences: s(1)

u , . . . , s(M)u

use beam search

Page 50: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Chain Structure

We can formulate any structure,

yj = hj(x, pa(yj))

where pa(yj) = parents of node j.

y4y3y2y1

x

If pa(yj) := {y1, . . . , yj−1}we recover CC

‘partial’ models are more efficient and interpretable

Page 51: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Structured Classifiers Chains1 Measure some heuristic

marginal dependence [30]conditional dependence [31]

2 Find a structure3 Plug in base classifiers and run some CC inference

y4y3y2y1

xx

Related to Bayesian networks, [1, 2]:

p(y, x) =L∏

j=1

p(yj |pa(yj), x)

Page 52: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Structured Classifiers Chains1 Measure some heuristic

marginal dependence [30]conditional dependence [31]

2 Find a structure3 Plug in base classifiers and run some CC inference

y4y3y2y1

xx

Related to Bayesian networks, [1, 2]:

p(y, x) =L∏

j=1

p(yj |pa(yj), x)

Page 53: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Label Powerset (LP)One multi-class problem (taking many values),

y = argmaxy∈{0,1}L

p(x)L∏

j=1

p(yj |x, y1, . . . , yj−1) • PCC

= argmaxy∈Y

p(y|x) • LP, where Y ⊂ {0, 1}L

≡ argmaxy∈{0,...,2L−1}

p(y|x) • a multi-class problem!

y1, y2, y3, y4

x

Page 54: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Label Powerset (LP)

One multi-class problem (taking many values),

y = argmaxy∈{0,1}L

p(x)L∏

j=1

p(yj |x, y1, . . . , yj−1) • PCC

= argmaxy∈Y

p(y|x) • LP, where Y ⊂ {0, 1}L

≡ argmaxy∈{0,...,2L−1}

p(y|x) • a multi-class problem!

Each value is a label vector,

typically, the occurrences of the training set.

In practice, |Y| ≤ N , and |Y| � 2L

Page 55: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Label Powerset Method (LP)1 Transform dataset . . .

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 1 0x(4) 1 0 0 1x(5) 0 0 0 1. . . into a multi-class problem, taking 2L possible values:

X Y ∈ 2L

x(1) 0110x(2) 1000x(3) 0110x(4) 1001x(5) 0001

2 . . . and train any off-the-shelf multi-class classifier.

Page 56: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Issues with LP

complexity: there is no greedy label-by-label option

imbalance: few examples per class label

overfitting: how to predict new value?

Example

In the Enron dataset, 44% of labelsets are unique (a singletraining example or test instance). In del.icio.us dataset, 98%are unique.

Page 57: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

RAkELX Y ∈ 2L

x(1) 0110x(2) 1000x(3) 0110x(4) 1001x(5) 0001

Ensembles of RAndom k-labEL subsets (RAkEL) [27]

Do LP on M subsets⊂ {1, . . . ,L} of size k

X Y123 ∈ 2k

x(1) 011x(2) 100x(3) 011x(4) 100x(5) 000

X Y124 ∈ 2k

x(1) 010x(2) 100x(3) 010x(4) 101x(5) 001

X Y234 ∈ 2k

x(1) 110x(2) 000x(3) 110x(4) 001x(5) 001

Page 58: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Pruned SetsX Y ∈ 2L

x(1) 0110x(2) 1000x(3) 0110x(4) 1001x(5) 0001

Ensembles of Pruned label Sets (EPS) [21]

Do LP on M pruned subsets (wrt class values)

Can flip bits to reduce ratio of classes to examples

X Y ∈ 2L

x(1) 0110x(3) 0110x(4) 0001x(5) 0001

X Y ∈ 2L

x(1) 0110x(2) 1000x(3) 0110x(4) 0001x(4) 1000

X Y ∈ 2L

x(1) 0110x(2) 1000x(3) 0110x(4) 1000x(5) 0001

Page 59: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Ensemble-based Voting

Most problem-transformation methods are ensemble-based,e.g., ECC, EPS, RAkEL.

Ensemble Voting

y1 y2 y3 y4

h1(x) 1 1 1h2(x) 0 1 0h3(x) 1 0 0h4(x) 1 0 0score 0.75 0.25 0.75 0

y 1 0 1 0y4y3y2y1

y123 y234y134y124

x

more predictive power (ensemble effect)

LP can predict novel label combinations

Page 60: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Scaling Up

LSHTC4: Large Scale Hierarchal Text Classification

A wikipedia-scale problem

325, 056 labels

2.4M examples

Even with only 1, 000 features, have to learn over 300Mparameters with BR (linear models)

. . . plus 52, 831M more with CC

. . . plus ensembles (×10,×50?)

LP transformation generates around 1.47M classes

Page 61: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Scaling Up

Our approach [16, 23]:1 Ignore the predefined hierarchy2 work with subsets of the labelset (RAkEL)3 prune them (pruned sets)4 chain these sets together (classifier chains)5 mix of base classifiers (centroid, decision trees, SVMs)6 ensemble with sample features and instances (random

subspace)7 randomization: splits, pruning, reintroduction, chain

links, base classifier parameters8 train models in parallel, weight according to score on

hold-out sets (avoid overfitting!)

Page 62: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Pairwise Multi-labelClassification

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1

Create a pairwise transformation, of up to L(L−1)2 binary

classifiers (all-vs-all), but smaller than in BRX Y1v2

x(1) 0x(2) 1x(3) 0x(4) 1

X Y1v3

x(1) 0x(2) 1x(4) 1

X Y1v4

x(2) 1x(5) 0

X Y2v3

x(3) 1

X Y2v4

x(1) 1x(3) 1x(4) 0x(5) 0

X Y3v4

x(1) 1x(4) 0x(5) 0

Ensemble voting, or calibrated label ranking [7]

Can also model four classes (related to LP)

Page 63: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Hierarchy of MLC (HOMER)

x

��

h1

�� $${{

{1, 4}

��

y5 {2, 3, 6}

��

h2

{{ ��

h3

�� $$yyy1 y4 y2 y4 y6

1 Cluster labels (randomly, k-means) [28], or usepre-defined hierarchy

2 Apply problem transformation

Page 64: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Regularization

Regularization

y = b(h(x)), or y = b(h(x), x)

where

y = h(x) is an initial classification; and

b is some regularizer

Examples:

Meta BR: A second (meta) BR (b) takes as input the outputfrom an initial BR (h) [9]

Error Correcting Output Codes: bit vector y has beendistorted by noise; attempt to correct it [6]

Subset matching: if y does not exist in training set, matchit to the closest one that does

Page 65: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Problem TransformationSummary

Two ways of viewing a multi-label problem of L labels:1 L binary problems (BR),2 a multi-class problem with 2L classes (LP)

or a combination of these.

General method:1 Transform data into subproblems (binary or multi-class)2 Apply some off-the-shelf base classifier3 (Optional) Regularize4 (Optional) Ensemble

Page 66: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources

Page 67: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Algorithm Adaptation

1 Take your favourite (most suitable) classifier2 Modify it for multi-label classification

Advantage: a single model, usually very scalable

Disadvantage: predictive performance depends on theproblem domain

Page 68: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

k Nearest Neighbours (kNN)Assign to x the majority class of the k ‘nearest neighbours’

y = argmaxy

∑i∈Nk

y(i)

where Nk contains the training pairs with x(i) closest to x.

4 3 2 1 0 1 2 3x1

4

3

2

1

0

1

2

3

x2

c1c2c3c4c5c6?

Page 69: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label kNNAssigns the most common labels of the k nearest neighbours

p(yj = 1|x) = 1k

∑i∈Nk

y(i)j

yj = argmaxyj∈{0,1}

[p(yj |x) > 0.5]

4 3 2 1 0 1 2 3 4x1

5

4

3

2

1

0

1

2

3

x2

000001010011101?

For example, [32]. Related to ensemble voting.

Page 70: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Decision Trees

x1

>0.3

~~

≤0.3

!!y x3

>−2.9

}}

≤−2.9

x2

=A

~~

=B

!!

y

y y

construct like C4.5 (multi-label entropy [3])

multiple labels at the leaves

predictive clustering trees [12] are highly competitive inan random forest/ensemble

Page 71: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Conditional Random Fields

x

y2y1

φ1 φ2

φ3

p(y|x) = 1Z(x)

∏c

φc(x, y)

=1

Z(x)exp{

∑c

wcfc(x, y)}

where, e.g., φ3(x, y) = φ3(y1, y2) ∝ p(y2|y1). Factors can bemodelled with, e.g., with a problem transformation

Page 72: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Conditional Random Fields

x

y2y1

φ1 φ2

φ3

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1

where, e.g., φ3(x, y) = φ3(y1, y2) ∝ p(y2|y1). Factors can bemodelled with, e.g., with a problem transformation

Page 73: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Conditional Random Fields

x

y2y1

φ1 φ2

φ3

where, e.g., φ3(x, y) = φ3(y1, y2) ∝ p(y2|y1). Factors can bemodelled with, e.g., with a problem transformation, butcomputational burden is shifted to inference, e.g.,

y = argmaxy∈{0,1}L

f1(x, y1)f2(x, y2)f3(y2, y1)

Gibbs simpling [10] (like an undirected PCC)

Supported combinations [8] (i.e., Y in LP)

Page 74: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Neural Network

y3y2y1

z4z3z2z1

x5x4x3x2x1

Just include an output node for each label.

train with, e.g., gradient descent + error back-propagation

Page 75: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Other Algorithm Adaptations

Max-margin methods / SVMs [29]

Association rules [25]

Boosting [24]

Generative (Bayesian) [15]

Page 76: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources

Page 77: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Label Dependence in MLC

Common approach: Present methods to1 measure label dependence2 find a structure that best represents this

and then apply classifiers, compare results to BR.

Page 78: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Label Dependence in MLC

Common approach: Present methods to1 measure label dependence2 find a structure that best represents this

and then apply classifiers, compare results to BR.

Example

y4y3y2y1

xx

y4y3y2y1

y124 y234

x

Which labels (nodes) to link together? (CC, PGMs)

Which subsets to form from the labelset? (RAkEL)

Page 79: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Label Dependence in MLC

Common approach: Present methods to1 measure label dependence2 find a structure that best represents this

and then apply classifiers, compare results to BR.

ProblemAccuracy often indistinguishable to that of random en-sembles, or slow! (although, may be more compactand/or interpretable)

Page 80: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Marginal label dependence

Marginal dependence

When the joint is not the product of the marginals, i.e.,

p(y2) 6= p(y2|y1)

p(y1)p(y2) 6= p(y1, y2)Y1 Y2

Estimate from co-occurrence frequencies in training data

Page 81: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Marginal label dependenceExample

amazed happy relaxing quiet sad angry

amazed

happy

relaxing

quiet

sad

angry

Figure: Music dataset - Mutual Information

Page 82: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Marginal label dependenceExample

beach sunset foliage field mountain urban

beach

sunset

foliage

field

mountain

urban

Figure: Scene dataset - Mutual Information

Page 83: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Exploiting marginal dependence

A Toy Dataset

X1 X2 Y1 Y2 Y3

0 0 0 0 01 0 1 0 10 1 1 0 11 1 1 1 0

Measure marginal label dependence (i.e., do labels co-occurfrequently, or does one exclude the other?).

Page 84: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Exploiting marginal dependenceA Toy Dataset

X1 X2 Y1 Y2 Y3

0 0 0 0 01 0 1 0 10 1 1 0 11 1 1 1 0

Measure marginal label dependence (i.e., do labels co-occurfrequently, or does one exclude the other?).

But all labels are interdependent! For example,

p(y2 = 1|y1 = 1) 6= p(y2 = 1)

1/3 > 1/4

Could use a threshold, or statistical significance, . . .But how does this relate to classification, p(yj |x)?

Page 85: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Conditional label dependence

Conditional dependence

. . . conditioned on input observation x.

p(y2|y1, x) 6= p(y2|x) X

Y1 Y2

Have to build and measure models

Indication of conditional dependence if

the performance of LP/CC exeeds that of BR

errors among the binary models are correlated

Page 86: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Conditional label dependence

Conditional independence

. . . conditioned on input observation x.

p(y2) 6= p(y2|y1)

, but p(y2|x) = p(y2|, y1, x)X

Y1 Y2 vs

X

Y1 Y2

Have to build and measure models

Indication of conditional dependence if

the performance of LP/CC exeeds that of BR

errors among the binary models are correlated

Page 87: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Exploiting conditionaldependence

A Toy Dataset

X1 X2 Y1 Y2 Y3

0 0 0 0 01 0 1 0 10 1 1 0 11 1 1 1 0

Measure conditional label dependence (build models,measure the difference in error rate).

Page 88: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Exploiting conditionaldependence

A Toy Dataset

X1 X2 Y1 Y2 Y3

0 0 0 0 01 0 1 0 10 1 1 0 11 1 1 1 0

Measure conditional label dependence (build models,measure the difference in error rate).

But building models is expensive!

Which structure to construct?

Page 89: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Exploiting conditionaldependence

A Toy Dataset

OR

AN

D

XO

R

X1 X2 Y1 Y2 Y3

0 0 0 0 01 0 1 0 10 1 1 0 11 1 1 1 0

Oracle: complete conditional independence!

Complete conditional independence,

p(Yj |Yk ,X1,X2) = p(Yj |X1,X2), ∀j, k : 0 < j < k ≤ L

Then the binary relevance (BR) classifier should suffice?

Page 90: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

The LOGICAL Problem

or and xor

x

xorandor

x

or,and,xor

x

Figure: BR (left), CC (middle), LP (right)

Table: The LOGICAL problem, base classifier logistic regression.

Metric BR CC LP

HAMMING SCORE 0.83 1.00 1.00EXACT MATCH 0.50 1.00 1.00

Why didn’t BR work?

Page 91: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

XOR Solution

or and xor

x

orandxor

x

xorandor

x

Only one of these works (with greedy inference)!

The ground truth (oracle) is

p(yXOR|yAND, x) = p(yXOR|x)

but, recall: we have an estimation of this,

f (yXOR|yAND, x) 6= f (yXOR|x)

(finite data, finite training time, limited class of model f ,i.e., linear): dependence depends on the model!

Page 92: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solutions

1 Use a suitable structure2 Use a suitable base classifier3 Ensure that labels are conditionally independent.

Page 93: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solutions

1 Use a suitable structure How to find it?2 Use a suitable base classifier Which one is suitable?3 Ensure that labels are conditionally independent. How to

do that?

Main limiting factor: computational complexity.

Page 94: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

The LOGICAL Problem

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YOR|x1 ,x2

YOR=0

YOR=1

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YAND|x1 ,x2

YAND=0

YAND=1

or and xor

x

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YXOR|x1 ,x2

YXOR=0

YXOR=1

Figure: Binary Relevance (BR): linear decision boundary (solid line,estimated with logistic regression) not viable for YXOR label

Page 95: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solution via Structure

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YOR|x1 ,x2

YOR=0

YOR=1

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YAND|x1 ,x2

YAND=0

YAND=1

xorandor

x

x1

0.20.0

0.20.4

0.60.8

1.01.2

x 2

0.20.0

0.20.4

0.60.8

1.01.2

y OR

0.20.00.20.4

0.6

0.8

1.0

1.2

1.4

1.6

Model YXOR|YXOR,x1 ,x2

Figure: Solution via structure: linear model now applicable to YXOR

Page 96: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solution via Structure

orandxor

x

Figure: Solution via structure: two labels have augmented decisionspace

Can also use undirected connections

directionality not an issue,

but implies greater computational burden (≈ LP)

. . . possibly shifted to inference (≈ PCC, CDN)

Page 97: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solution via Multi-classDecomposition

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YOR,AND,XOR|x1 ,x2

Y[OR,AND,XOR] =[0,0,0]

Y[OR,AND,XOR] =[1,0,1]

Y[OR,AND,XOR] =[1,1,0]

or,and,xor

x

Figure: Label Powerset (LP): solvable with one-vs-one multi-classdecomposition for any (e.g., linear) base classifier

Page 98: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solution via Multi-classDecomposition

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YOR,AND,XOR|x1 ,x2

Y[OR,AND,XOR] =[0,0,0]

Y[OR,AND,XOR] =[1,0,1]

Y[OR,AND,XOR] =[1,1,0]

or,xorand

x

Figure: Label Powerset (LP): solvable with one-vs-one multi-classdecomposition for any (e.g., linear) base classifier. Also possiblewith RAkEL subsets YOR,XOR and YAND

Page 99: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solution via Con. Independence

1.0 1.1 1.2 1.3 1.4 1.5 1.6z1

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

z 2

Model YOR|,z1 ,z2

YOR=0

YOR=1

1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60z1

1.00

1.02

1.04

1.06

1.08

1.10

1.12

z 2

Model YAND|,z1 ,z2

YAND=0

YAND=1

or and xor

φ(x)

1.30 1.35 1.40 1.45 1.50 1.55 1.60z1

0.4

0.6

0.8

1.0

1.2

1.4

1.6

z 2

Model YXOR|,z1 ,z2

YXOR=0

YXOR=1

Figure: Solution via non-linear (e.g., random RBF) transformationon input to new space z (creating independence).

Page 100: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Solution via SuitableBase-classifier

x1

0

zz

1

$$x2

0

{{

1

##

x2

0

{{

1

##

[0, 0, 0] [0, 1, 1] [1, 1, 0]

Figure: Solution via non-linear classifier (e.g., Decision Tree). Leaveshold examples, where y = [yOR, yAND, yXOR]

Page 101: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

On Real World Problems . . .

3 2 1 0 1 2 3 4 5 63

2

1

0

1

2

3

4

5

6Y0

Y1

Y2

Y3

Y4

Y5

Figure: Music dataset, kernel PCA

Page 102: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Latent Variables

p(y|x) =∑

z p(x, y, z)p(x)

=1Z

∑z

p(x, y, z)

y3y2y1

x1 x2 x3 x4 x5 vs

y3y2y1

z4z3z2z1

x5x4x3x2x1

Can view label dependencies as having marginalized outlatent variables

Page 103: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Inner Layer Methods

1 Use an inner layerz = f(x), z ∈ RH

2 Apply a classifiery = h(z)

y2y1

z4z3z2z1

x5x4x3x2x1

PCA, CCA [17]

Kernel PCA [29]

Mixture models [15]

Clustering [28]

Compressive Sensing [11]

Deep Learning [18]

Auto Encoders

Page 104: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Another look: ProblemTransformation

X1 X2 X3

Y1

Y2

Y3

Y3Y2Y1

Y12 Y23

X5X4X3X2X1

Figure: Methods CC and RAkEL (among others) can be viewed asusing an inner layer [18].

Page 105: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

What about marginaldependence?

Can be seen as a kind of constraint

used for regularization(recall: e.g., ECOC, subset matching)

Page 106: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Label Dependence: Summary

Marginal dependence for regularizationConditional dependence

. . . depends on the model

. . . may be introduced

Should consider together:base classifierstructureinner layer

An open problem

Much existing research is relevant

Page 107: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources

Page 108: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label EvaluationIn single-label classification, simply compare true label y withpredicted label y [or p(y|x)].What about in multi-labelclassification?

Example

If true label vector is y = [1, 0, 0, 0], then y =?

urban

mountain

beach

foliage

1 0 0 01 1 0 00 0 0 00 1 1 1

compare bit-wise? too lenient?

compare vector-wise? too strict?

Page 109: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Hamming Loss

Example

y(i) y(i)

x(1) [1 0 1 0] [1 0 0 1]x(2) [0 1 0 1] [0 1 0 1]x(3) [1 0 0 1] [1 0 0 1]x(4) [0 1 1 0] [0 1 0 0]x(5) [1 0 0 0] [1 0 0 1]

HAMMING LOSS =1

NL

N∑i=1

L∑j=1

I[y(i)j 6= y(i)

j ]

= 0.20

Page 110: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

0/1 Loss

Example

y(i) y(i)

x(1) [1 0 1 0] [1 0 0 1]x(2) [0 1 0 1] [0 1 0 1]x(3) [1 0 0 1] [1 0 0 1]x(4) [0 1 1 0] [0 1 0 0]x(5) [1 0 0 0] [1 0 0 1]

0/1 LOSS =1N

N∑i=1

I(y(i) 6= y(i))

= 0.60

Page 111: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Other MetricsJACCARD INDEX – often called multi-label ACCURACY

RANK LOSS – average fraction of pairs not correctly orderedONE ERROR – if top ranked label is not in set of true labelsCOVERAGE – average “depth” to cover all true labelsLOG LOSS – i.e., cross entropyPRECISION – predicted positive labels that are relevantRECALL – relevant labels which were predictedPRECISION vs. RECALL curvesF-MEASURE

micro-averaged (‘global’ view)macro-averaged by label (ordinary averaging of a binarymeasure, changes in infrequent labels have a big impact)macro-averaged by example (one example at a time,average across examples)

For general evaluation, use multiple and contrastingevaluation measures!

Page 112: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

HAMMING LOSS vs. 0/1 LOSSHamming loss

evaluation by example, suitable for evaluating

yj = argmaxyj∈{0,1}

p(yj |x)

i.e., BRfavours sparse labellingdoes not benefit directly from modelling labeldependence

0/1 lossevaluation by label, suitable for evaluating

y = argmaxy∈{0,1}L

p(y|x)

i.e., PCC, LPdoes not favour sparse labellingbenefits from models of label dependence

Page 113: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

HAMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. HAMMING LOSS

y(i) y(i)

x(1) [1 0 1 0] [1 0 0 1]x(2) [1 0 0 1] [1 0 0 1]x(3) [0 1 1 0] [0 1 0 0]x(4) [1 0 0 0] [1 0 1 1]x(5) [0 1 0 1] [0 1 0 1]

HAM. LOSS 0.3

0/1 LOSS 0.6

Usually cannot minimize both at the same time . . .

. . . unless: labels are independent of each other! [5]

Page 114: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

HAMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. HAMMING LOSS

y(i) y(i)

x(1) [1 0 1 0] [1 0 1 1]x(2) [1 0 0 1] [1 1 0 1]x(3) [0 1 1 0] [0 1 1 0]x(4) [1 0 0 0] [1 0 1 0]x(5) [0 1 0 1] [0 1 0 1]

Optimize HAMMING LOSS

. . .

HAM. LOSS 0.2

0/1 LOSS 0.8

. . . 0/1 LOSS goes up

Usually cannot minimize both at the same time . . .

. . . unless: labels are independent of each other! [5]

Page 115: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

HAMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. HAMMING LOSS

y(i) y(i)

x(1) [1 0 1 0] [0 1 0 1]x(2) [1 0 0 1] [1 0 0 1]x(3) [0 1 1 0] [0 0 1 0]x(4) [1 0 0 0] [0 1 1 1]x(5) [0 1 0 1] [0 1 0 1]

Optimize 0/1 LOSS . . .

HAM. LOSS 0.4

0/1 LOSS 0.4

. . . HAMMING LOSS goes up

Usually cannot minimize both at the same time . . .

. . . unless: labels are independent of each other! [5]

Page 116: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

HAMMING LOSS vs. 0/1 LOSS

Example: 0/1 LOSS vs. HAMMING LOSS

y(i) y(i)

x(1) [1 0 1 0] [0 1 0 1]x(2) [1 0 0 1] [1 0 0 1]x(3) [0 1 1 0] [0 0 1 0]x(4) [1 0 0 0] [0 1 1 1]x(5) [0 1 0 1] [0 1 0 1]

Usually cannot minimize both at the same time . . .

. . . unless: labels are independent of each other! [5]

Page 117: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Threshold Selection

Methods often return a posterior probability, or ensemblevotes p(x). Use a threshold of 0.5 ?

yj =

{1, pj(x) ≥ 0.50, otherwise

Example with threshold of 0.5

x(i) y(i) p(x(i)) y(i) := I[p(x(i)) ≥ 0.5]x(1) [1 0 1 0] [0.9 0.0 0.4 0.6] [1 0 0 1]x(2) [0 1 0 1] [0.1 0.8 0.0 0.8] [0 1 0 1]x(3) [1 0 0 1] [0.8 0.0 0.1 0.7] [1 0 0 1]x(4) [0 1 1 0] [0.1 0.7 0.4 0.2] [0 1 0 0]x(5) [1 0 0 0] [1.0 0.0 0.0 1.0] [1 0 0 1]

Page 118: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Threshold SelectionMethods often return a posterior probability, or ensemblevotes p(x). Use a threshold of 0.5 ?

yj =

{1, pj(x) ≥ 0.50, otherwise

Example with threshold of 0.5

x(i) y(i) p(x(i)) y(i) := I[p(x(i)) ≥ 0.5]x(1) [1 0 1 0] [0.9 0.0 0.4 0.6] [1 0 0 1]x(2) [0 1 0 1] [0.1 0.8 0.0 0.8] [0 1 0 1]x(3) [1 0 0 1] [0.8 0.0 0.1 0.7] [1 0 0 1]x(4) [0 1 1 0] [0.1 0.7 0.4 0.2] [0 1 0 0]x(5) [1 0 0 0] [1.0 0.0 0.0 1.0] [1 0 0 1]

. . . but would eliminate two errors with a threshold of 0.4 !

Page 119: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Threshold Selection

Threshold calibration strategies:

Ad-hoc, e.g., t = 0.5

Internal validation, e.g., t ∈ {0.1, 0.2, . . . , 0.9}PCut: such that the training data and test data havesimilar average number of labels/example

Can be done efficiently.Can also calibrate tj for each label individually.Assumes training set similar to test set (i.e., not ideal fordata streams)

Can be viewed as another form of regularization

y = b(h(x))

Page 120: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Threshold Selection

Threshold calibration strategies:

Ad-hoc, e.g., t = 0.5

Internal validation, e.g., t ∈ {0.1, 0.2, . . . , 0.9}

PCut: such that the training data and test data havesimilar average number of labels/example

Can be done efficiently.Can also calibrate tj for each label individually.Assumes training set similar to test set (i.e., not ideal fordata streams)

Can be viewed as another form of regularization

y = b(h(x))

Page 121: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Threshold SelectionThreshold calibration strategies:

Ad-hoc, e.g., t = 0.5Internal validation, e.g., t ∈ {0.1, 0.2, . . . , 0.9}PCut: such that the training data and test data havesimilar average number of labels/example

t = argmint

∣∣∣ 1N

∑i,j

I(p(i)j > t)

︸ ︷︷ ︸test data

− 1N

∑i,j

y(i)j︸ ︷︷ ︸

train data

∣∣∣Can be done efficiently.Can also calibrate tj for each label individually.Assumes training set similar to test set (i.e., not ideal fordata streams)

Can be viewed as another form of regularization

y = b(h(x))

Page 122: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Various Real-World Concerns

In data streams, label dependence (and therefore,appropriate structures/transformations/base classifiers)

may not be known in advancemust learn it incrementallyand adapt to change over time (concept drift)New labels must be incorporated, old labels phased out

Labels may be missing from training data,but we don’t know when they’re missing (non-relevance 6=missing)Labelling is more intensive per example (affects bothmanual labelling and active learning)

Page 123: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Background

4 Problem Transformation

5 Algorithm Adaptation

6 Label Dependence

7 Multi-label Evaluation

8 Summary & Resources

Page 124: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Summary

Multi-label classification

Can be approached via problem transformation oralgorithm adaptation

Label dependence and scalability are the main themes

An active area of research and a gateway to many relatedareas

Page 126: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Software & Datasets

Mulan (Java)

Meka (Java)

Scikit-Learn (Python) offers some multi-label support

Clus (Java)

LAMDA (Matlab)

Datasets

http://mulan.sourceforge.net/datasets.html

http://meka.sourceforge.net/#datasets

Page 127: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

MEKA

A WEKA-based framework for multi-label classificationand evaluation

support for data-stream, semi-supervised classification

http://meka.sourceforge.net

Page 128: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

A MEKA Classifierpackage weka.classifiers.multilabel;import weka.core.∗;

public class DumbClassifier extends MultilabelClassifier {

/∗∗∗ BuildClassifier∗/

public void buildClassifier (Instances D) throws Exception {// the first L attributes are the labelsint L = D.classIndex();}

/∗∗∗ DistributionForInstance− return the distribution p(y[j ]|x)∗/

public double[] distributionForInstance(Instance x) throws Exception {int L = x.classIndex();// predict 0 for each labelreturn new double[L];}}

Page 129: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

References

Antonucci Alessandro, Giorgio Corani, Denis Maua, and Sandra Gabaglio.

An ensemble of Bayesian networks for multilabel classification.In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI’13, pages1220–1225. AAAI Press, 2013.

Hanen Borchani.

Multi-dimensional classification using Bayesian networks for stationary and evolving streaming data.PhD thesis, Departamento de Inteligencia Artificial, Facultad de Informatica, Universidad Politecnica deMadrid, 2013.

Amanda Clare and Ross D. King.

Knowledge discovery in multi-label phenotype data.Lecture Notes in Computer Science, 2168, 2001.

Krzysztof Dembczynski, Weiwei Cheng, and Eyke Hullermeier.

Bayes optimal multilabel classification via probabilistic classifier chains.In ICML ’10: 27th International Conference on Machine Learning, pages 279–286, Haifa, Israel, June 2010.Omnipress.

Krzysztof Dembczynski, Willem Waegeman, Weiwei Cheng, and Eyke Hullermeier.

On label dependence and loss minimization in multi-label classification.Mach. Learn., 88(1-2):5–45, July 2012.

Chun-Sung Ferng and Hsuan-Tien Lin.

Multi-label classification with error-correcting codes.In Proceedings of the 3rd Asian Conference on Machine Learning, ACML 2011, Taoyuan, Taiwan, November13-15, 2011, pages 281–295, 2011.

Johannes Furnkranz, Eyke Hullermeier, Eneldo Loza Mencıa, and Klaus Brinker.

Multilabel classification via calibrated label ranking.Machine Learning, 73(2):133–153, November 2008.

Nadia Ghamrawi and Andrew McCallum.

Collective multi-label classification.

Page 130: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

In CIKM ’05: 14th ACM international Conference on Information and Knowledge Management, pages195–200, New York, NY, USA, 2005. ACM Press.

Shantanu Godbole and Sunita Sarawagi.

Discriminative methods for multi-labeled classification.In PAKDD ’04: Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 22–30.Springer, 2004.

Yuhong Guo and Suicheng Gu.

Multi-label classification using conditional dependency networks.In IJCAI ’11: 24th International Conference on Artificial Intelligence, pages 1300–1305. IJCAI/AAAI, 2011.

Daniel Hsu, Sham M. Kakade, John Langford, and Tong Zhang.

Multi-label prediction via compressed sensing.In NIPS ’09: Neural Information Processing Systems 2009, 2009.

Dragi Kocev, Celine Vens, Jan Struyf, and Saso Deroski.

Tree ensembles for predicting structured outputs.Pattern Recognition, 46(3):817–833, March 2013.

Abhishek Kumar, Shankar Vembu, AdityaKrishna Menon, and Charles Elkan.

Beam search algorithms for multilabel learning.Machine Learning, 92(1):65–89, 2013.

Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and Saso Dzeroski.

An extensive experimental comparison of methods for multi-label learning.Pattern Recognition, 45(9):3084–3104, September 2012.

Andrew Kachites McCallum.

Multi-label text classification with a mixture model trained by em.In AAAI 99 Workshop on Text Learning, 1999.

Antti Puurula, Jesse Read, and Albert Bifet.

Kaggle LSHTC4 winning solution.Technical report, Kaggle LSHTC4 Winning Solution, 2014.

Piyush Rai and Hal Daume.

Page 131: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label prediction via sparse infinite CCA.In NIPS 2009: Advances in Neural Information Processing Systems 22, pages 1518–1526. 2009.

Jesse Read and Jaakko Hollmen.

A deep interpretation of classifier chains.In Advances in Intelligent Data Analysis XIII - 13th International Symposium, IDA 2014, pages 251–262,October 2014.

Jesse Read and Jaakko Hollmen.

Multi-label classification using labels as hidden nodes.ArXiv.org, stats.ML(1503.09022v1), 2015.

Jesse Read, Luca Martino, and David Luengo.

Efficient monte carlo methods for multi-dimensional learning with classifier chains.Pattern Recognition, 47(3):15351546, 2014.

Jesse Read, Bernhard Pfahringer, and Geoff Holmes.

Multi-label classification using ensembles of pruned sets.In ICDM 2008: Eighth IEEE International Conference on Data Mining, pages 995–1000. IEEE, 2008.

Jesse Read, Bernhard Pfahringer, Geoffrey Holmes, and Eibe Frank.

Classifier chains for multi-label classification.Machine Learning, 85(3):333–359, 2011.

Jesse Read, Antti Puurula, and Albert Bifet.

Multi-label classification with meta labels.In ICDM’14: IEEE International Conference on Data Mining (ICDM 2014), pages 941–946. IEEE, December2014.

Robert E. Schapire and Yoram Singer.

Boostexter: A boosting-based system for text categorization.Machine Learning, 39(2/3):135–168, 2000.

F. A. Thabtah, P. Cowling, and Yonghong Peng.

MMAC: A new multi-class, multi-label associative classification approach.In ICDM ’04: Fourth IEEE International Conference on Data Mining, pages 217–224, 2004.

Page 132: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Grigorios Tsoumakas and Ioannis Katakis.

Multi label classification: An overview.International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.

Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas.

Random k-labelsets for multi-label classification.IEEE Transactions on Knowledge and Data Engineering, 23(7):1079–1089, 2011.

Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Vlahavas.

Effective and efficient multilabel classification in domains with large number of labels.In ECML/PKDD Workshop on Mining Multidimensional Data, 2008.

Jason Weston, Olivier Chapelle, Andre Elisseeff, Bernhard Scholkopf, and Vladimir Vapnik.

Kernel dependency estimation.In NIPS, pages 897–904, 2003.

Julio H. Zaragoza, Luis Enrique Sucar, Eduardo F. Morales, Concha Bielza, and Pedro Larranaga.

Bayesian chain classifiers for multidimensional classification.In 24th International Joint Conference on Artificial Intelligence (IJCAI ’11), pages 2192–2197, 2011.

Min-Ling Zhang and Kun Zhang.

Multi-label learning by exploiting label dependency.In KDD ’10: 16th ACM SIGKDD International conference on Knowledge Discovery and Data mining, pages999–1008. ACM, 2010.

Min-Ling Zhang and Zhi-Hua Zhou.

ML-KNN: A lazy learning approach to multi-label learning.Pattern Recognition, 40(7):2038–2048, 2007.

Min-Ling Zhang and Zhi-Hua Zhou.

A review on multi-label learning algorithms.IEEE Transactions on Knowledge and Data Engineering, 99(PrePrints):1, 2013.

Page 133: Multi-label Classification - GitHub Pages · Multi-label Classification K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy yalso known as multi-target, multi-dimensional

Multi-label Classification

Jesse Readhttps://users.ics.aalto.fi/jesse/

Department of Information and Computer ScienceHelsinki, Finland

Summer School on Data Sciences for Big DataSeptember 5, 2015