multi-label learning from batch and streaming data · introduction k = 2 k >2 l = 1 binary...
TRANSCRIPT
![Page 1: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/1.jpg)
Multi-label learning from batch andstreaming data
Jesse Read
Telecom ParisTech
Ecole Polytechnique
Summer School on Mining Big and Complex Data5 September 2016 — Ohrid, Macedonia
![Page 2: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/2.jpg)
Introduction
x =
Binary classification
y ∈ {sunset, non sunset}
![Page 3: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/3.jpg)
Introduction
x =
Multi-class classification
y ∈ {sunset, people, foliage, beach, urban}
![Page 4: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/4.jpg)
Introduction
x =
Multi-label classification
y ⊆ {sunset, people, foliage, beach, urban}∈ {0, 1}5 = [1, 0, 1, 0, 0, 0]
i.e., multiple labels per instance instead of a single label.
![Page 5: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/5.jpg)
Introduction
K = 2 K > 2L = 1 binary multi-classL > 1 multi-label multi-output†
† also known as multi-target, multi-dimensional.
Figure: For L target variables (labels), each of K values.
multi-output can be cast to multi-label, just as multi-classcan be cast to binary
set of labels (L) is predefined (contrast to tagging/keywordassignment)
![Page 6: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/6.jpg)
Introduction
Table: Academic articles containing the phrase “multi-labelclassification” (Google Scholar, 2016)
year in text in title1996-2000 23 12001-2005 188 182006-2010 1470 1642011-2015 5280 629
![Page 7: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/7.jpg)
Single-label vs. Multi-label
Table: Single-label Y ∈ {0, 1}
X1 X2 X3 X4 X5 Y1 0.1 3 1 0 00 0.9 1 0 1 10 0.0 1 1 0 01 0.8 2 0 1 11 0.0 2 0 1 0
0 0.0 3 1 1 ?
Table: Multi-label Y ⊆ {λ1, . . . , λL}
X1 X2 X3 X4 X5 Y1 0.1 3 1 0 {λ2, λ3}0 0.9 1 0 1 {λ1}0 0.0 1 1 0 {λ2}1 0.8 2 0 1 {λ1, λ4}1 0.0 2 0 1 {λ4}0 0.0 3 1 1 ?
![Page 8: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/8.jpg)
Single-label vs. Multi-label
Table: Single-label Y ∈ {0, 1}
X1 X2 X3 X4 X5 Y1 0.1 3 1 0 00 0.9 1 0 1 10 0.0 1 1 0 01 0.8 2 0 1 11 0.0 2 0 1 0
0 0.0 3 1 1 ?
Table: Multi-label [Y1, . . . ,YL] ∈ 2L
X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4
1 0.1 3 1 0 0 1 1 00 0.9 1 0 1 1 0 0 00 0.0 1 1 0 0 1 0 01 0.8 2 0 1 1 0 0 11 0.0 2 0 1 0 0 0 1
0 0.0 3 1 1 ? ? ? ?
![Page 9: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/9.jpg)
Outline
1 Introduction
2 Applications
3 Methods
4 Label Dependence
5 Multi-label Classification in Data Streams
![Page 10: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/10.jpg)
Text Categorization and TagRecommendation
For example, the IMDb dataset: Textual movie plot summariesassociated with genres (labels). Also: Bookmarks, Bibtex,del.icio.us datasets.
![Page 11: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/11.jpg)
Text Categorization and TagRecommendation
For example, the IMDb dataset: Textual movie plot summariesassociated with genres (labels). Also: Bookmarks, Bibtex,del.icio.us datasets.
aban
don
ed
acci
den
t
...
viol
ent
wed
din
g
horror
romance
. . . comedy
action
i X1 X2 . . . X1000 X1001 Y1 Y2 . . . Y27 Y28
1 1 0 . . . 0 1 0 1 . . . 0 02 0 1 . . . 1 0 1 0 . . . 0 03 0 0 . . . 0 1 0 1 . . . 0 04 1 1 . . . 0 1 1 0 . . . 0 15 1 1 . . . 0 1 0 1 . . . 0 1...
......
. . ....
......
.... . .
......
120919 1 1 . . . 0 0 0 0 . . . 0 1
![Page 12: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/12.jpg)
Labelling Images
Images are labelled to indicate
multiple concepts
multiple objects
multiple people
e.g., Associating Scenes with concepts⊆ {beach, sunset, foliage, field, mountain, urban}
![Page 13: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/13.jpg)
Labelling Audio
For example, labelling music with emotions, concepts, etc.
amazed-surprised
happy-pleased
relaxing-calm
quiet-still
sad-lonely
angry-aggressive
amazed happy relaxing quiet sad angry
amazed
happy
relaxing
quiet
sad
angry
![Page 14: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/14.jpg)
Related Tasksmulti-output1 classification: outputs are nominal
X1 X2 X3 X4 X5 ran
k
gen
der
gro
up
x1 x2 x3 x4 x5 1 M 2x1 x2 x3 x4 x5 4 F 2x1 x2 x3 x4 x5 2 M 1
multi-output regression: outputs are real-valued
X1 X2 X3 X4 X5 pri
ce
age
per
cen
t
x1 x2 x3 x4 x5 37.00 25 0.88x1 x2 x3 x4 x5 22.88 22 0.22x1 x2 x3 x4 x5 88.23 11 0.77
label ranking, i.e., preference learning
λ3 � λ1 � λ4 � . . . � λ2
1aka multi-target, multi-dimensional
![Page 15: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/15.jpg)
Related Areasmulti-task learning: multiple tasks, shared representationsequential learning: predict across time indices instead ofacross label indices; each label may have a different input
y1 y2 y3 y4
x1 x2 x3 x4
structured output prediction: assume particular structureamoung outputs, e.g., pixels, hierarchy
n21
n16
n11
n6
n1
n22
n17
n12
n7
n2
n23
n18
n13
n8
n3
n24
n19
n14
n9
n4
n25
n20
n15
n10
n5
1
6
2
10
3 4
5
8
9
7
![Page 16: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/16.jpg)
Streaming Multi-label Data
Many advanced applications must deal with data streams:
Data arrives continuously, potentially infinitely
Prediction must be made immediately
Expect concept drift
For example,
Demand prediction
Intrusion detection
Pollution detection
![Page 17: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/17.jpg)
Demand PredictionOutputs (labels) represent the demand at multiple points.
Figure: Stops in the greater Helsinki region. The Kutsuplus taxiservice could be called to any of these.
We are interested in predicting, for each label [y1, . . . , yL],
p(yj = 1|x) • probability of demand at j-th node
![Page 18: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/18.jpg)
Localization and TrackingOutputs represent points in space which may contain anobject (yj = 1) or not (yj = 0). Observations are given as x.
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
Figure: Modelled on a real-world scenario; a room with a single lightsource and a number of light sensors.
We are interested in predicting, for each label [y1, . . . , yL],
yj = 1 • if j-th tile occupied
![Page 19: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/19.jpg)
Route/Destination ForecastingPersonal nodes of a traveller and predicted trajectory
L • number of geographic points of interest
x • observed data (e.g., GPS, sensor activity, time of day)
p(yj = 1|x) • probability an object is present at the j-th node
{xi, yi}Ni=1 • training data
![Page 20: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/20.jpg)
Outline
1 Introduction
2 Applications
3 Methods
4 Label Dependence
5 Multi-label Classification in Data Streams
![Page 21: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/21.jpg)
Multi-label Classification
y4y3y2y1
x
yj = hj(x) = argmaxyj∈{0,1}
p(yj |x) • for index, j = 1, . . . ,L
and then,
y = h(x) = [y1, . . . , y4]
=[
argmaxy1∈{0,1}
p(y1|x), · · · , argmaxy4∈{0,1}
p(y4|x)]
=[
f1(x), · · · , f4(x)]= f (W>x)
This is the Binary Relevance method (BR).
![Page 22: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/22.jpg)
BR Transformation1 Transform dataset . . .
X Y1 Y2 Y3 Y4
x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1. . . into L separate binary problems (one for each label)
X Y1
x(1) 0x(2) 1x(3) 0x(4) 1x(5) 0
X Y2
x(1) 1x(2) 0x(3) 1x(4) 0x(5) 0
X Y3
x(1) 1x(2) 0x(3) 0x(4) 0x(5) 0
X Y4
x(1) 0x(2) 0x(3) 0x(4) 1x(5) 1
2 and train with any off-the-shelf binary base classifier.
![Page 23: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/23.jpg)
Why Not Binary Relevance?
BR ignores label dependence, i.e.,
p(y|x) =L∏
j=1
p(yj |x)
which may not always hold!
Example (Film Genre Classification)
p(yromance|x) 6= p(yromance|x, yhorror)
![Page 24: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/24.jpg)
Why Not Binary Relevance?BR ignores label dependence, i.e.,
p(y|x) =L∏
j=1
p(yj |x)
which may not always hold!
Table: Average predictive performance (5 fold CV, EXACT MATCH)
L BR MCC
Music 6 0.30 0.37Scene 6 0.54 0.68Yeast 14 0.14 0.23Genbase 27 0.94 0.96Medical 45 0.58 0.62Enron 53 0.07 0.09Reuters 101 0.29 0.37
![Page 25: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/25.jpg)
Classifier Chains
Modelling label dependence,
y4y3y2y1
x
p(y|x) = p(y1|x)L∏
j=2
p(yj |x, y1, . . . , yj−1)
and,y = argmax
y∈{0,1}L
p(y|x)
![Page 26: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/26.jpg)
Bayes Optimal CC
Example
0
0
0
0.5
10.50.2
1
0
0.9
10.1
0.8
0.4
1
0
0
0.4
10.60.7
1
0
0.5
10.5
0.3
0.6
1 p(y = [0, 0, 0]) = 0.0402 p(y = [0, 0, 1]) = 0.0403 p(y = [0, 1, 0]) = 0.2884 . . .6 p(y = [1, 0, 1]) = 0.2527 . . .8 p(y = [1, 1, 1]) = 0.090
return argmaxy p(y|x)
Search space of {0, 1}L paths is too much
![Page 27: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/27.jpg)
CC Transformation
Similar to BR: make L binary problems, but include previouspredictions as feature attributes,
X Y1
x(1) 0x(2) 1x(3) 0x(4) 1x(5) 0
X Y1 Y2
x(1) 0 1x(2) 1 0x(3) 0 1x(4) 1 0x(5) 0 0
X Y1 Y2 Y3
x(1) 0 1 1x(2) 1 0 0x(3) 0 1 0x(4) 1 0 0x(5) 0 0 0
X Y1 Y3 Y3 Y4
x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1
and, again, apply any classifier (not necessarily a probabilisticone)!
![Page 28: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/28.jpg)
Greedy CC
y4y3y2y1
x
L classifiers for L labels. For test instance x, classify1 y1 = h1(x)2 y2 = h2(x, y1)
3 y3 = h3(x, y1, y2)
4 y4 = h4(x, y1, y2, y3)
and returny = [y1, . . . , yL]
![Page 29: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/29.jpg)
Example
0
0
0
1
1
0
1
1
0
0
1
1
0
1
y = h(x) = [?, ?, ?]
y3y2y1
x
1 y1 = h1(x) =argmaxy1
p(y1|x) = 1
2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1
Improves over BR; similar build time (if L < D);
able to use any off-the-shelf classifier for hj ; parralelizable
But, errors may be propagated down the chain
![Page 30: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/30.jpg)
Example
0
0
0
1
1
0
10.4
1
0
0
1
1
0
1
0.6
y = h(x) = [1, ?, ?]
y3y2y1
x
1 y1 = h1(x) =argmaxy1
p(y1|x) = 1
2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1
Improves over BR; similar build time (if L < D);
able to use any off-the-shelf classifier for hj ; parralelizable
But, errors may be propagated down the chain
![Page 31: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/31.jpg)
Example
0
0
0
1
1
0
1
1
0
0
10.7
1
0
1
0.3
0.6
y = h(x) = [1, 0, ?]
y3y2y1
x
1 y1 = h1(x) =argmaxy1
p(y1|x) = 1
2 y2 = h2(x, y1) = . . . = 0
3 y3 = h3(x, y1, y2) = . . . = 1
Improves over BR; similar build time (if L < D);
able to use any off-the-shelf classifier for hj ; parralelizable
But, errors may be propagated down the chain
![Page 32: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/32.jpg)
Example
0
0
0
1
1
0
1
1
0
0
0.4
10.60.7
1
0
1
0.6
y = h(x) = [1, 0, 1]
y3y2y1
x
1 y1 = h1(x) =argmaxy1
p(y1|x) = 1
2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1
Improves over BR; similar build time (if L < D);
able to use any off-the-shelf classifier for hj ; parralelizable
But, errors may be propagated down the chain
![Page 33: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/33.jpg)
Example
0
0
0
1
1
0
1
1
0
0
10.60.7
1
0
1
0.6
y = h(x) = [1, 0, 1]
y3y2y1
x
1 y1 = h1(x) =argmaxy1
p(y1|x) = 1
2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1
Improves over BR; similar build time (if L < D);
able to use any off-the-shelf classifier for hj ; parralelizable
But, errors may be propagated down the chain
![Page 34: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/34.jpg)
Monte-Carlo search for CCExample
0
0
0
1
1
0
0.9
1
0.8
0.4
1
0
0
10.60.7
1
0
1
0.6
Sample T times . . .
p([1, 0, 1]) = 0.6 · 0.7 · 0.6 =0.252
p([0, 1, 0]) = 0.4 · 0.8 · 0.9 =0.288
return argmaxytp(yt |x)
Tractable, with similar accuracy to (Bayes Optimal) PCC
Can use other search algorithms, e.g., beam search
![Page 35: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/35.jpg)
Does Label-order Matter?
y4y3y2y1
x
vs
y1y3y2y4
x
In theory, models are equivalent, since
p(y|x) = p(y1|x)p(y2|y1, x) = p(y2|x)p(y1|y2, x)
but we are estimating p from finite and noisy data; thus
p(y1|x)p(y2|y1, x) 6= p(y2|x)p(y1|y2, x)
and in the greedy case,
p(y2|y1, x) ≈ p(y2|y1, x) = p(y2|y1 = argmaxy1
p(y1|x)|x)
![Page 36: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/36.jpg)
Does Label-order Matter?In theory, models are equivalent, since
p(y|x) = p(y1|x)p(y2|y1, x) = p(y2|x)p(y1|y2, x)
but we are estimating p from finite and noisy data; thus
p(y1|x)p(y2|y1, x) 6= p(y2|x)p(y1|y2, x)
and in the greedy case,
p(y2|y1, x) ≈ p(y2|y1, x) = p(y2|y1 = argmaxy1
p(y1|x)|x)
The approximations cause high variance on account of errorpropagation. We can
1 can reduce variance with an ensemble of classifier chains2 we can search space of chain orders (huge space, but a
little search makes a difference)
![Page 37: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/37.jpg)
Label Powerset Method (LP)
One multi-class problem (taking many values),
y1, y2, y3, y4
x
y = argmaxy∈Y
p(y|x) • where |Y| ≤ {0, 1}L
Each value is a label vector, 2L in total, but
typically, only the occurrences of the training set.
(in practice, |Y| ≤size of training set, and |Y| � 2L)
![Page 38: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/38.jpg)
Label Powerset Method (LP)1 Transform dataset . . .
X Y1 Y2 Y3 Y4
x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 1 0x(4) 1 0 0 1x(5) 0 0 0 1. . . into a multi-class problem, taking 2L possible values:
X Y ∈ 2L
x(1) 0110x(2) 1000x(3) 0110x(4) 1001x(5) 0001
2 . . . and train any off-the-shelf multi-class classifier.
![Page 39: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/39.jpg)
Issues with LP
complexity (up to 2L combinations)
imbalance: few examples per class label
overfitting: how to predict new value?
Example
In the Enron dataset, 44% of labelsets are unique (to a singletraining example or test instance). In del.icio.us dataset, 98%are unique.
![Page 40: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/40.jpg)
Meta LabelsImproving the label-powerset approach:
decomposition of label set into M subsets of size k (k < L)
pruning, such that, e.g., Y1,2 ∈ {[0, 0], [0, 1], [1, 1]}combine together with random subspace method with avoting scheme
Y3Y2Y1
Y1,2 Y2,3
X3X2X1
![Page 41: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/41.jpg)
Meta Labels
Improving the label-powerset approach:
decomposition of label set into M subsets of size k (k < L)
pruning, such that, e.g., Y1,2 ∈ {[0, 0], [0, 1], [1, 1]}combine together with random subspace method with avoting scheme
Method Inference ComplexityLabel Powerset O(2L ·D)Pruned Sets O(P ·D)
Decomposition / RAkEL O(M · 2k ·D)Meta Labels O(M · P ·D′)
where P < 2L and P < 2k , D′ < D.
![Page 42: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/42.jpg)
Summary of Mehtods
Two views of a multi-label problem of L labels:1 L binary problems2 a multi-class problem with up to 2L classes
Problem Transformation:1 Transform data into subproblems (binary or multi-class)2 Apply some off-the-shelf base classifier
or, Algorithm Adaptation:1 Take a suitable single-label classifier (kNN, neural
networks, decision trees . . . )2 Adapt it (if necessary) for multi-label classification
![Page 43: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/43.jpg)
Outline
1 Introduction
2 Applications
3 Methods
4 Label Dependence
5 Multi-label Classification in Data Streams
![Page 44: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/44.jpg)
Label Dependence in MLCCommon approach: Present methods to
1 measure label dependence2 find a structure that best represents this
and then apply classifiers, compare results to BR.
Example
y2y4y1y3
xx
y4y3y2y1
y124 y234
x
Link particular labels (nodes) together (CC-basedmethods)
Form particular label subsets (LP-based methods)
![Page 45: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/45.jpg)
Label Dependence in MLCCommon approach: Present methods to
1 measure label dependence2 find a structure that best represents this
and then apply classifiers, compare results to BR.
Measuring label dependence is expensive, models built on itoften do not improve over models built on random depen-dence!
Problem
For some metrics (such as Hamming-loss / label accuracy),knowledge of label dependence is theoretically unnecessary!
Problem
![Page 46: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/46.jpg)
Marginal label dependence
Marginal dependence
When the joint is not the product of the marginals, i.e.,
p(y2) 6= p(y2|y1)
p(y1)p(y2) 6= p(y1, y2)Y1 Y2
Estimate from co-occurrence frequencies in training data
![Page 47: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/47.jpg)
Marginal label dependenceExample
amazed happy relaxing quiet sad angry
amazed
happy
relaxing
quiet
sad
angry
Figure: Music dataset - Mutual Information
![Page 48: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/48.jpg)
Marginal label dependenceExample
beach sunset foliage field mountain urban
beach
sunset
foliage
field
mountain
urban
Figure: Scene dataset - Mutual Information
![Page 49: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/49.jpg)
Marginal label dependence
Marginal dependence
When the joint is not the product of the marginals, i.e.,
p(y2) 6= p(y2|y1)
p(y1)p(y2) 6= p(y1, y2)Y1 Y2
Estimate from co-occurrence frequencies in training data
Used for regularization/constraints:1 y = h(x) makes a prediction2 y′ = g(y) regularizes the prediction
![Page 50: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/50.jpg)
Conditional label dependence
But at classification time, we condition on the input!
Conditional dependence
. . . conditioned on input observation x.
p(y2|y1, x) 6= p(y2|x) X
Y1 Y2
Have to build and measure models
Indication of conditional dependence if
the performance of LP/CC exeeds that of BR
errors among the binary models are correlated
But what does this mean?
![Page 51: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/51.jpg)
Conditional label dependence
But at classification time, we condition on the input!
Conditional independence
. . . conditioned on input observation x.For example,
p(y2) 6= p(y2|y1)
but p(y2|x) = p(y2|, y1, x)
X
Y1 Y2 vs
X
Y1 Y2
Have to build and measure models
Indication of conditional dependence if
the performance of LP/CC exeeds that of BR
errors among the binary models are correlated
But what does this mean?
![Page 52: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/52.jpg)
The LOGICAL Problem
Example (The LOGICAL Toy Problem)
OR
AN
D
XO
R
X1 X2 Y1 Y2 Y3
0 0 0 0 01 0 1 0 10 1 1 0 11 1 1 1 0
Each label is a logical operation (independent of the others!)
![Page 53: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/53.jpg)
The LOGICAL Problem
or and xor
x
xorandor
x
or,and,xor
x
Figure: BR (left), CC (middle), LP (right)
Table: The LOGICAL problem, base classifier logistic regression.
Metric BR CC LP
HAMMING SCORE 0.83 1.00 1.00EXACT MATCH 0.50 1.00 1.00
Dependence is introduced by an inadequate model!Dependence depends on the model.
![Page 54: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/54.jpg)
The LOGICAL Problem
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x2
Model YOR|x1 ,x2
YOR=0
YOR=1
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x2
Model YAND|x1 ,x2
YAND=0
YAND=1
or and xor
x
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x2
Model YXOR|x1 ,x2
YXOR=0
YXOR=1
Figure: Binary Relevance (BR): linear decision boundary (solid line,estimated with logistic regression) not viable for YXOR label
![Page 55: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/55.jpg)
Solution via Structure
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x2
Model YOR|x1 ,x2
YOR=0
YOR=1
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x2
Model YAND|x1 ,x2
YAND=0
YAND=1
xorandor
x
x1
0.20.0
0.20.4
0.60.8
1.01.2
x 2
0.20.0
0.20.4
0.60.8
1.01.2
y OR
0.20.00.20.4
0.6
0.8
1.0
1.2
1.4
1.6
Model YXOR|YXOR,x1 ,x2
Figure: Classifier chains (CC): linear model now applicable to YXOR
![Page 56: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/56.jpg)
Solution via Multi-classDecomposition
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x2
Model YOR,AND,XOR|x1 ,x2
Y[OR,AND,XOR] =[0,0,0]
Y[OR,AND,XOR] =[1,0,1]
Y[OR,AND,XOR] =[1,1,0]
or,and,xor
x
Figure: Label Powerset (LP): solvable with one-vs-one multi-classdecomposition for any (e.g., linear) base classifier.
![Page 57: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/57.jpg)
Solution via Con. Independence
1.0 1.1 1.2 1.3 1.4 1.5 1.6z1
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
z 2
Model YOR|,z1 ,z2
YOR=0
YOR=1
1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60z1
1.00
1.02
1.04
1.06
1.08
1.10
1.12
z 2
Model YAND|,z1 ,z2
YAND=0
YAND=1
or and xor
φ(x)
1.30 1.35 1.40 1.45 1.50 1.55 1.60z1
0.4
0.6
0.8
1.0
1.2
1.4
1.6
z 2
Model YXOR|,z1 ,z2
YXOR=0
YXOR=1
Figure: Solution via latent structure (e.g., random RBF) to new inputspace z; creating independence: p(yXOR|z, yOR, yAND) ≈ p(yXOR|z).
![Page 58: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/58.jpg)
Solution via SuitableBase-classifier
x1 = 1
x2 = 1 x2 = 1
[0, 1, 1] [1, 1, 0][0, 0, 0] [0, 1, 1]
noyes
no yesno yes
Figure: Solution via non-linear classifier (e.g., Decision Tree). Leaveshold examples, where y = [yAND, yOR, yXOR]
![Page 59: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/59.jpg)
Detecting DependenceConditional label dependence and the choice of base modelare inseperable.
yj = hj(x) + εj
yk = hk(x) + εk
0.00 0.02 0.04 0.06 0.08 0.10 0.12εOR
0.00
0.05
0.10
0.15
0.20
0.25
0.30
ε XOR
ε = [εOR, εXOR]
0.0 0.2 0.4 0.6 0.8 1.0εOR
0.0
0.2
0.4
0.6
0.8
1.0
ε XOR
ε = [εOR, εXOR]
Figure: Errors from logistic regression (left) and decision tree (right).
Only dependence is captured!
![Page 60: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/60.jpg)
A fresh look at ProblemTransformation
X1 X2 X3
Y1
Y2
Y3
Y3Y2Y1
Y1,2 Y2,3
X3X2X1
Figure: Standard methods can be viewed as (‘deep’/cascaded) basisfunctions on the label space.
![Page 61: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/61.jpg)
Label Dependence: Summary
Marginal dependence for regularizationConditional dependence
. . . depends on the model
. . . may be introduced
Should consider together:base classifierlabel structureinner-layer structure
An open problem
Much existing research is relevant (latent-variablemodels, neural networks, deep learning, . . . )
![Page 62: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/62.jpg)
Outline
1 Introduction
2 Applications
3 Methods
4 Label Dependence
5 Multi-label Classification in Data Streams
![Page 63: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/63.jpg)
Classification in Data Streams
Setting:
sequence is potentially infinite
high speed of arrival
stream is one-way
Implications
work in limited memory
adapt to concept drift
![Page 64: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/64.jpg)
Multi-label Streams Methods
1 Batch-incremental Ensemble2 Problem transformation with an incremental base learner3 Multi-label kNN4 Multi-label incremental decision trees5 Neural networks
![Page 65: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/65.jpg)
Batch-Incremental EnsembleBuild regular multi-label models on batches/windows ofinstances (typically in a [weighted] ensemble).
[xy
]1,
[xy
]2,
[xy
]3,
[xy
]4︸ ︷︷ ︸
h1
,
[xy
]5,
[xy
]6,
[xy
]7,
[xy
]8︸ ︷︷ ︸
h2
,
[xy
]9,
[xy
]10
A common approach in the literature, and can besurprisingly effective
Free choice of base classifier (e.g., C4.5, SVM)What batch size to use?
Too small = models insufficientToo large = slow to adaptToo many batches = too slow
![Page 66: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/66.jpg)
Problem Transformation withIncremental Base Learner
Use an incremental learner (Naive Bayes, SGD, Hoeffdingtrees) with any problem transformation method (BR, LP, CC,. . . )
y4y3y2y1
x
Simple implementation,
Risk of overfitting (e.g.,with classifier chains)
Concept drift mayinvalidate structure
Limited choice of baselearner(must be incremental)
![Page 67: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/67.jpg)
Multi-label kNNMaintain a dynamic buffer of instances, compare each testinstance x to the k neighbouring instances,
yj =
{1(
1k
∑i|x(i)∈Ne(x) y(i)
j > 0.5)
0 otherwise
efficient wrt L
. . . but not wrt D
limited buffer size,
not suitable for allproblems
![Page 68: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/68.jpg)
ML Incremental Decision TreesA small sample can suffice to choose a splitting attribute(Hoeffding bound gives guarantees)As in regular tree, with modified splitting criteria, e.g.,
HML(S) = −L∑
j=1
∑c∈{0,1}
P(yj = c) log2 P(yj = c)
Examples with multiple labels collect at the leaves.
x1 = 1
[0, 0, 0], [0, 1, 1] x2 = 1
[0, 1, 1] [1, 1, 0]
noyes
no yes
![Page 69: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/69.jpg)
ML Incremental Decision TreesFast, and usually competitive,
But tree may grow very conservatively,
. . . and need to replace it (or part of it) when conceptchanges.
x1 = 1
[0, 0, 0], [0, 1, 1] x2 = 1
[0, 1, 1] [1, 1, 0]
noyes
no yes
Place multi-label classifiers at the leaves of the tree
and wrap it in an ensemble.
![Page 70: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/70.jpg)
Neural networksEach label is a node. Trained with SGD
y1 y2 y3 y4
x5x4x3x2x1
y = W>x
g = ∇E(W)
W (t+1)j,k = W (t)
j,k + λgj,k
Can be applied natively
One layer = BR, should use hidden layers to model labeldependence / improve performance
Hyper-parameter tuning can be tedious
Relatively poor performance in empirical comparisons onstandard data streams (improving now with recentadvances in SGD, more common use of basis expansion)
![Page 71: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/71.jpg)
Multi-label Data Streams: Issues
Overfitting
Class imbalance
Multi-dimensional concept drift
Labelled examples difficult to obtain (semi-supervised)
Dynamic label set
Time dependence
![Page 72: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/72.jpg)
Multi-label Concept Drift
Consider the relative frequencies of the j-th and k-th labels:[pj pj,k
pk
](if marginal independence then pj,k = pjpk).
Possible drift:
pj increases (j-th label relatively more frequent)
pj and pk both decrease (label cardinality decreasing)
pj,k changes relative to pjpk (change in marginaldependence relation between the labels)
![Page 73: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/73.jpg)
Multi-label Concept Drift
And when conditioned on input x, we consider the relativefrequencies/values of the j-th and k-th errors:[
εj εj,k
εk
](if conditional independence, then εj,k ≈ εj · εk).
Possible drift:
εj increases (more errors on j-th label)
εj and εk both increase (more errors)
εj,k changes relative to εj , εk (change in conditionaldependence relation)
![Page 74: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/74.jpg)
Example
Recall the distribution of errors
0.00 0.02 0.04 0.06 0.08 0.10 0.12εOR
0.00
0.05
0.10
0.15
0.20
0.25
0.30ε X
OR
ε = [εOR, εXOR]
This shape may change over time – and structures may needto be adjusted to cope
![Page 75: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/75.jpg)
Dealing with Concept Drift
Possible approaches
Just ignore it – batch models must be replaced anyway,kNN and SGD adapt; in other cases can use weightedensembles/fading factor
Monitor a predictive performance statistic with a changedetector (e.g., window based-detection, ADWIN) andreset models
Monitor the distribution with a change detector (e.g.,window based, KL divergence) and reset/recalibratemodels
(similar to single-labelled data, except more complexmeasurement)
![Page 76: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/76.jpg)
Dealing with UnlabelledInstances
Ignore instances with no label
Use active learning to get good labels
Use predicted labels (self-training)
Use an unsupervised process for example clustering,latent-variable representations.
![Page 77: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/77.jpg)
Dealing with UnlabelledInstances
Use an unsupervised process for example clustering,latent-variable representations.
1 zt = g(xt )2 yt = h(zt )3 update g with (xt , zt )4 update h with (zt−1, yt−1) (if yt−1 is available)
yt−1
update(·,·)
ytOOpredict(·)
zt−1 ztOOpropagate(·)update(·,·)
xt
Can also be as one single model
![Page 78: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/78.jpg)
Summary
Multi-label classification is an active area of research,relevant to many real-world problems
Methods that deal appropriately wiith label dependencecan achieve significant gains over a naive approach
Many multi-label problems come in the form of a datastream, incurring particular challenges
![Page 79: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional](https://reader033.vdocuments.site/reader033/viewer/2022060516/6049646903f9ab5f234eb144/html5/thumbnails/79.jpg)
Multi-label learning from batch andstreaming data
Jesse Read
Telecom ParisTech
Ecole Polytechnique
Summer School on Mining Big and Complex Data5 September 2016 — Ohrid, Macedonia