learning with implicit/explicit structures

62
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion Learning with Implicit/Explicit Structures James Kwok Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Chinese Workshop on Machine Learning & Applications James Kwok Learning with Implicit/Explicit Structures

Upload: others

Post on 18-Dec-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Learning with Implicit/Explicit Structures

James Kwok

Department of Computer Science and EngineeringHong Kong University of Science and Technology

Hong Kong

Chinese Workshop on Machine Learning & Applications

James Kwok Learning with Implicit/Explicit Structures

Page 2: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Structure is Everywhere

Data have structure

Example (structured data)

DNA (strings) natural language (trees) molecules (graphs)

various kernels have been defined for these structured data

structure is about the input

Example (structured sparsity)

prefers certain sparsity patterns of the parameter, not just asmall number of nonzero coefficients

James Kwok Learning with Implicit/Explicit Structures

Page 3: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Structure is Everywhere...

Outputs can also have structure

Example (hierarchical classification)

instance labels reside in a known tree- or DAG-structuredhierarchy

structure is explicit (e.g., yahoo hierarchy)

Structure can be implicit

Example (multitask learning)

tasks may have an underlying clustering structure

may have to be discovered by the learning algorithm

James Kwok Learning with Implicit/Explicit Structures

Page 4: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Hierarchical Multilabel Classification

James Kwok Learning with Implicit/Explicit Structures

Page 5: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Multiclass vs Multilabel Classification

Example (multiclass classification)

an instance can have only one label

Example (multilabel classification)

an instance can have more than onelabels

image tagging

tags: elephant, jungle and africa

James Kwok Learning with Implicit/Explicit Structures

Page 6: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Multilabel Classification

Example

video tagging

text categorization

gene functions analysis in bioinformatics

Are these labels independent?

NO! Labels often have structure!

James Kwok Learning with Implicit/Explicit Structures

Page 7: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Hierarchical Classification

Example (text classification)

labels may be organized in a known tree-structured hierarchy

Yahoo! taxonomy (as of 2004):: 16-level hierarchy

an instance is associated with the label of a node if it is alsoassociated with the node’s parent

James Kwok Learning with Implicit/Explicit Structures

Page 8: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Hierarchical Classification...

More generally, label hierarchy is a directed acyclic graph (DAG)

Example (bioinformatics: Gene Ontology (GO))

Genome annotation

a node can have multiple parents

if a node is positive, all its parents must be positive

Consider this label hierarchy information in making predictions

James Kwok Learning with Implicit/Explicit Structures

Page 9: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Training

train estimators for p(yi = 1 | ypa(i) = 1, x) at each node i

many possible estimation methods

Example

at each node i : train a binary SVM using those trainingexamples s.t. the parent of i is labeled positive

convert SVM output to a probability estimate using Platt’sprocedure

James Kwok Learning with Implicit/Explicit Structures

Page 10: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Potentially Large Number of Labels

Example

flickr (as of 2010): > 20 millions unique tags

humans can recognize 10,000-100,000 unique object classes

Yahoo! taxonomy (as of 2004): nearly 300,000 categories

How to reduce the number of learning problems?

James Kwok Learning with Implicit/Explicit Structures

Page 11: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Projection

1 project the long label vector to a low-dimensional vectore.g., PCA, randomized matrix

2 learn a mapping from input to each projected dimension

3 predict and reconstruct the label vector in the original space

Advantages

number of learning problems = dimension in projected space

flexible: any regressor can be used in the learning step

James Kwok Learning with Implicit/Explicit Structures

Page 12: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Prediction: Simple Case

Suppose that it is known that the test sample has k labels, and thelabels are unstructured, how to obtain the prediction from h?

Simply pick the k largest entries in h!

maxψi

∑i

hiψi

s.t. ψi ∈ 0, 1, ∀i (indicator)∑i

ψi = k

ψi = 1: node i is selected; 0 otherwiseJames Kwok Learning with Implicit/Explicit Structures

Page 13: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

What if the Labels are Structured?

can no longer simply pick the k largest entries

James Kwok Learning with Implicit/Explicit Structures

Page 14: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Mandatory vs Non-mandatory Leaf Node Prediction

Mandatory leaf node prediction (MLNP)

labels must end at leaf nodes

Example

leaf nodes are objects of interest (have strongersemantic/biological meanings)

e.g., taxonomies of musical signals and genes

label hierarchy is learned from the data internal nodes areonly artificial

Non-mandatory leaf node prediction (NMLNP)

labels can end at an internal node

James Kwok Learning with Implicit/Explicit Structures

Page 15: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

NMLNP on Tree-structured Hierarchies

maxψi

∑i∈T

hiψi

s.t. ψi ∈ 0, 1, ∀i ∈ T∑i∈T

ψi = k

parent’s ψi value must be ≥ that of the child

if a node is labeled positive, its parent must also be labeledpositive

Relax the binary constraint to 1 ≥ ψi ≥ 0

Efficient Algorithm

Condensing Sort and Select Algorithm (CSSA) [Baraniuk & Jones,TSP 1994]: originally used in signal processing

time complexity: O(N log N)

James Kwok Learning with Implicit/Explicit Structures

Page 16: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Example

James Kwok Learning with Implicit/Explicit Structures

Page 17: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Example...

James Kwok Learning with Implicit/Explicit Structures

Page 18: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

NMLNP on DAG-Structured Hierarchies

Idea: Merge the supernode with its unassigned parent that has thesmallest supernode value

James Kwok Learning with Implicit/Explicit Structures

Page 19: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Example...

James Kwok Learning with Implicit/Explicit Structures

Page 20: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Example...

James Kwok Learning with Implicit/Explicit Structures

Page 21: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Mandatory Leaf Node Prediction

Non-mandatory leaf node prediction (NMLNP)

labels can end at an internal node

Mandatory leaf node prediction (MLNP)

labels must end at a leaf node

Existing MLNP methods are for hierarchical multiclassclassification

train a classifier at each node; for a positive parent, recursivelylabel its child having the largest prediction as positiveat each node, exactly one subtree is to be pursued

Extension for hierarchical multilabel classification not easy

at each node, how many and which subtrees to pursue?

James Kwok Learning with Implicit/Explicit Structures

Page 22: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Probabilistic Approach

Notations (consider tree hierarchy first)

training examples (x(n), y(n))x(n): input; y(n) = [y

(n)1 , . . . , y

(n)|T | ]

′ ∈ 0, 1|T |: multilabel

denoting memberships of x(n) to each of the nodes

tree structure: yi = 1⇒ ypa(i) = 1 for any non-root node i

Assume: Labels for any group of siblings i1, i2, . . . , im areconditionally independent, given the label of their parent and x

p(yi1 , yi2 , . . . yim |ypa(i1), x) =∏m

j=1 p(yij |ypa(i1), x)

popularly used in Bayesian networks and hierarchicalmultilabel classification

James Kwok Learning with Implicit/Explicit Structures

Page 23: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Maximum a Posteriori MLNP

represent y by a set Ω ⊆ T : yi = 1 if i ∈ Ω; and 0 otherwise

MAP MLNP find Ω∗ that1 maximizes the posterior probability p(y0, . . . , y|T ||x)2 respects T

Ω∗ = maxΩ p(yΩ = 1, yΩc = 0|x)s.t. y0 = 1

Ω contains no partial path (MLNP)

yi ’s respect the label hierarchy

Note:

p(yΩ = 1, yΩc = 0|x) considers all the node labels in thehierarchy simultaneouslycf. existing MLNP (multiclass) methods: considers thehierarchy information locally at each node

James Kwok Learning with Implicit/Explicit Structures

Page 24: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Maximum a Posteriori MLNP on Label DAGs

Using the conditional independence simplification

p(y1, . . . , y|G||x) = p(y0|x)∏

i∈G\0

p(yi | yPa(i), x)

Pa(i): the set of (possibly multiple) parents of node i

direct maximization of p(y1, . . . , y|G||x) is difficult

Assume

p(y1, . . . , y|G||x) ∝ p(y0|x)∏

i∈G\0

∏j∈Pa(i)

p(yi |yj , x)

composite likelihood (or pseudolikelihood): replace a difficultpdf by a set of marginals or conditionals that are easier toevaluate

pairwise conditional likelihood

James Kwok Learning with Implicit/Explicit Structures

Page 25: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

MAP MLNP on Label DAGs...

for each node i , define

wi =

l∈child(0) log(1− pl0) i = 0∑j∈Pa(i)(log pij − log(1− pij)) leaf i∑j∈Pa(i)(log pij − log(1− pij)) +

∑l∈child(i) log(1− pli ) o.w.

pij ≡ p(yi = 1|yj = 1, x) for j ∈ Pa(i)if we knew that prediction of x has k leaf labels

maxψi

∑i∈G

wiψi

s.t.∑

leaf node i

ψi = k

ψ0 = 1, ψi ∈ 0, 1 ∀i ∈ G∑j∈child(i)

ψj ≥ 1 ∀ internal node i : ψi = 1

ψi ≤ ψj ∀j ∈ Pa(i), ∀i ∈ G\0(|# leaf nodes|k

)candidate solutions expensive

James Kwok Learning with Implicit/Explicit Structures

Page 26: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Nested Approximation Property (NAP)

Definition (k-leaf-sparse)

A multilabel y is k-leaf-sparse if k of the leaf nodes are labeled one

Example

1-leaf-sparse 2-leaf-sparse

James Kwok Learning with Implicit/Explicit Structures

Page 27: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Nested Approximation Property (NAP)...

Definition (Nested Approximation Property (NAP))

For a pattern x, let its optimal k-leaf-sparse multilabel be Ωk . TheNAP is satisfied if

i : i ∈ Ωk ⊂ i : i ∈ Ωk ′ for all k < k ′

NAP is often implicitly assumed in many hierarchicalclassification algorithms

e.g., classifications that are based on threshold tuninghigher threshold: 1-leaf-sparse; lower threshold: 2-leaf-sparse

James Kwok Learning with Implicit/Explicit Structures

Page 28: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Solve

maxψi

∑i∈G

wiψi

s.t.∑

leaf node i

ψi = k

ψ0 = 1, ψi ∈ 0, 1 ∀i ∈ G∑j∈child(i)

ψj ≥ 1 ∀ internal node i : ψi = 1

ψi ≤ ψj ∀j ∈ Pa(i), ∀i ∈ G\0

James Kwok Learning with Implicit/Explicit Structures

Page 29: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Example: Find Optimal 1-leaf-sparse Solution

James Kwok Learning with Implicit/Explicit Structures

Page 30: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Example: Find Optimal 2-leaf-sparse Solution

update takesO(|L|N log N) time

repeat this process until kleaf nodes are selected

total: O(k|L|N log N)time

James Kwok Learning with Implicit/Explicit Structures

Page 31: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Unknown Number of Labels

We assumed that we know the prediction of x has k leaf labels

What if k is not known?

Straightforward approach

run the algorithm with k = 1, . . . , |L| (L: #leaf nodes)

find the Ωk ∈ Ω1, . . . ,Ω|L| that maximizes the posteriorprobability for DAG

Recall that Ωk ⊂ Ωk+1 under the NAP assumption

can simply set k = |L|, and Ωi is immediately obtained as theΩ in iteration i

James Kwok Learning with Implicit/Explicit Structures

Page 32: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Experiments

12 functional genomics data sets with 2 sets of labelstree-structured labels: FunCat hierarchyDAG-structured labels: GO hierarchy

max #class depth #label on average

FunCat 500 5 8.8

GO 4134 14 35

James Kwok Learning with Implicit/Explicit Structures

Page 33: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Evaluation

precision-recall (PR) curve

Prec =

∑i TPi∑

i TPi +∑

i FPi,Rec =

∑i TPi∑

i TPi +∑

i FNi

TPi/FPi/FNi : number of true positives / false positives /false negatives for label i

AUPRC: the larger the better

James Kwok Learning with Implicit/Explicit Structures

Page 34: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

NMLNP on Label Trees (FunCat)

CLUS-HMC (Vens et al., MLJ 2008) - decision tree(state-of-the-art)

AUPRC values

data set CSSA CLUS-HMC (state-of-the-art)seq 0.226 0.218

pheno 0.167 0.166struc 0.194 0.189hom 0.257 0.254

cellcycle 0.196 0.180church 0.179 0.178derisi 0.194 0.183eisen 0.220 0.212

gasch1 0.216 0.212gasch2 0.218 0.203

spo 0.216 0.195expr 0.228 0.218

CSSA outperforms CLUS-HMC on all data sets!James Kwok Learning with Implicit/Explicit Structures

Page 35: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

NMLNP on Label DAGs (GO)

AUPRC values

data set CSSAG CLUS-HMCseq 0.478 0.469

pheno 0.426 0.425struc 0.455 0.446hom 0.493 0.481

cellcycle 0.454 0.443church 0.442 0.436derisi 0.442 0.440eisen 0.479 0.454

gasch1 0.468 0.453gasch2 0.454 0.449

spo 0.442 0.440expr 0.473 0.453

CSSA outperforms CLUS-HMC on all data sets again!

James Kwok Learning with Implicit/Explicit Structures

Page 36: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

MLNP on Label Trees

Compare with

HMC-LP [Cerri et al, IDA 2011]

the only existing algorithm that can perform MLNP on trees(but not on DAGs)

other NMLNP methods (converted for MLNP)

first use the MetaLabeler to predict #leaf labels (k) that thetest pattern hasthen pick the k leaf labels using different NMLNP methods

hierarchical SVM (H-SVM) [Cesa-Bianchi et al., JMLR 2006]Bayesian classifier chain (BCC) [Zaragoza et al, IJCAI 2011]CLUS-HMC

James Kwok Learning with Implicit/Explicit Structures

Page 37: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

MLNP on Label Treesused with MetaLabeler

data set MAT HMC-LP H-SVM BCC CLUS-HMCrcv1v2 subset1 0.81 [1] 0.05 [5] 0.80 [2] 0.60 [4] 0.62 [3]rcv1v2 subset2 0.83 [1] 0.05 [5] 0.81 [2] 0.62 [3] 0.61 [4]rcv1v2 subset3 0.82 [1] 0.05 [5] 0.81 [2] 0.41 [4] 0.58 [3]rcv1v2 subset4 0.82 [1] 0.05 [5] 0.81 [2] 0.59 [3] 0.59 [3]rcv1v2 subset5 0.81 [1] 0.05 [5] 0.80 [2] 0.57 [4] 0.61 [3]imageclef07a 0.89 [1] 0.01 [5] 0.89 [1] 0.77 [3] 0.65 [4]imageclef07d 0.86 [1] 0.19 [5] 0.86 [1] 0.82 [3] 0.65 [4]

delicious 0.47 [3] 0.08 [5] 0.50 [2] 0.34 [4] 0.53 [1]enron 0.72 [1] 0.37 [5] 0.72 [1] 0.67 [4] 0.68 [3]wipo 0.78 [1] 0.36 [5] 0.78 [1] 0.52 [4] 0.71 [3]

caltech-101 0.77 [2] 0.73 [3] 0.78 [1] 0.52 [4] -seq (funcat) 0.29 [1] 0.00 [5] 0.29 [1] 0.16 [4] 0.27 [3]

pheno (funcat) 0.24 [1] 0.02 [5] 0.20 [2] 0.09 [4] 0.18 [3]struc (funcat) 0.23 [1] 0.00 [5] 0.23 [1] 0.03 [4] 0.20 [3]hom (funcat) 0.31 [2] 0.01 [5] 0.33 [1] 0.04 [4] 0.23 [3]

cellcycle (funcat) 0.24 [2] 0.01 [5] 0.26 [1] 0.13 [4] 0.19 [3]church (funcat) 0.19 [2] 0.01 [5] 0.19 [2] 0.06 [4] 0.21 [1]derisi (funcat) 0.21 [2] 0.01 [5] 0.21 [2] 0.13 [4] 0.22 [1]eisen (funcat) 0.32 [2] 0.01 [5] 0.35 [1] 0.20 [4] 0.29 [3]

gasch1 (funcat) 0.30 [2] 0.01 [5] 0.33 [1] 0.17 [4] 0.29 [3]gasch2 (funcat) 0.28 [1] 0.01 [5] 0.28 [1] 0.11 [4] 0.24 [3]

James Kwok Learning with Implicit/Explicit Structures

Page 38: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

MLNP on Label DAGs

used with MetaLabelerdata set MAS H-SVM BCC CLUS-HMCseq (GO) 0.61 [1] 0.55 [3] 0.48 [4] 0.57 [2]

pheno (GO) 0.61 [1] 0.60 [2] 0.58 [3] 0.55 [4]struc (GO) 0.53 [2] 0.47 [3] 0.45 [4] 0.60 [1]hom (GO) 0.63 [1] 0.56 [3] 0.52 [4] 0.62 [2]

cellcycle (GO) 0.55 [1] 0.50 [3] 0.21 [5] 0.50 [3]church (GO) 0.55 [1] 0.49 [3] 0.26 [4] 0.54 [2]derisi (GO) 0.53 [1] 0.49 [2] 0.36 [4] 0.47 [3]eisen (GO) 0.60 [1] 0.53 [2] 0.49 [4] 0.53 [2]

gasch1 (GO) 0.62 [1] 0.54 [2] 0.49 [3] 0.46 [4]gasch2 (GO) 0.54 [1] 0.49 [3] 0.41 [4] 0.50 [2]

spo (GO) 0.50 [1] 0.47 [3] 0.48 [2] 0.46 [4]expr (GO) 0.59 [1] 0.54 [4] 0.50 [5] 0.55 [3]

James Kwok Learning with Implicit/Explicit Structures

Page 39: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Validating the NAP Assumption

use brute-force search to find the best k-leaf-sparse prediction

check if it includes the best (k − 1)-leaf-sparse prediction

% test patterns satisfying NAP at different k:

2 4 6 8 10 12 1490

91

92

93

94

95

96

97

98

99

100

k

insta

nce

s s

atisfy

ing

NA

P(%

)

2 4 6 8 10 12 14 16 18 2090

91

92

93

94

95

96

97

98

99

100

k

insta

nce

s s

atisfy

ing

NA

P(%

)

5 10 15 20 2590

91

92

93

94

95

96

97

98

99

100

k

inst

ance

s sa

tisfy

ing

NAP

(%)

pheno(funcat) pheno(GO) church(GO)

NAP holds almost 100% of the time

James Kwok Learning with Implicit/Explicit Structures

Page 40: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Multitask Learning

James Kwok Learning with Implicit/Explicit Structures

Page 41: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Multitask Learning (MTL)

Example

rating of products: each customer is a task

handwritten digit recognition: each digit is a task

Learn all tasks simultaneously, instead of separately

different tasks share some information

better to learn them together, especially when there areinsufficient training samples

James Kwok Learning with Implicit/Explicit Structures

Page 42: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Notations

T tasks

task t: training samples (x(t)1 , y

(t)1 ), . . . , (x

(t)nt , y

(t)nt )

y(t) = X(t)wt + e

X(t) ≡ [x(t)1 , . . . , x

(t)nt ]T

y(t) ≡ [y(t)1 , . . . , y

(t)nt ]T

wt : weight

Given a set of T tasks, how to learn them together?

how to estimate W = [w1, . . .wT ]?

James Kwok Learning with Implicit/Explicit Structures

Page 43: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Popular Approaches

Pooling

pool all the tasks together, and treat them as one single task

w1 = w2 = · · · = wT

Regularized MTL

all tasks are close to some model w0

wt = w0︸︷︷︸common

+ vt︸︷︷︸task-specific

Learning with outlier tasks: robust MTL

assume: there are some outlying wt ’s

James Kwok Learning with Implicit/Explicit Structures

Page 44: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Task Clustering Structure

Tasks have structure!

Learning with clustered tasks: clustered MTL

assume: wt ’s form several clusters

(clustered MTL) (pooling/regularized MTL) (robust MTL)

how many clusters?

all features have the same task clustering structure

Good?

James Kwok Learning with Implicit/Explicit Structures

Page 45: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

More Complicated Task Clustering Structure?

Example (movie recommendation)

different features may have different task clustering structures

James Kwok Learning with Implicit/Explicit Structures

Page 46: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Flexible Task Clustering Structure

decompose wt as ut + vt

ut : clustering center at each featurevt : variations specific to each task

U = [u1, . . . ,uT ],V = [v1, . . . , vT ]

Flexible Task-Clustered MTL (FlexTClus)

minU,V

T∑t=1

‖y(t) − X(t)(ut + vt)‖2︸ ︷︷ ︸square loss

+λ1‖U‖clus + λ2‖U‖2F + λ3‖V‖2F︸ ︷︷ ︸ridge regularizers

‖U‖clus =∑D

d=1

∑i<j |Udi − Udj |

for each feature: pairwise difference in the parameters of tasksi and jtry to group parameters of tasks i and j together (k-meansclustering)

James Kwok Learning with Implicit/Explicit Structures

Page 47: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Special Cases

minU,V∑T

t=1 ‖y(t)−X(t)(ut+vt)‖2+λ1‖U‖clus+λ2‖U‖2F +λ3‖V‖2F

Special cases

λ1 = λ2 = λ3 = 0: independent least squares regression oneach task

λ1 =∞: regularized MTL

λ1 = 0: independent ridge regression on each task

λ2 6= 0, λ3 = 0: independent least squares regression on eachtask

James Kwok Learning with Implicit/Explicit Structures

Page 48: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Properties

Convergence to ground truth

With high probability,

maxd=1,...,D maxt=1,...,T |Wdt −W ∗dt | ≤ O

(1√n

)W∗: ground truth

n: number of samples

U captures the clustering structure at feature level

With high probability, and for sufficiently large n,

(i , j in the same cluster) W ∗di = W ∗

dj ↔ Udi = Udj

(i , j in different clusters) |W ∗di −W ∗

dj | ≥ ρ ↔ Udi 6= Udj

James Kwok Learning with Implicit/Explicit Structures

Page 49: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Optimization

minU,V∑T

t=1 ‖y(t)−X(t)(ut+vt)‖2+λ1‖U‖clus+λ2‖U‖2F +λ3‖V‖2F

How to solve this optimization problem?

Back to basics! Gradient descent

1m

∑mi=1 `(w ; xi , yi ) + λΩ(w) LOOP

1 find descent direction

2 choose stepsize

3 descent

James Kwok Learning with Implicit/Explicit Structures

Page 50: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Gradient Descent

Advantages

easy to implement

low per-iteration complexity good scalability (BIG DATA)

Disadvantage

uses first-order (gradient) information

slow convergence rate may require a large number ofiterations

James Kwok Learning with Implicit/Explicit Structures

Page 51: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Accelerated Gradient Methods

First developed by Nesterov in 1983

for smooth optimization

minβ f (β) (f is smooth in β)

Problem: ` and/or Ω may be nonsmooth

SVM: hinge loss (nonsmooth) +‖w‖22 (smooth)

lasso: square loss (smooth) +‖w‖1 (nonsmooth)

Extension to composite optimization

objective has both smooth and nonsmooth components

minβ f (β)︸︷︷︸smooth

+ r(β)︸︷︷︸nonsmooth

recently popularly used in machine learningJames Kwok Learning with Implicit/Explicit Structures

Page 52: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Accelerated Gradient Decent

Gradient decent

1 find descent direction

2 choose stepsize

3 descent

FISTA [Beck & Teboulle, 2009]

Q(β, βt) ≡ (β − β

t)T∇f (β

t)

+L

2‖β − β

t‖2 + r(β)

L: Lipschitz constant of ∇f‖∇f (x)−∇f (y)‖ ≤ L‖x−y‖

James Kwok Learning with Implicit/Explicit Structures

Page 53: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Convergence

Let F (β) = f (β) + r(β)

Fast convergence

F (βt)− F (β∗)︸ ︷︷ ︸

optimal obj value

≤ 2L‖ ˆβ0−β∗‖2

(t+1)2

optimal convergence rate

to obtain an ε-optimal solution

F (βt)− F (β∗) ≤ ε

needs 1/√ε iterations

cf. gradient descent: convergence rate is O(1t ) for smooth

objectives

James Kwok Learning with Implicit/Explicit Structures

Page 54: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Optimization via FISTA

minU,V

T∑t=1

‖y(t) − X(t)(ut + vt)‖2︸ ︷︷ ︸f (U,V)

+λ1‖U‖clus + λ2‖U‖2F + λ3‖V‖2F︸ ︷︷ ︸r(U,V)

1: Initialize: U1, V1, τ1 ← 1.2: for k = 1, 2, . . . ,N − 1 do3: Compute U = Uk − 1

Lk∂Uf (Uk ,Vk), V = Vk − 1

Lk∂Vf (Uk ,Vk).

4: Uk ← arg minU ‖U− U‖2F + 2λ1

Lk‖U‖clus + 2λ2

Lk‖U‖2F (can be solved

in O(T log T ) time)

5: Vk ←[

vij

1+2λ3/Lk

].

6: τk+1 ←1+√

1+4τ 2k

2 .

7:

[Uk+1

Vk+1

]←

[Uk

Vk

]+ τk−1

τk+1

([Uk

Vk

]−

[Uk−1

Vk−1

]).

8: end for9: Output UN .

complexity: O(

1√ε(TDn + DT log T )

)James Kwok Learning with Implicit/Explicit Structures

Page 55: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Experiment: Synthetic Data Sets

(C1) all tasks are independent(C2) all tasks are from the same cluster(C3) same as C2, but with corrupted features(C4) a main task cluster plus a few outlier tasks(C5) tasks in overlapping groups(C6) all but the last two features are from a common cluster

clusters obtainedare close toground truth

James Kwok Learning with Implicit/Explicit Structures

Page 56: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

NMSE on Synthetic Data Sets

proposed model outperforms existing methods when the taskclusters are not well formed (C3,C5 and C6)

James Kwok Learning with Implicit/Explicit Structures

Page 57: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Examination Score Prediction

consists of examination records from 139 secondary schools inyears 1985, 1986 and 1987

each task is to predict exam scores for students in one school

inputs: year of the exam, gender, school gender, schooldenomination, etc

(10% samples for training) (20% samples for training)

FlexTClus has very competitive NMSE

task clustering structure: only one underlying task cluster(consistent with previous studies)

James Kwok Learning with Implicit/Explicit Structures

Page 58: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Rating of Products

ratings of 201 students (tasks) on 20 different personalcomputers, each described by 13 attributes

root mean squared error

James Kwok Learning with Implicit/Explicit Structures

Page 59: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Task Clustering Structure

one main cluster for the first 12 attributes (related toperformance & service)

lots of varied opinions on price

James Kwok Learning with Implicit/Explicit Structures

Page 60: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Handwritten Digit Recognition

10-class classification problem 10 1-vs-rest problemsPCA to reduce the dimensionality to 64

FlexTClus consistently has the lowest classification errortask clustering structures for the leading PCA features arevery different from those of the trailing PCA features

James Kwok Learning with Implicit/Explicit Structures

Page 61: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Conclusion

Structure is everywhere

Hierarchical multilabel classification

(label) structure is knownconsidered both mandatory and non-mandatory leaf nodepredictionsalgorithms are efficientcan be used on label hierarchies of both trees and DAGs

Multitask learning

(task clustering) structure is unknowncaptures task structures at the feature levelcan be solved efficiently by accelerated proximal methodbetter accuracy; and the obtained task structures agree withthe known/plausible properties of the data

James Kwok Learning with Implicit/Explicit Structures

Page 62: Learning with Implicit/Explicit Structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

James Kwok Learning with Implicit/Explicit Structures