learning with implicit/explicit structures

Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion

Learning with Implicit/Explicit Structures

James Kwok

Department of Computer Science and EngineeringHong Kong University of Science and Technology

Hong Kong

Chinese Workshop on Machine Learning & Applications

James Kwok Learning with Implicit/Explicit Structures


Structure is Everywhere

Data have structure

Example (structured data)

DNA (strings) natural language (trees) molecules (graphs)

various kernels have been defined for these structured data

structure is about the input

Example (structured sparsity)

prefers certain sparsity patterns of the parameter, not just asmall number of nonzero coefficients



Structure is Everywhere...

Outputs can also have structure

Example (hierarchical classification)

instance labels reside in a known tree- or DAG-structuredhierarchy

structure is explicit (e.g., yahoo hierarchy)

Structure can be implicit

Example (multitask learning)

tasks may have an underlying clustering structure

may have to be discovered by the learning algorithm



Hierarchical Multilabel Classification



Multiclass vs Multilabel Classification

Example (multiclass classification)

an instance can have only one label

Example (multilabel classification)

an instance can have more than onelabels

image tagging

tags: elephant, jungle and africa



Multilabel Classification

Example

video tagging

text categorization

gene functions analysis in bioinformatics

Are these labels independent?

NO! Labels often have structure!



Hierarchical Classification

Example (text classification)

labels may be organized in a known tree-structured hierarchy

Yahoo! taxonomy (as of 2004):: 16-level hierarchy

an instance is associated with the label of a node if it is alsoassociated with the node’s parent



Hierarchical Classification...

More generally, label hierarchy is a directed acyclic graph (DAG)

Example (bioinformatics: Gene Ontology (GO))

Genome annotation

a node can have multiple parents

if a node is positive, all its parents must be positive

Consider this label hierarchy information in making predictions



Training

train estimators for p(yi = 1 | ypa(i) = 1, x) at each node i

many possible estimation methods

Example

at each node i : train a binary SVM using those trainingexamples s.t. the parent of i is labeled positive

convert SVM output to a probability estimate using Platt’sprocedure



Potentially Large Number of Labels

Example

flickr (as of 2010): > 20 millions unique tags

humans can recognize 10,000-100,000 unique object classes

Yahoo! taxonomy (as of 2004): nearly 300,000 categories

How to reduce the number of learning problems?



Projection

1 project the long label vector to a low-dimensional vectore.g., PCA, randomized matrix

2 learn a mapping from input to each projected dimension

3 predict and reconstruct the label vector in the original space

Advantages

number of learning problems = dimension in projected space

flexible: any regressor can be used in the learning step



Prediction: Simple Case

Suppose that it is known that the test sample has k labels, and thelabels are unstructured, how to obtain the prediction from h?

Simply pick the k largest entries in h!

maxψi

∑i

hiψi

s.t. ψi ∈ 0, 1, ∀i (indicator)∑i

ψi = k

ψi = 1: node i is selected; 0 otherwiseJames Kwok Learning with Implicit/Explicit Structures


What if the Labels are Structured?

can no longer simply pick the k largest entries



Mandatory vs Non-mandatory Leaf Node Prediction

Mandatory leaf node prediction (MLNP)

labels must end at leaf nodes

Example

leaf nodes are objects of interest (have strongersemantic/biological meanings)

e.g., taxonomies of musical signals and genes

label hierarchy is learned from the data internal nodes areonly artificial

Non-mandatory leaf node prediction (NMLNP)

labels can end at an internal node



NMLNP on Tree-structured Hierarchies

maxψi

∑i∈T

hiψi

s.t. ψi ∈ 0, 1, ∀i ∈ T∑i∈T

ψi = k

parent’s ψi value must be ≥ that of the child

if a node is labeled positive, its parent must also be labeledpositive

Relax the binary constraint to 1 ≥ ψi ≥ 0

Efficient Algorithm

Condensing Sort and Select Algorithm (CSSA) [Baraniuk & Jones,TSP 1994]: originally used in signal processing

time complexity: O(N log N)



Example



Example...



NMLNP on DAG-Structured Hierarchies

Idea: Merge the supernode with its unassigned parent that has thesmallest supernode value



Example...



Mandatory Leaf Node Prediction

Non-mandatory leaf node prediction (NMLNP)

labels can end at an internal node

Mandatory leaf node prediction (MLNP)

labels must end at a leaf node

Existing MLNP methods are for hierarchical multiclassclassification

train a classifier at each node; for a positive parent, recursivelylabel its child having the largest prediction as positiveat each node, exactly one subtree is to be pursued

Extension for hierarchical multilabel classification not easy

at each node, how many and which subtrees to pursue?



Probabilistic Approach

Notations (consider tree hierarchy first)

training examples (x(n), y(n))x(n): input; y(n) = [y

(n)1 , . . . , y

(n)|T | ]

′ ∈ 0, 1|T |: multilabel

denoting memberships of x(n) to each of the nodes

tree structure: yi = 1⇒ ypa(i) = 1 for any non-root node i

Assume: Labels for any group of siblings i1, i2, . . . , im areconditionally independent, given the label of their parent and x

p(yi1 , yi2 , . . . yim |ypa(i1), x) =∏m

j=1 p(yij |ypa(i1), x)

popularly used in Bayesian networks and hierarchicalmultilabel classification



Maximum a Posteriori MLNP

represent y by a set Ω ⊆ T : yi = 1 if i ∈ Ω; and 0 otherwise

MAP MLNP find Ω∗ that1 maximizes the posterior probability p(y0, . . . , y|T ||x)2 respects T

Ω∗ = maxΩ p(yΩ = 1, yΩc = 0|x)s.t. y0 = 1

Ω contains no partial path (MLNP)

yi ’s respect the label hierarchy

Note:

p(yΩ = 1, yΩc = 0|x) considers all the node labels in thehierarchy simultaneouslycf. existing MLNP (multiclass) methods: considers thehierarchy information locally at each node



Maximum a Posteriori MLNP on Label DAGs

Using the conditional independence simplification

p(y1, . . . , y|G||x) = p(y0|x)∏

i∈G\0

p(yi | yPa(i), x)

Pa(i): the set of (possibly multiple) parents of node i

direct maximization of p(y1, . . . , y|G||x) is difficult

Assume

p(y1, . . . , y|G||x) ∝ p(y0|x)∏

i∈G\0

∏j∈Pa(i)

p(yi |yj , x)

composite likelihood (or pseudolikelihood): replace a difficultpdf by a set of marginals or conditionals that are easier toevaluate

pairwise conditional likelihood



MAP MLNP on Label DAGs...

for each node i , define

wi =

∑

l∈child(0) log(1− pl0) i = 0∑j∈Pa(i)(log pij − log(1− pij)) leaf i∑j∈Pa(i)(log pij − log(1− pij)) +

∑l∈child(i) log(1− pli ) o.w.

pij ≡ p(yi = 1|yj = 1, x) for j ∈ Pa(i)if we knew that prediction of x has k leaf labels

maxψi

∑i∈G

wiψi

s.t.∑

leaf node i

ψi = k

ψ0 = 1, ψi ∈ 0, 1 ∀i ∈ G∑j∈child(i)

ψj ≥ 1 ∀ internal node i : ψi = 1

ψi ≤ ψj ∀j ∈ Pa(i), ∀i ∈ G\0(|# leaf nodes|k

)candidate solutions expensive



Nested Approximation Property (NAP)

Definition (k-leaf-sparse)

A multilabel y is k-leaf-sparse if k of the leaf nodes are labeled one

Example

1-leaf-sparse 2-leaf-sparse



Nested Approximation Property (NAP)...

Definition (Nested Approximation Property (NAP))

For a pattern x, let its optimal k-leaf-sparse multilabel be Ωk . TheNAP is satisfied if

i : i ∈ Ωk ⊂ i : i ∈ Ωk ′ for all k < k ′

NAP is often implicitly assumed in many hierarchicalclassification algorithms

e.g., classifications that are based on threshold tuninghigher threshold: 1-leaf-sparse; lower threshold: 2-leaf-sparse



Solve

maxψi

∑i∈G

wiψi

s.t.∑

leaf node i

ψi = k

ψ0 = 1, ψi ∈ 0, 1 ∀i ∈ G∑j∈child(i)

ψj ≥ 1 ∀ internal node i : ψi = 1

ψi ≤ ψj ∀j ∈ Pa(i), ∀i ∈ G\0



Example: Find Optimal 1-leaf-sparse Solution



Example: Find Optimal 2-leaf-sparse Solution

update takesO(|L|N log N) time

repeat this process until kleaf nodes are selected

total: O(k|L|N log N)time



Unknown Number of Labels

We assumed that we know the prediction of x has k leaf labels

What if k is not known?

Straightforward approach

run the algorithm with k = 1, . . . , |L| (L: #leaf nodes)

find the Ωk ∈ Ω1, . . . ,Ω|L| that maximizes the posteriorprobability for DAG

Recall that Ωk ⊂ Ωk+1 under the NAP assumption

can simply set k = |L|, and Ωi is immediately obtained as theΩ in iteration i



Experiments

12 functional genomics data sets with 2 sets of labelstree-structured labels: FunCat hierarchyDAG-structured labels: GO hierarchy

max #class depth #label on average

FunCat 500 5 8.8

GO 4134 14 35



Evaluation

precision-recall (PR) curve

Prec =

∑i TPi∑

i TPi +∑

i FPi,Rec =

∑i TPi∑

i TPi +∑

i FNi

TPi/FPi/FNi : number of true positives / false positives /false negatives for label i

AUPRC: the larger the better



NMLNP on Label Trees (FunCat)

CLUS-HMC (Vens et al., MLJ 2008) - decision tree(state-of-the-art)

AUPRC values

data set CSSA CLUS-HMC (state-of-the-art)seq 0.226 0.218

pheno 0.167 0.166struc 0.194 0.189hom 0.257 0.254

cellcycle 0.196 0.180church 0.179 0.178derisi 0.194 0.183eisen 0.220 0.212

gasch1 0.216 0.212gasch2 0.218 0.203

spo 0.216 0.195expr 0.228 0.218

CSSA outperforms CLUS-HMC on all data sets!James Kwok Learning with Implicit/Explicit Structures


NMLNP on Label DAGs (GO)

AUPRC values

data set CSSAG CLUS-HMCseq 0.478 0.469

pheno 0.426 0.425struc 0.455 0.446hom 0.493 0.481

cellcycle 0.454 0.443church 0.442 0.436derisi 0.442 0.440eisen 0.479 0.454

gasch1 0.468 0.453gasch2 0.454 0.449

spo 0.442 0.440expr 0.473 0.453

CSSA outperforms CLUS-HMC on all data sets again!



MLNP on Label Trees

Compare with

HMC-LP [Cerri et al, IDA 2011]

the only existing algorithm that can perform MLNP on trees(but not on DAGs)

other NMLNP methods (converted for MLNP)

first use the MetaLabeler to predict #leaf labels (k) that thetest pattern hasthen pick the k leaf labels using different NMLNP methods

hierarchical SVM (H-SVM) [Cesa-Bianchi et al., JMLR 2006]Bayesian classifier chain (BCC) [Zaragoza et al, IJCAI 2011]CLUS-HMC



MLNP on Label Treesused with MetaLabeler

data set MAT HMC-LP H-SVM BCC CLUS-HMCrcv1v2 subset1 0.81 [1] 0.05 [5] 0.80 [2] 0.60 [4] 0.62 [3]rcv1v2 subset2 0.83 [1] 0.05 [5] 0.81 [2] 0.62 [3] 0.61 [4]rcv1v2 subset3 0.82 [1] 0.05 [5] 0.81 [2] 0.41 [4] 0.58 [3]rcv1v2 subset4 0.82 [1] 0.05 [5] 0.81 [2] 0.59 [3] 0.59 [3]rcv1v2 subset5 0.81 [1] 0.05 [5] 0.80 [2] 0.57 [4] 0.61 [3]imageclef07a 0.89 [1] 0.01 [5] 0.89 [1] 0.77 [3] 0.65 [4]imageclef07d 0.86 [1] 0.19 [5] 0.86 [1] 0.82 [3] 0.65 [4]

delicious 0.47 [3] 0.08 [5] 0.50 [2] 0.34 [4] 0.53 [1]enron 0.72 [1] 0.37 [5] 0.72 [1] 0.67 [4] 0.68 [3]wipo 0.78 [1] 0.36 [5] 0.78 [1] 0.52 [4] 0.71 [3]

caltech-101 0.77 [2] 0.73 [3] 0.78 [1] 0.52 [4] -seq (funcat) 0.29 [1] 0.00 [5] 0.29 [1] 0.16 [4] 0.27 [3]

pheno (funcat) 0.24 [1] 0.02 [5] 0.20 [2] 0.09 [4] 0.18 [3]struc (funcat) 0.23 [1] 0.00 [5] 0.23 [1] 0.03 [4] 0.20 [3]hom (funcat) 0.31 [2] 0.01 [5] 0.33 [1] 0.04 [4] 0.23 [3]

cellcycle (funcat) 0.24 [2] 0.01 [5] 0.26 [1] 0.13 [4] 0.19 [3]church (funcat) 0.19 [2] 0.01 [5] 0.19 [2] 0.06 [4] 0.21 [1]derisi (funcat) 0.21 [2] 0.01 [5] 0.21 [2] 0.13 [4] 0.22 [1]eisen (funcat) 0.32 [2] 0.01 [5] 0.35 [1] 0.20 [4] 0.29 [3]

gasch1 (funcat) 0.30 [2] 0.01 [5] 0.33 [1] 0.17 [4] 0.29 [3]gasch2 (funcat) 0.28 [1] 0.01 [5] 0.28 [1] 0.11 [4] 0.24 [3]



MLNP on Label DAGs

used with MetaLabelerdata set MAS H-SVM BCC CLUS-HMCseq (GO) 0.61 [1] 0.55 [3] 0.48 [4] 0.57 [2]

pheno (GO) 0.61 [1] 0.60 [2] 0.58 [3] 0.55 [4]struc (GO) 0.53 [2] 0.47 [3] 0.45 [4] 0.60 [1]hom (GO) 0.63 [1] 0.56 [3] 0.52 [4] 0.62 [2]

cellcycle (GO) 0.55 [1] 0.50 [3] 0.21 [5] 0.50 [3]church (GO) 0.55 [1] 0.49 [3] 0.26 [4] 0.54 [2]derisi (GO) 0.53 [1] 0.49 [2] 0.36 [4] 0.47 [3]eisen (GO) 0.60 [1] 0.53 [2] 0.49 [4] 0.53 [2]

gasch1 (GO) 0.62 [1] 0.54 [2] 0.49 [3] 0.46 [4]gasch2 (GO) 0.54 [1] 0.49 [3] 0.41 [4] 0.50 [2]

spo (GO) 0.50 [1] 0.47 [3] 0.48 [2] 0.46 [4]expr (GO) 0.59 [1] 0.54 [4] 0.50 [5] 0.55 [3]



Validating the NAP Assumption

use brute-force search to find the best k-leaf-sparse prediction

check if it includes the best (k − 1)-leaf-sparse prediction

% test patterns satisfying NAP at different k:

2 4 6 8 10 12 1490

91

92

93

94

95

96

97

98

99

100

k

insta

nce

s s

atisfy

ing

NA

P(%

)

2 4 6 8 10 12 14 16 18 2090

91

92

93

94

95

96

97

98

99

100

k

insta

nce

s s

atisfy

ing

NA

P(%

)

5 10 15 20 2590

91

92

93

94

95

96

97

98

99

100

k

inst

ance

s sa

tisfy

ing

NAP

(%)

pheno(funcat) pheno(GO) church(GO)

NAP holds almost 100% of the time



Multitask Learning



Multitask Learning (MTL)

Example

rating of products: each customer is a task

handwritten digit recognition: each digit is a task

Learn all tasks simultaneously, instead of separately

different tasks share some information

better to learn them together, especially when there areinsufficient training samples



Notations

T tasks

task t: training samples (x(t)1 , y

(t)1 ), . . . , (x

(t)nt , y

(t)nt )

y(t) = X(t)wt + e

X(t) ≡ [x(t)1 , . . . , x

(t)nt ]T

y(t) ≡ [y(t)1 , . . . , y

(t)nt ]T

wt : weight

Given a set of T tasks, how to learn them together?

how to estimate W = [w1, . . .wT ]?



Popular Approaches

Pooling

pool all the tasks together, and treat them as one single task

w1 = w2 = · · · = wT

Regularized MTL

all tasks are close to some model w0

wt = w0︸︷︷︸common

+ vt︸︷︷︸task-specific

Learning with outlier tasks: robust MTL

assume: there are some outlying wt ’s



Task Clustering Structure

Tasks have structure!

Learning with clustered tasks: clustered MTL

assume: wt ’s form several clusters

(clustered MTL) (pooling/regularized MTL) (robust MTL)

how many clusters?

all features have the same task clustering structure

Good?



More Complicated Task Clustering Structure?

Example (movie recommendation)

different features may have different task clustering structures



Flexible Task Clustering Structure

decompose wt as ut + vt

ut : clustering center at each featurevt : variations specific to each task

U = [u1, . . . ,uT ],V = [v1, . . . , vT ]

Flexible Task-Clustered MTL (FlexTClus)

minU,V

T∑t=1

‖y(t) − X(t)(ut + vt)‖2︸︷︷︸square loss

+λ1‖U‖clus + λ2‖U‖2F + λ3‖V‖2F︸︷︷︸ridge regularizers

‖U‖clus =∑D

d=1

∑i<j |Udi − Udj |

for each feature: pairwise difference in the parameters of tasksi and jtry to group parameters of tasks i and j together (k-meansclustering)



Special Cases

minU,V∑T

t=1 ‖y(t)−X(t)(ut+vt)‖2+λ1‖U‖clus+λ2‖U‖2F +λ3‖V‖2F

Special cases

λ1 = λ2 = λ3 = 0: independent least squares regression oneach task

λ1 =∞: regularized MTL

λ1 = 0: independent ridge regression on each task

λ2 6= 0, λ3 = 0: independent least squares regression on eachtask



Properties

Convergence to ground truth

With high probability,

maxd=1,...,D maxt=1,...,T |Wdt −W ∗dt | ≤ O

(1√n

)W∗: ground truth

n: number of samples

U captures the clustering structure at feature level

With high probability, and for sufficiently large n,

(i , j in the same cluster) W ∗di = W ∗

dj ↔ Udi = Udj

(i , j in different clusters) |W ∗di −W ∗

dj | ≥ ρ ↔ Udi 6= Udj



Optimization

minU,V∑T

t=1 ‖y(t)−X(t)(ut+vt)‖2+λ1‖U‖clus+λ2‖U‖2F +λ3‖V‖2F

How to solve this optimization problem?

Back to basics! Gradient descent

1m

∑mi=1 `(w ; xi , yi ) + λΩ(w) LOOP

1 find descent direction

2 choose stepsize

3 descent



Gradient Descent

Advantages

easy to implement

low per-iteration complexity good scalability (BIG DATA)

Disadvantage

uses first-order (gradient) information

slow convergence rate may require a large number ofiterations



Accelerated Gradient Methods

First developed by Nesterov in 1983

for smooth optimization

minβ f (β) (f is smooth in β)

Problem: ` and/or Ω may be nonsmooth

SVM: hinge loss (nonsmooth) +‖w‖22 (smooth)

lasso: square loss (smooth) +‖w‖1 (nonsmooth)

Extension to composite optimization

objective has both smooth and nonsmooth components

minβ f (β)︸︷︷︸smooth

+ r(β)︸︷︷︸nonsmooth

recently popularly used in machine learningJames Kwok Learning with Implicit/Explicit Structures


Accelerated Gradient Decent

Gradient decent

1 find descent direction

2 choose stepsize

3 descent

FISTA [Beck & Teboulle, 2009]

Q(β, βt) ≡ (β − β

t)T∇f (β

t)

+L

2‖β − β

t‖2 + r(β)

L: Lipschitz constant of ∇f‖∇f (x)−∇f (y)‖ ≤ L‖x−y‖



Convergence

Let F (β) = f (β) + r(β)

Fast convergence

F (βt)− F (β∗)︸︷︷︸

optimal obj value

≤ 2L‖ ˆβ0−β∗‖2

(t+1)2

optimal convergence rate

to obtain an ε-optimal solution

F (βt)− F (β∗) ≤ ε

needs 1/√ε iterations

cf. gradient descent: convergence rate is O(1t ) for smooth

objectives



Optimization via FISTA

minU,V

T∑t=1

‖y(t) − X(t)(ut + vt)‖2︸︷︷︸f (U,V)

+λ1‖U‖clus + λ2‖U‖2F + λ3‖V‖2F︸︷︷︸r(U,V)

1: Initialize: U1, V1, τ1 ← 1.2: for k = 1, 2, . . . ,N − 1 do3: Compute U = Uk − 1

Lk∂Uf (Uk ,Vk), V = Vk − 1

Lk∂Vf (Uk ,Vk).

4: Uk ← arg minU ‖U− U‖2F + 2λ1

Lk‖U‖clus + 2λ2

Lk‖U‖2F (can be solved

in O(T log T ) time)

5: Vk ←[

vij

1+2λ3/Lk

].

6: τk+1 ←1+√

1+4τ 2k

2 .

7:

[Uk+1

Vk+1

]←

[Uk

Vk

]+ τk−1

τk+1

([Uk

Vk

]−

[Uk−1

Vk−1

]).

8: end for9: Output UN .

complexity: O(

1√ε(TDn + DT log T )

)James Kwok Learning with Implicit/Explicit Structures


Experiment: Synthetic Data Sets

(C1) all tasks are independent(C2) all tasks are from the same cluster(C3) same as C2, but with corrupted features(C4) a main task cluster plus a few outlier tasks(C5) tasks in overlapping groups(C6) all but the last two features are from a common cluster

clusters obtainedare close toground truth



NMSE on Synthetic Data Sets

proposed model outperforms existing methods when the taskclusters are not well formed (C3,C5 and C6)



Examination Score Prediction

consists of examination records from 139 secondary schools inyears 1985, 1986 and 1987

each task is to predict exam scores for students in one school

inputs: year of the exam, gender, school gender, schooldenomination, etc

(10% samples for training) (20% samples for training)

FlexTClus has very competitive NMSE

task clustering structure: only one underlying task cluster(consistent with previous studies)



Rating of Products

ratings of 201 students (tasks) on 20 different personalcomputers, each described by 13 attributes

root mean squared error



Task Clustering Structure

one main cluster for the first 12 attributes (related toperformance & service)

lots of varied opinions on price



Handwritten Digit Recognition

10-class classification problem 10 1-vs-rest problemsPCA to reduce the dimensionality to 64

FlexTClus consistently has the lowest classification errortask clustering structures for the leading PCA features arevery different from those of the trailing PCA features



Conclusion

Structure is everywhere

Hierarchical multilabel classification

(label) structure is knownconsidered both mandatory and non-mandatory leaf nodepredictionsalgorithms are efficientcan be used on label hierarchies of both trees and DAGs

Multitask learning

(task clustering) structure is unknowncaptures task structures at the feature levelcan be solved efficiently by accelerated proximal methodbetter accuracy; and the obtained task structures agree withthe known/plausible properties of the data


learning with implicit/explicit structures

Documents