ppdsparse: a parallel primal and dual sparse method to extreme...

26
PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme Classification Ian E.H. Yen 1 , Xiangru Huang 2 , Wei Dai 1 , Pradeep Ravikumar 1 , Inderjit S. Dhillon 2 and Eric Xing 1 1 Carnegie Mellon University. 2 University of Texas at Austin KDD, 2017 Ian E.H. Yen, Xiangru Huang , Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst) PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme Classification KDD, 2017 1 / 17

Upload: others

Post on 11-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

PPDSparse: A Parallel Primal and Dual Sparse Methodto Extreme Classification

Ian E.H. Yen1, Xiangru Huang2, Wei Dai1, Pradeep Ravikumar1,Inderjit S. Dhillon2 and Eric Xing1

1Carnegie Mellon University. 2University of Texas at Austin

KDD, 2017

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 1 / 17

Page 2: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Outline

1 Problem SettingExtreme ClassificationRelated Works

2 AlgorithmSeparable LossAlgorithm Diagram

3 TheoryAnalysis of primal and dual sparsity

4 Experimental Results

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 2 / 17

Page 3: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Outline

1 Problem SettingExtreme ClassificationRelated Works

2 AlgorithmSeparable LossAlgorithm Diagram

3 TheoryAnalysis of primal and dual sparsity

4 Experimental Results

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 3 / 17

Page 4: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Extreme Classification

Goal: Learn a function h(x) : RD → RK from D input features to Koutput scores that is consistent with labels y ∈ {0, 1}K .

K is large (e.g. 103˜106).

Average number of positive labels (per sample)kp = 1

N

∑Ni=1(

∑k yik).

Multiclass: kp = 1; Multilabel: kp � K .

Average number of positive samples (per class)np = 1

K

∑Kk=1 n

kp = 1

K

∑Kk=1(

∑i yik).

np = Nkp/K � N

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 4 / 17

Page 5: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Extreme Classification

Goal: Learn a function h(x) : RD → RK from D input features to Koutput scores that is consistent with labels y ∈ {0, 1}K .

K is large (e.g. 103˜106).

Average number of positive labels (per sample)kp = 1

N

∑Ni=1(

∑k yik).

Multiclass: kp = 1; Multilabel: kp � K .

Average number of positive samples (per class)np = 1

K

∑Kk=1 n

kp = 1

K

∑Kk=1(

∑i yik).

np = Nkp/K � N

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 4 / 17

Page 6: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Extreme Classification

Goal: Learn a function h(x) : RD → RK from D input features to Koutput scores that is consistent with labels y ∈ {0, 1}K .

K is large (e.g. 103˜106).

Average number of positive labels (per sample)kp = 1

N

∑Ni=1(

∑k yik).

Multiclass: kp = 1; Multilabel: kp � K .

Average number of positive samples (per class)np = 1

K

∑Kk=1 n

kp = 1

K

∑Kk=1(

∑i yik).

np = Nkp/K � N

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 4 / 17

Page 7: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Extreme Classification

We consider Linear Classifier:

h(x) := W Tx where W ∈ RD×K .

Challenge: When K is large, training of simple linear model requiresO(NDK ) cost.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 5 / 17

Page 8: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Extreme Classification

We consider Linear Classifier:

h(x) := W Tx where W ∈ RD×K .

Challenge: When K is large, training of simple linear model requiresO(NDK ) cost.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 5 / 17

Page 9: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Outline

1 Problem SettingExtreme ClassificationRelated Works

2 AlgorithmSeparable LossAlgorithm Diagram

3 TheoryAnalysis of primal and dual sparsity

4 Experimental Results

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 6 / 17

Page 10: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Approaches

Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.

Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.

Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.

This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17

Page 11: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Approaches

Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.

Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.

Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.

This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17

Page 12: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Approaches

Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.

Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.

Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.

This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17

Page 13: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Approaches

Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.

Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.

Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.

This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17

Page 14: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Outline

1 Problem SettingExtreme ClassificationRelated Works

2 AlgorithmSeparable LossAlgorithm Diagram

3 TheoryAnalysis of primal and dual sparsity

4 Experimental Results

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 8 / 17

Page 15: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

We consider the classwise-separable hinge loss

L(z , y) :=K∑

k=1

`(zk , yk) =K∑

k=1

max(1− ykzk , 0)

Minimizing a separable loss is equivalent to One-versus-all:

minW∈RD×K

N∑i=1

K∑k=1

`(wTk xi , yik) =

K∑k=1

(N∑i=1

`(wTk xi , yik)

)

To obtain sparse iterates, we add `1-penalty on W and add bias perclass w0k . The dual problem of the `1-`2-regularized problem is:

minαk∈RN

G (αk) :=1

2‖w(αk)‖2 −

N∑i=1

αik

s.t. w(αk) = proxλ(X̂Tαk),

0 ≤ αik ≤ 1.

(1)

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 9 / 17

Page 16: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

We consider the classwise-separable hinge loss

L(z , y) :=K∑

k=1

`(zk , yk) =K∑

k=1

max(1− ykzk , 0)

Minimizing a separable loss is equivalent to One-versus-all:

minW∈RD×K

N∑i=1

K∑k=1

`(wTk xi , yik) =

K∑k=1

(N∑i=1

`(wTk xi , yik)

)

To obtain sparse iterates, we add `1-penalty on W and add bias perclass w0k . The dual problem of the `1-`2-regularized problem is:

minαk∈RN

G (αk) :=1

2‖w(αk)‖2 −

N∑i=1

αik

s.t. w(αk) = proxλ(X̂Tαk),

0 ≤ αik ≤ 1.

(1)

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 9 / 17

Page 17: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Outline

1 Problem SettingExtreme ClassificationRelated Works

2 AlgorithmSeparable LossAlgorithm Diagram

3 TheoryAnalysis of primal and dual sparsity

4 Experimental Results

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 10 / 17

Page 18: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Primal-Dual-Sparse Active-set Method

O

(nnz(wk)nnz(x j)︸ ︷︷ ︸

Search

+ nnz(αk)nnz(x i )︸ ︷︷ ︸Update+Maintain

)per iteration.

Apply Random Sparsification on (already sparse) wk before search.

Update α by Coordinate Descent within Ak .

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 11 / 17

Page 19: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Parallel & Primal-Dual Sparse Method

Due to the separable loss, theoptimization can beembarrassingly parallelized withone-time communication.

The input yk and output wk ofeach sub-problem are sparse.

Can be implemented in adistributed, shared-memory, ortwo-level parallelization setting.

Space: O(nnz(X ) + D).

Nearly linear speedup even withthousands of cores.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 12 / 17

Page 20: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Outline

1 Problem SettingExtreme ClassificationRelated Works

2 AlgorithmSeparable LossAlgorithm Diagram

3 TheoryAnalysis of primal and dual sparsity

4 Experimental Results

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 13 / 17

Page 21: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Theory: Primal and Dual Sparsity

Key Insight: The number of positive samples for each class

np =NkpK

is small. The following results hold if class-wise bias wk0 are added.

Step-1: bound ‖w‖1, and optimal ‖α∗‖1 in terms of np:

‖wk‖1 ≤2nkpλ

, ‖α∗k‖1 ≤ 4nkp .

Step-2: bound nnz(w), and nnz(α) in terms of ‖w‖1 and ‖α∗‖1:

nnz(w̃k) ≤ ‖wk‖21

δ2, nnz(αt

k) ≤ t ≤4‖α∗k‖2

1

ε

where w̃ is Random-Sparsified version of w with δ-approximationerror in ∇G (α), and ε is the desired precision of solution.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 14 / 17

Page 22: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Theory: Primal and Dual Sparsity

Key Insight: The number of positive samples for each class

np =NkpK

is small. The following results hold if class-wise bias wk0 are added.

Step-1: bound ‖w‖1, and optimal ‖α∗‖1 in terms of np:

‖wk‖1 ≤2nkpλ

, ‖α∗k‖1 ≤ 4nkp .

Step-2: bound nnz(w), and nnz(α) in terms of ‖w‖1 and ‖α∗‖1:

nnz(w̃k) ≤ ‖wk‖21

δ2, nnz(αt

k) ≤ t ≤4‖α∗k‖2

1

ε

where w̃ is Random-Sparsified version of w with δ-approximationerror in ∇G (α), and ε is the desired precision of solution.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 14 / 17

Page 23: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Theory: Primal and Dual Sparsity

Key Insight: The number of positive samples for each class

np =NkpK

is small. The following results hold if class-wise bias wk0 are added.

Step-1: bound ‖w‖1, and optimal ‖α∗‖1 in terms of np:

‖wk‖1 ≤2nkpλ

, ‖α∗k‖1 ≤ 4nkp .

Step-2: bound nnz(w), and nnz(α) in terms of ‖w‖1 and ‖α∗‖1:

nnz(w̃k) ≤ ‖wk‖21

δ2, nnz(αt

k) ≤ t ≤4‖α∗k‖2

1

ε

where w̃ is Random-Sparsified version of w with δ-approximationerror in ∇G (α), and ε is the desired precision of solution.

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 14 / 17

Page 24: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Multilabel Classification

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 15 / 17

Page 25: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Multiclass Classification

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 16 / 17

Page 26: PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme …ianyen.site/publication/PPDSparse_Slide_KDD.pdf · 2018-06-02 · Outline 1 Problem Setting Extreme Classi cation

Thank you

Xiangru [email protected]

Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 17 / 17