is unlabeled data suitable for multiclass svm-based web page classification?
DESCRIPTION
My presentation at SSLNLP Workshop (NAACL 2009) on June 4th, 2009TRANSCRIPT
![Page 1: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/1.jpg)
Is Unlabeled Data Suitable for Multiclass SVM-basedWeb Page Classification?
Arkaitz Zubiaga, Vıctor Fresno, Raquel Martınez
Universidad Nacional de Educacion a Distancia
June 4, 2009
![Page 2: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/2.jpg)
Text Classification
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 2 / 31
![Page 3: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/3.jpg)
Text Classification
What is it?
We have a set of documents:
D = {d1, ..., d|D|}
With a set of predefined categories:
C = {c1, ..., c|C |}
Classification is known as:
〈dj , ci 〉 ∈ D × C
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 3 / 31
![Page 4: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/4.jpg)
Motivation
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 4 / 31
![Page 5: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/5.jpg)
Motivation
Motivation
Several studies for plain text classification (news), but a few for webpage classification.
Typical web page classification task:
Semi-supervised: not much labeled documents.Multiclass: taxonomy > 2.
(Joachims, 1999) proved the suitability of unlabeled data for binarytasks.
What about multiclass tasks?(Chapelle et al., 2006) did it over image datasets, but never fortext/web pages.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 5 / 31
![Page 6: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/6.jpg)
Support Vector Machines
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 6 / 31
![Page 7: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/7.jpg)
Support Vector Machines
SVM
It looks for a hyperplane to separate the classes
Margin maximization
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 7 / 31
![Page 8: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/8.jpg)
Support Vector Machines
SVM
It looks for a hyperplane to separate the classes
Margin maximization
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 7 / 31
![Page 9: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/9.jpg)
Support Vector Machines
SVM
Optimization function: min 12 ||ω||
2 + C ·∑n
i=1 ξdi
Subject to: yi (ω · xi + b) ≥ 1− ξi , ξi ≥ 0
It only handles binary and supervised problems by nature.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 8 / 31
![Page 10: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/10.jpg)
Multiclass SVM
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 9 / 31
![Page 11: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/11.jpg)
Multiclass SVM
Multiclass SVM
Approaches to multiclass SVM:
Direct.Combining binary classfiers.
One-against-one.One-against-all.
Usually applied to supervised tasks, but hardly ever to semi-supervisedones.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 10 / 31
![Page 12: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/12.jpg)
Multiclass SVM
Multiclass SVM: Direct approach
The optimization function considers all the hyperplanes at the sametime.
min1
2
n∑m=1
||wm||2 + Cl∑
i=1
∑m 6=yi
ξmi
Subject to:
wyi · xi + byi ≥ wm · xi + bm + 2− ξmi , ξmi ≥ 0
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 11 / 31
![Page 13: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/13.jpg)
Multiclass SVM
Multiclass SVM: One-against-one
It creates k·(k−1)2 binary classifiers
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
![Page 14: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/14.jpg)
Multiclass SVM
Multiclass SVM: One-against-one
It creates k·(k−1)2 binary classifiers
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
![Page 15: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/15.jpg)
Multiclass SVM
Multiclass SVM: One-against-one
It creates k·(k−1)2 binary classifiers
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
![Page 16: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/16.jpg)
Multiclass SVM
Multiclass SVM: One-against-one
It creates k·(k−1)2 binary classifiers
sign(ωTij · x + bij) −→ Add a vote for the winning class between i and j
The class with more votes will be the output.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
![Page 17: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/17.jpg)
Multiclass SVM
Multiclass SVM: One-against-all
It creates k binary classifiers
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
![Page 18: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/18.jpg)
Multiclass SVM
Multiclass SVM: One-against-all
It creates k binary classifiers
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
![Page 19: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/19.jpg)
Multiclass SVM
Multiclass SVM: One-against-all
It creates k binary classifiers
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
![Page 20: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/20.jpg)
Multiclass SVM
Multiclass SVM: One-against-all
It creates k binary classifiers
Ci = arg maxi=1,...,k
(ωi · x + bi )
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
![Page 21: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/21.jpg)
S3VM
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 14 / 31
![Page 22: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/22.jpg)
S3VM
Semi-supervised SVM (S3VM)
Unlabeled documents are considered during the learning phase.
The optimization function results:
min1
2· ||ω||2 + C ·
l∑i=1
ξdi + C ∗ ·u∑
j=1
ξ∗d
j
Convex optimization algorithms required.
Commonly used over binary taxonomies, but hardly ever with moreclasses.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 15 / 31
![Page 23: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/23.jpg)
Multiclass S3VM
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 16 / 31
![Page 24: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/24.jpg)
Multiclass S3VM
Multiclass S3VM
(Yajima and Kuo, 2006) present the following optimization function:
min(1
2
h∑i=1
βiT K−1βi + Cl∑
j=1
∑i 6=yj
max(0, 1− (βyj
j − βij ))2)
where β represents the product of a vector and a kernel matrix defined bythe author.
(Chapelle et al., 2006): direct approach by means of the ContinuationMethod.
2 steps:
(Qi et al., 2004) use Fuzzy C-Means to predict new unlabeleddocuments.(Xu and Schuurmans, 2005) rely on a clustering-based approach tolabel the unlabeled data.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 17 / 31
![Page 25: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/25.jpg)
Compared Approaches: Multiclass SVM vs Multiclass S3VM
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 18 / 31
![Page 26: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/26.jpg)
Compared Approaches: Multiclass SVM vs Multiclass S3VM
Multiclass SVM vs Multiclass S3VM
2-steps-SVM/1-step-SVM: Multiclass SVM.Does an intermediate step adding newly labeled data improveclassifier’s performance?
One-against-all-S3VM/One-against-all-SVM.
One-against-one-S3VM/One-agaisnt-one-SVM.Does unlabeled data help to improve binary combining classifier’sresults?
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 19 / 31
![Page 27: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/27.jpg)
Experiments
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 20 / 31
![Page 28: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/28.jpg)
Experiments
Experiments settings
Datasets:
BankSearch: 10.000 web documents / 10 categories (4.000 for thetraining set).WebKB: 4.518 web documents / 6 categories (2.000 for the trainingset).Yahoo! Science: 788 web documents / 6 categories (200 for thetraining set).
Numerous labeled/unlabeled sets.
9 executions for each.
Representation: TF-IDF.
Software:
SVM-light (http://svmlight.joachims.org)SVM-multiclass
Evaluation by means of the accuracy (percent of correct predictions).
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 21 / 31
![Page 29: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/29.jpg)
Results
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 22 / 31
![Page 30: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/30.jpg)
Results
Results: BankSearch
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 23 / 31
![Page 31: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/31.jpg)
Results
Results: WebKB
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 24 / 31
![Page 32: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/32.jpg)
Results
Results: Yahoo! Science
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 25 / 31
![Page 33: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/33.jpg)
Results
Results
Supervised multiclass approaches (2-steps-SVM & 1-step-SVM)outperform the rest.
Among binary combinations, one-against-all outperformsone-against-one.
Unlabeled data slightly helps for one-against-all.
1-step-SVM and 2-steps-SVM show similar results, except forWebKB, where the former wins.
It could be due to the homogeneous nature of the WebKB dataset.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 26 / 31
![Page 34: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/34.jpg)
Conclusions and Outlook
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 27 / 31
![Page 35: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/35.jpg)
Conclusions and Outlook
Conclusions
Comparison of multiclass SVM and S3VM approaches for web pageclassification.
Direct and combining approaches.
Direct approaches outperform the rest.
Unlabeled data did not provide considerable improvements, and evenprovide worsenings in some cases.
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 28 / 31
![Page 36: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/36.jpg)
Conclusions and Outlook
Future Work
To add more multiclass S3VM approaches to the study.
To test with different SVM settings (kernel, parameters,...).
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 29 / 31
![Page 37: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/37.jpg)
Thank you
Index
1 Text Classification
2 Motivation
3 Support Vector Machines
4 Multiclass SVM
5 S3VM
6 Multiclass S3VM
7 Compared Approaches: Multiclass SVM vs Multiclass S3VM
8 Experiments
9 Results
10 Conclusions and Outlook
11 Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 30 / 31
![Page 38: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?](https://reader035.vdocuments.site/reader035/viewer/2022062710/5599f3e21a28ab7a6d8b4604/html5/thumbnails/38.jpg)
Thank you
Thank you
Thank you
A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 31 / 31