apr. 11, 2018 global mutual information based feature ... · recruit restaurant visitor forecasting...
TRANSCRIPT
Global Mutual Information Based Feature Selection By Quantum Annealing
Kotaro Tanahashi*, Shinichi Takayanagi*, Tomomitsu Motohashi*, Shu Tanaka✝ * Recruit Communications Co.,Ltd. ✝ Waseda University, JST PRESTO
Apr. 11, 2018
(C)Recruit Communications Co., Ltd.2
Introduction of Recruit
We provide various kinds of online services from job search to hotel reservations across the world.
Automobile
Education
Life & Local O2O
Travel
Beauty
Housing Bridal & Baby
Human Resources
IT & Trends Media
Dining
www.flaticon.com
(C)Recruit Communications Co., Ltd.3
Introduction of Recruit
Internet Users Clients
• We help users to find the best clients through our services. • Data science plays an important role in the business.
(C)Recruit Communications Co., Ltd.4
Data Science at Recruit
Recruit has hosted two data mining competitions in Kaggle Kaggle, KDD Cup: International competitions of data mining
{Engineers at Recruit (as of March 2018)
We are passionate about data scienceSome of us came in 1st and 2nd place in KDD Cup 2015
www.kdd.org/kdd-cup
Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015)
www.kaggle.com
(C)Recruit Communications Co., Ltd.5
Feature Selection: A Key Technique
“Beating Kaggle the easy way”
• A key technique to win data mining competitions • Find the most relevant features • Balance bias-variance trade-off
Features
ndata
n featuresrelevant features
data
User 1 User 2 User 3 User 4
User n-1 User n
✔ Improve prediction ✔ Reduce computational cost
Benefits
https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf
Feature selection is essential in predictive analysis
(C)Recruit Communications Co., Ltd.6
Types of Feature Selection (FS) Algorithms
Wrapper methods Iteratively evaluate a feature subset by black-box learning algorithm
Set of all features
Generate a subset Learning Algorithm
Selecting the best subset
Performance
Embedded methods Train a model and select features at the same time
Set of all features
Generate a subset
Learning Algorithm + Performance
Selecting the best subset
Filter methods Features are selected by some criteria such as Mutual Information
✔ Independent on learning algorithms ✔ Can be used as a pre-processing
Set of all features PerformanceSelecting the
best subsetLearning Algorithm
Filter methods are useful as a pre-processing since it does not dependent on the predictive models
(C)Recruit Communications Co., Ltd.7
What is Mutual Information (MI)?
Figures are retrieved from http://minepy.readthedocs.io/en/latest/python.html
Mutual Information I(X;Y) is a measure of the mutual independence between two random variables X and Y
Shannon entropy
Pearson r = 0.8 MI = 0.5
Pearson r = 0.0 MI = 0.7
Pearson r = 0.0 MI = 0.1
✔ MI can capture non-linear relationships unlike Pearson’s correlation coefficient
Mutual Information I(X;Y)
Able to predict Y given X
Hard to predict Y given XLow
High
(C)Recruit Communications Co., Ltd.8
General formulation of MIFS
Mutual Information based Feature Selection (MIFS)
MIFS: using Mutual Information as a criteria in filter methods
MIFS selects a feature subset with a size of k which maximizes the Mutual Information (MI) between the features and the target variable
Unfortunately, the exact calculation of is intractable…
(C)Recruit Communications Co., Ltd.9
Heuristic MIFS Algorithms
[1] H. Peng et al., 2005 [2] J. R. Vergara & P. A. Estévez, 2015
Max Relevance method Selecting the most relevant feature iteratively
Mim Redundancy & Max Relevance method[1] (MRMR) Selecting the most relevant and least redundant feature iteratively
Repeat k times
Repeat k times
Some heuristic MIFS algorithms have been developed. However, these methods are greedy approximations[2].
(C)Recruit Communications Co., Ltd.10
Our Contributions
MIFS optimization
QUBO formulation of MIFS
MI i
ncre
ase
(%) w
.r.t L
inea
r
5 6 7 8 10 15 20 25 30 40 #features
()06 2-4 1- 0
(2) We confirmed optimizations by D-Wave do well in MIFS
(1) We reformulate MIFS by QUBO
image is retrieved from https://www.dwavesys.com/resources/media-resourcesHOW?
Bet
ter
QUBO: Quadratic Unconstrained Binary Optimization
(C)Recruit Communications Co., Ltd.11
Reformulation of MIFS by QUBO (1)
Theorem 1.1: Chain theorem for Conditional Mutual Information
Using theorem 1.1, the following equation holds for all i ∈ S
Averaging the equation above for all i leads to
Proof.
Expand the MI term
(C)Recruit Communications Co., Ltd.12
Reformulation of MIFS by QUBO (2)
Approximate under the assumption of Conditional Independence (CI)
Proof.If we assume the conditional independence
We can obtain
(C)Recruit Communications Co., Ltd.13
Optimization of MIFS
QUBO formulation of MIFS
α: penalty strengthMI Penalty for selecting
only k features
Reformulation of MIFS by QUBO (3)
MIFS can be optimized by Ising annealing machines
(C)Recruit Communications Co., Ltd.14
Interpretation of the Derived Formulation
Heuristic methods such as Max Relevance or MRMR are included in the derived formulation
Expand the derived formulation
Increase: Relevance, Complementary Reduce: Redundancy
Relevance Redundancy Complementary
(C)Recruit Communications Co., Ltd.15
Comparison of Optimization Methods
Binary Quadratic Problem (BQP)
QUBO
Linear Relaxation[1] (Linear)
Problem Formulation Optimization Methods
Truncated Power[1,2] (TPower)
Tabu Search by qbsolv[3]
D-Wave 2000Q
[1] H. Venkateswara, et al., 2015 [2] X. T. Yuan & T. Zhang, 2013 [3] https://github.com/dwavesystems/qbsolv
We compared several optimization methodsfor two types of formulations (BQP, QUBO)
(C)Recruit Communications Co., Ltd.16
Linear Relaxation Method (Linear)
[1] H. Venkateswara, et al., 2015
Linearize the quadratic term by introducing new variables
One of the optimal conditions is , which leads to
Since Qij ≧ 0, the solution of this problem is given by k largest column sum of Q.This solution is tightly bounded[1]. Time complexity is O(nk).
The computation of Linear is fastand the solution is tightly bounded.
(C)Recruit Communications Co., Ltd.17
Truncated Power Method (TPower)Finding the largest k-sparse eigenvector of Q is defined as
We select i th feature if xi > 0This is calculated by the following procedure[1]
[1] X. T. Yuan & T. Zhang, 2013 [2] H. Venkateswara, et al., 2015
Repeat T times
This method is confirmed to be the best-performing method for BQP problem with non-negative matrix[2]. Time complexity of the algorithm is O(Tn2).
TPower is known to be the state-of-the-art method for BQP problems
(C)Recruit Communications Co., Ltd.18
Optimization by D-Wave Machine
• Machine: D-Wave 2000Q • Embedding: 64 bit full connection • Annealing Time: 20µs • Annealing Repetitions: 10
Full Connection Embedding for C(4,4,4)
We used the D-Wave machine with the following settings
When feature size n is larger than hardware size h (=64), we use Linear to narrow down the candidate features to h as a pre-processing.
(C)Recruit Communications Co., Ltd.19
Comparison of Mutual Information Score
Data Name: a1a #features: 122 #data points: 8000
MI i
ncre
ase
(%) w
.r.t L
inea
r
5 6 7 8 10 15 20 25 30 40 #features
()06 2-4 1- 0
Mutual Information Score
Bet
ter
We compared MI scores of each optimization method for a public dataset. The increases with regard to Linear are shown in the graph below.
D-Wave obtained the best MI scores among other methods
(C)Recruit Communications Co., Ltd.20
Classification AccuracyWe calculated the classification accuracy for different #features. Accuracy is a good measure to evaluate the quality of a selected subset of features.
Classification Accuracy
Original features
Selected k-features
Measure the classification accuracy by random forest classifiers
(C)Recruit Communications Co., Ltd.
0.78
0.76
0.74
0.72
0.70
Acc
urac
y
403530252015105#features
D-Wave TPower Tabu(qbsolv) Linear
21
The accuracies of D-Wave are better when #features is small
Classification Accuracy
Better
Bet
ter
We evaluated each method by classification accuracy for different #features.
Data Name: a1a #features: 122 #data points: 8000
Classification Accuracy
(C)Recruit Communications Co., Ltd.22
Summary
• We derived the QUBO formulation of MIFS so that the problem can be embedded in Ising machines
• We used the D-Wave quantum annealing machine as a solver in MIFS
• The optimization method by D-Wave outperformed TPower which is the state-of-the-art optimization method for BQP
• We are planning to use MIFS by D-Wave in Kaggle!
(C)Recruit Communications Co., Ltd.23
Thank you for listening
(C)Recruit Communications Co., Ltd.24
Runtime of Optimizations
Data Name: a1a #features: 122 #data points: 8000
method Averaege Runtime
Linear 9.0 msec
TPower 26.1 msec
Tabu(qbsolv) 14.3 sec
D-Wave 9.0 msec (Linear)+ 100 μsec (annealing)
(C)Recruit Communications Co., Ltd.25
Comparison to MRMR, Max Rel.
0.78
0.76
0.74
0.72
0.70
Acc
urac
y
403530252015105#features
D-Wave MRMR Max Rel.
Data Name: a1a #features: 122 #data points: 8000