apr. 11, 2018 global mutual information based feature ... · recruit restaurant visitor forecasting...

Global Mutual Information Based Feature Selection By Quantum Annealing

Kotaro Tanahashi*, Shinichi Takayanagi*, Tomomitsu Motohashi*, Shu Tanaka✝ * Recruit Communications Co.,Ltd. ✝ Waseda University, JST PRESTO

Apr. 11, 2018

（C）Recruit Communications Co., Ltd.2

Introduction of Recruit

We provide various kinds of online services from job search to hotel reservations across the world.

Automobile

Education

Life & Local O2O

Travel

Beauty

Housing Bridal & Baby

Human Resources

IT & Trends Media

Dining

www.flaticon.com


Introduction of Recruit

Internet Users Clients

• We help users to find the best clients through our services. • Data science plays an important role in the business.


Data Science at Recruit

Recruit has hosted two data mining competitions in Kaggle Kaggle, KDD Cup: International competitions of data mining

{Engineers at Recruit (as of March 2018)

We are passionate about data scienceSome of us came in 1st and 2nd place in KDD Cup 2015

www.kdd.org/kdd-cup

Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015)

www.kaggle.com


Feature Selection: A Key Technique

“Beating Kaggle the easy way”

• A key technique to win data mining competitions • Find the most relevant features • Balance bias-variance trade-off

Features

ndata

n featuresrelevant features

data

User 1 User 2 User 3 User 4

User n-1 User n

✔ Improve prediction ✔ Reduce computational cost

Benefits

https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf

Feature selection is essential in predictive analysis


Types of Feature Selection (FS) Algorithms

Wrapper methods Iteratively evaluate a feature subset by black-box learning algorithm

Set of all features

Generate a subset Learning Algorithm

Selecting the best subset

Performance

Embedded methods Train a model and select features at the same time

Set of all features

Generate a subset

Learning Algorithm + Performance

Selecting the best subset

Filter methods Features are selected by some criteria such as Mutual Information

✔ Independent on learning algorithms ✔ Can be used as a pre-processing

Set of all features PerformanceSelecting the

best subsetLearning Algorithm

Filter methods are useful as a pre-processing since it does not dependent on the predictive models

https://en.wikipedia.org/wiki/Feature_selection


What is Mutual Information (MI)?

Figures are retrieved from http://minepy.readthedocs.io/en/latest/python.html

Mutual Information I(X;Y) is a measure of the mutual independence between two random variables X and Y

Shannon entropy

Pearson r = 0.8 MI = 0.5



✔ MI can capture non-linear relationships unlike Pearson’s correlation coefficient

Mutual Information I(X;Y)

Able to predict Y given X

Hard to predict Y given XLow

High

http://minepy.readthedocs.io/en/latest/python.html


General formulation of MIFS

Mutual Information based Feature Selection (MIFS)

MIFS: using Mutual Information as a criteria in filter methods

MIFS selects a feature subset with a size of k which maximizes the Mutual Information (MI) between the features and the target variable

Unfortunately, the exact calculation of is intractable…


Heuristic MIFS Algorithms

[1] H. Peng et al., 2005 [2] J. R. Vergara & P. A. Estévez, 2015

Max Relevance method Selecting the most relevant feature iteratively

Mim Redundancy & Max Relevance method[1] (MRMR) Selecting the most relevant and least redundant feature iteratively

Repeat k times

Repeat k times

Some heuristic MIFS algorithms have been developed. However, these methods are greedy approximations[2].


Our Contributions

MIFS optimization

QUBO formulation of MIFS

MI i

ncre

ase

(%) w

.r.t L

inea

r

5 6 7 8 10 15 20 25 30 40 #features

()06 2-4 1- 0

(2) We confirmed optimizations by D-Wave do well in MIFS

(1) We reformulate MIFS by QUBO

image is retrieved from https://www.dwavesys.com/resources/media-resourcesHOW?

Bet

ter

QUBO: Quadratic Unconstrained Binary Optimization

https://www.dwavesys.com/resources/media-resources


Reformulation of MIFS by QUBO (1)

Theorem 1.1: Chain theorem for Conditional Mutual Information

Using theorem 1.1, the following equation holds for all i ∈ S

Averaging the equation above for all i leads to

Proof.

Expand the MI term



Approximate under the assumption of Conditional Independence (CI)

Proof.If we assume the conditional independence

We can obtain


Optimization of MIFS

QUBO formulation of MIFS

α: penalty strengthMI Penalty for selecting

only k features


MIFS can be optimized by Ising annealing machines


Interpretation of the Derived Formulation

Heuristic methods such as Max Relevance or MRMR are included in the derived formulation

Expand the derived formulation

Increase: Relevance, Complementary Reduce: Redundancy

Relevance Redundancy Complementary


Comparison of Optimization Methods

Binary Quadratic Problem (BQP)

QUBO

Linear Relaxation[1] (Linear)

Problem Formulation Optimization Methods

Truncated Power[1,2] (TPower)

Tabu Search by qbsolv[3]

D-Wave 2000Q

[1] H. Venkateswara, et al., 2015 [2] X. T. Yuan & T. Zhang, 2013 [3] https://github.com/dwavesystems/qbsolv

We compared several optimization methodsfor two types of formulations (BQP, QUBO)

https://github.com/dwavesystems/qbsolv


Linear Relaxation Method (Linear)

[1] H. Venkateswara, et al., 2015

Linearize the quadratic term by introducing new variables

One of the optimal conditions is , which leads to

Since Qij ≧ 0, the solution of this problem is given by k largest column sum of Q.This solution is tightly bounded[1]. Time complexity is O(nk).

The computation of Linear is fastand the solution is tightly bounded.


Truncated Power Method (TPower)Finding the largest k-sparse eigenvector of Q is defined as

We select i th feature if xi > 0This is calculated by the following procedure[1]

[1] X. T. Yuan & T. Zhang, 2013 [2] H. Venkateswara, et al., 2015

Repeat T times

This method is confirmed to be the best-performing method for BQP problem with non-negative matrix[2]. Time complexity of the algorithm is O(Tn2).

TPower is known to be the state-of-the-art method for BQP problems


Optimization by D-Wave Machine

• Machine: D-Wave 2000Q • Embedding: 64 bit full connection • Annealing Time: 20µs • Annealing Repetitions: 10

Full Connection Embedding for C(4,4,4)

We used the D-Wave machine with the following settings

When feature size n is larger than hardware size h (=64), we use Linear to narrow down the candidate features to h as a pre-processing.


Comparison of Mutual Information Score

Data Name: a1a #features: 122 #data points: 8000

MI i

ncre

ase

(%) w

.r.t L

inea

r

5 6 7 8 10 15 20 25 30 40 #features

()06 2-4 1- 0

Mutual Information Score

Bet

ter

We compared MI scores of each optimization method for a public dataset. The increases with regard to Linear are shown in the graph below.

D-Wave obtained the best MI scores among other methods


Classification AccuracyWe calculated the classification accuracy for different #features. Accuracy is a good measure to evaluate the quality of a selected subset of features.

Classification Accuracy

Original features

Selected k-features

Measure the classification accuracy by random forest classifiers

（C）Recruit Communications Co., Ltd.

0.78

0.76

0.74

0.72

0.70

Acc

urac

y

403530252015105#features

D-Wave TPower Tabu(qbsolv) Linear

21

The accuracies of D-Wave are better when #features is small


Better

Bet

ter

We evaluated each method by classification accuracy for different #features.




Summary

• We derived the QUBO formulation of MIFS so that the problem can be embedded in Ising machines

• We used the D-Wave quantum annealing machine as a solver in MIFS

• The optimization method by D-Wave outperformed TPower which is the state-of-the-art optimization method for BQP

• We are planning to use MIFS by D-Wave in Kaggle!


Thank you for listening


Runtime of Optimizations


method Averaege Runtime

Linear 9.0 msec

TPower 26.1 msec

Tabu(qbsolv) 14.3 sec

D-Wave 9.0 msec (Linear)+ 100 μsec (annealing)


Comparison to MRMR, Max Rel.

0.78

0.76

0.74

0.72

0.70

Acc

urac

y

403530252015105#features

D-Wave MRMR Max Rel.


apr. 11, 2018 global mutual information based feature ... · recruit restaurant visitor forecasting...

Documents