hot topics in machine learning (or how to win a kaggle ... · hot topics in machine learning (or...

64
Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz 1 1 Trendiction S.A., Luxembourg June 17, 2016 Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 1 / 41

Upload: others

Post on 20-May-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Hot Topics in Machine Learning (or how to win aKaggle competition)

Benedikt Wilbertz1

1Trendiction S.A., Luxembourg

June 17, 2016

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 1 / 41

Page 2: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Introduction

Setting for Supervised Learning

Y: prediction space (labels for classification, RK for Regression)

Y : Y-valued random variable to predict

X : typicallly X = Rd, space of predictors (aka features)

X: X -valued random variable modeling the distribution of the predictors

N : N = (Y × X )N , space containing all training sample of size N

N: random variable representing all training samples of size N .Independent of X and Y .

All random variables X,Y,N are defined on a joint probabiliy space (Ω,S,P).Let fN be a model trained on a realization of the random variable N (thismeans a random sample from Y × X of size N)The optimal model in the least square sense is then given by

Supervised Learning Problem

E (Y − fN(X))2 → min

fN∈models(N)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 2 / 41

Page 3: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Introduction

Error decomposition

In order to assess the performance of a prediction fN(X) which was trained bya random sample of N observation from Y × X , we fix a random predictorx := X(ω) and derive for the mean squared error:

MSE(x) := E(

[Y − fN(X)]2 |X=x

)= . . .

= E(

[Y − E(Y |X=x)]2 |X=x

)+ [E(Y |X=x)− E(fN(X)|X=x)]

2

+E(

[fN(X)− E(fN(X)|X=x)]2 |X=x

)σ2(Y |X=x) irreducible error

(E(Y |X=x)− E(fN(X)|X=x))2

model bias

σ2(fN(X)|X=x) model variance

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 3 / 41

Page 4: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Convolutional Networks

Very popular in the late 80s and 90s:

– 7 layers – 60k parameters

Training

Stochastic Gradient Descent

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 4 / 41

Page 5: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Renaissance of neural networks in 2012

Krizhevsky et al ’12: ImageNet Classification with Deep ConvolutionalNeuralNetworks

trained on 1.2 million labeled images

highly optimized GPU code

achieved absolute new state-of-the-art results for classification on 1000 objects.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 5 / 41

Page 6: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

GoogLeNet (Szegedy et al. ’14)

–27 layers deep – 1.5 GFLOP/forward pass – 7M parameters

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 6 / 41

Page 7: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Pushing deep learning to the limits

He et al ’16: Residual networks with 1k layers and 10M parameters

But what makes the difference to the 90s (apart from sample/parameter size)?

Data augmentation and bootstrapping

Drop out layers

ReLU activation

fast GPUs

Improvements on the training

Regularization

Nesterov/AdaGrad/AdaDelta/Adam variants for SGD

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 7 / 41

Page 8: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Pushing deep learning to the limits

He et al ’16: Residual networks with 1k layers and 10M parameters

But what makes the difference to the 90s (apart from sample/parameter size)?

Data augmentation and bootstrapping

Drop out layers

ReLU activation

fast GPUs

Improvements on the training

Regularization

Nesterov/AdaGrad/AdaDelta/Adam variants for SGD

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 7 / 41

Page 9: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Pushing deep learning to the limits

He et al ’16: Residual networks with 1k layers and 10M parameters

But what makes the difference to the 90s (apart from sample/parameter size)?

Data augmentation and bootstrapping

Drop out layers

ReLU activation

fast GPUs

Improvements on the training

Regularization

Nesterov/AdaGrad/AdaDelta/Adam variants for SGD

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 7 / 41

Page 10: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Software packages

Hardware

NVIDIA GTX 1080: 8.8 TFLOPS for 800 EUR

Cuda 8.0 and Pascal Architecture: Half-precision arithmetic (FP16) willdouble the computing power

Google’s TPUs (Tensor Processing Units): custom ASIC of certainoperation in Tensorflow

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 8 / 41

Page 11: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Software packages

Hardware

NVIDIA GTX 1080: 8.8 TFLOPS for 800 EUR

Cuda 8.0 and Pascal Architecture: Half-precision arithmetic (FP16) willdouble the computing power

Google’s TPUs (Tensor Processing Units): custom ASIC of certainoperation in Tensorflow

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 8 / 41

Page 12: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Applications

- Processing of 40M images/day (600img/s) from social media for logo/brandrecognition (only 0.5% contain a logo)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 9 / 41

Page 13: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Deep Learning

Neural Networks and Deep Learning

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 10 / 41

Page 14: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

History

Random Forest (Breiman ’97)

Gradient Tree Boosting (Friedman ’99)

Gradient Tree Boosting + Regularization (XGBoost)

Basic idea of tree ensembles

Model:

y =

K∑k=1

fk(x), fk ∈ F

Tree: fk(x) = wq(x), w ∈ RT , q : Rd → 1, 2, . . . , T

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 11 / 41

Page 15: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

History

Random Forest (Breiman ’97)

Gradient Tree Boosting (Friedman ’99)

Gradient Tree Boosting + Regularization (XGBoost)

Basic idea of tree ensembles

Model:

y =

K∑k=1

fk(x), fk ∈ F

Tree: fk(x) = wq(x), w ∈ RT , q : Rd → 1, 2, . . . , T

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 11 / 41

Page 16: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

History

Random Forest (Breiman ’97)

Gradient Tree Boosting (Friedman ’99)

Gradient Tree Boosting + Regularization (XGBoost)

Basic idea of tree ensembles

Model:

y =

K∑k=1

fk(x), fk ∈ F

Tree: fk(x) = wq(x), w ∈ RT , q : Rd → 1, 2, . . . , T

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 11 / 41

Page 17: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

History

Random Forest (Breiman ’97)

Gradient Tree Boosting (Friedman ’99)

Gradient Tree Boosting + Regularization (XGBoost)

Basic idea of tree ensembles

Model:

y =

K∑k=1

fk(x), fk ∈ F

Tree: fk(x) = wq(x), w ∈ RT , q : Rd → 1, 2, . . . , T

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 11 / 41

Page 18: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

Optimizing the tree structure

Objective:

min←n∑

i=1

l(yi, yi) +

K∑k=1

Ω(fk)

with regularization Ω(fk) = γT + 12λ∑T

j=1 w2j and general loss function l.

Problem: Tree construction is a batch process, so we cannot apply some onlinemethod like SGD

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 12 / 41

Page 19: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

Optimizing the tree structure

Objective:

min←n∑

i=1

l(yi, yi) +

K∑k=1

Ω(fk)

with regularization Ω(fk) = γT + 12λ∑T

j=1 w2j and general loss function l.

Problem: Tree construction is a batch process, so we cannot apply some onlinemethod like SGD

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 12 / 41

Page 20: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

Additive Training (Boosting)

Start from constant prediction, add a new function each time

y(0)i = 0

y(1)i = f1(xi) = y(0) + f1(xi)

y(2)i = f1(xi) + f2(xi) = y(1) + f2(xi)

. . .

y(t)i =

t∑k=1

fk(xi) = y(t−1) + ft(xi)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 13 / 41

Page 21: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

The prediction at round t is y(t)i = y

(t−1)i + ft(xi)

Using 2nd order taylor expansion on l, this can be applied on any smooth lossfunction like Euclidean loss (Regression), Softmax loss (Classification), NDCG(Ranking problems), etc.

Using gradient gi and hessian hi of l, an optimal tree is grown (stopping whenregularized gini gain becomes negative), which optimizes in interation t

min ←n∑

i=1

l(yi, y(t−1)i + ft(xi)) + Ω(ft(xi))

≈n∑

i=1

l(yi, y(t−1)i ) + gift(xi) +

1

2hif

2t (xi) + Ω(ft) + const

(Explicit solution for leaf weights w. Splits are constructed in a greedy way)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 14 / 41

Page 22: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Gradient Boosting

Trees and Boosting

Software packages

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 15 / 41

Page 23: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Ensembles

Model ensembles

Stacking

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 16 / 41

Page 24: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Kaggle

Kaggle

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 17 / 41

Page 25: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Kaggle The Competition

The Task

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 18 / 41

Page 26: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Kaggle The Competition

Task / Rules

Timeframe: Oct 2015 – Feb 2016

Multiple-Choice Question with 4 answers (NIR 25%)

Trainingset: 2500 questions

Validationset: 8192 questions

Public Leaderboard based on 12.5%

Mandatory model submission one week before end

Final testset (12000 new questions) released one week before end

Private Leaderboard only based on this new questions

external data explicitely allowed

800 teams participated in stage I

170 in stage II

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 19 / 41

Page 27: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Kaggle The Competition

Task / Rules

Timeframe: Oct 2015 – Feb 2016

Multiple-Choice Question with 4 answers (NIR 25%)

Trainingset: 2500 questions

Validationset: 8192 questions

Public Leaderboard based on 12.5%

Mandatory model submission one week before end

Final testset (12000 new questions) released one week before end

Private Leaderboard only based on this new questions

external data explicitely allowed

800 teams participated in stage I

170 in stage II

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 19 / 41

Page 28: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Kaggle The Competition

Challenges

What do you need to solve this problems?

External Data (Wikipedia, CK12, etc.)

NLP knowledge

Search infrastructure (Elasticsearch/Lucene)

Feature engineering and machine learning

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 20 / 41

Page 29: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Kaggle IBM Watson

Invitation for IBM Watson

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 21 / 41

Page 30: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Kaggle IBM Watson

Decline to Participate

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 22 / 41

Page 31: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: A scientist claims he has found a cure for a skin disease. Afterpublishing the results, the experiment was found to be biased. Why didpublishing the results allow bias to be recognized within the experiment?

a) It allowed others to replicate the experiment.

b) It helped the scientist gain notoriety within his field.

c) It allowed the cure to be manufactured by the best company.

d) It helped other researchers find out more about the skin disease.

Our Answer: a) (509.6, 461.9, 427.0, 495.6)Correct Answer: a)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 23 / 41

Page 32: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: A scientist claims he has found a cure for a skin disease. Afterpublishing the results, the experiment was found to be biased. Why didpublishing the results allow bias to be recognized within the experiment?

a) It allowed others to replicate the experiment.

b) It helped the scientist gain notoriety within his field.

c) It allowed the cure to be manufactured by the best company.

d) It helped other researchers find out more about the skin disease.

Our Answer: a) (509.6, 461.9, 427.0, 495.6)Correct Answer: a)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 23 / 41

Page 33: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: A scientist claims he has found a cure for a skin disease. Afterpublishing the results, the experiment was found to be biased. Why didpublishing the results allow bias to be recognized within the experiment?

a) It allowed others to replicate the experiment.

b) It helped the scientist gain notoriety within his field.

c) It allowed the cure to be manufactured by the best company.

d) It helped other researchers find out more about the skin disease.

Our Answer: a) (509.6, 461.9, 427.0, 495.6)

Correct Answer: a)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 23 / 41

Page 34: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: A scientist claims he has found a cure for a skin disease. Afterpublishing the results, the experiment was found to be biased. Why didpublishing the results allow bias to be recognized within the experiment?

a) It allowed others to replicate the experiment.

b) It helped the scientist gain notoriety within his field.

c) It allowed the cure to be manufactured by the best company.

d) It helped other researchers find out more about the skin disease.

Our Answer: a) (509.6, 461.9, 427.0, 495.6)Correct Answer: a)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 23 / 41

Page 35: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: What is the primary function of skin cells?

a) to deliver messages to the brain

b) to generate movement of muscles

c) to provide a physical barrier to the body

d) to produce carbohydrates for energy

Our Answer: c) (253.0, 261.9, 302.3, 277.0)Correct Answer: c)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 24 / 41

Page 36: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: What is the primary function of skin cells?

a) to deliver messages to the brain

b) to generate movement of muscles

c) to provide a physical barrier to the body

d) to produce carbohydrates for energy

Our Answer: c) (253.0, 261.9, 302.3, 277.0)Correct Answer: c)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 24 / 41

Page 37: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: What is the primary function of skin cells?

a) to deliver messages to the brain

b) to generate movement of muscles

c) to provide a physical barrier to the body

d) to produce carbohydrates for energy

Our Answer: c) (253.0, 261.9, 302.3, 277.0)

Correct Answer: c)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 24 / 41

Page 38: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: What is the primary function of skin cells?

a) to deliver messages to the brain

b) to generate movement of muscles

c) to provide a physical barrier to the body

d) to produce carbohydrates for energy

Our Answer: c) (253.0, 261.9, 302.3, 277.0)Correct Answer: c)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 24 / 41

Page 39: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: Which of the following would be most useful for calculating thedensity of a rock sample?

a) microscope and balance

b) graduated cylinder and balance

c) microscope and graduated cylinder

d) beaker and graduated cylinder

Our Answer: b) (267.5, 276.4, 271.3, 275.8)Correct Answer: b)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 25 / 41

Page 40: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: Which of the following would be most useful for calculating thedensity of a rock sample?

a) microscope and balance

b) graduated cylinder and balance

c) microscope and graduated cylinder

d) beaker and graduated cylinder

Our Answer: b) (267.5, 276.4, 271.3, 275.8)Correct Answer: b)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 25 / 41

Page 41: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: Which of the following would be most useful for calculating thedensity of a rock sample?

a) microscope and balance

b) graduated cylinder and balance

c) microscope and graduated cylinder

d) beaker and graduated cylinder

Our Answer: b) (267.5, 276.4, 271.3, 275.8)

Correct Answer: b)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 25 / 41

Page 42: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Questions

Examples

Question: Which of the following would be most useful for calculating thedensity of a rock sample?

a) microscope and balance

b) graduated cylinder and balance

c) microscope and graduated cylinder

d) beaker and graduated cylinder

Our Answer: b) (267.5, 276.4, 271.3, 275.8)Correct Answer: b)

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 25 / 41

Page 43: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

Information Retrieval IR

Idea: For each answer a)-d), create pairs of question + answer and scorethese 4 pairs in a search engine. The pair with the highest score wins.

Example

Put (This is a question) AND (this is an answer) into Google andrank by the number of hits.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 26 / 41

Page 44: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

Information Retrieval IR

Idea: For each answer a)-d), create pairs of question + answer and scorethese 4 pairs in a search engine. The pair with the highest score wins.

Example

Put (This is a question) AND (this is an answer) into Google andrank by the number of hits.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 26 / 41

Page 45: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

TF/IDF

Term Frequency-Inverse Document Frequency(tf/idf), is a numerical statisticthat is intended to reflect how important a word is to a document in acollection or corpus.

Definition

TFIDF(t, d,D) := ft,d · IDF(t,D),

where ft,d is the frequency of term t in document d and

IDF(t,D) = logN

|d ∈ D : t ∈ d|

with N being the toal number of documents in the corpus.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 27 / 41

Page 46: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

BM25

Okapi BM25 (BM stands for Best Matching) is a ranking function used bysearch engines to rank matching documents according to their relevance to agiven search query. It is based on the probabilistic retrieval frameworkdeveloped in the 1970s and 1980s by Stephen E. Robertson, Karen SparckJones, and others.

Definition

BM25(t, d,D) := IDF(t,D) · ft,d · (k1 + 1)

ft,d + k1 ·(

1− b+ b · |D|avgdl

) ,where

IDF(t,D) = logN − n(t) + 0.5

n(t) + 0.5,

and k1 ∈ [1.2, 2.0] and b = 0.75.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 28 / 41

Page 47: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

Word embeddings

Word Embeddings (Word2Vec, GloVe) are shallow, two-layer neural networks,that are trained to reconstruct linguistic contexts of words: the network isshown a word, and must guess at which words occurred in adjacent positionsin an input text.They build up a mapping femb :W → Rd, where d typically has size 100 or300.One important feature of this mapping is, that they map semantically closewords into similar locations of the d-dimensional vector space.This even allows doing some kind of arithmetic on words, i.e.

femb(Berlin)− femb(Germany) + femb(Italy) = femb(Rom)

Problem

How to score question + answer? Sum/Average/Weighted by IDF?

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 29 / 41

Page 48: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

PMI

Pointwise Mutual Information (PMI), is a measure of association used ininformation theory and statistics.

Definition

pmi(x; y) := logp(x, y)

p(x)p(y)= log

p(x|y)

p(x)= log

p(y|x)

p(y).

Choosing p(x, y) as the probability for the co-occurences of words x and y, wecan use this measure to compare each single word in the question to all theanswers words.The average (or median) of all these scores is then taken as the overall score ofa question-answer pair.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 30 / 41

Page 49: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

Feature Hashing

Embedding

Use a hashing algorithm of fixed length (say 4096) in order to encode word /sentences as fixed length vectors.

Learning

(Motivated by T. Mikolov’s negative sampling in word2vec)

Using Quizlet’s flashcards we generated an extended dataset

N positive term-definition pairs

3N negative term-definition pairs (i.e. term paired with a randomdefinition)

Train binary classifier using XGBoost with max.depth=10 and 1000srounds.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 31 / 41

Page 50: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

Feature Hashing

Embedding

Use a hashing algorithm of fixed length (say 4096) in order to encode word /sentences as fixed length vectors.

Learning

(Motivated by T. Mikolov’s negative sampling in word2vec)

Using Quizlet’s flashcards we generated an extended dataset

N positive term-definition pairs

3N negative term-definition pairs (i.e. term paired with a randomdefinition)

Train binary classifier using XGBoost with max.depth=10 and 1000srounds.

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 31 / 41

Page 51: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

Final Learning

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 32 / 41

Page 52: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Our Approach

14 days to go. . .

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 33 / 41

Page 53: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Last minute changes

Large Scale XGBoost

Pushing XGBoost to the limits. . .

50M quizlet cards

3 + 1 negative sampling yields 200M observations

Feature Hashing produces sparse matrix with 2147863398 entries

Result from XGBoost

long vectors not supported yet:../../src/include/Rinlinedfuns.h:137

Running with 150M samples was fine but needed fast machine to finish tilcompetition deadline. . .

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 34 / 41

Page 54: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Last minute changes

Large Scale XGBoost

Pushing XGBoost to the limits. . .

50M quizlet cards

3 + 1 negative sampling yields 200M observations

Feature Hashing produces sparse matrix with 2147863398 entries

Result from XGBoost

long vectors not supported yet:../../src/include/Rinlinedfuns.h:137

Running with 150M samples was fine but needed fast machine to finish tilcompetition deadline. . .

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 34 / 41

Page 55: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Last minute changes

Large Scale XGBoost

Pushing XGBoost to the limits. . .

50M quizlet cards

3 + 1 negative sampling yields 200M observations

Feature Hashing produces sparse matrix with 2147863398 entries

Result from XGBoost

long vectors not supported yet:../../src/include/Rinlinedfuns.h:137

Running with 150M samples was fine but needed fast machine to finish tilcompetition deadline. . .

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 34 / 41

Page 56: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Last minute changes

Last Minute Learning

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 35 / 41

Page 57: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Last minute changes

Public Leaderboard / Model submission deadline

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 36 / 41

Page 58: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Competitors

Cardal

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 37 / 41

Page 59: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Competitors

Cardal’s approach

Data sources

Wikipedia, CK12, Quizlet, StudyStack, Saylor, Openstax, UtahOER, miscsources from AI2/Aristo

Processing

Hand-written parsers for all the sources (regex!!)

Uses 4 different stemmers

28 sets of features

Lucene Search/Scoring plus homebrewed search/score

Learning

Gradient boosting

Ensemble of 6 models, each uses its own feature mix

Lots of handtuned parameters

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 38 / 41

Page 60: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Competitors

Cardal’s approach

Data sources

Wikipedia, CK12, Quizlet, StudyStack, Saylor, Openstax, UtahOER, miscsources from AI2/Aristo

Processing

Hand-written parsers for all the sources (regex!!)

Uses 4 different stemmers

28 sets of features

Lucene Search/Scoring plus homebrewed search/score

Learning

Gradient boosting

Ensemble of 6 models, each uses its own feature mix

Lots of handtuned parameters

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 38 / 41

Page 61: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Competitors

Cardal’s approach

Data sources

Wikipedia, CK12, Quizlet, StudyStack, Saylor, Openstax, UtahOER, miscsources from AI2/Aristo

Processing

Hand-written parsers for all the sources (regex!!)

Uses 4 different stemmers

28 sets of features

Lucene Search/Scoring plus homebrewed search/score

Learning

Gradient boosting

Ensemble of 6 models, each uses its own feature mix

Lots of handtuned parameters

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 38 / 41

Page 62: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Results

Private Leaderboard

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 39 / 41

Page 63: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Aftermath

Private Leaderboard

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 40 / 41

Page 64: Hot Topics in Machine Learning (or how to win a Kaggle ... · Hot Topics in Machine Learning (or how to win a Kaggle competition) Benedikt Wilbertz1 1Trendiction S.A., Luxembourg

Competition Summary

Summary

THANK YOU!

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 41 / 41