yelp data challenge

30
Yelp Data Challenge User Rating Prediction using machine and Deep learning. Amr Koura

Upload: amr-koura

Post on 08-Apr-2017

117 views

Category:

Engineering


0 download

TRANSCRIPT

Yelp Data Challenge User Rating Prediction using machine and Deep learning.

Amr Koura

Content

• Yelp Data Challenge• Problem Definition• Classical Machine Learning• Deep Learning• Docker Image• Conclusion

Yelp Data Challenge

Yelp dataset

Academic Dataset.Information about local business.Five Json files:1. business2. check-in3. review4. tip5. user

https://www.yelp.com/dataset_challenge/dataset

User Reviews

User Reviews

Jason Files contains user reviews:Example:

{"user_id": "Iu6AxdBYGR4A0wspR9BYHA", "review_id": "KPvLNJ21_4wbYNctrOwWdQ", "stars": 5, "date": "2014-02-13", "text": "Excellent food. Superb customer service. I miss the mario machines they used to have, but it's still a great place steeped in tradition.", "type": "review", "business_id": "5UmKMjUEUNdYWqANhGckJw"}

Problem Definition

“Given a user Review text , can we predict user Rating?”

Simpler Problem

Assuming That:

• bad rate (1-3)• good rate (4-5)we will predict whether the user like the

service (rate=4 or 5) or not(rate=1,2 or 3)

“Given a user Review text , can we predict whether user like service or not?”

Pre-Processing

Pre-processing

Extract the first 2000 review as dataset.

Features: user reviews text.Labels: if Rate >3: Positive Else: Negative

Approach

Classical Machine Learning

Algorithms

Deep learning Approach

Classical Machine Learning Algorithms

Classical Machine Learningcombination of several machine learning algorithms.

Logistic Regression. Naive Bayes. Stochastic gradient Descent. Support Vector Machine.

combine results using majority votes.

https://github.com/amrqura/yelpRatePrediction

Classical Machine LearningFeatures:

use words of the review text as features as the following:

1.Extract most common 3000 words in the training dataset.

In each statement examine if the frequent words exists or not.

Feature set : matrix of [2000*3000] 2000: number of statements. 3000: boolean values examine existence of frequent word.

https://github.com/amrqura/yelpRatePrediction

Classical Machine LearningTraining:

Train model in 80% statements and validate on 20%.

Run:

$ python3 TextClassification.py

https://github.com/amrqura/yelpRatePrediction

Classical Machine LearningResults(5- cross validation):

Naive Bayes accuracy percent: 66.5 MultinomialNB accuracy percent: 67.95 BernoulliNB accuracy percent: 65.9 Logistic regression accuracy percent: 67.15 SGD accuracy percent: 63.3 SVC accuracy percent: 54.0 LinearSVC accuracy percent: 63.75 NuSVC accuracy percent: 66.35 Voted accuracy percent: 67.35

https://github.com/amrqura/yelpRatePrediction

Deep learning Approach

Convolutional Neural Network for sentence classification

CNN Architecture

https://arxiv.org/pdf/1408.5882v2.pdf

CNN DatasetEach statement is represented by n*k matrix. n= number of words k= vector length.

we use word2vec to convert each word to vector.

https://github.com/amrqura/deepYelpRatePrediction

CNN Training

implementation using Tensorflow. number of filters=128 filter size=3,4,5. drop out=0.5

Apply max pooling.

output layer: two nodes with softmax function.

use cross-entropy error function. use L2 regularization.

https://github.com/amrqura/deepYelpRatePrediction

CNNTraining:

5 cross validation. Train model in 1600 statements and validate on 400.

Run:

$ python rating_prediction.py

Accuracy: 0.73549998

https://github.com/amrqura/deepYelpRatePrediction

Docker Image

Docker ImageDocker Image:

https://hub.docker.com/r/amrkoura/yelpchallenge/

Run: docker pull amrkoura/yelpchallange docker run -t -i amrkoura/yelpchallenge /bin/bash cd /src/

Run classical machine learning:

python3 TextClassification.py

Run Deep learning:

python rating_prediction.py

Conclusion

Conclusion2 implementations to predict the user rating from user review. • classical machine learning • Deep learning , convolutional neural network.

public Docker image:

https://hub.docker.com/r/amrkoura/yelpchallenge/

Thank you