yelp dataset challenge

35
Yelp Dataset Challenge ANWAR SHAIKH ASHWIN NIMHAN MANASHREE RAO SHRIJIT PILLAI TEJAS SHAH

Upload: shrijit-pillai

Post on 08-Jan-2017

228 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Yelp Dataset Challenge

Yelp Dataset ChallengeANWAR SHAIKHASHWIN NIMHANMANASHREE RAOSHRIJIT PILLAITEJAS SHAH

Page 2: Yelp Dataset Challenge

Project Tasks

Task 1 Assign Categories to Business in the Yelp Data Set

Task 2 Recommend Food Items and/or services in a Restaurant

Determine Influential Factors in a City affecting Restaurants

Page 3: Yelp Dataset Challenge

Task 1

...

Page 4: Yelp Dataset Challenge

Task 1 : Methodology

Business Business

To To Review Category

Map Map

…...

…...

Tf-Idf1. Default2. BM253. Dirichlet

Lucene Index

Lucene IndexMapping Phase Category to

Review Mapping

Predicted Categorie

s

Training Set

Testing Set

Page 5: Yelp Dataset Challenge

Evaluation

Top 3 Top 5 Top 70

0.2

0.4

0.6

Precision

Recall

F2-Measure

At least 1 TP

0.54

0.380.33

0.550.66 0.72

0.55 0.57 0.58

0.85 0.88 0.89

BM25 Similarity

Page 6: Yelp Dataset Challenge

Evaluation

Top 3 Top 5 Top 70

0.2

0.4

0.6

Precision

Recall

F2-Measure

At least 1 TP0.51

0.36 0.33

0.530.62 0.66

0.53 0.54 0.55

0.84 0.85 0.87

Default Similarity

Page 7: Yelp Dataset Challenge

Evaluation

Top 3 Top 5 Top 70

0.2

0.4

0.6

Precision

Recall

F2-Measure

At least 1 TP0.42

0.32 0.3

0.58 0.60.55

0.53 0.510.47

0.81 0.84 0.86

LMDirichlet Similarity

Page 8: Yelp Dataset Challenge

Task 2: Recommend Restaurant Food Items or Services

...

Page 9: Yelp Dataset Challenge

• Sentiments are mapped to Features

Feature Processi

ng

• Features are filtered to obtain only restaurant related ones using Task 1 Solution

Feature Filterin

g

• Nouns are used as features and adjectives as sentiments

Feature Extracti

on

• Nouns and adjectives are extracted

XML Parsing

• Each Review is passed into the tool which forms an XML file for each input

Stanford

CoreNLP

• Reviews of each restaurants are indexed one after another

Restaur

ant Review

s

Task 2 : Methodology

Page 10: Yelp Dataset Challenge

Feature Extraction

Every token has an associated POS tag

POS tag with “NN” are Nouns and “JJ” are adjectives

Nouns are considered as features and adjectives as sentiments

Page 11: Yelp Dataset Challenge

Feature Filtering Noise present in features obtained from Feature Extraction Phase

Using Task 1 Solution, categories of input features are determined

Features whose categories are related to restaurants are considered for further processing

Before Feature Filtering After Feature Filtering

• cheese• burger• ones• menu• combinations• idea• commission

• cheese• burger• menu

Page 12: Yelp Dataset Challenge

Feature Processing• Problem : The relationship

between noun and adjective was ambiguous for some sentences.

• Example : The food was great but the service was bad

• After parsing “bad” belongs to food or service?

Page 13: Yelp Dataset Challenge

New Review

Adjective

Positive or

Negative?

Negative Word

in 4-word

distance?

Decision (Recommended or not Recommen

ded)

Classification of reviews1. For each sentence the noun is

extracted through feature extraction

2. Corresponding adjective is identified as positive or negative

3. Negation is searched for within 4 word distance of adjective

4. Feature is classified as Recommended if number of positive sentiments associated with it is more than the number of negative sentiments

All the above steps are repeated for each review

Page 14: Yelp Dataset Challenge

Sample ResultPredicted Features

Predicted Feature Sentiments

Predicted as Recommended

Features ?

Actual Recommended

Featuressub, next, decent Y Ybread flavorful, bland, fresh, great, nice Y Ypeppercorn nice Y Ystuff-it chewy Y Nsandwich mayo/mustard/vinegar, east, good,

unknownY Y

menu decent Y Ybacon real Y Ybite huge Y Nveggies sorry N Y

Page 15: Yelp Dataset Challenge

Evaluation Set 1 - Recommended Features are obtained from 60%

reviews of a particular restaurant.

Set 2 - The remaining 40% of the reviews are considered for testing

If a recommended feature from Set 1 is present as a recommended feature in Set 2, then it is a True Positive

Evaluation Metrics Precision Recall Precision Recall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.53

0.67

Page 16: Yelp Dataset Challenge

Identifying Influential topics“Identify features from reviews which are relevant

city wide and influence the user’s choice and restaurant’s popularity”

PhasesI. Business classification by cityII. Popular item word-countIII. NLP feature extractionIV. Feature re-ranking modelV. Model fitness evaluation

Page 17: Yelp Dataset Challenge

Business Classification Phase I

Issue: Reviews specify neighborhood not city. (~150 !!!)Solution:

1. Identify city based on geo-code through mapping service.2. K-means clustering

1. Data point features (Business Id, Latitude, Longitude)2. Dissimilarity metric (Euclidian distance)3. Cluster count: k (10)4. Centroid Labeling

3. Data persistence and indexing1. Split reviews based on clustered business ids2. Save & index for next phase.

Page 18: Yelp Dataset Challenge
Page 19: Yelp Dataset Challenge

Word-count Phase II

Issue: How do we get the influential factors of a citySolution: Word count as first passObservation: Noise (adjectives, verbs, expressions)Proposal: Include features derived through NLP

Page 20: Yelp Dataset Challenge

NLP Features Phase IIIIssue: Noise reduction and contextual awarenessSolution: Use NLP to identify features in the reviewsObservation: Subtle change in ordering of wordsProposal: Re-ranking the words using metrics from user and review.

• Nouns are extracted from the XML and are used as features

Feature Extraction

• Each Review is passed into the tool which forms an XML file for each input

Stanford CoreNLP

• Reviews of all restaurants filtered by city are indexed.

Restaurant Reviews by City

Page 21: Yelp Dataset Challenge

∑𝑹ε𝑹𝟏𝟎𝟎𝟎

¿¿¿

Mathematical Formula

Page 22: Yelp Dataset Challenge
Page 23: Yelp Dataset Challenge

Elite User

Who is Important?

Page 24: Yelp Dataset Challenge

Elite User

Useful ReviewWho is Important?

What is Important?

Page 25: Yelp Dataset Challenge

Mathematical Formula

Features from NLP does take in account word count and context but does NOT consider user weight and review weight

Program with Mathematical

Formula

SolrInde

x

Word list from NLP

Top 1K Relevant Reviews

Scored word

Page 26: Yelp Dataset Challenge

User Review Count = Urc

Average Stars Votes = Uv

Friends Elite = Ue

Yelping Since Compliments Fans = Uf

Mathematical Formula

Uvnorm =

{(𝟎 .𝟐𝟓 .𝐔𝐞+𝟎 .𝟓𝟓 𝑼 𝒗𝑼𝒓𝒄 +𝟎 .𝟐𝟎 .𝐔𝐟 )}

Normalization of votes

User Review Count = Urc

Average Stars Votes = Uv

Friends Elite = Ue

Yelping Since Compliments Fans = Uf

User Review Count VotesU1 10 1000U2 1000 1000

Page 27: Yelp Dataset Challenge

User Review Count = Urc

Average Stars Votes = Uv

Friends Elite = Ue

Yelping Since Compliments Fans = Uf

Mathematical Formula

{(𝟎 .𝟏𝟓 .𝑹𝒗+𝟎 .𝟏𝟓 .𝑹𝒔 )+𝟎 .𝟕 .(𝟎 .𝟐𝟓 .𝐔𝐞+𝟎 .𝟓𝟓 𝑼 𝒗𝑼 𝒓𝒄 +𝟎 .𝟐𝟎 .𝐔𝐟 )}

Review User Stars = Rs

Text Date Votes = Rv

User

Stars Sentiment1 Very

Strong2 Inclined -ve3 Ambivalent4 Inclined +ve

5 Very Strong

Page 28: Yelp Dataset Challenge

User Review Count = Urc

Average Stars Votes = Uv

Friends Elite = Ue

Yelping Since Compliments Fans = Uf

Mathematical Formula

{𝒕𝒇 . 𝐥𝐨𝐠 (𝟏− 𝒅𝒇𝑫𝒄𝒐𝒖𝒏𝒕 )}× {(𝟎 .𝟏𝟓 .𝑹𝒗+𝟎 .𝟏𝟓 .𝑹𝒔 )+𝟎 .𝟕 .(𝟎 .𝟐𝟓 .𝑼𝒆+𝟎 .𝟓𝟓 𝑼 𝒗

𝑼𝒓𝒄 +𝟎 .𝟐𝟎 .𝑼𝒇 )}

Review User Stars = Rs

Text Date Votes = Rv

UserReview Relevance TermFrequency = tf Document Frequecny

= df Document Count =

Dcount

User Review Count = Urc

Average Stars Votes = Uv

Friends Elite = Ue

Yelping Since Compliments Fans = Uf

Page 29: Yelp Dataset Challenge

j

Output

12

3

45

6

79

8

10

11

12

14

13

15

16

17

Page 30: Yelp Dataset Challenge

MadisonRank

Wordcount List

NLP list-Unformatted

NLP list- Model

1food food pizza2place beer cheese3 like cheese coffee4from menu breakfast5service curds burger6go atmosphere taco7time burger sushi8madison dane chocolate9been drinks beer

10cheese beers sandwich11menu restaurant curds

12bar table ice13restaurant coffee wine14ordered pizza store15 love something cream16order sandwich lunch17chicken dinner rolls

18beer lunch atmosphere19pizza meal tea20sauce sauce curries21night burgers steak22people drink noodle23make bread spot24staff server soup25made chicken egg

RankWordcount List

NLP list-Unformatted

NLP list- Model

1 food food donut2good pizza bagel3place burger cupcake4great menu gelato5 like restaurant gyro6service fries yogurt

7 time atmosphere buffet8go chicken boba9back patio pizza

10 from breakfast sushi11been table coffee12 love lunch sub13ordered dinner wing14chicken meal crepe15nice salad burger16order cheese burrito17 restaurant potato taco18 little server cookie19menu something gluten20pizza sauce breakfast

21bar drinks coffee-shop22delicious rice hash-brown23 friendly burgers cake24first beer Vegan

25Pretty Spot Teas

Pheonix Las VegasRank Wordcount

NLP list-Unformatted

NLP list- Model

1food food donuts2good beer bagel3place sushi crepe

4 like restaurant pizza5great meal oyster6service menu yogurt

7 from atmosphere shrimp8time table burger9vegas steak gelato

10go dinner sushi11back server wings

12ordered salad sandwich13restaurant tables pancake14nice rib coffee15been buffet burrito16order dining curry

17chicken breakfast buffet18 little waitress waffle

19pretty shrimp chocolate

20 love something cake

21menu beers breakfast22eat dishes tea23delicious dish cookies

24first restaurants gluten25people sauce pastrami

Page 31: Yelp Dataset Challenge

Evaluation Metric: NDCG

Predicted topics for Phoenix under categories: Bakery, Breakfast and Brunch

To capture the strongest sentiments about these topics, we analyzed the top 1000 features for businesses under predicted under Bakery, Breakfast and Brunch for the specific city, in this case Phoenix.

Using these features as input for relevance score, we analyze the top 30 topics predicted by the model:

NDCG = 18.80190835 / 21.8978282 = 0.8586

Page 32: Yelp Dataset Challenge

RankNLP list- Output From

ModelRelevance

Score LogDCG=

rel(i)/log i1donut 3 0 32bagel 3 1 33cupcake 3 1.584963 1.8927894gelato 0 2 05gyro 2 2.321928 0.8613536yogurt 2 2.584963 0.7737067buffet 0 2.807355 08boba 1 3 0.3333339pizza 0 3.169925 010sushi 0 3.321928 011coffee 3 3.459432 0.86719412sub 2 3.584963 0.55788613wing 1 3.70044 0.27023814crepe 2 3.807355 0.52529915burger 2 3.906891 0.51191616burrito 2 4 0.517taco 2 4.087463 0.48930118cookie 2 4.169925 0.47962519gluten 0 4.247928 020breakfast 2 4.321928 0.46275621coffee-shop 2 4.392317 0.4553422hash-brown 1 4.459432 0.22424423cake 3 4.523562 0.66319424vegan 1 4.584963 0.21810425teas 2 4.643856 0.43067726bruschetta 1 4.70044 0.21274627waffle 3 4.754888 0.6309328pancake 3 4.807355 0.62404429subway 1 4.857981 0.20584730 latte 3 4.906891 0.611385

RankRelevance

Score Log  Ideal DCG (IDCG)1 3 0 32 3 1 33 3 1.5849625 1.892789264 3 2 1.55 3 2.32192809 1.292029676 3 2.5849625 1.160558427 3 2.80735492 1.068621568 3 3 19 3 3.169925 0.94639463

10 2 3.32192809 0.6020599911 2 3.45943162 0.5781296512 2 3.5849625 0.5578858913 2 3.70043972 0.5404763114 2 3.80735492 0.5252990715 2 3.9068906 0.5119160516 2 4 0.517 2 4.08746284 0.4893010818 2 4.169925 0.4796249319 2 4.24792751 0.4708178320 2 4.32192809 0.4627564321 1 4.39231742 0.2276702522 1 4.45943162 0.2242438223 1 4.52356196 0.2210647324 1 4.5849625 0.2181042925 1 4.64385619 0.2153382826 1 4.70043972 0.2127460527 0 4.7548875 028 0 4.80735492 029 0 4.857981 0

30 0 4.9068906 0

Page 33: Yelp Dataset Challenge

Things to Note ! Based on Results: Identified categories: Breakfast and Brunch, Bakery Keywords

donutbagelcupcakegelatogyroyogurtbuffetbobapizzasushicoffeesubwingcrepeburgerburritotacocookieglutenbreakfastcoffee-shophash-browncake

Page 34: Yelp Dataset Challenge

Things to Note ! Identified categories:

Breakfast and Brunch, Bakerydonutbagelcupcakegelatogyroyogurtbuffetbobapizzasushicoffeesubwingcrepeburgerburritotacocookieglutenbreakfastcoffee-shophash-browncake

Page 35: Yelp Dataset Challenge

Thank You!