yelp dataset challenge

Yelp Dataset ChallengeANWAR SHAIKHASHWIN NIMHANMANASHREE RAOSHRIJIT PILLAITEJAS SHAH

Project Tasks

Task 1 Assign Categories to Business in the Yelp Data Set

Task 2 Recommend Food Items and/or services in a Restaurant

Determine Influential Factors in a City affecting Restaurants

Task 1

...

Task 1 : Methodology

Business Business

To To Review Category

Map Map

…...

…...

Tf-Idf1. Default2. BM253. Dirichlet

Lucene Index

Lucene IndexMapping Phase Category to

Review Mapping

Predicted Categorie

s

Training Set

Testing Set

Evaluation

Top 3 Top 5 Top 70

0.2

0.4

0.6

Precision

Recall

F2-Measure

At least 1 TP

0.54

0.380.33

0.550.66 0.72

0.55 0.57 0.58

0.85 0.88 0.89

BM25 Similarity

Evaluation

Top 3 Top 5 Top 70

0.2

0.4

0.6

Precision

Recall

F2-Measure

At least 1 TP0.51

0.36 0.33

0.530.62 0.66

0.53 0.54 0.55

0.84 0.85 0.87

Default Similarity

Evaluation

Top 3 Top 5 Top 70

0.2

0.4

0.6

Precision

Recall

F2-Measure

At least 1 TP0.42

0.32 0.3

0.58 0.60.55

0.53 0.510.47

0.81 0.84 0.86

LMDirichlet Similarity

Task 2: Recommend Restaurant Food Items or Services

...

• Sentiments are mapped to Features

Feature Processi

ng

• Features are filtered to obtain only restaurant related ones using Task 1 Solution

Feature Filterin

g

• Nouns are used as features and adjectives as sentiments

Feature Extracti

on

• Nouns and adjectives are extracted

XML Parsing

• Each Review is passed into the tool which forms an XML file for each input

Stanford

CoreNLP

• Reviews of each restaurants are indexed one after another

Restaur

ant Review

s

Task 2 : Methodology

Feature Extraction

Every token has an associated POS tag

POS tag with “NN” are Nouns and “JJ” are adjectives

Nouns are considered as features and adjectives as sentiments

Feature Filtering Noise present in features obtained from Feature Extraction Phase

Using Task 1 Solution, categories of input features are determined

Features whose categories are related to restaurants are considered for further processing

Before Feature Filtering After Feature Filtering

• cheese• burger• ones• menu• combinations• idea• commission

• cheese• burger• menu

Feature Processing• Problem : The relationship

between noun and adjective was ambiguous for some sentences.

• Example : The food was great but the service was bad

• After parsing “bad” belongs to food or service?

New Review

Adjective

Positive or

Negative?

Negative Word

in 4-word

distance?

Decision (Recommended or not Recommen

ded)

Classification of reviews1. For each sentence the noun is

extracted through feature extraction

2. Corresponding adjective is identified as positive or negative

3. Negation is searched for within 4 word distance of adjective

4. Feature is classified as Recommended if number of positive sentiments associated with it is more than the number of negative sentiments

All the above steps are repeated for each review

Sample ResultPredicted Features

Predicted Feature Sentiments

Predicted as Recommended

Features ?

Actual Recommended

Featuressub, next, decent Y Ybread flavorful, bland, fresh, great, nice Y Ypeppercorn nice Y Ystuff-it chewy Y Nsandwich mayo/mustard/vinegar, east, good,

unknownY Y

menu decent Y Ybacon real Y Ybite huge Y Nveggies sorry N Y

Evaluation Set 1 - Recommended Features are obtained from 60%

reviews of a particular restaurant.

Set 2 - The remaining 40% of the reviews are considered for testing

If a recommended feature from Set 1 is present as a recommended feature in Set 2, then it is a True Positive

Evaluation Metrics Precision Recall Precision Recall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.53

0.67

Identifying Influential topics“Identify features from reviews which are relevant

city wide and influence the user’s choice and restaurant’s popularity”

PhasesI. Business classification by cityII. Popular item word-countIII. NLP feature extractionIV. Feature re-ranking modelV. Model fitness evaluation

Business Classification Phase I

Issue: Reviews specify neighborhood not city. (~150 !!!)Solution:

1. Identify city based on geo-code through mapping service.2. K-means clustering

1. Data point features (Business Id, Latitude, Longitude)2. Dissimilarity metric (Euclidian distance)3. Cluster count: k (10)4. Centroid Labeling

3. Data persistence and indexing1. Split reviews based on clustered business ids2. Save & index for next phase.

Word-count Phase II

Issue: How do we get the influential factors of a citySolution: Word count as first passObservation: Noise (adjectives, verbs, expressions)Proposal: Include features derived through NLP

NLP Features Phase IIIIssue: Noise reduction and contextual awarenessSolution: Use NLP to identify features in the reviewsObservation: Subtle change in ordering of wordsProposal: Re-ranking the words using metrics from user and review.

• Nouns are extracted from the XML and are used as features

Feature Extraction

• Each Review is passed into the tool which forms an XML file for each input

Stanford CoreNLP

• Reviews of all restaurants filtered by city are indexed.

Restaurant Reviews by City

∑𝑹ε𝑹𝟏𝟎𝟎𝟎

¿¿¿

Mathematical Formula

Elite User

Who is Important?

Elite User

Useful ReviewWho is Important?

What is Important?


Features from NLP does take in account word count and context but does NOT consider user weight and review weight

Program with Mathematical

Formula

SolrInde

x

Word list from NLP

Top 1K Relevant Reviews

Scored word

User Review Count = Urc

Average Stars Votes = Uv

Friends Elite = Ue

Yelping Since Compliments Fans = Uf


Uvnorm =

{(𝟎 .𝟐𝟓 .𝐔𝐞+𝟎 .𝟓𝟓 𝑼 𝒗𝑼𝒓𝒄 +𝟎 .𝟐𝟎 .𝐔𝐟 )}

Normalization of votes



Friends Elite = Ue


User Review Count VotesU1 10 1000U2 1000 1000



Friends Elite = Ue



{(𝟎 .𝟏𝟓 .𝑹𝒗+𝟎 .𝟏𝟓 .𝑹𝒔 )+𝟎 .𝟕 .(𝟎 .𝟐𝟓 .𝐔𝐞+𝟎 .𝟓𝟓 𝑼 𝒗𝑼 𝒓𝒄 +𝟎 .𝟐𝟎 .𝐔𝐟 )}

Review User Stars = Rs

Text Date Votes = Rv

User

Stars Sentiment1 Very

Strong2 Inclined -ve3 Ambivalent4 Inclined +ve

5 Very Strong



Friends Elite = Ue



{𝒕𝒇 . 𝐥𝐨𝐠 (𝟏− 𝒅𝒇𝑫𝒄𝒐𝒖𝒏𝒕 )}× {(𝟎 .𝟏𝟓 .𝑹𝒗+𝟎 .𝟏𝟓 .𝑹𝒔 )+𝟎 .𝟕 .(𝟎 .𝟐𝟓 .𝑼𝒆+𝟎 .𝟓𝟓 𝑼 𝒗

𝑼𝒓𝒄 +𝟎 .𝟐𝟎 .𝑼𝒇 )}

Review User Stars = Rs

Text Date Votes = Rv

UserReview Relevance TermFrequency = tf Document Frequecny

= df Document Count =

Dcount



Friends Elite = Ue


j

Output

12

3

45

6

79

8

10

11

12

14

13

15

16

17

MadisonRank

Wordcount List

NLP list-Unformatted

NLP list- Model

1food food pizza2place beer cheese3 like cheese coffee4from menu breakfast5service curds burger6go atmosphere taco7time burger sushi8madison dane chocolate9been drinks beer

10cheese beers sandwich11menu restaurant curds

12bar table ice13restaurant coffee wine14ordered pizza store15 love something cream16order sandwich lunch17chicken dinner rolls

18beer lunch atmosphere19pizza meal tea20sauce sauce curries21night burgers steak22people drink noodle23make bread spot24staff server soup25made chicken egg

RankWordcount List


NLP list- Model

1 food food donut2good pizza bagel3place burger cupcake4great menu gelato5 like restaurant gyro6service fries yogurt

7 time atmosphere buffet8go chicken boba9back patio pizza

10 from breakfast sushi11been table coffee12 love lunch sub13ordered dinner wing14chicken meal crepe15nice salad burger16order cheese burrito17 restaurant potato taco18 little server cookie19menu something gluten20pizza sauce breakfast

21bar drinks coffee-shop22delicious rice hash-brown23 friendly burgers cake24first beer Vegan

25Pretty Spot Teas

Pheonix Las VegasRank Wordcount


NLP list- Model

1food food donuts2good beer bagel3place sushi crepe

4 like restaurant pizza5great meal oyster6service menu yogurt

7 from atmosphere shrimp8time table burger9vegas steak gelato

10go dinner sushi11back server wings

12ordered salad sandwich13restaurant tables pancake14nice rib coffee15been buffet burrito16order dining curry

17chicken breakfast buffet18 little waitress waffle

19pretty shrimp chocolate

20 love something cake

21menu beers breakfast22eat dishes tea23delicious dish cookies

24first restaurants gluten25people sauce pastrami

Evaluation Metric: NDCG

Predicted topics for Phoenix under categories: Bakery, Breakfast and Brunch

To capture the strongest sentiments about these topics, we analyzed the top 1000 features for businesses under predicted under Bakery, Breakfast and Brunch for the specific city, in this case Phoenix.

Using these features as input for relevance score, we analyze the top 30 topics predicted by the model:

NDCG = 18.80190835 / 21.8978282 = 0.8586

RankNLP list- Output From

ModelRelevance

Score LogDCG=

rel(i)/log i1donut 3 0 32bagel 3 1 33cupcake 3 1.584963 1.8927894gelato 0 2 05gyro 2 2.321928 0.8613536yogurt 2 2.584963 0.7737067buffet 0 2.807355 08boba 1 3 0.3333339pizza 0 3.169925 010sushi 0 3.321928 011coffee 3 3.459432 0.86719412sub 2 3.584963 0.55788613wing 1 3.70044 0.27023814crepe 2 3.807355 0.52529915burger 2 3.906891 0.51191616burrito 2 4 0.517taco 2 4.087463 0.48930118cookie 2 4.169925 0.47962519gluten 0 4.247928 020breakfast 2 4.321928 0.46275621coffee-shop 2 4.392317 0.4553422hash-brown 1 4.459432 0.22424423cake 3 4.523562 0.66319424vegan 1 4.584963 0.21810425teas 2 4.643856 0.43067726bruschetta 1 4.70044 0.21274627waffle 3 4.754888 0.6309328pancake 3 4.807355 0.62404429subway 1 4.857981 0.20584730 latte 3 4.906891 0.611385

RankRelevance

Score Log Ideal DCG (IDCG)1 3 0 32 3 1 33 3 1.5849625 1.892789264 3 2 1.55 3 2.32192809 1.292029676 3 2.5849625 1.160558427 3 2.80735492 1.068621568 3 3 19 3 3.169925 0.94639463

10 2 3.32192809 0.6020599911 2 3.45943162 0.5781296512 2 3.5849625 0.5578858913 2 3.70043972 0.5404763114 2 3.80735492 0.5252990715 2 3.9068906 0.5119160516 2 4 0.517 2 4.08746284 0.4893010818 2 4.169925 0.4796249319 2 4.24792751 0.4708178320 2 4.32192809 0.4627564321 1 4.39231742 0.2276702522 1 4.45943162 0.2242438223 1 4.52356196 0.2210647324 1 4.5849625 0.2181042925 1 4.64385619 0.2153382826 1 4.70043972 0.2127460527 0 4.7548875 028 0 4.80735492 029 0 4.857981 0

30 0 4.9068906 0

Things to Note ! Based on Results: Identified categories: Breakfast and Brunch, Bakery Keywords

donutbagelcupcakegelatogyroyogurtbuffetbobapizzasushicoffeesubwingcrepeburgerburritotacocookieglutenbreakfastcoffee-shophash-browncake

Things to Note ! Identified categories:

Breakfast and Brunch, Bakerydonutbagelcupcakegelatogyroyogurtbuffetbobapizzasushicoffeesubwingcrepeburgerburritotacocookieglutenbreakfastcoffee-shophash-browncake

Thank You!

yelp dataset challenge

Data & Analytics