yelp dataset challenge
TRANSCRIPT
Yelp Dataset ChallengeANWAR SHAIKHASHWIN NIMHANMANASHREE RAOSHRIJIT PILLAITEJAS SHAH
Project Tasks
Task 1 Assign Categories to Business in the Yelp Data Set
Task 2 Recommend Food Items and/or services in a Restaurant
Determine Influential Factors in a City affecting Restaurants
Task 1
...
Task 1 : Methodology
Business Business
To To Review Category
Map Map
…...
…...
Tf-Idf1. Default2. BM253. Dirichlet
Lucene Index
Lucene IndexMapping Phase Category to
Review Mapping
Predicted Categorie
s
Training Set
Testing Set
Evaluation
Top 3 Top 5 Top 70
0.2
0.4
0.6
Precision
Recall
F2-Measure
At least 1 TP
0.54
0.380.33
0.550.66 0.72
0.55 0.57 0.58
0.85 0.88 0.89
BM25 Similarity
Evaluation
Top 3 Top 5 Top 70
0.2
0.4
0.6
Precision
Recall
F2-Measure
At least 1 TP0.51
0.36 0.33
0.530.62 0.66
0.53 0.54 0.55
0.84 0.85 0.87
Default Similarity
Evaluation
Top 3 Top 5 Top 70
0.2
0.4
0.6
Precision
Recall
F2-Measure
At least 1 TP0.42
0.32 0.3
0.58 0.60.55
0.53 0.510.47
0.81 0.84 0.86
LMDirichlet Similarity
Task 2: Recommend Restaurant Food Items or Services
...
• Sentiments are mapped to Features
Feature Processi
ng
• Features are filtered to obtain only restaurant related ones using Task 1 Solution
Feature Filterin
g
• Nouns are used as features and adjectives as sentiments
Feature Extracti
on
• Nouns and adjectives are extracted
XML Parsing
• Each Review is passed into the tool which forms an XML file for each input
Stanford
CoreNLP
• Reviews of each restaurants are indexed one after another
Restaur
ant Review
s
Task 2 : Methodology
Feature Extraction
Every token has an associated POS tag
POS tag with “NN” are Nouns and “JJ” are adjectives
Nouns are considered as features and adjectives as sentiments
Feature Filtering Noise present in features obtained from Feature Extraction Phase
Using Task 1 Solution, categories of input features are determined
Features whose categories are related to restaurants are considered for further processing
Before Feature Filtering After Feature Filtering
• cheese• burger• ones• menu• combinations• idea• commission
• cheese• burger• menu
Feature Processing• Problem : The relationship
between noun and adjective was ambiguous for some sentences.
• Example : The food was great but the service was bad
• After parsing “bad” belongs to food or service?
New Review
Adjective
Positive or
Negative?
Negative Word
in 4-word
distance?
Decision (Recommended or not Recommen
ded)
Classification of reviews1. For each sentence the noun is
extracted through feature extraction
2. Corresponding adjective is identified as positive or negative
3. Negation is searched for within 4 word distance of adjective
4. Feature is classified as Recommended if number of positive sentiments associated with it is more than the number of negative sentiments
All the above steps are repeated for each review
Sample ResultPredicted Features
Predicted Feature Sentiments
Predicted as Recommended
Features ?
Actual Recommended
Featuressub, next, decent Y Ybread flavorful, bland, fresh, great, nice Y Ypeppercorn nice Y Ystuff-it chewy Y Nsandwich mayo/mustard/vinegar, east, good,
unknownY Y
menu decent Y Ybacon real Y Ybite huge Y Nveggies sorry N Y
Evaluation Set 1 - Recommended Features are obtained from 60%
reviews of a particular restaurant.
Set 2 - The remaining 40% of the reviews are considered for testing
If a recommended feature from Set 1 is present as a recommended feature in Set 2, then it is a True Positive
Evaluation Metrics Precision Recall Precision Recall0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.53
0.67
Identifying Influential topics“Identify features from reviews which are relevant
city wide and influence the user’s choice and restaurant’s popularity”
PhasesI. Business classification by cityII. Popular item word-countIII. NLP feature extractionIV. Feature re-ranking modelV. Model fitness evaluation
Business Classification Phase I
Issue: Reviews specify neighborhood not city. (~150 !!!)Solution:
1. Identify city based on geo-code through mapping service.2. K-means clustering
1. Data point features (Business Id, Latitude, Longitude)2. Dissimilarity metric (Euclidian distance)3. Cluster count: k (10)4. Centroid Labeling
3. Data persistence and indexing1. Split reviews based on clustered business ids2. Save & index for next phase.
Word-count Phase II
Issue: How do we get the influential factors of a citySolution: Word count as first passObservation: Noise (adjectives, verbs, expressions)Proposal: Include features derived through NLP
NLP Features Phase IIIIssue: Noise reduction and contextual awarenessSolution: Use NLP to identify features in the reviewsObservation: Subtle change in ordering of wordsProposal: Re-ranking the words using metrics from user and review.
• Nouns are extracted from the XML and are used as features
Feature Extraction
• Each Review is passed into the tool which forms an XML file for each input
Stanford CoreNLP
• Reviews of all restaurants filtered by city are indexed.
Restaurant Reviews by City
∑𝑹ε𝑹𝟏𝟎𝟎𝟎
¿¿¿
Mathematical Formula
Elite User
Who is Important?
Elite User
Useful ReviewWho is Important?
What is Important?
Mathematical Formula
Features from NLP does take in account word count and context but does NOT consider user weight and review weight
Program with Mathematical
Formula
SolrInde
x
Word list from NLP
Top 1K Relevant Reviews
Scored word
User Review Count = Urc
Average Stars Votes = Uv
Friends Elite = Ue
Yelping Since Compliments Fans = Uf
Mathematical Formula
Uvnorm =
{(𝟎 .𝟐𝟓 .𝐔𝐞+𝟎 .𝟓𝟓 𝑼 𝒗𝑼𝒓𝒄 +𝟎 .𝟐𝟎 .𝐔𝐟 )}
Normalization of votes
User Review Count = Urc
Average Stars Votes = Uv
Friends Elite = Ue
Yelping Since Compliments Fans = Uf
User Review Count VotesU1 10 1000U2 1000 1000
User Review Count = Urc
Average Stars Votes = Uv
Friends Elite = Ue
Yelping Since Compliments Fans = Uf
Mathematical Formula
{(𝟎 .𝟏𝟓 .𝑹𝒗+𝟎 .𝟏𝟓 .𝑹𝒔 )+𝟎 .𝟕 .(𝟎 .𝟐𝟓 .𝐔𝐞+𝟎 .𝟓𝟓 𝑼 𝒗𝑼 𝒓𝒄 +𝟎 .𝟐𝟎 .𝐔𝐟 )}
Review User Stars = Rs
Text Date Votes = Rv
User
Stars Sentiment1 Very
Strong2 Inclined -ve3 Ambivalent4 Inclined +ve
5 Very Strong
User Review Count = Urc
Average Stars Votes = Uv
Friends Elite = Ue
Yelping Since Compliments Fans = Uf
Mathematical Formula
{𝒕𝒇 . 𝐥𝐨𝐠 (𝟏− 𝒅𝒇𝑫𝒄𝒐𝒖𝒏𝒕 )}× {(𝟎 .𝟏𝟓 .𝑹𝒗+𝟎 .𝟏𝟓 .𝑹𝒔 )+𝟎 .𝟕 .(𝟎 .𝟐𝟓 .𝑼𝒆+𝟎 .𝟓𝟓 𝑼 𝒗
𝑼𝒓𝒄 +𝟎 .𝟐𝟎 .𝑼𝒇 )}
Review User Stars = Rs
Text Date Votes = Rv
UserReview Relevance TermFrequency = tf Document Frequecny
= df Document Count =
Dcount
User Review Count = Urc
Average Stars Votes = Uv
Friends Elite = Ue
Yelping Since Compliments Fans = Uf
j
Output
12
3
45
6
79
8
10
11
12
14
13
15
16
17
MadisonRank
Wordcount List
NLP list-Unformatted
NLP list- Model
1food food pizza2place beer cheese3 like cheese coffee4from menu breakfast5service curds burger6go atmosphere taco7time burger sushi8madison dane chocolate9been drinks beer
10cheese beers sandwich11menu restaurant curds
12bar table ice13restaurant coffee wine14ordered pizza store15 love something cream16order sandwich lunch17chicken dinner rolls
18beer lunch atmosphere19pizza meal tea20sauce sauce curries21night burgers steak22people drink noodle23make bread spot24staff server soup25made chicken egg
RankWordcount List
NLP list-Unformatted
NLP list- Model
1 food food donut2good pizza bagel3place burger cupcake4great menu gelato5 like restaurant gyro6service fries yogurt
7 time atmosphere buffet8go chicken boba9back patio pizza
10 from breakfast sushi11been table coffee12 love lunch sub13ordered dinner wing14chicken meal crepe15nice salad burger16order cheese burrito17 restaurant potato taco18 little server cookie19menu something gluten20pizza sauce breakfast
21bar drinks coffee-shop22delicious rice hash-brown23 friendly burgers cake24first beer Vegan
25Pretty Spot Teas
Pheonix Las VegasRank Wordcount
NLP list-Unformatted
NLP list- Model
1food food donuts2good beer bagel3place sushi crepe
4 like restaurant pizza5great meal oyster6service menu yogurt
7 from atmosphere shrimp8time table burger9vegas steak gelato
10go dinner sushi11back server wings
12ordered salad sandwich13restaurant tables pancake14nice rib coffee15been buffet burrito16order dining curry
17chicken breakfast buffet18 little waitress waffle
19pretty shrimp chocolate
20 love something cake
21menu beers breakfast22eat dishes tea23delicious dish cookies
24first restaurants gluten25people sauce pastrami
Evaluation Metric: NDCG
Predicted topics for Phoenix under categories: Bakery, Breakfast and Brunch
To capture the strongest sentiments about these topics, we analyzed the top 1000 features for businesses under predicted under Bakery, Breakfast and Brunch for the specific city, in this case Phoenix.
Using these features as input for relevance score, we analyze the top 30 topics predicted by the model:
NDCG = 18.80190835 / 21.8978282 = 0.8586
RankNLP list- Output From
ModelRelevance
Score LogDCG=
rel(i)/log i1donut 3 0 32bagel 3 1 33cupcake 3 1.584963 1.8927894gelato 0 2 05gyro 2 2.321928 0.8613536yogurt 2 2.584963 0.7737067buffet 0 2.807355 08boba 1 3 0.3333339pizza 0 3.169925 010sushi 0 3.321928 011coffee 3 3.459432 0.86719412sub 2 3.584963 0.55788613wing 1 3.70044 0.27023814crepe 2 3.807355 0.52529915burger 2 3.906891 0.51191616burrito 2 4 0.517taco 2 4.087463 0.48930118cookie 2 4.169925 0.47962519gluten 0 4.247928 020breakfast 2 4.321928 0.46275621coffee-shop 2 4.392317 0.4553422hash-brown 1 4.459432 0.22424423cake 3 4.523562 0.66319424vegan 1 4.584963 0.21810425teas 2 4.643856 0.43067726bruschetta 1 4.70044 0.21274627waffle 3 4.754888 0.6309328pancake 3 4.807355 0.62404429subway 1 4.857981 0.20584730 latte 3 4.906891 0.611385
RankRelevance
Score Log Ideal DCG (IDCG)1 3 0 32 3 1 33 3 1.5849625 1.892789264 3 2 1.55 3 2.32192809 1.292029676 3 2.5849625 1.160558427 3 2.80735492 1.068621568 3 3 19 3 3.169925 0.94639463
10 2 3.32192809 0.6020599911 2 3.45943162 0.5781296512 2 3.5849625 0.5578858913 2 3.70043972 0.5404763114 2 3.80735492 0.5252990715 2 3.9068906 0.5119160516 2 4 0.517 2 4.08746284 0.4893010818 2 4.169925 0.4796249319 2 4.24792751 0.4708178320 2 4.32192809 0.4627564321 1 4.39231742 0.2276702522 1 4.45943162 0.2242438223 1 4.52356196 0.2210647324 1 4.5849625 0.2181042925 1 4.64385619 0.2153382826 1 4.70043972 0.2127460527 0 4.7548875 028 0 4.80735492 029 0 4.857981 0
30 0 4.9068906 0
Things to Note ! Based on Results: Identified categories: Breakfast and Brunch, Bakery Keywords
donutbagelcupcakegelatogyroyogurtbuffetbobapizzasushicoffeesubwingcrepeburgerburritotacocookieglutenbreakfastcoffee-shophash-browncake
Things to Note ! Identified categories:
Breakfast and Brunch, Bakerydonutbagelcupcakegelatogyroyogurtbuffetbobapizzasushicoffeesubwingcrepeburgerburritotacocookieglutenbreakfastcoffee-shophash-browncake
Thank You!