data mining capstone final report -...
TRANSCRIPT
DATA MINING CAPSTONE
FINAL REPORT
I
ABSTRACT
This report is to summarize the tasks accomplished for the Data Mining Capstone. The tasks are
based on yelp review data, majorly for restaurants. Six tasks are accomplished.
The first task is to visualize customer review text for all restaurants. Frequent word cloud is
plotted. Topics are detected from the review text for all restaurants. In addition, topic comparison
for two Chinese restaurants are provided and visualized.
The second task is to build cuisine map based on similarity between cuisines using customer
review text. Top fifty cuisines are found first to be included in the cuisine map. Term Frequency
(TF), Term Frequency-Inverse Document Frequency (TF-IDF), and clustering methods, i.e.
hierarchical clustering and k-means clustering, are used to build similarity matrix, heat map, and
cuisine map.
The third task is to recognize dish names from customer review text of a certain cuisine. Chinese
cuisine is chosen for this task. A labeled dish name list of Chinese cuisine is revised manually.
Then two algorithms, i.e. TopMine and SegPhrase, are used to mine a comprehensive Chinese
dish list based on the review text for Chinese restaurants and the labeled dish name list.
The fourth and fifth tasks are to detect popular dishes and recommend good restaurants for
certain dishes. Again, Chinese cuisine is chosen for this task. 700 dish names from task 3 are
used as a pool of Chinese dishes. The top 100 most popular dishes and their corresponding
tastiness are found by mining customer review text and review score, i.e. stars. We also
recommended top 100 most popular restaurants for two popular Chinese dishes, i.e. orange
chicken and fried rice.
II
The sixth task is to predict whether a set of restaurants will pass the public health inspection tests
given the corresponding Yelp text reviews along with some additional information such as the
locations and cuisines offered in these restaurants.
In addition, in this report we highlight the following: (1) the most useful data mining results
produced through these specific data mining tasks and potential people who might benefit from
such results; (2) the novel ideas/methods explored to carry out the tasks; (3) new knowledge
people can learn from the project activities, particularly through the experiments.
The report is organized as follows. Chapter 1 to Chapter 5 will introduce the six tasks: Chapters
1 to 3 address tasks 1 to 3, Chapter 4 addresses tasks 4 and 5, and Chapter 5 addresses task 6.
Chapter 6 introduces the useful results. Chapter 7 presents the novel method used in carrying out
the tasks. Chapter 8 summarizes the contribution of new knowledge discovered throughout the
capstone.
III
TABLE OF CONTENTS
1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET ............... 1
1.1 Tools Used........................................................................................................................ 1 1.2 Major Packages ................................................................................................................ 1 1.3 Data Import ...................................................................................................................... 1
1.4 Data Preprocess ................................................................................................................ 2 1.5 Topic Model Fitting ......................................................................................................... 3 1.6 Comparison of Topics for Two Chinese Restaurants ...................................................... 4
1.7 Discussion on the Topics for the Two Chinese Restaurants ............................................ 5 1.7.1 Similarity ....................................................................................................... 5 1.7.2 Difference ...................................................................................................... 5
2 CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP
CONSTRUCTION .............................................................................................................. 7
2.1 Tools Used........................................................................................................................ 7
2.2 Major Packages: ............................................................................................................... 7 2.3 Data Import ...................................................................................................................... 7 2.4 Data Preprocess ................................................................................................................ 8
2.5 Similarity Matrix without IDF ......................................................................................... 9 2.6 Similarity Matrix with IDF ............................................................................................ 10
2.7 Similarity Matrix with Clustering .................................................................................. 12 2.7.1 Hierarchical Clustering .............................................................................. 12
2.7.2 k-Means Clustering ..................................................................................... 13 2.8 Conclusions .................................................................................................................... 15
3 CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION ............................. 16
3.1 Task 3.1: Manual Tagging ............................................................................................. 16
3.2 Task 3.2: Mining Additional Dish Names ..................................................................... 16 3.2.1 Corpus Preparation .................................................................................... 16
3.3 Dish Name Identify Using TopMine .............................................................................. 17 3.3.1 Parameters .................................................................................................. 17 3.3.2 Opinion about the Result ............................................................................ 17
3.3.3 Improvement ............................................................................................... 18 3.4 Dish Name Identify Using SegPhrase ............................................................................ 18
3.4.1 Parameters .................................................................................................. 18 3.4.2 Opinion about the Result ............................................................................ 18
4 CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT
RECOMMENDATION .................................................................................................... 19
4.1 Data Preparation ............................................................................................................. 19
IV
4.1.1 Corpus ......................................................................................................... 19 4.1.2 Dish List ...................................................................................................... 20
4.2 Tools and Packages ........................................................................................................ 20 4.2.1 Attached Base Packages: ............................................................................ 20
4.2.2 Other Attached Packages: .......................................................................... 20 4.3 Task 4: Popular Dishes ................................................................................................... 20
4.3.1 Popularity Analysis ..................................................................................... 20 4.3.2 Sentiment Analysis ...................................................................................... 21 4.3.3 Illustration................................................................................................... 21
4.4 Task 5: Popular Restaurants ........................................................................................... 22 4.4.1 Popularity Analysis ..................................................................................... 22 4.4.2 Sentiment Analysis ...................................................................................... 23 4.4.3 Illustration................................................................................................... 23
4.5 Conclusions .................................................................................................................... 24 5 CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION ........................ 25
5.1 Tools Used...................................................................................................................... 25
5.1.1 Packages ..................................................................................................... 25 5.2 Text Preprocessing ......................................................................................................... 25 5.3 Training Method 1: Logistic Regression ........................................................................ 26
5.3.1 Text Representation Techniques ................................................................. 26 5.3.1.1 Unigram ............................................................................................... 26
5.3.2 Additional Features Used ........................................................................... 26 5.3.3 Learning Algorithm ..................................................................................... 26 5.3.4 Results Analysis .......................................................................................... 26
5.4 Training Method 2: Random Forest ............................................................................... 27
5.4.1 Text Representation Techniques ................................................................. 27 5.4.1.1 Unigram ............................................................................................... 27 5.4.1.2 Topic Model ......................................................................................... 27
5.4.2 Additional Features Used ........................................................................... 27 5.4.3 Learning Algorithm ..................................................................................... 28
5.4.4 Results Analysis .......................................................................................... 28 5.5 Method Comparison ....................................................................................................... 29
6 CHAPTER 6 USEFULNESS OF RESULTS ............................................................. 30
6.1 Cuisine Maps .................................................................................................................. 30 6.1.1 Usefulness for Customers ........................................................................... 30 6.1.2 Usefulness for Restaurant Owners.............................................................. 30
6.2 Dish Recognizer ............................................................................................................. 30 6.3 Popular Dishes Detection ............................................................................................... 31
6.4 Restaurant Recommendation ......................................................................................... 31 6.5 Hygiene Prediction ......................................................................................................... 31
7 CHAPTER 7 NOVELTY OF EXPLORATION ........................................................ 32
7.1 Hierarchical Clustering in Cuisine Map Development .................................................. 32 7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition ....................... 32
V
7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction ........ 32 8 CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE...................................... 33
8.1 Some Advantages of Random Forest over Logistic Regression .................................... 33
8.1.1 Random Forest is Less Prone to Overfitting than Logistic Regression ...... 33 8.1.2 Logistic Regression is not Good at Handling Missing Feature Value ....... 33
9 CHAPTER 9 IMPROVEMENT TO BE DONE ........................................................ 34
10 REFERENCES ........................................................................................................... 35
VI
List of Figures
Figure 1.1 Word cloud .................................................................................................................. 2 Figure 1.2 The topics of the sampled Restaurant ....................................................................... 3 Figure 1.3 The topics of the first Chinese Restaurant CR1 ...................................................... 4 Figure 1.4 The topics of the second Chinese Restaurant CR2 .................................................. 5
Figure 2.1 Similarity matrix without IDF................................................................................. 10 Figure 2.2 Similarity matrix using IDF..................................................................................... 11 Figure 2.3 Similarity matrix and hierarchical cluster ............................................................. 13 Figure 2.4 Similarity matrix and k-means cluster ................................................................... 15 Figure 4.1 Illustration for popular dish names ........................................................................ 22
Figure 4.2 Illustration for popular restaurants for orange chicken ....................................... 23
Figure 4.3 Illustration for popular restaurants for fried rice ................................................. 24
VII
List of Tables
TABLE 2.1 Cluster list ............................................................................................................... 14 TABLE 5.1 Prediction Score obtained by Logistic Regression .............................................. 26 TABLE 5.2 Prediction Score obtained by Random Forest & Unigram ................................ 28 TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model .......................... 29
1
1 CHAPTER 1
SUMMARY OF TASK 1: EXPLORATION OF DATA SET
In this chapter, we explore the yelp data set. Particularly, we mine the reviews on restaurants
from customers in order to find topics. We mine the topics based on Latent Dirichlet Allocation
(LDA) model and plot the topics in a circular tree for visualization. In addition, we mine and
compare the topics of two Chinese restaurants.
1.1 Tools Used
R version 3.1.3
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
1.2 Major Packages
jsonlite_0.9.14
tm_0.6-2
topicmodels_0.2-2
igraph_1.0.1
1.3 Data Import
First, we read “yelp_academic_dataset_business.json” into variable “BUSINESS” and
“yelp_academic_dataset_review.json” into variable “REVIEW” using “jsonlite” package.
Second, we select all the restaurants from “BUSINESS” by finding the entries that have
“Restaurants” in column “categories”. We denote this selected data set as “RESTAURANTS”.
Third, we merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by
“business_id” column. This also eliminates the entries in “REVIEW” that are not for restaurants.
2
Then, we randomly select 10,000 samples from “RESTAURANTS_REVIEW” and record it into
data set “restaurants_review”. A relatively small data set is sampled due to the limit in time and
computer capacity.
1.4 Data Preprocess
We then convert the data into a corpus by using “tm” packages:
We build a corpus based on the “text” column of “restaurants_review”.
We convert all the words into lower case.
We remove anything other than English letters or space.
We remove stop words.
We remove extra white space.
Make the text fit paper width, i.e. each line has at most 60 characters.
We heuristically complete stemmed words.
We constructs a document-term matrix.
We plot the word cloud to see what the major words are.
Figure 1.1 Word cloud
3
From the word cloud we can see the common words people use to describe a restaurant, e.g.
“good”, “food”, “great”, “place”, “just”, “like”, “service”, and etc. In addition, we can see food
names such as “salad”, “pizza”, “cheese”, “sushi”, and etc. We can also find that “chicken” is a
very common food in US. All in all, we can find a lot of information that makes sense.
1.5 Topic Model Fitting
We fit the document-term matrix into LDA model using the “LDA” function in the
“topicmodels” package. We set the number of topics as 10. The plot is shown in Figure 1.2
where ten words in each topic are presented. From Figure 1.2 we can tell that people mention
“food”, “place”, “service”, and “good”, “great” in many topics, which is to be expected. In
Figure 1.2, the words in topic i have i after them. For example, in topic 1, we have “food 1”,
“great 1” and so on.
Figure 1.2 The topics of the sampled Restaurant
4
1.6 Comparison of Topics for Two Chinese Restaurants
We randomly select two Chinese restaurants: CR1 with “business_id” “-
3WVw1TNQbPBzaKCaQQ1AQ” and CR2 with “business_id” “-mz0Zr0Dw6ZASg7_ah1R8A”.
We carry out the same procedure as above and obtain the LDA based topic plots as shown in
Figure 1.3 and Figure 1.4.
Figure 1.3 The topics of the first Chinese Restaurant CR1
5
Figure 1.4 The topics of the second Chinese Restaurant CR2
1.7 Discussion on the Topics for the Two Chinese Restaurants
1.7.1 Similarity
Both topic 2 for CR1 and topic 3 for CR2 contain “good”, “dish”, “order”, “beef” and “place”.
This is not surprising because beef is very common in USA. It is very likely good tasting dishes
containing beef are often ordered in both restaurants.
The topics for both restaurants contains “China” and “Chinese” and other common words such as
“food”, “good”, “place”, ”chicken”, and “dish”.
1.7.2 Difference
The major words of the topics really depend on the names and menus of the restaurants. It is
obvious that in the topics of restaurant CR1, people are talking about “chili” and “spiciness”
since the restaurant is called “China Chili” and probably serves a lot of spicy food. However, in
6
the restaurant CR2, “fried”, “egg”, “roll” and “pork” appear often because the second restaurant
is called “Sing High” and serves “Barbecued pork slices”, “egg roll”, “fried Won Ton”.
7
2 CHAPTER 2
SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP
CONSTRUCTION
In this chapter, we mine the data set to construct cuisine maps to visually understand the
landscape of different types of cuisines and their similarities. The cuisine map can help users
understand what cuisines are available and their relations, which allows for the discovery of new
cuisines, thus facilitating exploration of unfamiliar cuisines. The cuisine map is build based on
the categories and customer reviews of restaurants in Yelp data.
2.1 Tools Used
R version 3.1.3
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
2.2 Major Packages:
reshape2_1.4.1 plyr_1.8.3 ggplot2_1.0.1 scales_0.2.5 HSAUR_1.3-7
cluster_2.0.3 corrplot_0.73 proxy_0.4-15 tm_0.6-2 NLP_0.1-8
jsonlite_0.9.16
2.3 Data Import
First, we read “yelp_academic_dataset_business.json” into variable “BUSINESS” and
“yelp_academic_dataset_review.json” into variable “REVIEW” using “jsonlite” package.
Second, we select all the restaurants from “BUSINESS” by finding the entries that have
“Restaurants” in column “categories”. We denote this selected data set as “RESTAURANTS”.
Third, we merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by
“business_id” column. This also eliminates the entries in “REVIEW” that are not for restaurants.
8
2.4 Data Preprocess
First, we search for the most popular cuisines by counting the frequency of cuisine names in
“categories” column. We pick the top 50 ones to build the cuisine map:
[1] "American (New)" "American (Traditional)" "Nightlife" "Bars"
[5] "Mexican" "Italian" "Breakfast & Brunch" "Pizza"
[9] "Steakhouses" "Sandwiches" "Burgers" "Sushi Bars"
[13] "Japanese" "Chinese" "Seafood" "Buffets"
[17] "Fast Food" "Thai" "Asian Fusion" "Mediterranean"
[21] "French" "Cafes" "Sports Bars" "Barbeque"
[25] "Pubs" "Coffee & Tea" "Vietnamese" "Delis"
[29] "Vegetarian" "Lounges" "Greek" "Wine Bars"
[33] "Desserts" "Bakeries" "Gluten-Free" "Diners"
[37] "Indian" "Korean" "Salad" "Chicken Wings"
[41] "Hot Dogs" "Tapas Bars" "Arts & Entertainment" "Southern"
[45] "Tapas/Small Plates" "Middle Eastern" "Hawaiian" "Vegan"
[49] "Gastropubs" "Dim Sum"
Then, we eliminate the entries that are not in the 50 categories from
“RESTAURANTS_REVIEW”, randomly sample 10,000 entries from
“RESTAURANTS_REVIEW” and record it into data set “restaurants_review”. A relatively
small data set is sampled due to the limit in time and computer capacity.
We then convert the data into a corpus by using “tm” packages:
We build a corpus based on the “text” column of “restaurants_review”.
We convert all the words into lower case.
We remove anything other than English letters or space.
We remove stop words.
We remove extra white space.
We make the text fit paper width, i.e. each line has at most 60 characters.
9
2.5 Similarity Matrix without IDF
First, we construct a document term matrix using the corpus we prepared in section 2.4 using
Term Frequency (TF). We do not apply Inverse Document Frequency (IDF) in constructing the
document term matrix.
Second, we calculate the similarity matrix based on the document term matrix by using “1 minus
cosine distance” and plot the similarity matrix in Figure 2.1. As can be seen from Figure 2.1,
the similarity value is between 0 and 1. The similarity of a cuisine to itself is 1 as expected. We
can observe many sets of cuisines that are very similar to each other, which is consistent with
common sense. To name a few:
American (New) , American (Traditional), Night Life, and Bars
Italian and Pizza
Delis and Sandwiches
Fast Food and Burgers
Cafes and Breakfast & Brunch
Japanese and Sushi Bars
Mediterranean, Greek, and Middle Eastern
Vegetarian and Gluten-free
Chinese, Asian Fusion, and Dim Sum
10
Figure 2.1 Similarity matrix without IDF
2.6 Similarity Matrix with IDF
The results presented in the previous section make a lot of sense. The similarity values between
similar cuisines are indeed higher than those between not-so-similar or very-different cuisines.
However, the difference is not very significant. Therefore, we use IDF to enhance the difference.
We prepared another document term matrix using TF-IDF and calculate the similarity matrix
with the same method (cosine distance). The similarity matrix is shown in Figure 2.2.
11
Figure 2.2 Similarity matrix using IDF
As can be seen from Figure 2.2, the similarity values between cuisines that are actually similar
to each other are significantly higher than the values between cuisines that have less in common.
For example, Dim Sum is a type of Chinese food, based on Figure 2.1, it appears to have high
similarity to Japanese, Sushi Bars, and Seafood and its similarity to Chinese is not significantly
higher than its similarity to Japanese, Sushi Bars, and Seafood. But according to Figure 2.2, the
similarity of Dim Sum to Japanese, Sushi Bars, and Seafood is much weaker and its similarity to
Chinese is enhanced.
12
For another example, based on Figure 2.1, Greek seems to be very similar to American (New),
American (Traditional), Nightlife, Bars, and its similarity to Mediterranean and Middle Eastern
does not look very significant. Based on Figure 2.2, the similarity between Greek,
Mediterranean, and Middle Eastern is much easier to find.
2.7 Similarity Matrix with Clustering
We improved similarity matrix by using TF-IDF in section 2.6. However, related cuisines are
sometimes located far away from each other and the cuisine map is not very handy to use. For
instance, Middle Eastern, Mediterranean, and Greek are far away from each other in cuisine
maps shown in Figure 2.1 and Figure 2.2 though they are quite similar. Indeed, it takes a lot of
eye effort to find this relationship. Therefore, we carry out hierarchical cluster and -means
cluster to facilitate the visualization of the relationships between similar cuisines.
2.7.1 Hierarchical Clustering
We first try hierarchical clustering. A heat map is plotted in Figure 2.3 to show the similarity.
From Figure 2.3, the similarity relationship is very clear since cuisines that are very similar to
each other are closed located and the cuisines that are different are far away from each other. For
example, Middle Eastern, Mediterranean, and Greek now are in one cluster and are next each
other. My interesting clusters are forms, such as Japanese and Sushi Bars, Fast Food and
Burgers.
13
Figure 2.3 Similarity matrix and hierarchical cluster
2.7.2 k-Means Clustering
We also carry out means clustering on our data set using the document term matrix based on
TF-IDF. The results are as shown in Figure 2.4. We set 5k . The five different clusters are
presented using different colors. The clusters are listed in TABLE 2.1. The cluster result makes
sense but is not as good as the result obtained by hierarchical clustering.
14
TABLE 2.1 Cluster list
Cluster 1 "Sushi Bars" "Japanese"
Cluster 2 Mexican" "Breakfast Brunch" "Steakhouses" "Sandwiches"
"Burgers" "Chinese"
Cluster 3 "Seafood" "Buffets" "Fast Food" "Thai" "Asian
Fusion" "Mediterranean" "French" "Cafes" "Sports Bars"
"Barbeque" "Pubs" "Coffee Tea" "Vietnamese" "Delis"
"Vegetarian" "Lounges" "Greek" "Wine Bars"
"Desserts" "Bakeries" "Gluten Free" "Diners" "Indian"
"Korean" "Salad" "Chicken Wings" "Hotdogs" "Tapas
Bars" "Arts Entertainment" "Southern" "Tapas Small Plates" "Middle
Eastern" "Hawaiian" "Vegan" "Gastropubs" "Dim Sum"
Cluster 4 "Italian" "Pizza"
Cluster 5 "American (New)" "American (Traditional)" "Nightlife" "Bars"
15
Figure 2.4 Similarity matrix and k-means cluster
2.8 Conclusions
In this chapter, we investigate the development of cuisine map based on the categories and
customer reviews in Yelp data. Both TF and TF-IDF are used to build document term matrix.
Similarity matrices are obtained based on cosine distance and plotted in Figure 2.1 to Figure
2.4. It is found that TF-IDF can enhance the similarity between cuisines that are indeed similar
and weaken the similarity between cuisines that have less in common. We also carry out
hierarchical clustering and k-means clustering to facilitate the reader to use the cuisine map.
16
3 CHAPTER 3
SUMMARY OF TASK 3: DISH RECOGNITION
In this chapter, we investigate the mining of Chinese dish names from the yelp review data on
Chinese restaurants. We subset the reviews on Chinese restaurants from the original data set and
identify available dish names in Chinese cuisine by using TopMine [1] and SegPhrase [2].
3.1 Task 3.1: Manual Tagging
First, we revise the label file for Chinese cuisine manually.
We remove false positive non-dish names phrase.
We change a false negative dish name phrase to a positive label.
Second, we add more annotated phrases in the same format by searching for menus from Chinese
restaurants.
3.2 Task 3.2: Mining Additional Dish Names
3.2.1 Corpus Preparation
We import the data into R.
We read “yelp_academic_dataset_business.json” into variable “BUSINESS” and
“yelp_academic_dataset_review.json” into variable “REVIEW” using “jsonlite” package.
We select all the restaurants from “BUSINESS” by finding the entries that have
“Restaurants” in column “categories”. We denote this selected data set as
“RESTAURANTS”.
We merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by
“business_id” column. This also eliminates the entries in “REVIEW” that are not for
restaurants.
17
We subset “RESTAURANTS_REVIEW” by selecting the entries with “Chinese” in
column “categories” and save it into “RESTAURANTS_REVIEW_CHINESE”
We subset the “text” column in “RESTAURANTS_REVIEW_CHINESE” and save it
into .txt file with each line being one review.
3.3 Dish Name Identify Using TopMine
3.3.1 Parameters
We keep the default values of the parameters except that we modify “maxPattern” into 6 since
we believe that a dish name is likely to contain 1 to 6 words.
3.3.2 Opinion about the Result
We run the TopMine package and obtain more than 10k phrases. Some of the most frequent ones
appear to be dish names, such as
“dim sum” 2849
“fried rice” 2511
“egg rolls” 1777
“orange chicken” 1599
These are indeed typical Chinese dishes found in US. If you have never been to a Chinese
restaurant, you may want to go and try these dishes because apparently they are very popular in
US. My personal favorite is “dim sum” which is originally from Canton and Hong Kong.
However, there are still many frequent phrases that are not dish names, for example “Chinese
food” (with frequency 2853) and “Chinese restaurant” (with frequency 2108). This is because the
phrase mining algorithm, TopMine, is not specifically for dish name mining. The wrong dish
names are actually frequently used in the reviews and are indeed frequent phrases, which means
the algorithm works very well.
18
3.3.3 Improvement
With the limitation of the tools we have, we have to re-prepare our corpus so that the frequent
phrases other than dish names are removed beforehand. Therefore, we remove the word
“Chinese” and the following words from the original corpus to improve the result of phrase
mining:
“good”, “food”, “service”, “great”, “one”, “like”, “love”, “pretty”, “place”, “menu”, “ordered”,
“order”, “best”, “try”, “nice”, “well”, “didnt”, “dont”, “ive”, “eat”, “back”, “also”, “got”,
“always”, “come”, “people”, “get”, “will”, “can”, “really”, “just”, “time”, “little”, “us”, “meal”,
“diner”, “first”, “table”, “definitely”.
The reason why we remove these words is that we found they appear quite often in the corpus, as
shown in the word cloud in Figure 1.1, but are very unlikely to appear in a Chinese dish name.
After this procedure, the results are much better by observing that most of the top frequent
phrases are dish names.
3.4 Dish Name Identify Using SegPhrase
3.4.1 Parameters
We prepare a label and set algorithm parameter AUTO_LABEL=1. The first part of the label is
the label we revised manually in task 3.1. The second part of the label is from the result of
TopMine. Basically, we select the first 2k frequent phrases in the result of TopMine and replace
the frequency with label 1. We then manually revise the label by removing false positive.
3.4.2 Opinion about the Result
Using the label and the algorithm package we obtain very good dish name list. Below is the top
phrases in the list.
“orange chicken” “hot and sour soup” “cashew chicken” “sea bass” “hot pot”
“kung pao chicken” “brown rice” “shaved ice” “white rice” “char siu” “chow mein”
“won ton” “steamed rice” “fried rice” “bok choy” “sweet and sour pork”
19
4 CHAPTER 4
SUMMARY OF TASK 4 & 5: POPULAR DISHES AND
RESTARURANT RECOMMENDATION
In this chapter, we detect popular dishes of a specific cuisine (Chinese cuisine) and popular
restaurants for a specific dish (“orange chicken” and “fried rice”). Popularity is measured by the
frequency that a dish appears in reviews. We also carried out some sentiment analysis based on
the stars each dish or restaurant receives in reviews.
4.1 Data Preparation
4.1.1 Corpus
Read “yelp_academic_dataset_business.json” into variable “BUSINESS” and
“yelp_academic_dataset_review.json” into variable “REVIEW” using “jsonlite” package.
Select all the restaurants from “BUSINESS” by finding the entries that have
“Restaurants” in column “categories”. We denote this selected data set as
“RESTAURANTS”.
Merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by
“business_id” column. This also eliminates the entries in “REVIEW” that are not for
restaurants.
Subset “RESTAURANTS_REVIEW” by selecting the entries with “Chinese” in column
“categories” and save it into “CHINESE_REVIEW”
Subset the “text” column in “CHINESE_REVIEW” as the corpus.
Convert the corpus into ASCII encoding.
Strip extra whitespace from the corpus.
Remove punctuation marks from the corpus.
Remove numbers from the corpus.
20
4.1.2 Dish List
We used the top 500 dish names from the dish mining results obtained in Task 3. We read the txt
file (each line is a dish name) into R using function “readLines”.
4.2 Tools and Packages
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
4.2.1 Attached Base Packages:
stats, graphics, grDevices, utils, datasets methods, base
4.2.2 Other Attached Packages:
dplyr_0.4.3, tm_0.6-2, NLP_0.1-8, ggplot2_1.0.1, jsonlite_0.9.16, qdap_2.2.2,
RColorBrewer_1.1-2, qdapTools_1.3.1, qdapRegex_0.5.0, qdapDictionaries_1.0.6
4.3 Task 4: Popular Dishes
In this section, we detect the top 100 most popular dishes in Chinese cuisine.
4.3.1 Popularity Analysis
The measurement for popularity of a dish is defined as the frequency that the dish appears in
customers’ reviews. If a dish name appears more than one time in the same piece of review, it is
only counted once.
We obtain a data frame with m rows and 3)n columns, where m is the number of reviews
and n is the number of dishes. Each row represents an individual review. In each row, the first
m columns are the counts of the m dishes. Basically, if a dish name appears in the review, then
the value in the corresponding column is 1, otherwise it is 0. The st
1n column is the stars
21
corresponding to the review, the nd
2n column is the name of the corresponding restaurant,
and the rd
3n column is the overall star of the restaurant.
4.3.2 Sentiment Analysis
A frequently ordered (mentioned) dish is not necessarily tasty as well. We use the stars given by
reviewers in their reviews as an indicator for the tastiness of the corresponding dishes mentioned
in the reviews. For example, if a reviewer mentions “fried rice” and “orange chicken” in his or
her review, and he or she gives a five stars in the review for his or her experience, then “fried
rice” and “orange chicken” both earn a tastiness 5 due to this piece of review. We count the total
stars each dish earns from the reviewers as its overall tastiness. Then the overall tastiness is
normalized into a range of 1 to 5.
4.3.3 Illustration
The results are presented in Figure 4.1, where the x-axis is the top 100 popular dish names and
y-axis is the corresponding frequency-based popularity. We used color to show the tastiness of
the dishes. There exist a strong correlation that tastier dishes tends to be ordered (mentioned)
more often, which makes sense in practice.
22
Figure 4.1 Illustration for popular dish names
4.4 Task 5: Popular Restaurants
In this part, we mine the popular restaurants for a specific dish. Without losing generality, we use
“orange chicken” and “fried rice” as two examples because they are two of the most popular
dishes in Chinese cuisine as shown in Figure 4.1. Note that other dish names can be used for this
task since the method and code we use to obtain the results in this section are supposed to be
universal.
4.4.1 Popularity Analysis
We group the data frame obtained in task 4 by restaurant and calculate the total count of dishes
for each restaurant. We use the total count as popularity of the restaurant with respect to a dish.
For example, for restaurant “Panda Express”, the total count of “orange chicken” is 145 while
the total count of “fried rice” is 87. For another example, for restaurant “Chino Bandido”, the
total count of “orange chicken” is 36 while the total count of “fried rice” is 406. As you can see
that “Panda Express” is more popular for its “orange chicken” whereas “Chino Bandido” is more
popular for its “fried rice”.
23
4.4.2 Sentiment Analysis
A restaurant may serve a lot of “orange chicken” or “fried rice”, but it could be because of the
population in that area or its low price. We want to know if the customers are happy after having
its “orange chicken” or “fried rice”, which means how tasty the “orange chicken” and “fried
rice” are. We use the overall stars of the restaurant as a measurement.
4.4.3 Illustration
The results are presented in Figure 4.2 and Figure 4.3. The x-axis represents the top 100
restaurants that serve the dishes “orange chicken” and “fried rice” and the y-axis represents the
popularity of the corresponding restaurants. We used color to show the tastiness of the dishes.
Figure 4.2 Illustration for popular restaurants for orange chicken
24
Figure 4.3 Illustration for popular restaurants for fried rice
4.5 Conclusions
We believe that the figures provided above can be a good guide for people who want to try
Chinese food. They can find the most popular dishes in Figure 4.1 and find which restaurants
serve the best “orange chicken” and “fried rice” in Figure 4.2 and Figure 4.3.
25
5 CHAPTER 5
SUMMARY OF TASK 6: HYGIENE PREDICTION
In this chapter, we predict whether a set of restaurants will pass the public health inspection tests
given the corresponding Yelp text reviews along with some additional information such as the
locations and cuisines offered in these restaurants.
Two text representation techniques are used: Unigram and Topic Model.
Two learning algorithms are used: Logistic Regression and Random Forest.
Additional features are used such as “Categories”, “Stars”, and “Zipcode”.
5.1 Tools Used
R version 3.1.3
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
5.1.1 Packages
topicmodels_0.2-2 qdap_2.2.2 qdapTools_1.3.1 qdapRegex_0.5.0
qdapDictionaries_1.0.6 tm_0.6-2 NLP_0.1-8 quanteda_0.7.2-1
randomForest_4.6-10 caret_6.0-52
5.2 Text Preprocessing
We preprocess the review text as follows. Package “tm” is used.
Convert the text into ASCII encoding.
Strip extra whitespace from the text.
Remove punctuation marks from the text.
Remove numbers from the text.
26
5.3 Training Method 1: Logistic Regression
5.3.1 Text Representation Techniques
5.3.1.1 Unigram
First, we obtain word frequency from the reviews in training data and select the top frequent
words. Here we set = 301 and 1451.
Second, we use the counts of frequent words in the review of each restaurant as its
corresponding text-based features.
Package “qdap” is used.
5.3.2 Additional Features Used
Stars, Zipcode, and Categories
5.3.3 Learning Algorithm
Logistic Regression
5.3.4 Results Analysis
The results are presented in TABLE 5.1 where “Score” is the score given by Coursera grader.
TABLE 5.1 Prediction Score obtained by Logistic Regression
# of unigram feature Additional Features Score
Scheme 1 301 Stars and Zipcode 0.55778821485
Scheme 2 301 Stars, Zipcode, and
Categories
0.53725827475
Scheme 3 1451 Stars and Zipcode 0.509439655385
27
From TABLE 5.1 we can observe the following.
(1) The score is lower when additional feature “Categories” is used. This is probably because
some categories in testing data set do not appear in training data set.
(2) When more unigram features (frequent words) are used, the score is lower. This is probably
because of overfitting.
5.4 Training Method 2: Random Forest
5.4.1 Text Representation Techniques
5.4.1.1 Unigram
First, we obtain word frequency from the reviews in training data and select the top frequent
words. Here we set = 841 and 1451.
Second, we use the counts of frequent words in the review of each restaurant as its
corresponding text-based features.
Package “qdap” is used.
5.4.1.2 Topic Model
First, we mine 10, 50 and 100 topics from training data.
Second, we count the words that belong to the topics in a restaurant’s review and use the
counts as text-based features.
Package “topicmodels” is used.
5.4.2 Additional Features Used
Stars, Zipcode, and Categories
28
5.4.3 Learning Algorithm
Random Forest
Packages “caret” and “randomForest” are used.
5.4.4 Results Analysis
We use two text representation techniques and different numbers of features. The results are
shown in TABLE 5.2 and TABLE 5.3, respectively.
In TABLE 5.2, we observe the following.
(1) Results are improved by using additional feature “Categories”
(2) More unigram feature improve the result. It seems that a large number of unigram features
does not cause overfitting in these two cases. More tests are not carried out because more
features will result in unbearable training time.
TABLE 5.2 Prediction Score obtained by Random Forest & Unigram
# of unigram features Additional Features Score
Scheme 1 841 Stars, Zipcode, and
Categories
0.56127128414
Scheme 3 1451 Stars, Zipcode, and
Categories
0.561925058925
Scheme 2 1451 Stars, Zipcode 0.559788032673
In TABLE 5.3, we observe that more topics does not necessarily mean better result, overfitting
occurs when the number of topics becomes larger.
29
TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model
# of topics Additional Features Score
Scheme 1 10 Stars, Zipcode, and
Categories
0.520164615311
Scheme 2 50 Stars, Zipcode, and
Categories
0.552275390162
Scheme 3 100 Stars, Zipcode, and
Categories
0.540265423658
5.5 Method Comparison
From the result we can tell that logistic regression tends to see overfitting with small numbers of
features whereas random forest is less prone to overfitting. Overall, random forest provides
slightly better results than logistic regression, but the former takes much more computer time
than the latter.
Comparing results from TABLE 5.2 and TABLE 5.3, we observe that the topic model method on
average has similar results as unigram whereas its best result does not outperform Unigram. The
reason could be as follows. On one hand, topic model reduces the dimension of features and
enhances the important features. On the other hand, we may lose some information for
prediction during the dimension reduction.
30
6 CHAPTER 6
USEFULNESS OF RESULTS
In this chapter, we introduce the useful results obtained through the data mining capstone.
6.1 Cuisine Maps
In Chapter 2, we build several cuisine maps which show the similarity between 50 different
cuisines.
6.1.1 Usefulness for Customers
These maps can be very useful for customers who want to explore new cuisines. For instance,
according to the cuisine map in Figure 2.2, “Mediterranean”, “Greek”, and “Middle Eastern” are
three very similar cuisines. People who like one of them may want to try the other two if they use
the cuisine map.
6.1.2 Usefulness for Restaurant Owners
These maps can also benefit restaurant owners who want to extend their businesses. They can
choose to open their new restaurants next to or far away from certain restaurants. For example,
an owner of a cafe may want to open a new cafe next to a restaurant that specifically provides
breakfast and brunch since they are very similar according to the cuisine map and people will
love to grab a cup of coffee before or after breakfast.
6.2 Dish Recognizer
We recognize some dishes in task 3 as introduced in Chapter 3. This is useful for businessmen
who want to open restaurants. It is very helpful to know what dishes are served in certain cuisine
before opening a restaurant of that cuisine.
31
6.3 Popular Dishes Detection
We detect top 100 popular Chinese dishes with corresponding tastiness in task 4 as introduced in
Chapter 4. This is extremely useful for people who like Chinese food and who want to try
Chinese food. The reason is obvious. People can find the most popular and tasty dishes and avoid
ordering dishes that are not so welcomed.
In addition, this result is also very useful to owners of Chinese restaurants and businessmen who
want to start Chinese restaurants. For them, providing more popular food is more likely to bring
more customers and hence more profit.
6.4 Restaurant Recommendation
We recommend top 100 restaurants that serve orange chicken and fried rice in task 5 as
presented in Chapter 4. This is also quite useful for customers who want to try these two special
dishes.
6.5 Hygiene Prediction
This result helps customers in selecting clean restaurants to go and avoid restaurants that are not
so good at keeping hygiene.
32
7 CHAPTER 7
NOVELTY OF EXPLORATION
7.1 Hierarchical Clustering in Cuisine Map Development
When we build the cuisine map considering clustering, hierarchical clustering is used, as shown
in Figure 2.3. The hierarchical relation between cuisines are shown together with the similarity
matrix. This really helps users find the clusters based on their own need. Instead of fixing the
number of clusters beforehand, we allow users to choose how many clusters they want or to
simply find cuisine that are connected by the hierarchical links.
7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition
In recognizing dish names for Chinese cuisine, we use the output of TopMine as the input for
SegPhrase so that SegPhrase has a more comprehensive labeled dish list. The first part of the
labeled list is the one we revise manually in task 3.1. The second part of the list is from the result
of TopMine. This method turns out to be very effective, which results in a 12 out 10 score
according to the grader.
7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction
In training hygiene prediction models, we use two text representation techniques: unigram and
topic model. For unigram, we detect the top N popular terms first instead of using all the terms
in corpus, and then use the counts of the N words in the reviews as features. For topic model,
we first mine topics from customer reviews, and then use the word counts in the topics as
features.
The two methods are very effective according to the grader. A F1 = 0.56 is obtained using the
top term counts as features and a F1 = 0.55 is obtained using the topic model.
33
8 CHAPTER 8
CONTRIBUTION OF NEW KNOWLEDGE
8.1 Some Advantages of Random Forest over Logistic Regression
In carrying out task 6, we train both logistic regression model and random forest with the same
number of features and compare the results. Here are some advantages of random forest over
logistic regression found during the experiment.
8.1.1 Random Forest is Less Prone to Overfitting than Logistic Regression
We found that random forest provides better results when more and more features are included
without showing overfitting, though we do not carry out experiments with more than 1500
features. However, logistic regression shows the sign of overfitting when less than 1500 features
are used. This shows us that random forest is less prone to overfitting than logistic regression.
8.1.2 Logistic Regression is not Good at Handling Missing Feature Value
When we are using logistic regression as prediction algorithm and categories of restaurants as a
feature, warnings occur because some restaurant categories that do not appear in training data
appear in testing data, which causes worse prediction result. On the other hand, random forest
seems to be able to cope with such situation and even provide better prediction when restaurant
categories are used as a feature.
34
9 CHAPTER 9
IMPROVEMENT TO BE DONE
Several things can be done to improve this project: First, web based tools can be developed for
interactive illustration of results. Second, updating algorithm can be developed to update the
results in an efficient manner when more data are available instead of carrying out data mining
from scratch. Third, a location based restaurant and dish recommendation should be developed
which can be more helpful for customers in specific places.
35
10 REFERENCES
[1] El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of
the VLDB Endowment, 8.3 (2014): 305-316.
[2] Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases
from Massive Text Corpora,” Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data
(SIGMOD'15), Melbourne, Australia, May 2015. (* equally contributed)