data mining capstone final report -...

DATA MINING CAPSTONE

FINAL REPORT

I

ABSTRACT

This report is to summarize the tasks accomplished for the Data Mining Capstone. The tasks are

based on yelp review data, majorly for restaurants. Six tasks are accomplished.

The first task is to visualize customer review text for all restaurants. Frequent word cloud is

plotted. Topics are detected from the review text for all restaurants. In addition, topic comparison

for two Chinese restaurants are provided and visualized.

The second task is to build cuisine map based on similarity between cuisines using customer

review text. Top fifty cuisines are found first to be included in the cuisine map. Term Frequency

(TF), Term Frequency-Inverse Document Frequency (TF-IDF), and clustering methods, i.e.

hierarchical clustering and k-means clustering, are used to build similarity matrix, heat map, and

cuisine map.

The third task is to recognize dish names from customer review text of a certain cuisine. Chinese

cuisine is chosen for this task. A labeled dish name list of Chinese cuisine is revised manually.

Then two algorithms, i.e. TopMine and SegPhrase, are used to mine a comprehensive Chinese

dish list based on the review text for Chinese restaurants and the labeled dish name list.

The fourth and fifth tasks are to detect popular dishes and recommend good restaurants for

certain dishes. Again, Chinese cuisine is chosen for this task. 700 dish names from task 3 are

used as a pool of Chinese dishes. The top 100 most popular dishes and their corresponding

tastiness are found by mining customer review text and review score, i.e. stars. We also

recommended top 100 most popular restaurants for two popular Chinese dishes, i.e. orange

chicken and fried rice.

II

The sixth task is to predict whether a set of restaurants will pass the public health inspection tests

given the corresponding Yelp text reviews along with some additional information such as the

locations and cuisines offered in these restaurants.

In addition, in this report we highlight the following: (1) the most useful data mining results

produced through these specific data mining tasks and potential people who might benefit from

such results; (2) the novel ideas/methods explored to carry out the tasks; (3) new knowledge

people can learn from the project activities, particularly through the experiments.

The report is organized as follows. Chapter 1 to Chapter 5 will introduce the six tasks: Chapters

1 to 3 address tasks 1 to 3, Chapter 4 addresses tasks 4 and 5, and Chapter 5 addresses task 6.

Chapter 6 introduces the useful results. Chapter 7 presents the novel method used in carrying out

the tasks. Chapter 8 summarizes the contribution of new knowledge discovered throughout the

capstone.

III

TABLE OF CONTENTS

1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET ............... 1

1.1 Tools Used........................................................................................................................ 1 1.2 Major Packages ................................................................................................................ 1 1.3 Data Import ...................................................................................................................... 1

1.4 Data Preprocess ................................................................................................................ 2 1.5 Topic Model Fitting ......................................................................................................... 3 1.6 Comparison of Topics for Two Chinese Restaurants ...................................................... 4

1.7 Discussion on the Topics for the Two Chinese Restaurants ............................................ 5 1.7.1 Similarity ....................................................................................................... 5 1.7.2 Difference ...................................................................................................... 5

2 CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP

CONSTRUCTION .............................................................................................................. 7

2.1 Tools Used........................................................................................................................ 7

2.2 Major Packages: ............................................................................................................... 7 2.3 Data Import ...................................................................................................................... 7 2.4 Data Preprocess ................................................................................................................ 8

2.5 Similarity Matrix without IDF ......................................................................................... 9 2.6 Similarity Matrix with IDF ............................................................................................ 10

2.7 Similarity Matrix with Clustering .................................................................................. 12 2.7.1 Hierarchical Clustering .............................................................................. 12

2.7.2 k-Means Clustering ..................................................................................... 13 2.8 Conclusions .................................................................................................................... 15

3 CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION ............................. 16

3.1 Task 3.1: Manual Tagging ............................................................................................. 16

3.2 Task 3.2: Mining Additional Dish Names ..................................................................... 16 3.2.1 Corpus Preparation .................................................................................... 16

3.3 Dish Name Identify Using TopMine .............................................................................. 17 3.3.1 Parameters .................................................................................................. 17 3.3.2 Opinion about the Result ............................................................................ 17

3.3.3 Improvement ............................................................................................... 18 3.4 Dish Name Identify Using SegPhrase ............................................................................ 18

3.4.1 Parameters .................................................................................................. 18 3.4.2 Opinion about the Result ............................................................................ 18

4 CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT

RECOMMENDATION .................................................................................................... 19

4.1 Data Preparation ............................................................................................................. 19

IV

4.1.1 Corpus ......................................................................................................... 19 4.1.2 Dish List ...................................................................................................... 20

4.2 Tools and Packages ........................................................................................................ 20 4.2.1 Attached Base Packages: ............................................................................ 20

4.2.2 Other Attached Packages: .......................................................................... 20 4.3 Task 4: Popular Dishes ................................................................................................... 20

4.3.1 Popularity Analysis ..................................................................................... 20 4.3.2 Sentiment Analysis ...................................................................................... 21 4.3.3 Illustration................................................................................................... 21

4.4 Task 5: Popular Restaurants ........................................................................................... 22 4.4.1 Popularity Analysis ..................................................................................... 22 4.4.2 Sentiment Analysis ...................................................................................... 23 4.4.3 Illustration................................................................................................... 23

4.5 Conclusions .................................................................................................................... 24 5 CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION ........................ 25

5.1 Tools Used...................................................................................................................... 25

5.1.1 Packages ..................................................................................................... 25 5.2 Text Preprocessing ......................................................................................................... 25 5.3 Training Method 1: Logistic Regression ........................................................................ 26

5.3.1 Text Representation Techniques ................................................................. 26 5.3.1.1 Unigram ............................................................................................... 26

5.3.2 Additional Features Used ........................................................................... 26 5.3.3 Learning Algorithm ..................................................................................... 26 5.3.4 Results Analysis .......................................................................................... 26

5.4 Training Method 2: Random Forest ............................................................................... 27

5.4.1 Text Representation Techniques ................................................................. 27 5.4.1.1 Unigram ............................................................................................... 27 5.4.1.2 Topic Model ......................................................................................... 27

5.4.2 Additional Features Used ........................................................................... 27 5.4.3 Learning Algorithm ..................................................................................... 28

5.4.4 Results Analysis .......................................................................................... 28 5.5 Method Comparison ....................................................................................................... 29

6 CHAPTER 6 USEFULNESS OF RESULTS ............................................................. 30

6.1 Cuisine Maps .................................................................................................................. 30 6.1.1 Usefulness for Customers ........................................................................... 30 6.1.2 Usefulness for Restaurant Owners.............................................................. 30

6.2 Dish Recognizer ............................................................................................................. 30 6.3 Popular Dishes Detection ............................................................................................... 31

6.4 Restaurant Recommendation ......................................................................................... 31 6.5 Hygiene Prediction ......................................................................................................... 31

7 CHAPTER 7 NOVELTY OF EXPLORATION ........................................................ 32

7.1 Hierarchical Clustering in Cuisine Map Development .................................................. 32 7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition ....................... 32

V

7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction ........ 32 8 CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE...................................... 33

8.1 Some Advantages of Random Forest over Logistic Regression .................................... 33

8.1.1 Random Forest is Less Prone to Overfitting than Logistic Regression ...... 33 8.1.2 Logistic Regression is not Good at Handling Missing Feature Value ....... 33

9 CHAPTER 9 IMPROVEMENT TO BE DONE ........................................................ 34

10 REFERENCES ........................................................................................................... 35

VI

List of Figures

Figure 1.1 Word cloud .................................................................................................................. 2 Figure 1.2 The topics of the sampled Restaurant ....................................................................... 3 Figure 1.3 The topics of the first Chinese Restaurant CR1 ...................................................... 4 Figure 1.4 The topics of the second Chinese Restaurant CR2 .................................................. 5

Figure 2.1 Similarity matrix without IDF................................................................................. 10 Figure 2.2 Similarity matrix using IDF..................................................................................... 11 Figure 2.3 Similarity matrix and hierarchical cluster ............................................................. 13 Figure 2.4 Similarity matrix and k-means cluster ................................................................... 15 Figure 4.1 Illustration for popular dish names ........................................................................ 22

Figure 4.2 Illustration for popular restaurants for orange chicken ....................................... 23

Figure 4.3 Illustration for popular restaurants for fried rice ................................................. 24

VII

List of Tables

TABLE 2.1 Cluster list ............................................................................................................... 14 TABLE 5.1 Prediction Score obtained by Logistic Regression .............................................. 26 TABLE 5.2 Prediction Score obtained by Random Forest & Unigram ................................ 28 TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model .......................... 29

1

1 CHAPTER 1

SUMMARY OF TASK 1: EXPLORATION OF DATA SET

In this chapter, we explore the yelp data set. Particularly, we mine the reviews on restaurants

from customers in order to find topics. We mine the topics based on Latent Dirichlet Allocation

(LDA) model and plot the topics in a circular tree for visualization. In addition, we mine and

compare the topics of two Chinese restaurants.

1.1 Tools Used

R version 3.1.3

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 8 x64 (build 9200)

1.2 Major Packages

jsonlite_0.9.14

tm_0.6-2

topicmodels_0.2-2

igraph_1.0.1

1.3 Data Import

First, we read “yelp_academic_dataset_business.json” into variable “BUSINESS” and

“yelp_academic_dataset_review.json” into variable “REVIEW” using “jsonlite” package.

Second, we select all the restaurants from “BUSINESS” by finding the entries that have

“Restaurants” in column “categories”. We denote this selected data set as “RESTAURANTS”.

Third, we merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by

“business_id” column. This also eliminates the entries in “REVIEW” that are not for restaurants.

2

Then, we randomly select 10,000 samples from “RESTAURANTS_REVIEW” and record it into

data set “restaurants_review”. A relatively small data set is sampled due to the limit in time and

computer capacity.

1.4 Data Preprocess

We then convert the data into a corpus by using “tm” packages:

We build a corpus based on the “text” column of “restaurants_review”.

We convert all the words into lower case.

We remove anything other than English letters or space.

We remove stop words.

We remove extra white space.

Make the text fit paper width, i.e. each line has at most 60 characters.

We heuristically complete stemmed words.

We constructs a document-term matrix.

We plot the word cloud to see what the major words are.

Figure 1.1 Word cloud

3

From the word cloud we can see the common words people use to describe a restaurant, e.g.

“good”, “food”, “great”, “place”, “just”, “like”, “service”, and etc. In addition, we can see food

names such as “salad”, “pizza”, “cheese”, “sushi”, and etc. We can also find that “chicken” is a

very common food in US. All in all, we can find a lot of information that makes sense.

1.5 Topic Model Fitting

We fit the document-term matrix into LDA model using the “LDA” function in the

“topicmodels” package. We set the number of topics as 10. The plot is shown in Figure 1.2

where ten words in each topic are presented. From Figure 1.2 we can tell that people mention

“food”, “place”, “service”, and “good”, “great” in many topics, which is to be expected. In

Figure 1.2, the words in topic i have i after them. For example, in topic 1, we have “food 1”,

“great 1” and so on.

Figure 1.2 The topics of the sampled Restaurant

4

1.6 Comparison of Topics for Two Chinese Restaurants

We randomly select two Chinese restaurants: CR1 with “business_id” “-

3WVw1TNQbPBzaKCaQQ1AQ” and CR2 with “business_id” “-mz0Zr0Dw6ZASg7_ah1R8A”.

We carry out the same procedure as above and obtain the LDA based topic plots as shown in

Figure 1.3 and Figure 1.4.

Figure 1.3 The topics of the first Chinese Restaurant CR1

5

Figure 1.4 The topics of the second Chinese Restaurant CR2

1.7 Discussion on the Topics for the Two Chinese Restaurants

1.7.1 Similarity

Both topic 2 for CR1 and topic 3 for CR2 contain “good”, “dish”, “order”, “beef” and “place”.

This is not surprising because beef is very common in USA. It is very likely good tasting dishes

containing beef are often ordered in both restaurants.

The topics for both restaurants contains “China” and “Chinese” and other common words such as

“food”, “good”, “place”, ”chicken”, and “dish”.

1.7.2 Difference

The major words of the topics really depend on the names and menus of the restaurants. It is

obvious that in the topics of restaurant CR1, people are talking about “chili” and “spiciness”

since the restaurant is called “China Chili” and probably serves a lot of spicy food. However, in

6

the restaurant CR2, “fried”, “egg”, “roll” and “pork” appear often because the second restaurant

is called “Sing High” and serves “Barbecued pork slices”, “egg roll”, “fried Won Ton”.

7

2 CHAPTER 2

SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP

CONSTRUCTION

In this chapter, we mine the data set to construct cuisine maps to visually understand the

landscape of different types of cuisines and their similarities. The cuisine map can help users

understand what cuisines are available and their relations, which allows for the discovery of new

cuisines, thus facilitating exploration of unfamiliar cuisines. The cuisine map is build based on

the categories and customer reviews of restaurants in Yelp data.

2.1 Tools Used

R version 3.1.3



2.2 Major Packages:

reshape2_1.4.1 plyr_1.8.3 ggplot2_1.0.1 scales_0.2.5 HSAUR_1.3-7

cluster_2.0.3 corrplot_0.73 proxy_0.4-15 tm_0.6-2 NLP_0.1-8

jsonlite_0.9.16

2.3 Data Import

First, we read “yelp_academic_dataset_business.json” into variable “BUSINESS” and


Second, we select all the restaurants from “BUSINESS” by finding the entries that have

“Restaurants” in column “categories”. We denote this selected data set as “RESTAURANTS”.

Third, we merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by

“business_id” column. This also eliminates the entries in “REVIEW” that are not for restaurants.

8

2.4 Data Preprocess

First, we search for the most popular cuisines by counting the frequency of cuisine names in

“categories” column. We pick the top 50 ones to build the cuisine map:

[1] "American (New)" "American (Traditional)" "Nightlife" "Bars"

[5] "Mexican" "Italian" "Breakfast & Brunch" "Pizza"

[9] "Steakhouses" "Sandwiches" "Burgers" "Sushi Bars"

[13] "Japanese" "Chinese" "Seafood" "Buffets"

[17] "Fast Food" "Thai" "Asian Fusion" "Mediterranean"

[21] "French" "Cafes" "Sports Bars" "Barbeque"

[25] "Pubs" "Coffee & Tea" "Vietnamese" "Delis"

[29] "Vegetarian" "Lounges" "Greek" "Wine Bars"

[33] "Desserts" "Bakeries" "Gluten-Free" "Diners"

[37] "Indian" "Korean" "Salad" "Chicken Wings"

[41] "Hot Dogs" "Tapas Bars" "Arts & Entertainment" "Southern"

[45] "Tapas/Small Plates" "Middle Eastern" "Hawaiian" "Vegan"

[49] "Gastropubs" "Dim Sum"

Then, we eliminate the entries that are not in the 50 categories from

“RESTAURANTS_REVIEW”, randomly sample 10,000 entries from

“RESTAURANTS_REVIEW” and record it into data set “restaurants_review”. A relatively

small data set is sampled due to the limit in time and computer capacity.

We then convert the data into a corpus by using “tm” packages:

We build a corpus based on the “text” column of “restaurants_review”.

We convert all the words into lower case.

We remove anything other than English letters or space.

We remove stop words.

We remove extra white space.

We make the text fit paper width, i.e. each line has at most 60 characters.

9

2.5 Similarity Matrix without IDF

First, we construct a document term matrix using the corpus we prepared in section 2.4 using

Term Frequency (TF). We do not apply Inverse Document Frequency (IDF) in constructing the

document term matrix.

Second, we calculate the similarity matrix based on the document term matrix by using “1 minus

cosine distance” and plot the similarity matrix in Figure 2.1. As can be seen from Figure 2.1,

the similarity value is between 0 and 1. The similarity of a cuisine to itself is 1 as expected. We

can observe many sets of cuisines that are very similar to each other, which is consistent with

common sense. To name a few:

American (New) , American (Traditional), Night Life, and Bars

Italian and Pizza

Delis and Sandwiches

Fast Food and Burgers

Cafes and Breakfast & Brunch

Japanese and Sushi Bars

Mediterranean, Greek, and Middle Eastern

Vegetarian and Gluten-free

Chinese, Asian Fusion, and Dim Sum

10

Figure 2.1 Similarity matrix without IDF

2.6 Similarity Matrix with IDF

The results presented in the previous section make a lot of sense. The similarity values between

similar cuisines are indeed higher than those between not-so-similar or very-different cuisines.

However, the difference is not very significant. Therefore, we use IDF to enhance the difference.

We prepared another document term matrix using TF-IDF and calculate the similarity matrix

with the same method (cosine distance). The similarity matrix is shown in Figure 2.2.

11

Figure 2.2 Similarity matrix using IDF

As can be seen from Figure 2.2, the similarity values between cuisines that are actually similar

to each other are significantly higher than the values between cuisines that have less in common.

For example, Dim Sum is a type of Chinese food, based on Figure 2.1, it appears to have high

similarity to Japanese, Sushi Bars, and Seafood and its similarity to Chinese is not significantly

higher than its similarity to Japanese, Sushi Bars, and Seafood. But according to Figure 2.2, the

similarity of Dim Sum to Japanese, Sushi Bars, and Seafood is much weaker and its similarity to

Chinese is enhanced.

12

For another example, based on Figure 2.1, Greek seems to be very similar to American (New),

American (Traditional), Nightlife, Bars, and its similarity to Mediterranean and Middle Eastern

does not look very significant. Based on Figure 2.2, the similarity between Greek,

Mediterranean, and Middle Eastern is much easier to find.

2.7 Similarity Matrix with Clustering

We improved similarity matrix by using TF-IDF in section 2.6. However, related cuisines are

sometimes located far away from each other and the cuisine map is not very handy to use. For

instance, Middle Eastern, Mediterranean, and Greek are far away from each other in cuisine

maps shown in Figure 2.1 and Figure 2.2 though they are quite similar. Indeed, it takes a lot of

eye effort to find this relationship. Therefore, we carry out hierarchical cluster and -means

cluster to facilitate the visualization of the relationships between similar cuisines.

2.7.1 Hierarchical Clustering

We first try hierarchical clustering. A heat map is plotted in Figure 2.3 to show the similarity.

From Figure 2.3, the similarity relationship is very clear since cuisines that are very similar to

each other are closed located and the cuisines that are different are far away from each other. For

example, Middle Eastern, Mediterranean, and Greek now are in one cluster and are next each

other. My interesting clusters are forms, such as Japanese and Sushi Bars, Fast Food and

Burgers.

13

Figure 2.3 Similarity matrix and hierarchical cluster

2.7.2 k-Means Clustering

We also carry out means clustering on our data set using the document term matrix based on

TF-IDF. The results are as shown in Figure 2.4. We set 5k . The five different clusters are

presented using different colors. The clusters are listed in TABLE 2.1. The cluster result makes

sense but is not as good as the result obtained by hierarchical clustering.

14

TABLE 2.1 Cluster list

Cluster 1 "Sushi Bars" "Japanese"

Cluster 2 Mexican" "Breakfast Brunch" "Steakhouses" "Sandwiches"

"Burgers" "Chinese"

Cluster 3 "Seafood" "Buffets" "Fast Food" "Thai" "Asian

Fusion" "Mediterranean" "French" "Cafes" "Sports Bars"

"Barbeque" "Pubs" "Coffee Tea" "Vietnamese" "Delis"

"Vegetarian" "Lounges" "Greek" "Wine Bars"

"Desserts" "Bakeries" "Gluten Free" "Diners" "Indian"

"Korean" "Salad" "Chicken Wings" "Hotdogs" "Tapas

Bars" "Arts Entertainment" "Southern" "Tapas Small Plates" "Middle

Eastern" "Hawaiian" "Vegan" "Gastropubs" "Dim Sum"

Cluster 4 "Italian" "Pizza"

Cluster 5 "American (New)" "American (Traditional)" "Nightlife" "Bars"

15

Figure 2.4 Similarity matrix and k-means cluster

2.8 Conclusions

In this chapter, we investigate the development of cuisine map based on the categories and

customer reviews in Yelp data. Both TF and TF-IDF are used to build document term matrix.

Similarity matrices are obtained based on cosine distance and plotted in Figure 2.1 to Figure

2.4. It is found that TF-IDF can enhance the similarity between cuisines that are indeed similar

and weaken the similarity between cuisines that have less in common. We also carry out

hierarchical clustering and k-means clustering to facilitate the reader to use the cuisine map.

16

3 CHAPTER 3

SUMMARY OF TASK 3: DISH RECOGNITION

In this chapter, we investigate the mining of Chinese dish names from the yelp review data on

Chinese restaurants. We subset the reviews on Chinese restaurants from the original data set and

identify available dish names in Chinese cuisine by using TopMine [1] and SegPhrase [2].

3.1 Task 3.1: Manual Tagging

First, we revise the label file for Chinese cuisine manually.

We remove false positive non-dish names phrase.

We change a false negative dish name phrase to a positive label.

Second, we add more annotated phrases in the same format by searching for menus from Chinese

restaurants.

3.2 Task 3.2: Mining Additional Dish Names

3.2.1 Corpus Preparation

We import the data into R.

We read “yelp_academic_dataset_business.json” into variable “BUSINESS” and


We select all the restaurants from “BUSINESS” by finding the entries that have

“Restaurants” in column “categories”. We denote this selected data set as

“RESTAURANTS”.

We merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by

“business_id” column. This also eliminates the entries in “REVIEW” that are not for

restaurants.

17

We subset “RESTAURANTS_REVIEW” by selecting the entries with “Chinese” in

column “categories” and save it into “RESTAURANTS_REVIEW_CHINESE”

We subset the “text” column in “RESTAURANTS_REVIEW_CHINESE” and save it

into .txt file with each line being one review.

3.3 Dish Name Identify Using TopMine

3.3.1 Parameters

We keep the default values of the parameters except that we modify “maxPattern” into 6 since

we believe that a dish name is likely to contain 1 to 6 words.

3.3.2 Opinion about the Result

We run the TopMine package and obtain more than 10k phrases. Some of the most frequent ones

appear to be dish names, such as

“dim sum” 2849

“fried rice” 2511

“egg rolls” 1777

“orange chicken” 1599

These are indeed typical Chinese dishes found in US. If you have never been to a Chinese

restaurant, you may want to go and try these dishes because apparently they are very popular in

US. My personal favorite is “dim sum” which is originally from Canton and Hong Kong.

However, there are still many frequent phrases that are not dish names, for example “Chinese

food” (with frequency 2853) and “Chinese restaurant” (with frequency 2108). This is because the

phrase mining algorithm, TopMine, is not specifically for dish name mining. The wrong dish

names are actually frequently used in the reviews and are indeed frequent phrases, which means

the algorithm works very well.

18

3.3.3 Improvement

With the limitation of the tools we have, we have to re-prepare our corpus so that the frequent

phrases other than dish names are removed beforehand. Therefore, we remove the word

“Chinese” and the following words from the original corpus to improve the result of phrase

mining:

“good”, “food”, “service”, “great”, “one”, “like”, “love”, “pretty”, “place”, “menu”, “ordered”,

“order”, “best”, “try”, “nice”, “well”, “didnt”, “dont”, “ive”, “eat”, “back”, “also”, “got”,

“always”, “come”, “people”, “get”, “will”, “can”, “really”, “just”, “time”, “little”, “us”, “meal”,

“diner”, “first”, “table”, “definitely”.

The reason why we remove these words is that we found they appear quite often in the corpus, as

shown in the word cloud in Figure 1.1, but are very unlikely to appear in a Chinese dish name.

After this procedure, the results are much better by observing that most of the top frequent

phrases are dish names.

3.4 Dish Name Identify Using SegPhrase

3.4.1 Parameters

We prepare a label and set algorithm parameter AUTO_LABEL=1. The first part of the label is

the label we revised manually in task 3.1. The second part of the label is from the result of

TopMine. Basically, we select the first 2k frequent phrases in the result of TopMine and replace

the frequency with label 1. We then manually revise the label by removing false positive.

3.4.2 Opinion about the Result

Using the label and the algorithm package we obtain very good dish name list. Below is the top

phrases in the list.

“orange chicken” “hot and sour soup” “cashew chicken” “sea bass” “hot pot”

“kung pao chicken” “brown rice” “shaved ice” “white rice” “char siu” “chow mein”

“won ton” “steamed rice” “fried rice” “bok choy” “sweet and sour pork”

19

4 CHAPTER 4

SUMMARY OF TASK 4 & 5: POPULAR DISHES AND

RESTARURANT RECOMMENDATION

In this chapter, we detect popular dishes of a specific cuisine (Chinese cuisine) and popular

restaurants for a specific dish (“orange chicken” and “fried rice”). Popularity is measured by the

frequency that a dish appears in reviews. We also carried out some sentiment analysis based on

the stars each dish or restaurant receives in reviews.

4.1 Data Preparation

4.1.1 Corpus

Read “yelp_academic_dataset_business.json” into variable “BUSINESS” and


Select all the restaurants from “BUSINESS” by finding the entries that have

“Restaurants” in column “categories”. We denote this selected data set as

“RESTAURANTS”.

Merge “RESTAURANTS” and “REVIEW” into “RESTAURANTS_REVIEW” by

“business_id” column. This also eliminates the entries in “REVIEW” that are not for

restaurants.

Subset “RESTAURANTS_REVIEW” by selecting the entries with “Chinese” in column

“categories” and save it into “CHINESE_REVIEW”

Subset the “text” column in “CHINESE_REVIEW” as the corpus.

Convert the corpus into ASCII encoding.

Strip extra whitespace from the corpus.

Remove punctuation marks from the corpus.

Remove numbers from the corpus.

20

4.1.2 Dish List

We used the top 500 dish names from the dish mining results obtained in Task 3. We read the txt

file (each line is a dish name) into R using function “readLines”.

4.2 Tools and Packages

R version 3.2.2 (2015-08-14)


Running under: Windows 7 x64 (build 7601) Service Pack 1

4.2.1 Attached Base Packages:

stats, graphics, grDevices, utils, datasets methods, base

4.2.2 Other Attached Packages:

dplyr_0.4.3, tm_0.6-2, NLP_0.1-8, ggplot2_1.0.1, jsonlite_0.9.16, qdap_2.2.2,

RColorBrewer_1.1-2, qdapTools_1.3.1, qdapRegex_0.5.0, qdapDictionaries_1.0.6

4.3 Task 4: Popular Dishes

In this section, we detect the top 100 most popular dishes in Chinese cuisine.

4.3.1 Popularity Analysis

The measurement for popularity of a dish is defined as the frequency that the dish appears in

customers’ reviews. If a dish name appears more than one time in the same piece of review, it is

only counted once.

We obtain a data frame with m rows and 3)n columns, where m is the number of reviews

and n is the number of dishes. Each row represents an individual review. In each row, the first

m columns are the counts of the m dishes. Basically, if a dish name appears in the review, then

the value in the corresponding column is 1, otherwise it is 0. The st

1n column is the stars

21

corresponding to the review, the nd

2n column is the name of the corresponding restaurant,

and the rd

3n column is the overall star of the restaurant.

4.3.2 Sentiment Analysis

A frequently ordered (mentioned) dish is not necessarily tasty as well. We use the stars given by

reviewers in their reviews as an indicator for the tastiness of the corresponding dishes mentioned

in the reviews. For example, if a reviewer mentions “fried rice” and “orange chicken” in his or

her review, and he or she gives a five stars in the review for his or her experience, then “fried

rice” and “orange chicken” both earn a tastiness 5 due to this piece of review. We count the total

stars each dish earns from the reviewers as its overall tastiness. Then the overall tastiness is

normalized into a range of 1 to 5.

4.3.3 Illustration

The results are presented in Figure 4.1, where the x-axis is the top 100 popular dish names and

y-axis is the corresponding frequency-based popularity. We used color to show the tastiness of

the dishes. There exist a strong correlation that tastier dishes tends to be ordered (mentioned)

more often, which makes sense in practice.

22

Figure 4.1 Illustration for popular dish names

4.4 Task 5: Popular Restaurants

In this part, we mine the popular restaurants for a specific dish. Without losing generality, we use

“orange chicken” and “fried rice” as two examples because they are two of the most popular

dishes in Chinese cuisine as shown in Figure 4.1. Note that other dish names can be used for this

task since the method and code we use to obtain the results in this section are supposed to be

universal.

4.4.1 Popularity Analysis

We group the data frame obtained in task 4 by restaurant and calculate the total count of dishes

for each restaurant. We use the total count as popularity of the restaurant with respect to a dish.

For example, for restaurant “Panda Express”, the total count of “orange chicken” is 145 while

the total count of “fried rice” is 87. For another example, for restaurant “Chino Bandido”, the

total count of “orange chicken” is 36 while the total count of “fried rice” is 406. As you can see

that “Panda Express” is more popular for its “orange chicken” whereas “Chino Bandido” is more

popular for its “fried rice”.

23

4.4.2 Sentiment Analysis

A restaurant may serve a lot of “orange chicken” or “fried rice”, but it could be because of the

population in that area or its low price. We want to know if the customers are happy after having

its “orange chicken” or “fried rice”, which means how tasty the “orange chicken” and “fried

rice” are. We use the overall stars of the restaurant as a measurement.

4.4.3 Illustration

The results are presented in Figure 4.2 and Figure 4.3. The x-axis represents the top 100

restaurants that serve the dishes “orange chicken” and “fried rice” and the y-axis represents the

popularity of the corresponding restaurants. We used color to show the tastiness of the dishes.

Figure 4.2 Illustration for popular restaurants for orange chicken

24

Figure 4.3 Illustration for popular restaurants for fried rice

4.5 Conclusions

We believe that the figures provided above can be a good guide for people who want to try

Chinese food. They can find the most popular dishes in Figure 4.1 and find which restaurants

serve the best “orange chicken” and “fried rice” in Figure 4.2 and Figure 4.3.

25

5 CHAPTER 5

SUMMARY OF TASK 6: HYGIENE PREDICTION

In this chapter, we predict whether a set of restaurants will pass the public health inspection tests

given the corresponding Yelp text reviews along with some additional information such as the

locations and cuisines offered in these restaurants.

Two text representation techniques are used: Unigram and Topic Model.

Two learning algorithms are used: Logistic Regression and Random Forest.

Additional features are used such as “Categories”, “Stars”, and “Zipcode”.

5.1 Tools Used

R version 3.1.3



5.1.1 Packages

topicmodels_0.2-2 qdap_2.2.2 qdapTools_1.3.1 qdapRegex_0.5.0

qdapDictionaries_1.0.6 tm_0.6-2 NLP_0.1-8 quanteda_0.7.2-1

randomForest_4.6-10 caret_6.0-52

5.2 Text Preprocessing

We preprocess the review text as follows. Package “tm” is used.

Convert the text into ASCII encoding.

Strip extra whitespace from the text.

Remove punctuation marks from the text.

Remove numbers from the text.

26

5.3 Training Method 1: Logistic Regression

5.3.1 Text Representation Techniques

5.3.1.1 Unigram

First, we obtain word frequency from the reviews in training data and select the top frequent

words. Here we set = 301 and 1451.

Second, we use the counts of frequent words in the review of each restaurant as its

corresponding text-based features.

Package “qdap” is used.

5.3.2 Additional Features Used

Stars, Zipcode, and Categories

5.3.3 Learning Algorithm

Logistic Regression

5.3.4 Results Analysis

The results are presented in TABLE 5.1 where “Score” is the score given by Coursera grader.

TABLE 5.1 Prediction Score obtained by Logistic Regression

# of unigram feature Additional Features Score

Scheme 1 301 Stars and Zipcode 0.55778821485

Scheme 2 301 Stars, Zipcode, and

Categories

0.53725827475

Scheme 3 1451 Stars and Zipcode 0.509439655385

27

From TABLE 5.1 we can observe the following.

(1) The score is lower when additional feature “Categories” is used. This is probably because

some categories in testing data set do not appear in training data set.

(2) When more unigram features (frequent words) are used, the score is lower. This is probably

because of overfitting.

5.4 Training Method 2: Random Forest

5.4.1 Text Representation Techniques

5.4.1.1 Unigram

First, we obtain word frequency from the reviews in training data and select the top frequent

words. Here we set = 841 and 1451.

Second, we use the counts of frequent words in the review of each restaurant as its

corresponding text-based features.

Package “qdap” is used.

5.4.1.2 Topic Model

First, we mine 10, 50 and 100 topics from training data.

Second, we count the words that belong to the topics in a restaurant’s review and use the

counts as text-based features.

Package “topicmodels” is used.

5.4.2 Additional Features Used

Stars, Zipcode, and Categories

28

5.4.3 Learning Algorithm

Random Forest

Packages “caret” and “randomForest” are used.

5.4.4 Results Analysis

We use two text representation techniques and different numbers of features. The results are

shown in TABLE 5.2 and TABLE 5.3, respectively.

In TABLE 5.2, we observe the following.

(1) Results are improved by using additional feature “Categories”

(2) More unigram feature improve the result. It seems that a large number of unigram features

does not cause overfitting in these two cases. More tests are not carried out because more

features will result in unbearable training time.

TABLE 5.2 Prediction Score obtained by Random Forest & Unigram

# of unigram features Additional Features Score


Categories

0.56127128414


Categories

0.561925058925

Scheme 2 1451 Stars, Zipcode 0.559788032673

In TABLE 5.3, we observe that more topics does not necessarily mean better result, overfitting

occurs when the number of topics becomes larger.

29

TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model

# of topics Additional Features Score


Categories

0.520164615311


Categories

0.552275390162


Categories

0.540265423658

5.5 Method Comparison

From the result we can tell that logistic regression tends to see overfitting with small numbers of

features whereas random forest is less prone to overfitting. Overall, random forest provides

slightly better results than logistic regression, but the former takes much more computer time

than the latter.

Comparing results from TABLE 5.2 and TABLE 5.3, we observe that the topic model method on

average has similar results as unigram whereas its best result does not outperform Unigram. The

reason could be as follows. On one hand, topic model reduces the dimension of features and

enhances the important features. On the other hand, we may lose some information for

prediction during the dimension reduction.

30

6 CHAPTER 6

USEFULNESS OF RESULTS

In this chapter, we introduce the useful results obtained through the data mining capstone.

6.1 Cuisine Maps

In Chapter 2, we build several cuisine maps which show the similarity between 50 different

cuisines.

6.1.1 Usefulness for Customers

These maps can be very useful for customers who want to explore new cuisines. For instance,

according to the cuisine map in Figure 2.2, “Mediterranean”, “Greek”, and “Middle Eastern” are

three very similar cuisines. People who like one of them may want to try the other two if they use

the cuisine map.

6.1.2 Usefulness for Restaurant Owners

These maps can also benefit restaurant owners who want to extend their businesses. They can

choose to open their new restaurants next to or far away from certain restaurants. For example,

an owner of a cafe may want to open a new cafe next to a restaurant that specifically provides

breakfast and brunch since they are very similar according to the cuisine map and people will

love to grab a cup of coffee before or after breakfast.

6.2 Dish Recognizer

We recognize some dishes in task 3 as introduced in Chapter 3. This is useful for businessmen

who want to open restaurants. It is very helpful to know what dishes are served in certain cuisine

before opening a restaurant of that cuisine.

31

6.3 Popular Dishes Detection

We detect top 100 popular Chinese dishes with corresponding tastiness in task 4 as introduced in

Chapter 4. This is extremely useful for people who like Chinese food and who want to try

Chinese food. The reason is obvious. People can find the most popular and tasty dishes and avoid

ordering dishes that are not so welcomed.

In addition, this result is also very useful to owners of Chinese restaurants and businessmen who

want to start Chinese restaurants. For them, providing more popular food is more likely to bring

more customers and hence more profit.

6.4 Restaurant Recommendation

We recommend top 100 restaurants that serve orange chicken and fried rice in task 5 as

presented in Chapter 4. This is also quite useful for customers who want to try these two special

dishes.

6.5 Hygiene Prediction

This result helps customers in selecting clean restaurants to go and avoid restaurants that are not

so good at keeping hygiene.

32

7 CHAPTER 7

NOVELTY OF EXPLORATION

7.1 Hierarchical Clustering in Cuisine Map Development

When we build the cuisine map considering clustering, hierarchical clustering is used, as shown

in Figure 2.3. The hierarchical relation between cuisines are shown together with the similarity

matrix. This really helps users find the clusters based on their own need. Instead of fixing the

number of clusters beforehand, we allow users to choose how many clusters they want or to

simply find cuisine that are connected by the hierarchical links.

7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition

In recognizing dish names for Chinese cuisine, we use the output of TopMine as the input for

SegPhrase so that SegPhrase has a more comprehensive labeled dish list. The first part of the

labeled list is the one we revise manually in task 3.1. The second part of the list is from the result

of TopMine. This method turns out to be very effective, which results in a 12 out 10 score

according to the grader.

7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction

In training hygiene prediction models, we use two text representation techniques: unigram and

topic model. For unigram, we detect the top N popular terms first instead of using all the terms

in corpus, and then use the counts of the N words in the reviews as features. For topic model,

we first mine topics from customer reviews, and then use the word counts in the topics as

features.

The two methods are very effective according to the grader. A F1 = 0.56 is obtained using the

top term counts as features and a F1 = 0.55 is obtained using the topic model.

33

8 CHAPTER 8

CONTRIBUTION OF NEW KNOWLEDGE

8.1 Some Advantages of Random Forest over Logistic Regression

In carrying out task 6, we train both logistic regression model and random forest with the same

number of features and compare the results. Here are some advantages of random forest over

logistic regression found during the experiment.

8.1.1 Random Forest is Less Prone to Overfitting than Logistic Regression

We found that random forest provides better results when more and more features are included

without showing overfitting, though we do not carry out experiments with more than 1500

features. However, logistic regression shows the sign of overfitting when less than 1500 features

are used. This shows us that random forest is less prone to overfitting than logistic regression.

8.1.2 Logistic Regression is not Good at Handling Missing Feature Value

When we are using logistic regression as prediction algorithm and categories of restaurants as a

feature, warnings occur because some restaurant categories that do not appear in training data

appear in testing data, which causes worse prediction result. On the other hand, random forest

seems to be able to cope with such situation and even provide better prediction when restaurant

categories are used as a feature.

34

9 CHAPTER 9

IMPROVEMENT TO BE DONE

Several things can be done to improve this project: First, web based tools can be developed for

interactive illustration of results. Second, updating algorithm can be developed to update the

results in an efficient manner when more data are available instead of carrying out data mining

from scratch. Third, a location based restaurant and dish recommendation should be developed

which can be more helpful for customers in specific places.

35

10 REFERENCES

[1] El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of

the VLDB Endowment, 8.3 (2014): 305-316.

[2] Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases

from Massive Text Corpora,” Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data

(SIGMOD'15), Melbourne, Australia, May 2015. (* equally contributed)

data mining capstone final report -...

Documents