lets eat presentation_final_20160521
TRANSCRIPT
Let’s Eat!Brad Binder, Lesley Chapman,Jon Froiland, David Lee
Introduction
History:
Since 1979 there have been services that review
and rank restaurants (Zagat)
•
Today:
According to Nielson – Americans have on
average 41 apps on their smartphones, many of
which provide a recommendation service
Introduction
A variety of restaurant recommendation apps
have been created
Features include: find restaurants, make reservations,
and healthy options
–
A Restaurant Recommender would aim to help
users save money, time, and could help cure
buyers remorse
Problem Summary
We need a tool that resolves the challenge of
finding a restaurant in your area based upon
specific cuisine and menu item criteria
entered by the user
Hypothesis
Hypothesis: The Restaurant Recommender will recommend a
more accurate restaurant compared to selecting a restaurant
based on chance alone
Ho (null hypothesis): A user will find a restaurant that they like
based on chance alone
HA(alternative hypothesis): The restaurant recommender app
will provide a better restaurant suggestion to the user compared
chance alone
Data Ingestion
• WORM Storage–Stored HTML menu pages in one location which could be read many times
• Parsed HTML with BeautifulSoup
–Built out a list of “Restaurant” objects
• GET requests to WMATA API to pull metro station data
–JSON data parsed with pandas read_json() function
Ingestion Wrangling Analysis Modeling Visualization
Wrangling and Munging
• Majority of time spent wrangling the data and building restaurants–Removing duplicate and incomplete records–Standardizing inconsistent fields (e.g. price)–Aggregating and grouping–Data types
• Merged restaurant and WMATA data using Euclidean distance
Ingestion Wrangling Analysis Modeling Visualization
Data Overview
Ingestion Wrangling Analysis Modeling Visualization
964 Total Restaurants115,517 Total Menu Items
• Restaurant data includes:–Name–Location (address, latitude, longitude)–Type of cuisine–Menu (item, price, description)
• WMATA data includes:
–Station name
–Location (latitude, longitude)
–Metro Line
Analysis
Ingestion Wrangling Analysis Modeling Visualization
10 cities964 Restaurants
115,517 Menu Items
Analysis
Ingestion Wrangling Analysis Modeling Visualization
964 Restaurants115,517 Menu Items
Washington, D.C.
Ingestion Wrangling Analysis Modeling Visualization
Washington, D.C.
Ingestion Wrangling Analysis Modeling Visualization
Feature Selection
• Four feature extraction pipelines using sklearn–Chunking–Cuisine Type
• TfidfVectorizer
–Extract keywords and assign significance score
– Tokenize and chunk parts of speech using nltk
• LabelBinarizer
–Convert cuisine types to binary features
• FeatureUnion
Ingestion Wrangling Analysis Modeling Visualization
Modeling and Prediction
• Transformation pipelines and transformed feature vectors pickled
• Kmeans models fitted using training restaurant data, then pickled
• User inputs entered via Flask are stored as training instance
• Relevant pipeline and model loaded to transform and predict
Ingestion Wrangling Analysis Modeling Visualization
K=15
Ingestion Wrangling Analysis Modeling Visualization
Ingestion Wrangling Analysis Modeling Visualization
Reporting and Visualization
• Restaurant recommendations are determined by similarity within a matched cluster–“Similarity” is calculated by minimizing sklearn’spairwise euclidean distance function between the test data and the training instances in the feature space
• Predictions are exported into an interactive Tableau visualization
–Allows the user flexibility in making a selection through filtering and visual indicators
Demo
Results
• Some predictions are good, others not so good–Some clusters still contain a “hodge podge”
• Removing the “cuisine type” feature helped to eliminate what we saw as overfit
• Different k values saw better results in some cases, worse in others
• Additional features (price, ratings, metro) would require more clusters and MORE DATA
Conclusions
• More data over a “better” model• Might improve results using transformations
like Singular Value Decomposition (SVD) or Latent Dirichlet Allocation (LDA)– Better model analysis
• With more data, improve our tokenizer– Incorporate stemming, improve chunking
• Incorporating user feedback into prediction model (ex: Flask interface)
Additional Opportunities
• “Waiter-caller” function that would allow users to login, use the restaurant map search function, click on a restaurant, and be matched up with menu items based on keyword matches. As opposed to reading through an entire menu to find relevant items.
–Required more knowledge and implementation of javascript, css, and jinja into the Flask environment.
• Sentiment analyzer was developed but not integrated. Would allow users to go to restaurant and input a review. The review would then be analyzed giving back a recommended score (1-5) to the user.
–Similar requirements
Sources• Downey, Allen B. Think Bayes. O’Reilly Media; 1st Edition. 2013. Paperback.• Downey, Allen B. Think Python. O’Reilly Media; 1st Edition, 2012. Paperback.• Dwyer, Gareth. Flask by Example. Packt Publishing, 2016. Paperback.• Harris, Harlin, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An
Introspective Survey of Data Scientists and Their Work. O’Reilly Media; 1st Edition, 2013.
• Julian, David. Designing Machine Learning Systems with Python. Packt Publishing, 2016. Paperback.
• Kirk, Matthew. Thoughtful Machine Learning: A Test-Driven Approach. O’Reilly Media; 1st Edition, 2014. Paperback.
• Kumar, Ashish. Learning Predictive Analytics with Python. Packt Publishing, 2016. Paperback.
• McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media; 1st Edition, 2012. Paperback.
• Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web. O’Reilly Media; 1st Edition, 2015. Paperback.
• Raschka, Sebastian. Python Machine Learning. Packt Publishing, 2015. Paperback.• Segaran, Toby. Programming Collective Intelligence: Building Smart Web 2.0
Applications. O’Reilly Media, 2007. Paperback.