extract, load, visualize become a data ninja
TRANSCRIPT
![Page 1: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/1.jpg)
A talk delivered atIIM Lucknow 24/01/2016
EXTRACT. LOAD. VISUALIZEBecoming a data ninja
![Page 2: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/2.jpg)
![Page 3: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/3.jpg)
![Page 4: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/4.jpg)
• INTRODUCTION
• WEB CRAWLING – THE DESIGN
• DIY – WEB CRAWLING
• TEXT ANALYTICS– THE DESIGN
• CASE STUDY
• DIY – TEXT ANALYTICS
• SOME UNSOLICITED ADVICE? – BUILDING YOUR FIRM
WE WILL TALK OF….
![Page 5: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/5.jpg)
Who are we ?
We are a data science company, founded in 2009– with special interest in making the world an intelligent place to live in.
We identify data and bring it to light, making it visible, cohesive, comparable and easy to understand so that it really does support YOU in making the right decisions.
Who am I ?
I am a Practice Lead at JSM for Natural Language Processing & Machine Learning. I have architected multiple solutions in the area of text analytics for multiple industries like finance, healthcare, food & beverages & hospitality.
![Page 6: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/6.jpg)
AREAS WE WORK ON
PHARMA
Sales Pitch Analysis
RETAIL
Predictive + IoT
FINANCE
Competitive Intelligence
F&B
Customer Insights
MR
Scoping and Product Evaluation
SaaS
NLP, ML, Text
![Page 7: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/7.jpg)
WHAT DO CLIENTS WANT
![Page 8: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/8.jpg)
TOOLS WE PLAY WITH
OPEN SOURCE
Inexpensive
DATABASES
Fast & Scalable
INSIGHTS
Python, R
TECHNIQUES
Latest yet tested
VISUALIZATIONS
D3, GCharts, Tableau
MANAGEMENT
Basecamp
![Page 9: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/9.jpg)
Low Cost Data Collection
+
Comprehensive Analytics
![Page 10: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/10.jpg)
PART 1
Scraping data from the web
![Page 11: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/11.jpg)
HTML PAGES
![Page 12: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/12.jpg)
• HTML pages are like textbooks – content, titles, subtitles, paragraphs and so on
•Javascript adds interactivity to the HTML pages
HTML PAGES
![Page 13: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/13.jpg)
A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.
The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot."
OVERVIEW– WEB CRAWLING
![Page 14: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/14.jpg)
BUT, YOU ARE MANAGERS.
DO YOU NEED THIS ?
![Page 15: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/15.jpg)
YES !
You don’t need to write a code, promise.*
* T&Cs apply
![Page 16: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/16.jpg)
Tool 1
Import.io
![Page 17: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/17.jpg)
• BEST of the lot
• Gives great flexibility – just click and extract
• Most of the sites are compatible
• Easy CSV/Google Docs Export
• Provides APIs for regular data updates
• Low training time
INTRODUCTION
![Page 18: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/18.jpg)
USE CASES
You want to monitor feedbackhttp://www.consumercomplaints.in/snapdeal-com-b100038
![Page 20: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/20.jpg)
USE CASES
You want to create pricing strategy
http://www.shopclues.com/mobiles/unboxed-mobiles.html
![Page 21: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/21.jpg)
Tool 2
webscraper.io
![Page 22: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/22.jpg)
• Works where Import.io fails
• Bit buggy, but does a god job of providing flexible choices of data extraction
• Most of the sites are compatible
• Easy CSV Export
• NO APIs for regular data updates
• Moderate learning curve
INTRODUCTION
![Page 23: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/23.jpg)
LET’S GET OUR HANDS DIRTY
![Page 24: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/24.jpg)
• Don’t scrape too fast or you will get banned
• Respect robots.txt
• Extract only what you need
• Don’t overload their servers
• Don’t take data what’s not yours – only the data in public domain
PRECAUTIONS
![Page 25: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/25.jpg)
TEXT ANALYTICSMine gold from mountain heaps
![Page 26: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/26.jpg)
As a brand owner with significant investments in social media, the usual questions you might have in mind…
• Is the brand exuding same attributes I intended it to be?
• Is my internet presence helping me ?
• Can I measure my ROI for the money I spent?• What are the measurable metrics for effective social media management?
• When can I exploit emerging trends for my brand?
• How can I understand my customers better?
26
WHAT WOULD YOU WANT?
![Page 27: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/27.jpg)
27
WHAT WOULD YOU WANT TO TARGET?
TRANSACTIONAL CONVERSATIONS
Users talk about current, events, share cat videos and engage in trivial gossip
INFORMATIONAL CONVERSTIONS
Users engage with the brand to air appreciation or
complaints.
![Page 28: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/28.jpg)
28
DATA SOURCES
Social Media
Comments
Brand Website
BlogsCustomer
Emails
Customer Surveys
News Websites
![Page 29: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/29.jpg)
For a brand, say X, there would be thousands of conversations on blogs, websites and social media revolving around X.
Several hundred blog posts are published that contain the keywords “X” This company needs to know:
- How many of these posts are relevant and actually expressing opinions about X?
- How many relevant posts are negative, and how many are positive?
- What particular aspects and features of X are being praised or criticized?
- For all of the above, what is the trend for the past few time periods?
- How is my brand perceived among people demographics?
- How is my brand faring against my competitors?
29
QUESTIONS TO ANSWER
![Page 30: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/30.jpg)
BRIEF TERMINOLOGY
You build an algorithm, machine learns patterns, machine predicts, rinse & repeat.
MACHINE LEARNING
TEXT ANALYTICS
Analyzing unstructured text, assign structure, load into a BI/program to visualize
![Page 31: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/31.jpg)
CASE STUDY
HOW WE HELPED A RESTAURANT SERVE THEIR CUSTOMERS BETTER
![Page 32: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/32.jpg)
PROBLEM STATEMENT
The client had thousands of customer reviews which they wanted to analyse - to understand customer feedback and identify improvement opportunities.
The broad questions we focused on;
What did they say about the restaurant?
Keywords & topics of discussion across the comments
What elements of the restaurant would they want improved? – service, staff behaviour, ambience etc.
When did the customer visit the store?
How is client’s traffic distributed over time?
Ticket sizes across multiple customer dimensions – age, gender, ratings, location, time of visit etc.
Overall customer sentiments & views about UCH
PRIMARY FOCUS AREAS SECONDARY FOCUS AREAS
![Page 33: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/33.jpg)
FOCUS AREAS
TOPICS KEYWORDS SENTIMENT POINT OF SALES
![Page 34: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/34.jpg)
APPROACH
Extract data and validate
Corpus from social media
Tokenise and remove stop
words
Initiate ML models , NER , parsers &
topic algorithms
Initiate detection rules for topics, keywords,
gender, sentiment and multi-word concept
detection
Final Output
PRE - PROCESSING PARSING & ANALYSIS
OUTPUT
Part of Speech (POS)
Tagger
![Page 35: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/35.jpg)
Author Value Type Sentiment
Duncan Riley Samsung Galaxy S6 Entity Positive
Duncan Riley Apple Entity Negative
Duncan Riley LoopPay Entity Neutral
Duncan Riley mobile payments Keyword Positive
Duncan Riley point of sales Keyword Positive
Structuring data from free flowing text is easy to use by existing reporting and business intelligence software. Insights from the final reports can now be used for decision-making by the PR firm and their client
Actual blog post parsed through our SmartText Engine
![Page 37: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/37.jpg)
RATINGS – HOW MUCH IS BETTER ?
Increasing scope of differentiating operational improvements
Decreasing scope of customer loyalty
![Page 38: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/38.jpg)
OUR PLATFORM
7.7 Mn 92.6 K 62.6 K16reviews restaurants user profilestopics
As on 28th October, 2015
![Page 39: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/39.jpg)
LUNCHBOX - SCREENSHOTS & MOCKUPS
![Page 40: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/40.jpg)
Lunchbox – Search Screen
![Page 41: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/41.jpg)
Lunchbox – ViewMe
![Page 42: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/42.jpg)
Lunchbox – ViewMe
![Page 43: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/43.jpg)
Lunchbox – RankMe
![Page 44: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/44.jpg)
Lunchbox – MarketMe
![Page 45: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/45.jpg)
BUT HEY ! THIS NEEDS ME TO WRITE A CODE.
YOU LIAR !
![Page 46: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/46.jpg)
SCRAPE > VOSVIEWER > GEPHI > TABLEAU
![Page 47: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/47.jpg)
Scrape the web !
![Page 48: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/48.jpg)
Excel
![Page 49: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/49.jpg)
= ISNUMBER(SEARCH($N$1,H2))
![Page 50: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/50.jpg)
VOSViewer
![Page 51: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/51.jpg)
Gephi
![Page 52: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/52.jpg)
Tableau
![Page 53: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/53.jpg)
LET’S GET OUR HANDS DIRTY, AGAIN !
![Page 54: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/54.jpg)
SOME UNSOLICITED ADVICE ?
![Page 55: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/55.jpg)
![Page 56: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/56.jpg)
• Purse strings WILL be tightened
• Fewer ‘unicorn-level’ valuations
• No investment without revenue or clients
• MVP with traction – a must have
• Acquire or be acquired – or be a acquisition target
• Needs > Wants – solve a problem, but scope it out
•Bootstrapped startups will win, not the valuation-hungry
PREDICTIONS
![Page 57: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/57.jpg)
Don't sell what you do - disguise it
Ex. Create marketing strategy, not analytics/Tableau
Nugget 1
![Page 58: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/58.jpg)
A product that promotes laziness or transfers laziness from seller to buyer will sell
Nugget 2
![Page 59: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/59.jpg)
Is it something that Google/FB provides or can do? Even if it's partial, danger is real.
Nugget 3
![Page 60: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/60.jpg)
The product should be IP-able, scalable and monetize-able -atleast 2/3 must be met
Nugget 4
![Page 61: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/61.jpg)
Read Read Readtechcruch, recode, crunchbase, yourstory, nextbigwhat,
vccircle
Nugget 5
![Page 62: EXTRACT, LOAD, VISUALIZE Become a Data Ninja](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a5b1fd1a28ab1a628b6615/html5/thumbnails/62.jpg)
QUESTIONS ?