how to scraping content from web for location-based mobile app
TRANSCRIPT
Scraping content from web for location-based mobile
app.
Nguyen Hong Diepfounder, magik.vn
Summary
1. Web Scraping– Definitions– Value added– Analysis a Sample Case
2. Scrapy Framework– Overview– Architecture– A simple Scrapy program.
3. Build a auto scraping system for location-based apps– Extract LatLng from address– Extract phone number – Realtime update & continuous 24/7– Prevent duplication data– Deploy without a dedicated server or VPS
Web crawler
Internet bot that systematically browses the World Wide Web,
typically for web indexing.
Sources: wikipedia.org
Scrape
Crawl websites and extract structured data from pages.
Sources: wikipedia.org
Added Value?
giamua.com – “groupon”
baomoi.com
Added Value?
same user experiencebut
more content than
oizoioi.vn Price comparison for electronic
Added Value?
make
new knowledge from many informations
Wisdom
Knowledge
Information
Data
DIKW Hierachy
Nha Tro Tot
Added Value?
The smartphone revolutionnew platform
need new user experienced
Source: www.widexconnect.ca
And mores
Sources : Laban.vn
Analysis a sample case
(1)collect [home for sales] records from Web
(2)from many websites in Vietnam(3) as soon as they posted(4) continuous 24 / 7
Need
Step 1: Listing sources
Step 2: build general database
Step 3: Ctrl+C, Ctrl+V
• For every sites:– Find listing latest records webpage link.– For every record :• Check if new record
– Copy & paste fields into a new record in my DB.
Step 3: Ctrl+C, Ctrl+V
Bước 3 : Let’s Scrapy
Scrapy Framework
• Overview• Architecture• Xpath• Make a simple Scrapy program.
• Scrapy is a fast high-level screen scraping and web crawling framework.
• Open-source, 100% Python => Portable
Scrapy’s github info
• From 2008
• Stats
Architecture
Source: http://doc.scrapy.org/en/0.12/topics/architecture.html
XPath
Navigate through elements and attributes
in an XML document.
Simple Scrapy Program
• (1) Pick a website – http://www.mininova.org/today
• (2) Define the data you want to scrape
Simple Scrapy Program (cont.)
• (3) Write a Spider to extract the data
Simple Scrapy Program (cont.)
(4) Run the spider to extract the data
(5) Review scraped data
Build a auto scraping system for location-based apps
• Extract LatLng from address• Extract phone number • Realtime update & continuous 24/7• Prevent duplication data• Deploy without a dedicated server or
VPS
Extract LatLng from address
• Use Google Geocode• https://maps.googleapis.com/maps/api/geocode/json?
address=xxx&sensor=true_or_false&key=API_KEY
Extract LatLng from address (cont.)
Extract LatLng from address (cont.)
Extract Phone Number
• Libphonenumber’s python port.
• Sample
“Real time” update and continuous 24/7.
• Task Schedule (Windows)
• Cron jobs (Linux)
Prevent duplication data
• Make a middleware for ignore exists Item. IgnoreExistsMiddleW
are
Without a dedicated server or VPS
• Problems: my server-side is on a cpanel web hosting => can’t deploy scrapy
• Solutions: – Make a web services for sync new record data.
• /get_head_revision• /sync
– Scrapy run on my PC, then sync with server.