how to scraping content from web for location-based mobile app

38
Scraping content from web for location-based mobile app.

Upload: diep-bao-den

Post on 13-May-2015

741 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How to scraping content from web for location-based mobile app

Scraping content from web for location-based mobile

app.

Page 2: How to scraping content from web for location-based mobile app

Nguyen Hong Diepfounder, magik.vn

Page 3: How to scraping content from web for location-based mobile app

Summary

1. Web Scraping– Definitions– Value added– Analysis a Sample Case

2. Scrapy Framework– Overview– Architecture– A simple Scrapy program.

3. Build a auto scraping system for location-based apps– Extract LatLng from address– Extract phone number – Realtime update & continuous 24/7– Prevent duplication data– Deploy without a dedicated server or VPS

Page 4: How to scraping content from web for location-based mobile app

Web crawler

Internet bot that systematically browses the World Wide Web,

typically for web indexing.

Sources: wikipedia.org

Page 5: How to scraping content from web for location-based mobile app

Scrape

Crawl websites and extract structured data from pages.

Sources: wikipedia.org

Page 6: How to scraping content from web for location-based mobile app

Added Value?

Page 7: How to scraping content from web for location-based mobile app

giamua.com – “groupon”

Page 8: How to scraping content from web for location-based mobile app

baomoi.com

Page 9: How to scraping content from web for location-based mobile app

Added Value?

same user experiencebut

more content than

Page 10: How to scraping content from web for location-based mobile app

oizoioi.vn Price comparison for electronic

Page 11: How to scraping content from web for location-based mobile app

Added Value?

make

new knowledge from many informations

Wisdom

Knowledge

Information

Data

DIKW Hierachy

Page 12: How to scraping content from web for location-based mobile app

Nha Tro Tot

Page 13: How to scraping content from web for location-based mobile app
Page 14: How to scraping content from web for location-based mobile app

Added Value?

The smartphone revolutionnew platform

need new user experienced

Source: www.widexconnect.ca

Page 15: How to scraping content from web for location-based mobile app

And mores

Sources : Laban.vn

Page 16: How to scraping content from web for location-based mobile app

Analysis a sample case

(1)collect [home for sales] records from Web

(2)from many websites in Vietnam(3) as soon as they posted(4) continuous 24 / 7

Need

Page 17: How to scraping content from web for location-based mobile app

Step 1: Listing sources

Page 18: How to scraping content from web for location-based mobile app

Step 2: build general database

Page 19: How to scraping content from web for location-based mobile app

Step 3: Ctrl+C, Ctrl+V

• For every sites:– Find listing latest records webpage link.– For every record :• Check if new record

– Copy & paste fields into a new record in my DB.

Page 20: How to scraping content from web for location-based mobile app

Step 3: Ctrl+C, Ctrl+V

Page 21: How to scraping content from web for location-based mobile app

Bước 3 : Let’s Scrapy

Page 22: How to scraping content from web for location-based mobile app

Scrapy Framework

• Overview• Architecture• Xpath• Make a simple Scrapy program.

Page 23: How to scraping content from web for location-based mobile app

• Scrapy is a fast high-level screen scraping and web crawling framework.

• Open-source, 100% Python => Portable

Page 24: How to scraping content from web for location-based mobile app

Scrapy’s github info

• From 2008

• Stats

Page 25: How to scraping content from web for location-based mobile app

Architecture

Source: http://doc.scrapy.org/en/0.12/topics/architecture.html

Page 26: How to scraping content from web for location-based mobile app

XPath

Navigate through elements and attributes

in an XML document.

Page 27: How to scraping content from web for location-based mobile app

Simple Scrapy Program

• (1) Pick a website – http://www.mininova.org/today

• (2) Define the data you want to scrape

Page 28: How to scraping content from web for location-based mobile app

Simple Scrapy Program (cont.)

• (3) Write a Spider to extract the data

Page 29: How to scraping content from web for location-based mobile app
Page 30: How to scraping content from web for location-based mobile app

Simple Scrapy Program (cont.)

(4) Run the spider to extract the data

(5) Review scraped data

Page 31: How to scraping content from web for location-based mobile app

Build a auto scraping system for location-based apps

• Extract LatLng from address• Extract phone number • Realtime update & continuous 24/7• Prevent duplication data• Deploy without a dedicated server or

VPS

Page 32: How to scraping content from web for location-based mobile app

Extract LatLng from address

• Use Google Geocode• https://maps.googleapis.com/maps/api/geocode/json?

address=xxx&sensor=true_or_false&key=API_KEY

Page 33: How to scraping content from web for location-based mobile app

Extract LatLng from address (cont.)

Page 34: How to scraping content from web for location-based mobile app

Extract LatLng from address (cont.)

Page 35: How to scraping content from web for location-based mobile app

Extract Phone Number

• Libphonenumber’s python port.

• Sample

Page 36: How to scraping content from web for location-based mobile app

“Real time” update and continuous 24/7.

• Task Schedule (Windows)

• Cron jobs (Linux)

Page 37: How to scraping content from web for location-based mobile app

Prevent duplication data

• Make a middleware for ignore exists Item. IgnoreExistsMiddleW

are

Page 38: How to scraping content from web for location-based mobile app

Without a dedicated server or VPS

• Problems: my server-side is on a cpanel web hosting => can’t deploy scrapy

• Solutions: – Make a web services for sync new record data.

• /get_head_revision• /sync

– Scrapy run on my PC, then sync with server.