web api, rest api and web scraping

40
Web and REST API + Scraping Marco Brambilla, Riccardo Volonterio @marcobrambi @manudellavalle Web Science #WebSciCo

Upload: marco-brambilla

Post on 13-Jan-2017

478 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Web API, REST API and Web Scraping

Web and REST API + ScrapingMarco Brambilla, Riccardo Volonterio

@marcobrambi@manudellavalleWeb Science #WebSciCo

Page 2: Web API, REST API and Web Scraping

The Data Analysis Pipeline

Page 3: Web API, REST API and Web Scraping

The 5 3 Ws of (Social) API & Scraping

1) What2) Why3) Who4) When? Now.5) Where? Everywhere.6) How7) Use cases

a. UrbanScopeb. La Scala

Page 4: Web API, REST API and Web Scraping

What

“An API (Application Program Interface) is a set of routines, protocols and tools for building software applications”

–Wikipedia

A WebAPI = HTTP based API Allows a programmatic access to data and platform

And what if no API is available?

Page 5: Web API, REST API and Web Scraping

What

“An API (Application Program Interface) is a set of routines, protocols and tools for building software applications”

–Wikipedia

A WebAPI = HTTP based API Allows a programmatic access to data and platform

And what if no API is available? Scraping!

Page 6: Web API, REST API and Web Scraping

Why

• Separation between model and presentation• Regulate access to the data

• Traceable accounts• Enable access to only a portion of the available data• By Time• By Location

• Avoid direct access to the platform website• Request throttling

• Avoid server congestion• Avoid DOS/DDOS

• Provide paid access to full (or higher volume) data

Page 7: Web API, REST API and Web Scraping

Who

Morethan14.000APIsavailable(seeprogrammableweb.com)

Page 8: Web API, REST API and Web Scraping

How

• URLs• API types

• RESTful APIs• Authentication• Scraping

Page 9: Web API, REST API and Web Scraping

How - URL

URL format (RFC 2396):scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

E.G.:[email protected]:nodejs/node.gitmongodb://root:pass@localhost:27017/TestDB?options?replicaSet=testhttp://example.com

abc://user:[email protected]:123/path/data?key=value&k2=v2#fragid1

scheme credentials domain port path querystring fragment

Page 10: Web API, REST API and Web Scraping

How – HTTP request

When the scheme is HTTP(s), we can make HTTP requests as:VERB URL

The VERB (or method) can be one of:• GET• POST• DELETE• OPTIONS• and more: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html

Page 11: Web API, REST API and Web Scraping

How - Endpoints

Each API exposes a set of HTTP(s) endpoints (URLs) to call in order to get data or perform action on the platform.

Example (retrieve data):GET https://graph.facebook.com/searchGET https://api.twitter.com/1.1/search/tweets.jsonGET https://api.instagram.com/v1/tags/nofilter/media/recent

Example (platform actions):POST https://api.instagram.com/v1/media/{media-id}/commentsPOST https://api.twitter.com/1.1/statuses/update.jsonPOST https://graph.facebook.com/me/feed

Legend:PROTOCOLDOMAINAPIVERSIONENDPOINTHTTPMETHOD

Page 12: Web API, REST API and Web Scraping

How - Parameters

Most of the endpoints can be “tweaked” via one or more parameters. Parameters can be added to an URL by appending to the endpoint some (URL encoded) strings.

Example (Twitter):…/search/tweets.json?q=%23expo2015milano&lang=it&result_type=recent&count=100&token=1235abcd

The request in plain English is: “Twitter, I am 1235abcd, can you please give me the 100 most recent tweets, in italian, by the user @expo2015milano?”

Page 13: Web API, REST API and Web Scraping

How - REST

REST“Representational state transfer (REST) is a style of software architecture.”

- Roy T. Fielding

Client-Server: a uniform interface separates clients from servers.Stateless: the client-server communication is constrained by no client contextbeing stored on the server.Cacheable: clients and intermediaries can cache responses.Layered system: clients cannot tell whether are connected directly to theserver.Uniform interface: simplifies and decouples the architecture.

Page 14: Web API, REST API and Web Scraping

How - APIs

WebAPI vs RESTful API

WebAPI: An API that uses the HTTP protocol. Non standard/unpredictable URL format.

POST http://example.com/getUserData.do?user=1234

RESTful API: Resource based WebAPI. Standard/predictable URL format.GET http://example.com/user/1234

Page 15: Web API, REST API and Web Scraping

How – RESTful API

The RESTful API uses the available HTTP verbs to perform CRUD operations based on the “context”:• Collection: A set of items (e.g.: /users)• Item: A specific item in a collection (e.g.: /users/{id})

VERB Collection Item

POST Create anewitem. Notused

GET Get listofelements. Gettheselected item.

PUT Notused Updatetheselected item.

DELETE Notused Delete theselected item.

Page 16: Web API, REST API and Web Scraping

How - Authentication

Almost all the APIs require a kind of user authentication. The user mustregister to the developer of the provider to obtain the keys to access the API.Most of the main API providers use the OAuth protocol to authenticate auser.

The user registers to the developer portal to obtain a key and a secret.Platform authenticates the user via key/secret pair and supply a token thatcan be used to perform HTTP(S) requests to the desired API endpoint.

Page 17: Web API, REST API and Web Scraping

Challenges

• Pagination• Timelines

• Parallelization• Multiple accounts• API gaps

Page 18: Web API, REST API and Web Scraping

Challenges - Pagination

Like SQL queries, most APIs support data pagination to split huge chunks of data into smaller set of data.

Smaller chunks of data are easier to create, transfer (avoid long response time), cache and require less server computation time.

Page 19: Web API, REST API and Web Scraping

Challenges - Timeline

Most of the Social Networks leverage on the concept of timelines instead ofstandard pagination to iterate through results.

An ideal pagination will work like the following (Twitter example):

Page 20: Web API, REST API and Web Scraping

Challenges – Timeline 2

The problem is that dataare continuously added tothe top of the Twittertimeline.

So the next requestrequest can lead toretrieving the alreadyprocessed tweets.

Page 21: Web API, REST API and Web Scraping

Challenges – Timeline 3

The solution is to use thecursoring technique.

Instead of reading fromthe top of the timeline weread the data relative tothe already processedids.

Page 22: Web API, REST API and Web Scraping

Challenges - Parallelization

When possible make parallel requests to gather more data in less time.

Example:Get all the media from Instagram about @potus and #obama.

We can make two parallel requests (one for each tag) and paginate the results instead of waiting for the first tag to finish.

Some time this is not possible due to the requests interdependency like in pagination; each request depends on the result of previous one.

Page 23: Web API, REST API and Web Scraping

Challenges – Multiple accounts

Handling multiple accounts can increase the system throughput but increases the complexity of the system.

Strategies to handle multiple accounts:• Request based

• Round robin• Account pool

• Account based• Requests stack

Page 24: Web API, REST API and Web Scraping

Challenges – Multi accounts – Round robin

Theaccountthroughwhichtherequestismadeischosenusingtheround-robinstrategy.

Page 25: Web API, REST API and Web Scraping

Challenges – Multi accounts – Account pool

The account through which the request is made is chosen sequentially until it’s rate limitis reached, then we use the next available one.

Page 26: Web API, REST API and Web Scraping

Challenges – Multi accounts – Requests Stack

This strategy overturn the roles. Now the accounts, inparallel, get the next request to do from the pool. Once anaccount completes a request the cycle restarts until the poolis empty.

Page 27: Web API, REST API and Web Scraping

Challenges – Fill Gaps

Sometimes the API does not give us all the data we want, so we need to fill those gaps.

On quantity or time (Twitter) or on quality/content (Foursquare).

Example: Twitter time limited data. We need to access data older then 10 days.

Example:Foursquare check-ins timeline not available.Facebook like timestamps not available.

Buildyourowntimeline.

Page 28: Web API, REST API and Web Scraping

Use case – UrbanScope (Crawler)

“An API Crawler is a software that methodically interacts with a WebAPI to download data or to take some actions at predefined time intervals.”

http://urbanscope.polimi.it/

Page 29: Web API, REST API and Web Scraping

Use case – UrbanScope (Twitter)

Goal: Get all the geolocated tweets in Milan and analyze the languages to detect anomalies.

Problem: Twitter does not provide a simple endpoint “give me all tweets in Milan”.

Solution: Create a grid of points (50m distance) over Milan and ask Twitter for each point.

Problem: 70.000 points, with 1 account is 4 days.

Solution: Use more keys!

Problem: Handle multiple keys.

Solution: Cry… Then handle multiple keys.

Page 30: Web API, REST API and Web Scraping

Use case - UrbanScope - Explore Tweets

Page 31: Web API, REST API and Web Scraping

Use case – UrbanScope (Foursquare)

Goal: See how the check-ins progress in time to see where people go and visualize the pulse of Milan.

Problem: Foursquare does not provide the timeline of the check-ins.

Solution: Build our own timeline by collecting the number of check-ins every 4 hours for 90.000 venues.

Problem: 90.000 api requests to do in 4 hours with an API time limit of 5000 requests per hour.

Solution: See twitter… with more tears and more keys (24)

Problem: Foursquare ban accounts that seems to crawl data automatically -.-

Page 32: Web API, REST API and Web Scraping

Use case – UrbanScope – City magnets

Page 33: Web API, REST API and Web Scraping

Use case – UrbanScope- Top Venues

Page 34: Web API, REST API and Web Scraping

0

200

400

600

800

1000

1200

1400

Use case – Teatro Alla Scala (Milan)

Page 35: Web API, REST API and Web Scraping

Scraping

“Web scraping means collecting information from websites by extracting them directly from the HTML source code.”

Page 36: Web API, REST API and Web Scraping

Use case – La Scala (Scraper)

Goal: Steal Get data from twitter for 1 year ago.

Problem: API allow access to 10 days in the past.

Solution: The Twitter homepage allows to search without a time limit. Crete a web scraper that collect data from the HTML.

Problem: The page needs to scroll to load more data.

Solution: Use a (headless) fake browser to simulate the user interaction.

Problem: You need to “start a browser” to get data, can be memory greedy.

Solution: No. Maybe.

Page 37: Web API, REST API and Web Scraping

Use case – La Scala (Scraper)

Once we have all the data we can extract the required data using the following pipeline:

Gathertweets ExtractSentiment Trainclassifiers

Automaticclassification Extractusers

CalculatePopularityscore

Page 38: Web API, REST API and Web Scraping

Scraping strategies

Pre-packaged headless browsers– PhantomJS for JavaScript– Study and replicate/simulate navigation

programmatively

Manual implementation of navigation– Replicate JS (AJAX) invocations – Capture the data from the page– Marshall the data

Page 39: Web API, REST API and Web Scraping

Guidelines and example

• Scrape the minimum possible• Turn to API invocations as much as possible

• Example: Twitter Scraping• Grab the tweet IDs• Get the detailed data about tweets via API

• https://github.com/Volox/TwitterScraper

Page 40: Web API, REST API and Web Scraping

Marco Brambilla, @marcobrambi, [email protected] Della Valle, @manudellavalle, [email protected]

http://datascience.deib.polimi.it/course