scrapinghub documentation - read the docs...note: most of the features provided by the api are also...

Scrapinghub Documentation

Scrapinghub

Oct 09, 2019

Contents

1 Scrapy Cloud API 31.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 API endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Pagination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.4 Result formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.5 Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301.6 Meta parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2 Scrapy Cloud Write Entrypoint 312.1 Scrapy Cloud Write Entrypoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Crawlera API 373.1 Crawlera API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Crawlera Stats API 454.1 Crawlera Stats API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 AutoExtract API 495.1 AutoExtract API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Unified Schema 616.1 Unified Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

i


Note: This is the documentation of Scrapinghub APIs for Scrapy Cloud and Crawlera. For help guides and otherarticles please check our Help Center.

See Scrapy Cloud API.

Contents 1

https://support.scrapinghub.com/support/home


2 Contents

CHAPTER 1

Scrapy Cloud API

Note: Check also the Help Center for general guides and articles.

Scrapy Cloud provides an HTTP API for interacting with your spiders, jobs and scraped data.

1.1 Getting started

1.1.1 Authentication

You’ll need to authenticate using your API key.

There are two ways to authenticate:

HTTP Basic:

$ curl -u APIKEY: https://storage.scrapinghub.com/foo

URL Parameter:

$ curl https://storage.scrapinghub.com/foo?apikey=APIKEY

1.1.2 Example

Running a spider is simple:

$ curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d→˓spider=SPIDER

Where APIKEY is your API key, PROJECT is the spider’s project ID, and SPIDER is the name of the spider you wantto run.

3


https://app.scrapinghub.com/account/apikey


It’s possible to override Scrapy settings for a job:

$ curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d→˓spider=SPIDER \

-d job_settings='{"LOG_LEVEL": "DEBUG"}'

job_settings should be a valid JSON and will be merged with project and spider settings provided for givenspider.

1.2 API endpoints

1.2.1 app.scrapinghub.com

Jobs API

The jobs API makes it easy to work with your spider’s jobs and lets you schedule, stop, update and delete them.

Note: Most of the features provided by the API are also available through the python-scrapinghub client library.

run.json

Schedules a job for a given spider.

Parameter Description Requiredproject Project ID. Yesspider Spider name. Yesadd_tag Add specified tag to job. Nopriority Job priority. Supported values: 0 (lowest) to 4 (highest). Default: 2. Nojob_settings Job settings represented as a JSON object. Nounits Amount of units to run job. Supported values: 1 to 6. No

Note: Any other parameter will be treated as a spider argument.

Method Description Supported parametersPOST Schedule the specified spider. project, job, spider, add_tag, priority, job_settings

Example:

$ curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=123 -d→˓spider=somespider -d units=2 -d add_tag=sometag -d spiderarg1=example -d job_→˓settings='{ "setting1": "value1", "setting2": "value2" }'{"status": "ok", "jobid": "123/1/1"}

4 Chapter 1. Scrapy Cloud API


jobs/list.{json,jl}

Retrieve job information for a given project, spider, or specific job.

Parameter Description Requiredproject Project ID. Yesjob Job ID. Nospider Spider name. Nostate Return jobs with specified state. Nohas_tag Return jobs with specified tag. Nolacks_tag Return jobs that lack specified tag. No

Supported state values: pending, running, finished, deleted.

Method Description Supported parametersGET Retrieve job information. project, job, spider, state, has_tag, lacks_tag

Examples:

# Retrieve the latest 3 finished jobs$ curl -u APIKEY: "https://app.scrapinghub.com/api/jobs/list.json?project=123&→˓spider=somespider&state=finished&count=3"{

"status": "ok","count": 3,"total": 3,"jobs": [{

"responses_received": 1,"items_scraped": 2,"close_reason": "finished","logs": 29,"tags": [],"spider": "somespider","updated_time": "2015-11-09T15:21:06","priority": 2,"state": "finished","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T15:20:25","id": "123/45/14544","errors_count": 0,"elapsed": 138399

},{

"responses_received": 1,"items_scraped": 2,"close_reason": "finished","logs": 29,"tags": [

"consumed"],"spider": "somespider","updated_time": "2015-11-09T14:21:02","priority": 2,

(continues on next page)

1.2. API endpoints 5


(continued from previous page)

"state": "finished","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T14:20:25","id": "123/45/14543","errors_count": 0,"elapsed": 3433762

},{

"responses_received": 1,"items_scraped": 2,"close_reason": "finished","logs": 29,"tags": [

"consumed"],"spider": "somespider","updated_time": "2015-11-09T13:21:08","priority": 2,"state": "finished","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T13:20:31","id": "123/45/14542","errors_count": 0,"elapsed": 7034158

}]

}

# Retrieve all running jobs$ curl -u APIKEY: "https://app.scrapinghub.com/api/jobs/list.json?project=123&→˓state=running"{


"responses_received": 483,"items_scraped": 22,"logs": 20,"tags": [],"spider": "somespider","elapsed": 17442,"priority": 2,"state": "running","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T15:25:07","id": "123/45/13140","errors_count": 0,"updated_time": "2015-11-09T15:26:43"

},{

"responses_received": 207,"items_scraped": 207,





"logs": 468,"tags": [],"spider": "someotherspider","elapsed": 4085,"priority": 3,"state": "running","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T13:00:46","id": "123/67/11952","errors_count": 0,"updated_time": "2015-11-09T15:26:57"

}]

}

# Retrieve all jobs with the tag ``consumed``$ curl -u APIKEY: "https://app.scrapinghub.com/api/jobs/list.json?project=123&lacks_→˓tag=consumed"{


"responses_received": 208,"items_scraped": 208,"logs": 471,"tags": ["sometag"],"spider": "somespider","elapsed": 1010,"priority": 3,"state": "running","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T13:00:46","id": "123/45/11952","errors_count": 0,"updated_time": "2015-11-09T15:28:27"

},{

"responses_received": 619,"items_scraped": 22,"close_reason": "finished","logs": 29,"tags": ["sometag"],"spider": "someotherspider","updated_time": "2015-11-09T15:27:20","priority": 2,"state": "finished","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T15:25:07","id": "123/67/13140","errors_count": 0,"elapsed": 67409





},{

"responses_received": 3,"items_scraped": 20,"close_reason": "finished","logs": 58,"tags": ["sometag", "someothertag"],"spider": "yetanotherspider","updated_time": "2015-11-09T15:25:28","priority": 2,"state": "finished","version": "1447064100","spider_type": "manual","started_time": "2015-11-09T15:25:07","id": "123/89/1627","errors_count": 0,"elapsed": 179211

}]

}

jobs/update.json

Updates information about jobs.

Parameter Description Requiredproject Project ID. Yesjob Job ID. Yesadd_tag Add specified tag to job. Noremove_tag Remove specified tag from job. No

Method Description Supported parametersPOST Update job information. project, job, add_tag, remove_tag

Example:

$ curl -u APIKEY: https://app.scrapinghub.com/api/jobs/update.json -d project=123 -d→˓job=123/1/2 -d add_tag=consumed

jobs/delete.json

Deletes one or more jobs.

Parameter Description Requiredproject Project ID. Yesjob Job ID. Yes

Method Description Supported parametersPOST Delete job(s). project, job



Example:

$ curl -u APIKEY: https://app.scrapinghub.com/api/jobs/delete.json -d project=123 -d→˓job=123/1/2 -d job=123/1/3

jobs/stop.json

Stops one running job.

Parameter Description Requiredproject Project ID. Yesjob Job ID. Yes

Method Description Supported parametersPOST Stop job. project, job

Example:

$ curl -u APIKEY: https://app.scrapinghub.com/api/jobs/stop.json -d project=123 -d→˓job=123/1/1

Comments API

The comments API lets you add comments directly to scraped data, which can later be viewed on the items page.

Comment object

Field Descriptionid Comment ID.created Created date.archived Archived date.author Comment author.avatar User gravatar URL.text Comment texteditable If set to true, comment can be edited.

comments/:comment_id

Edits or archives a comment.

Parameter Description Requiredcomment_id Comment ID. Yestext Comment text. PUT

Method Description Supported ParametersPUT Update comment text. comment_id, textDELETE Delete comment. comment_id



PUT example:

$ curl -X PUT -u APIKEY: --data 'text=my+new+text' "https://app.scrapinghub.com/api/→˓comments/12"

DELETE example:

$ curl -X DELETE -u APIKEY: "https://app.scrapinghub.com/api/comments/12"

comments/:project_id/:spider_id/:job_id

Retrieves all comments for a job indexed by item or item/field.

Example:

$ curl -u APIKEY: "https://app.scrapinghub.com/api/comments/14/13/12"{

"0": [comment, comment, ...],"0/title": [comment, comment, ...],"12/url": [comment, comment, ...],

}

Where comment is a comment object as defined above.

comments/:project_id/stats

Retrieves the number of items with unarchived comments for each job of the project.

Example:

$ curl -u APIKEY: "https://app.scrapinghub.com/api/comments/51/stats"{

"51/422/2": 1,"51/414/2": 1,"51/421/2": 1,"51/423/2": 4,"51/413/3": 3,"51/418/2": 1

}

comments/:project_id/:spider_id/:job_id/:item_no[/:field]

Retrieves, updates or archives comments.

Parameter Description Requiredtext Comment text. POST

Method Description Supported parametersGET Retrieve comments for an item or field.POST Update the specified comments with the given text. textDELETE Archive the specified comment.



GET examples:

$ curl -u APIKEY: "https://app.scrapinghub.com/api/comments/14/13/12/11"$ curl -u APIKEY: "https://app.scrapinghub.com/api/comments/14/13/12/11/logo"

POST examples:

$ curl -X POST --data 'text=some+text' -u APIKEY: "https://app.scrapinghub.com/api/→˓comments/14/13/12/11"$ curl -X POST --data 'text=some+text' -u APIKEY: "https://app.scrapinghub.com/api/→˓comments/14/13/12/11/logo"

DELETE examples:

$ curl -X DELETE -u APIKEY: "https://app.scrapinghub.com/api/comments/14/13/12/11"$ curl -X DELETE -u APIKEY: "https://app.scrapinghub.com/api/comments/14/13/12/11/logo→˓"

1.2.2 storage.scrapinghub.com

JobQ API

The JobQ API allows you to retrieve finished jobs from the queue.


jobq/:project_id/count

Count the jobs for the specified project.

Parameter Description Requiredspider Filter results by spider name. Nostate Filter results by state (pending/running/finished/deleted) Nostartts UNIX timestamp at which to begin results, in milliseconds. Noendts UNIX timestamp at which to end results, in milliseconds. Nohas_tag Filter results by existing tags Nolacks_tag Filter results by missing tags No

Hint: It’s possible to repeat has_tag, lacks_tag multiple times. In this case has_tag works as an OR opera-tion, while lacks_tag works as an AND operation.

HTTP (assuming only 2 jobs, where 1st one is marked with tagA, 2nd - with tagB):

$ curl -u APIKEY: "https://storage.scrapinghub.com/jobq/53/count"2$ curl -u APIKEY: "https://storage.scrapinghub.com/jobq/53/count?has_tag=tagA&has_→˓tag=tagB"2





$ curl -u APIKEY: "https://storage.scrapinghub.com/jobq/53/count?lacks_tag=tagA&lacks_→˓tag=tagB"0

Method Description Supported parametersGET Count jobs for the specified project. spider, state, startts, endts, has_tag, lacks_tag

Examples

Count jobs for a given project

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/jobq/53/count32110

jobq/:project_id/list

Lists the jobs for the specified project, in order from most recent to last.

Field Descriptionts The time at which the job was added to the queue.

Parameter Description Requiredspider Filter results by spider name. Nostate Filter results by state (pending,running,finished,deleted) Nostartts UNIX timestamp at which to begin results, in milliseconds. Noendts UNIX timestamp at which to end results, in milliseconds. Nocount Limit results by a given number of jobs Nostart Skip N first jobs from results Nostop The job key at which to stop showing results. Nokey Get job data for a given set of job keys Nohas_tag Filter results by existing tags Nolacks_tag Filter results by missing tags No

Method Description Supported parametersGET List jobs for the specified project. startts, endts, stop

Examples

List jobs for a given project

HTTP:



$ curl -u APIKEY: https://storage.scrapinghub.com/jobq/53/list{"key":"53/7/81","ts":1397762393489}{"key":"53/7/80","ts":1395111612849}{"key":"53/7/78","ts":1393972804722}{"key":"53/7/77","ts":1393972734215}

List jobs finished between two timestamps

If you pass the startts and endts parameters, the API will return only the jobs finished between them.

HTTP:

$ curl -u APIKEY: "https://storage.scrapinghub.com/jobq/53/list?startts=1359774955431&→˓endts=1359774955440"{"key":"53/6/7","ts":1359774955439}{"key":"53/3/3","ts":1359774955437}{"key":"53/9/1","ts":1359774955431}

Retrieve jobs finished after some job

JobQ returns the list of jobs, with the most recently finished first. We recommend associating the key of the mostrecently finished job with the downloaded data. When you want to update your data later on, you can list the jobs andstop at the previously downloaded job, through the stop parameter.

Using HTTP:

$ curl -u APIKEY: "https://storage.scrapinghub.com/jobq/53/list?stop=53/7/81"{"key":"53/7/83","ts":1403610146780}{"key":"53/7/82","ts":1397827910849}

Job metadata API

The Job metadata API allows you to get metadata for the given jobs.


jobs/:project_id/:spider_id/:job_id[/:field_name]

Retrieve job data or specific meta field.

Examples

Get metadata for the job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/jobs/1/2/3

{"close_reason": "finished","completed_by": "jobrunner","deploy_id": 1,"finished_time": 1566311833872,





"pending_time": 1566311800654,"priority": 2,"project": 1,"running_time": 1566311801163,"scheduled_by": "testuser","scrapystats": {

"downloader/request_bytes": 594,"downloader/request_count": 2,"downloader/request_method_count/GET": 2,"downloader/response_bytes": 1866,"downloader/response_count": 2,"downloader/response_status_count/200": 1,"downloader/response_status_count/404": 1,"elapsed_time_seconds": 3.211014,"finish_reason": "finished","finish_time": 1566311822568.0,"item_scraped_count": 1,"log_count/DEBUG": 3,"log_count/INFO": 11,"log_count/WARNING": 1,"memusage/max": 72433664,"memusage/startup": 72433664,"response_received_count": 2,"robotstxt/request_count": 1,"robotstxt/response_count": 1,"robotstxt/response_status_count/404": 1,"scheduler/dequeued": 1,"scheduler/dequeued/disk": 1,"scheduler/enqueued": 1,"scheduler/enqueued/disk": 1,"start_time": 1566311819357.0

},"spider": "testspider","spider_args": {"arg1": "val1", "arg2": "val2"},"spider_type": "manual","started_by": "jobrunner","state": "finished","tags": [

"tag1","tag2"

],"units": 2,"version": "6d32f52-master"

}

Warning: Please consider the example response with caution. Some of the fields appear only on specific con-ditions: for example, after finishing/deleting or restoring a job. Some other fields highly depend on the givenspider/job configuration. There also might be some additional fields for internal use only which can be changed atany given moment without prior notice.

Get specific metadata field for the job

HTTP:



$ curl -u APIKEY: https://storage.scrapinghub.com/jobs/1/2/3/tags

["tag1","tag2"

]

Items API

Note: Even though these APIs support writing, they are most often used for reading. The crawlers running onScrapinghub cloud are the ones that write to these endpoints. However, both operations are documented here forcompletion.

The Items API lets you interact with the items stored in the hubstorage backend for your projects. For example, youcan download all the items for the job '53/34/7' through:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7


Item object

Field Description_type The item definition._template The template matched against. Portia only._cached_page_id Cached page ID. Used to identify the scraped page in storage.

Scraped fields will be top level alongside the internal fields listed above.

items/:project_id[/:spider_id][/:job_id][/:item_no][/:field_name]

Retrieve or insert items for a project, spider, or job. Where item_no is the index of the item.

Parameter Description Requiredformat Results format. See Result formats. Nometa Meta keys to show. Nonodata If set, no data will be returned other than specified meta keys. No

Note: Pagination and meta parameters are supported, see Pagination and Meta parameters.

Header DescriptionContent-Range Can be used to specify a start index when inserting items.



Method Description Supported parametersGET Retrieve items for a given project, spider, or job. format, meta, nodataPOST Insert items for a given job N/A

Note: Please always use pagination parameters (start, startafter and count) to limit amount of items inresponse to prevent timeouts and different performance issues. See pagination examples below for more details.

Examples

Retrieve all items from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7

Retrive first item from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7/0

Retrieve values from a single field

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7/fieldname

Retrieve all items from a given spider

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34

Retrieve all items from a given project

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/

[Pagination] Retrieve first N items from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?count=10

[Pagination] Retrieve N items from a given job starting from the given item

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?count=10&start=53/34/→˓7/20

[Pagination] Retrieve N items from a given job starting from the item following to the given one

HTTP:



$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?count=10&→˓startafter=53/34/7/19

[Pagination] Retrieve a few items from a given job by their IDs

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?index=5&index=6

Get meta field from items

To get only metadata from items, pass the nodata=1 parameter along with the meta field that you want to get.

HTTP:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/1/7?meta=_key&nodata=1"{"_key":"53/1/7/0"}{"_key":"53/1/7/1"}{"_key":"53/1/7/2"}

Get items in a specific format

Check the available formats in the Result formats section at the API Overview.

JSON:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/34/7?meta=_key&nodata=1 -→˓H \"Accept: application/json\""[{"_key":"28144/1/1/0"},{"_key":"28144/1/1/1"},{"_key":"28144/1/1/2"}, ...]

JSON Lines:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/34/7?meta=_key&nodata=1 -→˓H \"Accept: application/x-jsonlines\""{"_key":"28144/1/1/0"}{"_key":"28144/1/1/1"}{"_key":"28144/1/1/2"}...

Add items to a job via POST

Add the items stored in the file items.jl (JSON lines format) to the job 53/34/7:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7 -X POST -T items.jl

Use the Content-Range header to specify a start index:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7 -X POST -T items.jl -→˓H "content-range: items 500-/*"

The API will only return 200 if the data was successfully stored. There’s no limit on the amount of data you can send,but a HTTP 413 response will be returned if any single item is over 1M.

items/:project_id/:spider_id/:job_id/stats

Retrieve the item stats for a given job.



Field Descriptioncounts[field] The number of times the field was scraped.totals.input_bytes The total size of all items in bytes.totals.input_values The total number of items.

Parameter Description Requiredall Include hidden fields in results. No

Method Description Supported parametersGET Retrieve item stats for the specified job. all

Example

Get the stats from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7/stats{"counts":{"field1":9350,"field2":514},"totals":{"input_bytes":14390294,"input_values→˓":10000}}

Logs API

The logs API lets you work with logs from your crawls.

Log object

Field Description Requiredmessage Log message. Yeslevel Integer log level as defined in the table below. Yestime UNIX timestamp of the message, in milliseconds. No

Log levels

Value Log level10 DEBUG20 INFO30 WARNING40 ERROR50 CRITICAL

logs/:project_id/:spider_id/:job_id

Retrieve or upload logs for a given job.



Parameter Description Requiredformat Results format. See Result formats. No


Method Description Supported parametersGET Retrieve logs. formatPOST Upload logs.

Retrieving logs

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/logs/1111111/1/1/{"time":1444822757227,"level":20,"message":"Log opened."}{"time":1444822757229,"level":20,"message":"[scrapy.log] Scrapy 1.0.3.post6+g2d688cd→˓started"}

Submitting logs

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/logs/53/34/7 -X POST -T log.jl

Requests API

The requests API allows you to work with request and response data from your crawls.


Request object

Field Description Requiredtime Request start timestamp in milliseconds Yesmethod HTTP method. Default: GET Yesurl Request URL. Yesstatus HTTP response code. Yesduration Request duration in milliseconds. Yesrs Response size in bytes. Yesparent The index of the parent request. Nofp Request fingerprint. No



Note: Seed requests from start URLs will have no parent field.

requests/:project_id[/:spider_id][/:job_id][/:request_no]

Retrieve or insert request data for a project, spider or job, where request_no is the index of the request.

Parameter Description Requiredformat Results format. See Result formats. Nometa Meta keys to show. Nonodata If set, no data will be returned other than specified meta keys. No


requests/:project_id/:spider_id/:job_id

Examples

Get the requests from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/requests/53/34/7{"parent":0,"duration":12,"status":200,"method":"GET","rs":1024,"url":"http://scrapy.→˓org/","time":1351521736957}

Adding requests

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/requests/53/34/7 -X POST -T→˓requests.jl

requests/:project_id/:spider_id/:job_id/stats

Retrieve request stats for a given job.

Field Descriptioncounts[field] The number of times the field occurs.totals.input_bytes The total size of all requests in bytes.totals.input_values The total number of requests.

Example

HTTP:



$ curl -u APIKEY: https://storage.scrapinghub.com/requests/53/34/7/stats{"counts":{"url":21,"parent":19,"status":21,"method":21,"rs":21,"duration":21,"fp":21}→˓,"totals":{"input_bytes":2397,"input_values":21}}

Activity API

Scrapinghub keeps track of certain project events such as when spiders are run or new spiders are deployed. Thisactivity log can be accessed in the dashboard by clicking on Activity in the left sidebar, or programmatically throughthe API described below.

activity/:project_id

Retrieve messages for a specified project. Results are returned in reverse order.

Parameter Description Requiredcount Maximum number of results to return. No

Method Description Supported parametersGET Returns the messages for the specified project. countPOST Creates a message.

GET example:

$ curl -u APIKEY: https://storage.scrapinghub.com/activity/1111111/?count=2{"event":"job:completed","job":"1111111/3/4","user":"jobrunner"}{"event":"job:cancelled","job":"1111111/3/4","user":"example"}

POST example:

$ curl -d '{"foo": 2}' https://storage.scrapinghub.com/activity/1111111/{"foo":4}{"foo":3}

activity/projects

Retrieve messages for multiple projects.

Results are returned in reverse order.

Parameter Description Requiredcount Maximum number of results to return. Nop Project ID. Multiple values supported. Nopcount Maximum number of results to return per project. Nometa Meta parameter to add to results. See Meta parameters. No

Method Description Supported parametersGET Returns the messages for the specified projects. count, p, pcount, meta

GET example:



# Retrieve a single result for projects 1111111 and 2222222, using the ``meta``→˓parameter to include the project ID in the results.$ curl -u APIKEY: https://storage.scrapinghub.com/activity/projects/?pcount=1&meta=_→˓project&p=1111111&p=2222222{"_project": 2222222, "bar": 1}{"_project": 1111111, "foo": 4}

Collections API

Scrapinghub’s Collections are key-value stores for arbitrary large number of records. They are especially useful tostore information produced and/or used by multiple scraping jobs.

Note: The frontier API is best suited to store queues of urls to be processed by scraping jobs.

Quickstart

A collection is identified by a project id, a type and a name. A record can be any json dictionnary. They are identifiedby a _key field.

In the following we use project id 78 , the regular storage type s for the collection named my_collection.

Note: Avoid using multiple collections with the same name and different type like /s/my_collection and /cs/my_collection. During operations on an entire collection, like renaming or deleting, Hubstorage will treathomonyms as a single entity and rename or delete both.

Create/Update a record:

$ curl -u $APIKEY: -X POST -d '{"_key": "foo", "value": "bar"}' \https://storage.scrapinghub.com/collections/78/s/my_collection

Access a record:

$ curl -u $APIKEY: -X GET \https://storage.scrapinghub.com/collections/78/s/my_collection/foo

Delete a record:

$ curl -u $APIKEY: -X DELETE \https://storage.scrapinghub.com/collections/78/s/my_collection/foo



List records:

$ curl -u $APIKEY: -X GET \https://storage.scrapinghub.com/collections/78/s/my_collection

Create/Update multiple records:

We use the jsonline format by default (json objects separated by a newline):

$ curl -u $APIKEY: -X POST -d '{"_key": "foo", "value": "bar"}\n{"_key": "goo", "value→˓": "baz"}' \

https://storage.scrapinghub.com/collections/78/s/my_collection

Details

The following collection types are available:

Type Full name Hubstorage method Descriptions store new_store Basic set storecs cached store new_cached_store Items expire after a monthvs versioned store new_versioned_store Up to 3 copies of each item will be retainedvcs versioned cache

storenew_versioned_cached_storeMultiple copies are retained, and each one expires after

a month

Note: Avoid using multiple collections with the same name and different type like /s/my_collection and /cs/my_collection. During operations on an entire collection, like renaming or deleting, Hubstorage will treathomonyms as a single entity and rename or delete, both.

Records are json objects, with the following constraints:

• Their serialized size can’t be larger than 1 MB;

• Javascript’s inf values are not supported;

• Floating-point numbers can’t be larger than 2^64 - 1.

API

collections/:project_id/list

List all collections.

$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/list{"type":"s","name":"my_collection"}{"type":"s","name":"my_collection_2"}{"type":"cs","name":"my_other_collection"}



collections/:project_id/:type/:collection

Read, write or remove items in a collection.

Parameter Description Requiredkey Read items with specified key. Multiple values supported. Noprefix Read items with specified key prefix. Noprefixcount Maximum number of values to return per prefix. Nostartts UNIX timestamp at which to begin results, in milliseconds. Noendts UNIX timestamp at which to end results, in milliseconds. No

Method Description Supported parametersGET Read items from the specified collection. key, prefix, prefixcount, startts, endtsPOST Write items to the specified collection.DELETE Delete items from the specified collection. key, prefix, prefixcount, startts, endts


GET examples:

$ curl -u APIKEY: "https://storage.scrapinghub.com/collections/78/s/my_collection?→˓key=foo1&key=foo2"{"value":"bar1"}{"value":"bar2"}$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection?→˓prefix=f{"value":"bar"}$ curl -u APIKEY: "https://storage.scrapinghub.com/collections/78/s/my_collection?→˓startts=1402699941000&endts=1403039369570"{"value":"bar"}

Prefix filters, unlike other filters, use indexes and should be used when possible. You can use the prefixcountparameter to limit the number of values returned for each prefix.

A common pattern is to download changes within a certain time period. You can use the startts and endtsparameters to select records within a certain time window.

The current timestamp can be retrieved like so:

$ curl https://storage.scrapinghub.com/system/ts1403039369570

Note: Timestamp filters may perform poorly when selecting a small number of records from a large collection.

collections/:project_id/:type/:collection/count

Count the number of items in a collection.

$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/count{"count":972,"scanned":972}%



If the collection is large, the result may contain a nextstart field that is used for pagination, see Pagination.

collections/:project_id/:type/:collection/:item

Read Write or Delete an individual item.

Method DescriptionGET Read the item with the given keyPOST Write the item with the given keyDELETE Delete the item with the given key

$ curl -u $APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/foo{"value":"bar"}

collections/:project_id/:type/:collection/:item/value

Read an individual item value.

$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/foo/→˓valuebar

collections/:project_id/:type/:collection/deleted

POST with a list of item keys to delete them.

Note: This endpoint is designed to delete a large number of non-consecutive items. To delete consecutives itemsprefer the faster DELETE based endpoints.

$ curl -u $APIKEY: -X POST -d '"foo"' -d '"bar"' \https://storage.scrapinghub.com/collections/78/s/my_collection/deleted

collections/:project_id/delete?name=:collection

Delete an entire collection immediately.

$ curl -u APIKEY: -X POST https://storage.scrapinghub.com/collections/78/delete?→˓name=my_collection

collections/:project_id/rename?name=:collection&new_name=:new_name

Rename a collection and move all its items immediately.

$ curl -u APIKEY: -X POST https://storage.scrapinghub.com/collections/rename?name=my_→˓collection&new_name=my_collection_renamed



Frontier API

The Hub Crawl Frontier (HCF) stores pages visited and outstanding requests to make. It can be thought of as apersistent shared storage for a crawl scheduler.

Web pages are identified by a fingerprint. This can be the URL of the page, but crawlers may use any other string(e.g. a hash of post parameters, if it processes post requests), so there is no requirement for the fingerprint to be a validURL.

A project can have many frontiers and each frontier is broken down into slots. A separate priority queue is maintainedper slot. This means that requests from each slot can be prioritized separately and crawled at different rates and atdifferent times.

Arbitrary data can be stored in both the crawl queue and with the set of fingerprints.

A typical example would be to use the URL as a fingerprint and the hostname as a slot. The crawler should ensure thateach host is only crawled from one process at any given time so that politeness can be maintained.


Batch object

Field Descriptionid Batch ID.requests An array of request objects.

Request object

Field Description Requiredfp Request fingerprint. Yesqdata Data to be stored along with the fingerprint in the request queue. Nofdata Data to be stored along with the fingerprint in the fingerprint set. Nop Priority: lower priority numbers are returned first. Defaults to 0. No

/hcf/:project_id/:frontier/s/:slot

Field Descriptionnewcount The number of new requests that have been added.

Method Description Supported parametersPOST Enqueues a request in the specified slot. fp, qdata, fdata, pDELETE Deletes the specified slot.

POST examples

Add a request to the frontier



HTTP:

$ curl -u API_KEY: -d '{"fp":"/some/path.html"}' \https://storage.scrapinghub.com/hcf/78/test/s/example.com

{"newcount":1}

Add requests with additional parameters

By using the same priority as request depth, the website can be traversed in breadth-first order from the starting URL.

HTTP:

$ curl -u API_KEY: -d $'{"fp":"/"}\n{"fp":"page1.html", "p": 1, "qdata": {"depth": 1}}→˓' \

https://storage.scrapinghub.com/hcf/78/test/s/example.com{"newcount":2}

DELETE example

The example belows delete the slot example.com from the frontier.

HTTP:

$ curl -u API_KEY: -X DELETE https://storage.scrapinghub.com/hcf/78/test/s/example.→˓com/

/hcf/:project_id/:frontier/s/:slot/q

Retrieve requests for a given slot.

Parameter Description Requiredmincount The minimum number of requests to retrieve. No

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/s/example.com/q{"id":"00013967d8af7b0001","requests":[["/",null]]}{"id":"01013967d8af7e0001","requests":[["page1.html",{"depth":1}]]}

/hcf/:project_id/:frontier/s/:slot/q/deleted

Delete a batch of requests.

Once a batch has been processed, clients should indicate that the batch is completed so that it will be removed and nolonger returned when new batches are requested.

This can be achieved by posting the IDs of the completed batches:

$ curl -u API_KEY: -d '"00013967d8af7b0001"' https://storage.scrapinghub.com/hcf/78/→˓test/s/example.com/q/deleted

You can specify the IDs as arrays or single values. As with the previous examples, multiple lines of input is accepted.



/hcf/:project_id/:frontier/s/:slot/f

Retrieve fingerprints for a given slot.

Example

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/s/example.com/f{"fp":"/"}{"fp":"page1.html"}

Results are ordered lexicographically by fingerprint value.

/hcf/:project_id/list

Lists the frontiers for a given project.

Example

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/list["test"]

/hcf/:project_id/:frontier/list

Lists the slots for a given frontier.

Example

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/list["example.com"]

1.2.3 Python client

You can use the python-scrapinghub library to interact with Scrapy Cloud API. Check the documentation for installa-tion instructions and usage examples.

1.3 Pagination

You can paginate the results for the majority of the APIs using a number of parameters. The pagination parametersdiffer depending on the target host for a given endpoint.


https://github.com/scrapinghub/python-scrapinghub

https://python-scrapinghub.readthedocs.io/


1.3.1 app.scrapinghub.com

Parameter Descriptioncount Number of results per page.offset Offset to retrieve specific records.

1.3.2 storage.scrapinghub.com

Parameter Descriptioncount Number of results per page.index Offset to retrieve specific records. Multiple values supported.start Skip results before the given one. See a note about format below.startafter Return results after the given one. See a note about format below.

Note: The parameters naming inconsistency is caused by historical reasons and will be fixed in the coming platformupdates.

Note: While index parameter is just a short <entity_id> (ex: index=4), start and startafter parame-ters should have the full form <project_id>/<spider_id>/<job_id>/<entity_id> (ex: start=1/2/3/4, startafter=1/2/3/3).

1.4 Result formats

There are two ways to specify the format of results: Using the Accept header, or using the format parameter.

The Accept header supports the following values:

• application/x-jsonlines

• application/json

• application/xml

• text/plain

• text/csv

The format parameter supports the following values:

• json

• jl

• xml

• csv

• text

XML-RPC data types are used for XML output.

1.4. Result formats 29

http://en.wikipedia.org/wiki/XML-RPC#Data_types


1.4.1 CSV parameters

Parameter Description Requiredfields Comma delimited list of fields to include, in order from left to right. Yesinclude_headers When set to ‘1’ or ‘Y’, show header names in first row. Nosep Separator character. Noquote Quote character. Noescape Escape character. Nolineend Line end string. No

When using CSV, you will need to specify the fields parameter to indiciate required fields and their order. Example:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/34/7?format=csv&fields=id,→˓name&include_headers=1"

1.5 Headers

gzip compression is supported. A client can specify that gzip responses can be handled using theaccept-encoding: gzip request header. content-encoding: gzip header must be present in theresponse to signal the gzip content encoding.

You can use the saveas request parameter to specify a filename for browser downloads. For example, specifying?saveas=foo.json will cause a header of Content-Disposition: Attachment; filename=foo.json to be returned.

1.6 Meta parameters

You can use the meta parameter to return metadata for the record in addition to its core data.

The following values are available:

Parameter Description_key The item key in the format :project_id/:spider_id/:job_id/:item_no._ts Timestamp in milliseconds for when the item was added.

Example:

$ curl "https://storage.scrapinghub.com/items/53/34/7?meta=_key&meta=_ts"{"_key":"1111111/1/1/0","_ts":1342078473363, ... }

Note: If the data contains fields with the same name as the requested fields, they will both appear in the result.


CHAPTER 2

Scrapy Cloud Write Entrypoint

See Scrapy Cloud Write Entrypoint.

2.1 Scrapy Cloud Write Entrypoint

Note: This is the documentation of a low-level protocol that most Scrapy Cloud users don’t need to deal with. Formore high-level documentation and user guides check the Help Center.

Scrapy Cloud Write Entrypoint is a write-only interface to Scrapy Cloud storage. Its main purpose is to make it easyto write crawlers and scripts compatible with Scrapy Cloud in different programming languages using custom Dockerimages.

Jobs in Scrapy Cloud run inside Docker containers. When a Job container is started, a named pipe is created at thelocation stored in the SHUB_FIFO_PATH environment variable. To interface with Scrapy Cloud storage, your crawlerhas to open this named pipe and write messages on it, following a simple text-based protocol as described below.

2.1.1 Protocol

Each message is a line of ASCII characters terminated by a newline character. Message consists of the following parts:

• a 3-character command (one of “ITM”, “LOG”, “REQ”, “STA”, or “FIN”),

• followed by a space character,

• then followed by a payload as a JSON object,

• and a final newline character \n.

This is how example log message will look like:

LOG {"time": 1485269941065, "level": 20, "message": "Some log message"}

31


https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud


http://man7.org/linux/man-pages/man7/fifo.7.html

http://json.org/


This example and all the following examples omit the trailing newline character because it’s a non-printable character.This is how you would write the above example message in Python:

pipe.write('LOG {"time": 1485269941065, "level": 20, "message": "Some log message"}\n→˓')pipe.flush()

Newline characters are used as message separators. So, make sure that the serialized JSON object payload doesn’tcontain newline characters between key/value pairs and that newline characters inside strings for both keys and valuesare properly escaped, i.e an actual \ (reverse solidus, backslash), followed by n. Here’s an example of two consecutivelog messages which carry a multiline messages in the payload:

LOG {"time": 1485269941065, "level": 20, "message": "First multiline message. Line→˓1\nLine 2"}LOG {"time": 1485269941066, "level": 30, "message": "Second multiline message. Line→˓1\nLine 2"}

In Python this will look like this:

pipe.write('LOG {"time": 1485269941065, "level": 20, "message": "First multiline→˓message. Line 1\\nLine 2"}\n')pipe.write('LOG {"time": 1485269941066, "level": 20, "message": "Second multiline→˓message. Line 1\\nLine 2"}\n')pipe.flush()

Unicode characters in JSON object MUST be escaped using standard JSON u four-hex-digits syntax, e.g. item {"":""} should look like this:

ITM {"\u043a\u043b\u044e\u0447": "\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435"}

The total size of the message MUST not exceed 1 MiB. For messages that exceed this size the error will be loggedinstead.

ITM command

The ITM command writes a single item into Scrapy Cloud storage. ITM payload has not predefined schema.

Example:

ITM {"key": "value"}

To support very simple scripts the Scrapy Cloud Write Entrypoint allows sending plain JSON objects as items, i.e.without the 3-character command and space prefix. The following two messages are valid and equivalent:

ITM {"key": "value"}

{"key": "value"}

LOG command

The LOG command writes a single log message into Scrapy Cloud storage. The schema for the LOG payload isdescribed in Log object.

Example:

32 Chapter 2. Scrapy Cloud Write Entrypoint


LOG {"level": 20, "message": "Some log message"}

REQ command

The REQ command writes a single request into Scrapy Cloud storage. The schema for the REQ payload is describedin Request object.

Example:

REQ {"url": "http://example.com", "method": "GET", "status": 200, "rs": 10, "duration→˓": 20}

STA command

STA stands for stats and is used to populate the job stats page and to create graphs on the job details page.

Field Description Requiredtime UNIX timestamp of the message, in milliseconds. Nostats JSON object with arbitrary keys and values. Yes

If following keys are present in the STA payload – their values will be used to populate Scheduled Requests graph ona job details page:

• scheduler/enqueued

• scheduler/dequeued

The key names above were picked for compatibility with Scrapy stats.

Example:

STA {"time": 1485269941065, "stats": {"key": 0, "key2": 20.5, "scheduler/enqueued":→˓20, "scheduler/dequeued": 15}}

FIN command

The FIN command is used to set the outcome of a crawler execution, once it’s finished.

Field Description Requiredoutcome String with custom outcome message, limited to 255 chars Yes

Example:

FIN {"outcome": "finished"}

2.1.2 Printing to stdout and stderr

The output printed by a job in Scrapy Cloud is automatically converted into log messages. Lines printed to stdoutare converted into INFO level log messages. Lines printed to stderr are converted into ERROR level log messages.For example, if the script prints Hello, world to stdout, the resulting LOG command will look like this:

2.1. Scrapy Cloud Write Entrypoint 33

https://doc.scrapy.org/en/latest/topics/stats.html


LOG {"time": 1485269941065, "level": 20, "message": "Hello, world"}

There’s very basic support for multiline standard output – if some output consists of multiple lines where first linestarts with a non-space character and subsequent lines start with a space character, it would be considered as a singlelog entry. For example, the following traceback in stderr:

Traceback (most recent call last):File "<stdin>", line 1, in <module>

NameError: name 'e' is not defined

will produce the following log messages:

LOG {"time": 1485269941065, "level": 40, "message": "Traceback (most recent call→˓last):\n File \"<stdin>\", line 1, in <module>"}LOG {"time": 1485269941066, "level": 40, "message": "NameError: name 'e' is not→˓defined"}

Resulting log messages are subject to 1 MiB limit – this means that output longer than 1023 KiB is likely to causeerrors.

Warning: Even though you can write log messages by printing them to stdout and stderr, we recommend you touse the named pipe and LOG message instead. Due to the way data is sent between processes, it is not possible tomaintain the order of the messages coming from different sources (named pipe, stdout, stderr). Exclusive usagedof the named pipe will both give the best performance and guarantee that messages are received in exactly the sameorder they were sent.

2.1.3 How to build a compatible crawler

Scripts or non-Scrapy spiders have to be deployed as custom Docker images.

Each spider needs to follow the pattern:

1. Get the path to the named pipe mentioned earlier from SHUB_FIFO_PATH environment variable.

2. Open named pipe for writing. E.g. in Python you do it like this:

import os

path = os.environ['SHUB_FIFO_PATH']pipe = open(path, 'w')

3. Write messages to the pipe. If you want to send a message instantly, you have to flush the stream, otherwise itmay remain in the file buffer inside the crawler process. However this is not always required as buffer will beflushed once enough data is written or when file object is closed (depends on the programming language youuse):

# write itempipe.write('ITM {"a": "b"}\n')pipe.flush()# ...# write requestpipe.write('REQ {"time": 1484337369817, "url": "http://example.com", "method":→˓"GET", "status": 200, "rs": 10, "duration": 20}\n')pipe.flush()






# ...# write log entrypipe.write('LOG {"time": 1484337369817, "level": 20, "message": "Some log message→˓"}\n')pipe.flush()# ...# write statspipe.write('STA {"time": 1485269941065, "stats": {"key": 0, "key2": 20.5}}\n')pipe.flush()# ...# set outcomepipe.write('FIN {"outcome": "finished"}\n')pipe.flush()

4. Close the named pipe when the crawl is finished:

pipe.close()

Note: scrapinghub-entrypoint-scrapy uses Scrapy Cloud Write Entrypoint, check the code if you need an example.

2.1. Scrapy Cloud Write Entrypoint 35

https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/blob/master/sh_scrapy/writer.py

CHAPTER 3

Crawlera API

See Crawlera API.

3.1 Crawlera API

Note: Check also the Help Center for general guides and articles.

3.1.1 Proxy API

Crawlera works with a standard HTTP web proxy API, where you only need an API key for authentication. This isthe standard way to perform a request via Crawlera:

curl -vx proxy.crawlera.com:8010 -U <API key>: http://httpbin.org/ip

3.1.2 Errors

When an error occurs, Crawlera sends a response containing an X-Crawlera-Error header and an error message in thebody.

Note: These errors are internal to Crawlera and are subject to change at any time, so should not be relied on and onlyused for debugging.

37



X-Crawlera-Error Response Code Error Messagebad_session_id 400 Bad session IDuser_session_limit 400 Session limit exceededbad_proxy_auth 407 Incorrect authentication datatoo_many_conns 429 Parallel connections limit has been reached*header_auth 470 Unauthorized header

500 Unexpected errornxdomain 502 Error looking up domainehostunreach 502 Host is unreachableeconnrefused 502 Connection refusedeconnreset 502 Connection reset by peersocket_closed_remotely 502 The socket has been closed remotelyclient_conn_closed 503 Connection closed by clientnoslaves 503 No available proxiesbanned 503 Proxy has been bannedserverbusy 503 Server busy: too many outstanding requeststimeout 504 Connection timed outmsgtimeout 504 Message passing timeoutdomain_forbidden 523 The domain is forbidden for crawlingbad_header 540 Bad header valuedata_error 541 Response size is too big

* Crawlera limits the number of concurrent connections based on your Crawlera plan. See: Crawlera pricing table formore information on plans.

* Crawlera limits the size of response. If you attempt to download file larger than 500MB it will return an error.

3.1.3 Sessions

Sessions

Sessions allow reusing the same slave for every request. Sessions expire 30 minutes after their last use and Crawleralimits the number of concurrent sessions to 100 for C10 plans, and 5000 for all other plans.

Sessions are managed using the X-Crawlera-Session header. To create a new session send:

X-Crawlera-Session: create

Crawlera will respond with the session ID in the same header:

X-Crawlera-Session: <session ID>

From then onward, subsequent requests can be made through the same slave by sending the session ID in the requestheader:

X-Crawlera-Session: <session ID>

Another way to create sessions is using the /sessions endpoint:

curl -u <API key>: proxy.crawlera.com:8010/sessions -X POST

This will also return a session ID which you can pass to future requests with the X-Crawlera-Session header likebefore. This is helpful when you can’t get the next request using X-Crawlera-Session.

38 Chapter 3. Crawlera API

https://scrapinghub.com/crawlera


If an incorrect session ID is sent, Crawlera responds with a bad_session_id error.

List sessions

Issue the endpoint List sessions with the GET method to list your sessions. The endpoint returns a JSON document inwhich each key is a session ID and the associated value is a slave.

Example:

curl -u <API key>: proxy.crawlera.com:8010/sessions{"1836172": "<SLAVE1>", "1691272": "<SLAVE2>"}

Delete a session

Issue the endpoint Delete a session with the DELETE method in order to delete a session.

Example:

curl -u <API key>: proxy.crawlera.com:8010/sessions/1836172 -X DELETE

Session Request Limits

There is a default delay of 12 seconds between each request using the same IP. These delays can differ for morepopular domains. If the requests per second limit is exceeded, further requests will be delayed for up to 15 minutes.Each request made after exceeding the limit will increase the request delay. If the request delay reaches the soft limit(120 seconds), then each subsequent request will contain X-Crawlera-Next-Request-In header with the calculated delayas the value.

3.1.4 Request Headers

Crawlera supports multiple HTTP headers to control its behaviour.

Not all headers are available in every plan, here is a chart of the headers available in each plan (C10, C50, etc):

Header C10 C50 C100 C200 EnterpriseX-Crawlera-UA XXX XXX XXX XXXX-Crawlera-Profile XXX XXX XXX XXXX-Crawlera-No-Bancheck XXX XXX XXX XXXX-Crawlera-Cookies XXX XXX XXX XXX XXXX-Crawlera-Timeout XXX XXX XXX XXX XXXX-Crawlera-Session XXX XXX XXX XXX XXXX-Crawlera-JobId XXX XXX XXX XXX XXXX-Crawlera-Max-Retries XXX XXX XXX XXX XXX

X-Crawlera-UA

Only available on C50, C100, C200 and Enterprise plans.

Deprecated. Use X-Crawlera-Profile instead.

This header controls Crawlera User-Agent behaviour. The supported values are:

3.1. Crawlera API 39


• pass - pass the User-Agent as it comes on the client request

• desktop - use a random desktop browser User-Agent

• mobile - use a random mobile browser User-Agent

If X-Crawlera-UA isn’t specified, it will default to desktop. If an unsupported value is passed inX-Crawlera-UA header, Crawlera replies with a 540 Bad Header Value.

More User-Agent types will be supported in the future (chrome, firefox) and added to the list above.

X-Crawlera-Profile


This is a replacement of X-Crawlera-UA header with slightly different behaviour: X-Crawlera-UA only setsUser-Agent header but X-Crawlera-Profile applies a set of headers which actually used by the browser. Forexample, all modern browsers set Accept-Language and Accept-Encoding headers. Also, some browsers setDNT and Upgrade-Insecure-Requests headers.

We provide correct default values for the headers sent by the mimicked browser. If you want to use your own header,please use complimentary header X-Crawlera-Profile-Pass. The value of X-Crawlera-Profile-Passis the name of the header you need to use. In that case, Crawlera won’t override you value. You can put several headernames there, delimited by comma.

Example

You want to use your own specific browser locale (de_DE) instead of default en_US. In that case, you needto put Accept-Language as a value of X-Crawlera-Profile-Pass and provide de_DE as a value ofAccept-Language.

X-Crawlera-Profile: desktopX-Crawlera-Profile-Pass: Accept-LanguageAccept-Language: de_DE

This header’s intent is to replace legacy X-Crawlera-UA so if you pass both X-Crawlera-UA andX-Crawlera-Profile, the latter supersedes X-Crawlera-UA.

Example:

X-Crawlera-UA: desktopX-Crawlera-Profile: pass

Crawlera won’t respect X-Crawlera-UA setting here because X-Crawlera-Profile is set.

Supported values for this headers are:

• pass - do not use any browser profile, use User-Agent, provided by the client

• desktop- use a random desktop browser profile ignoring client User-Agent header

• mobile - use a random mobile browser profile ignoring client User-Agent header

By default, no profile is used. Crawlera starts to process X-Crawlera-UA header. If an unsupported value is passedin X-Crawlera-Profile header, Crawlera replies with a 540 Bad Header Value.

X-Crawlera-No-Bancheck




This header instructs Crawlera not to check responses against its ban rules and pass any received response to the client.The presence of this header (with any value) is assumed to be a flag to disable ban checks.

Example:

X-Crawlera-No-Bancheck: 1

X-Crawlera-Cookies

This header allows to disable internal cookies tracking performed by Crawlera.

Example:

X-Crawlera-Cookies: disable

X-Crawlera-Session

This header instructs Crawlera to use sessions which will tie requests to a particular slave until it gets banned.

Example:

X-Crawlera-Session: create

When create value is passed, Crawlera creates a new session an ID of which will be returned in the response headerwith the same name. All subsequent requests should use that returned session ID to prevent random slave switchingbetween requests. Crawlera sessions currently have maximum lifetime of 30 minutes. See Sessions for information onthe maximum number of sessions.

X-Crawlera-JobId

This header sets the job ID for the request (useful for tracking requests in the Crawlera logs).

Example:

X-Crawlera-JobId: 999

X-Crawlera-Max-Retries

Note: This header has no effect when using X-Crawlera-Session header.

This header limits the number of retries performed by Crawlera.

Example:

X-Crawlera-Max-Retries: 1

Passing 1 in the header instructs Crawlera to do up to 1 retry. Default number of retries is 5 (which is also the allowedmaximum value, the minimum being 0).



X-Crawlera-Timeout

This header sets Crawlera’s timeout in milliseconds for receiving a response from the target website. The timeoutmust be specified in milliseconds and be between 30,000 and 180,000. It’s not possible to set the timeout higher than180,000 milliseconds or lower than 30,000 milliseconds, it will be rounded to its nearest maximum or minimum value.

Example:

X-Crawlera-Timeout: 40000

The example above sets the response timeout to 40,000 milliseconds. In the case of a streaming response, each chunkhas 40,000 milliseconds to be received. If no response is received after 40,000 milliseconds, a 504 response will bereturned. If not specified, it will default to 30000.

[Deprecated] X-Crawlera-Use-Https

Previously the way to perform https requests needed the http variant of the url plus the header X-Crawlera-Use-Httpswith value 1 like the following example:

curl -x proxy.crawlera.com:8010 -U <API key>: http://twitter.com -H x-crawlera-use-→˓https:1

Now you can directly use the https url and remove the X-Crawlera-Use-Https header, like this:

curl -x proxy.crawlera.com:8010 -U <API key>: https://twitter.com

If you don’t use curl for crawlera you can check the rest of the documentation and update your scripts in or-der to continue using crawlera without issues. Also some programming languages will ask for the Certificate filecrawlera-ca.crt. You can install the certificate on your system or set it explicitely on the script.

3.1.5 Response Headers

X-Crawlera-Next-Request-In

This header is returned when response delay reaches the soft limit (120 seconds) and contains the calculated delayvalue. If the user ignores this header, the hard limit (1000 seconds) may be reached, after which Crawlera will returnHTTP status code 503 with delay value in Retry-After header.

X-Crawlera-Debug

This header activates tracking of additional debug values which are returned in response headers. At the moment onlyrequest-time and ua values are supported, comma should be used as a separator. For example, to start trackingrequest time send:

X-Crawlera-Debug: request-time

or, to track both request time and User-Agent send:

X-Crawlera-Debug: request-time,ua

The request-time option forces Crawlera to output to the response header a request time (in seconds) of the lastrequest retry (i.e. the time between Crawlera sending request to a slave and Crawlera receiving response headers fromthat slave):



X-Crawlera-Debug-Request-Time: 1.112218

The ua option allows to obtain information about the actual User-Agent which has been applied to the last request(useful for finding reasons behind redirects from a target website, for instance):

X-Crawlera-Debug-UA: Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+→˓(KHTML, like Gecko)

X-Crawlera-Error

This header is returned when an error condition is met, stating a particular Crawlera error behind HTTP status codes(4xx or 5xx). The error message is sent in the response body.

Example:

X-Crawlera-Error: user_session_limit

Note: Returned errors are internal to Crawlera and are subject to change at any time, so should not be relied on.

3.1.6 Using Crawlera with Scrapy Cloud

To employ Crawlera in Scrapy Cloud projects the Crawlera addon is used. Go to Settings > Addons > Crawlera toactivate.

Settings

CRAWLERA_URL proxy URL (default: http://proxy.crawlera.com:8010)CRAWLERA_ENABLED tick the checkbox to enable CrawleraCRAWLERA_APIKEY Crawlera API keyCRAWLERA_MAXBANS number of bans to ignore before closing the spider (default: 20)CRAWLERA_DOWNLOAD_TIMEOUT timeout for requests (default: 190)

3.1.7 Using Crawlera with headless browsers

See our articles in the Knowledge Base:

• Using Crawlera Headless Proxy

• Using Crawlera with Splash

• Using Crawlera with Selenium and Polipo

• Using Crawlera with PhantomJS

• Using Crawlera with Puppeteer


https://support.scrapinghub.com/support/solutions/articles/22000234854-how-to-use-crawlera-with-headless-browsers

https://support.scrapinghub.com/support/solutions/articles/22000188428-using-crawlera-with-splash

https://support.scrapinghub.com/support/solutions/articles/22000203564-using-crawlera-with-selenium-and-polipo

https://support.scrapinghub.com/support/solutions/articles/22000214738-using-crawlera-with-phantomjs

https://support.scrapinghub.com/support/solutions/articles/22000220800-using-crawlera-with-puppeteer


3.1.8 Using Crawlera from different languages

Check out our Knowledge Base for examples of using Crawlera with different programming languages:

• Python

• PHP

• Ruby

• Node.js

• Java

• C#

3.1.9 Fetch API

Warning: The Fetch API is deprecated and will be removed soon. Use the standard proxy API instead.

Crawlera’s fetch API let’s you request URLs as an alternative to Crawlera’s proxy interface.

Fields

Note: Field values should always be encoded.

Field Required Description Exampleurl yes URL to fetch http://www.food.com/headers no Headers to send in the outgoing request header1:value1;header2:value2

Basic example:

curl -u <API key>: http://proxy.crawlera.com:8010/fetch?url=https://twitter.com

Headers example:

curl -u <API key>: 'http://proxy.crawlera.com:8010/fetch?url=http%3A//www.food.com&→˓headers=Header1%3AVal1%3BHeader2%3AVal2'

Working with HTTPS

See Crawlera with HTTPS in our Knowledge Base

Working with Cookies

See Crawlera and Cookies in our Knowledge Base


https://support.scrapinghub.com/support/solutions/articles/22000203567-using-crawlera-with-python

https://support.scrapinghub.com/support/solutions/articles/22000203568-using-crawlera-with-php

https://support.scrapinghub.com/support/solutions/articles/22000203569-using-crawlera-with-ruby

https://support.scrapinghub.com/support/solutions/articles/22000203570-using-crawlera-with-node-js

https://support.scrapinghub.com/support/solutions/articles/22000203571-using-crawlera-with-java

https://support.scrapinghub.com/support/solutions/articles/22000204134-using-crawlera-with-c-

https://support.scrapinghub.com/support/solutions/articles/22000188407-crawlera-with-https

https://support.scrapinghub.com/support/solutions/articles/22000188409-crawlera-and-cookies

CHAPTER 4

Crawlera Stats API

See Crawlera Stats API.

4.1 Crawlera Stats API

Use the stats HTTP API to access crawlera usage data.

4.1.1 Authentication

This API uses HTTP Basic authentication. You’ll need to use your API key.

4.1.2 API endpoints

root url: “crawlera-stats.scrapinghub.com”

/stats

Crawlera usage stats.

Stats object

Field Descriptiontime_gte Start of interval. ISO 8601 formatted dateclean Number of successful responsesfailed Number of unsuccessful responsesconcurrency 80 percentile of concurrent connectionstotal_time 80 percentile of response time (milliseconds)traffic Total traffic (bytes)

45

https://app.scrapinghub.com/account/apikey


Parameters

Field Description Re-quired

start_dateISO 8601 formatted date. Defaults to 7 days ago from now Noend_dateISO 8601 formatted date. Defaults to UTC now Nogroupby How to group results. Defaults to no grouping. Available values: max, hour, day, month, year

“max” means group by the most granular datetime precision possible (5min)No

users Only fetch data for this set of users (comma separated) Nolimit Number of desired items per page. Defaults to 500 Noafter Token for requesting next items on timeline No

Examples

Last 7 days traffic:

$ curl -u APIKEY: 'https://crawlera-stats.scrapinghub.com/stats/'{

"limit":500,"after":””,"results": [{

"time_gte":"2018-12-12T11:05:00+00:00","failed":112275,"traffic":125085476006,"concurrency":2,"total_time":1930,"clean":3758963

}]}

Last 7 days traffic max resolution:

$ curl -u APIKEY: 'https://crawlera-stats.scrapinghub.com/stats/?groupby=max'{

"limit":500,"after":"MHgxLjcwNWYzMmMwMDAwMDBwKzMw","results": [{

"clean":175,"total_time":2032,"time_gte":"2018-12-17T15:30:00+00:00","failed":16,"concurrency":2,"traffic":3554065

}....{

"clean":166,"total_time":2036,"time_gte":"2018-12-17T16:15:00+00:00","failed":4,"concurrency":1,"traffic":11257159


46 Chapter 4. Crawlera Stats API



}]}

Consume next page:

$ curl -u APIKEY: 'https://crawlera-stats.scrapinghub.com/stats/?groupby=max&→˓after=MHgxLjcwNWYzMmMwMDAwMDBwKzMw'{

"limit":500,"after":"MHgxLjcwNWYzNzcwMDAwMDBwKzMw","results":[....]

}

One day traffic per hour:

$ curl -u APIKEY: 'https://crawlera-stats.scrapinghub.com/stats/?start_date=2019-01-→˓01T00%3A00&end_date=2019-01-01T23%3A59&groupby=hour'{

“limit":500,“after”: “”,"results":[....]

}

4.1. Crawlera Stats API 47


48 Chapter 4. Crawlera Stats API

CHAPTER 5

AutoExtract API

See AutoExtract API.

5.1 AutoExtract API

The AutoExtract API is a service for automatically extracting information from web content. You provide the URLsthat you are interested in, and what type of content you expect to find there (product or article). The service will thenfetch the content, and apply a number of techniques behind the scenes to extract as much information as possible.Finally, the extracted information is returned to you in structured form.

5.1.1 Before you Begin

You will need to obtain an API key before you can start using the AutoExtract API. You should receive one when youcomplete the signup process. If you haven’t received one, you can contact the AutoExtract support team directly [email protected].

Note: In all of the examples below, you will need to replace the string ‘[api key]’ with your unique key.

5.1.2 Basic Usage

Currently, the API has a single endpoint: https://autoextract.scrapinghub.com/v1/extract. A request is composed of oneor more queries. Each query contains a URL to extract from, and a page type that indicates what the extraction resultshould be (product or article). Requests and responses are transmitted in JSON format over HTTPS. Authenticationperformed using HTTP Basic Authentication where your API key is the username and the password is empty.

curl --verbose \--user '[api key]':'' \--header 'Content-Type: application/json' \


49

mailto:[email protected]

https://autoextract.scrapinghub.com/v1/extract



--data '[{"url": "https://blog.scrapinghub.com/gopro-study", "pageType": "article→˓"}]' \

https://autoextract.scrapinghub.com/v1/extract

Or, in Python

import requestsresponse = requests.post(

'https://autoextract.scrapinghub.com/v1/extract',auth=('[api key]', ''),json=[{'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}

→˓])print(response.json())

Requests

Requests are comprised of a JSON array of queries. Each query is a map containing the following fields:

Name Re-quired

Type Description

url Yes String URL of web page to extract from. Must be a valid http:// or https:// URL.pageType Yes String Type of extraction to perform. Must be article or product.meta No String User UTF-8 string, which will be passed through the extraction pipeline and returned

in the query result. Max size 4 Kb.articleBodyRawNo booleanWhether or not to include article HTML in article extractions. True by default. Set-

ting this to false can reduce response size significantly if HTML is not required.

Responses

API responses are wrapped in a JSON array (this is to facilitate query batching; see below). A query response for asingle article extraction looks like this (some large fields are truncated):

[{"query": {

"id": "1564747029122-9e02a1868d70b7a1","domain": "scrapinghub.com","userQuery": {"url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-

→˓hedge-funds-think-of-alternative-financial-data","pageType": "article"

}},"article": {

"articleBody": "Unbeknownst to many..","articleBodyHtml": "<article>Unbeknownst to many..","articleBodyRaw": "<span id=...","headline": "A Sneak Peek Inside What Hedge Funds Think of Alternative

→˓Financial Data","inLanguage": "en","datePublished": "2018-06-19T00:00:00","datePublishedRaw": "June 19, 2018",


50 Chapter 5. AutoExtract API



"author": "Ian Kerins","authorsList": [

"Ian Kerins"],"mainImage": "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg

→˓#keepProtocol","images": [

"https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg"],"description": "A Sneak Peek Inside What Hedge Funds Think of Alternative

→˓Financial Data","url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-

→˓funds-think-of-alternative-financial-data","probability": 0.7369686365127563

}}

]

5.1.3 Output fields

Query

All API responses include the original query along with some additional information such as the query ID:

# Enriched queryprint(response.json()[0]['query'])

Product Extraction

If you requested a product extraction, and the extraction succeeds, then the product field will be available in thequery result:

import requests

response = requests.post('https://autoextract.scrapinghub.com/v1/extract',auth=('[api key]', ''),json=[{'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/

→˓index.html', 'pageType': 'product'}])print(response.json()[0]['product'])

The following fields are available for products:

5.1. AutoExtract API 51


NameType Descriptionname String The name of the product.offersList of

dictio-narieswithprice,currency,regularPriceandavailabilitystringfields

Offers of the product. All fields are optional but one of price or availability is present.price field is a string with a valid number (dot is a decimal separator). currency is cur-rency as given on the web site, without extra normalization (for example both “$” and “USD”are possible currencies). It is present only if price is also present. regularPrice is theprice before the discount or any special offer. It is present only when the price is differ-ent from regularPrice. availability is product availability, currently it can either be"InStock" or "OutOfStock". "InStock" includes the following cases: in-stock, lim-ited availability, pre-sale (indicates that the item is available for ordering and delivery beforegeneral availability), pre-order (indicates that the item is available for pre-order, but will bedelivered when generally available), in-store-only (indicates that the item is available only atphysical locations). "OutOfStock" includes following cases: out-of-stock, discontinued andsold-out.

sku String Stock Keeping Unit identifier for the product assigned by the seller.mpn String Manufacturer part number identifier for product. It is issued by the manufacturer and is same

across different websites for a product.gtin List of

dict withtypeandvaluestringfields

Standardized GTIN product identifier which is unique for a product across different sellers.It includes the following type: isbn10, isbn13, issn, ean13, upc, ismn, gtin8, gtin14. gtin14corresponds to former names EAN/UCC-14, SCC-14, DUN-14, UPC Case Code, UPC ShippingContainer Code. ean13 also includes the jan (japnese article number). E.g. [{'type':'isbn13', 'value': '9781933624341'}]

brandString Brand or manufacturer of the product.breadcrumbsList of

dictio-narieswithnameandlinkoptionalstringfields

A list of breadcrumbs (a specific navigation element) with optional name and URL.

mainImageString A URL or data URL value of the main image of the product.imagesList of

stringsA list of URL or data URL values of all images of the product (may include the main image).

descriptionString Description of the product.aggregateRatingDic-

tionarywithratingValue,bestRatingfloatfieldsandreviewCountint field

ratingValue is the average rating value. bestRating is the best possible rating value.reviewCount is the number of reviews or ratings for the product. All fields are optional butone of reviewCount or ratingValue is present.

additionalPropertyList ofdictio-narieswithnameandvaluefields

A list of product properties or characteristics, name field contains the property name, andvalue field contains the property value.

probabilityFloat Probability that the requested page is a single product page.url String URL of page where this product was extracted.



All fields are optional, except for url and probability. Fields without a valid value (null or empty array) areexcluded from extraction results.

Below is an example response with all product fields present:

[{"product": {

"name": "Product name","offers": [

{"price": "42","currency": "USD","availability": "InStock"

}],"sku": "product sku","mpn": "product mpn","gtin": [{"type": "ean13","value": "978-3-16-148410-0"

}],"brand": "product brand","breadcrumbs": [

{"name": "Level 1","link": "http://example.com"

}],"mainImage": "http://example.com/image.png","images": [

"http://example.com/image.png"],"description": "product description","aggregateRating": {

"ratingValue": 4.5,"bestRating": 5.0,"reviewCount": 31

},"additionalProperty": [

{"name": "property 1","value": "value of property 1"

}],"probability": 0.95,"url": "https://example.com/product"

},"query": {"id": "1564747029122-9e02a1868d70b7a2","domain": "example.com","userQuery": {"pageTypeHint": "product","url": "https://example.com/product"

}}





}]

Article Extraction

If you requested an article extraction, and the extraction succeeds, then the article field will be available in thequery result:

import requests

response = requests.post('https://autoextract.scrapinghub.com/v1/extract',auth=('[api key]', ''),json=[{'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-

→˓with-python-3-support','pageType': 'article'}])

print(response.json()[0]['article'])

The following fields are avaialable for articles:

Name Type DescriptionheadlineString Article headline or title.datePublishedString Date, ISO-formatted with ‘T’ separator, may contain a timezone.datePublishedRawString Same date but before parsing, as it appeared on the site.author String Author (or authors) of the article.authorsListList of strings All authors of the article split into separate strings, for example the author

value might be "Alice and Bob" and authorList value ["Alice","Bob"], while for a single author author value might be "Alice Johnes"and authorList value ["Alice Johnes"].

inLanguageString Language of the article, as an ISO 639-1 language code.breadcrumbsList of dic-

tionaries withname and linkoptional stringfields


mainImageString A URL or data URL value of the main image of the article.images List of strings A list of URL or data URL values of all images of the article (may include the main

image).descriptionString A short summary of the article, human-provided if available, or auto-generated.articleBodyString Text of the article, including sub-headings and image captions, with newline sepa-

rators.articleBodyHtmlString Simplified HTML of the article, including sub-headings, image captions and em-

bedded content (videos, tweets, etc). See “Format of articleBodyHtml Field” sec-tion below.

articleBodyRawString HTML of the article body as seen in source page.videoUrlsList of strings A list of URLs of all videos inside the article body.audioUrlsList of strings A list of URLs of all audios inside the article body.probabilityFloat Probability that this is a single article page.url String URL of page where this article was extracted.

All fields are optional, except for url and probability. The articleBodyRaw field will only be returned ifyou pass "articleBodyRaw": true as as query parameter. Fields without a valid value (null or empty array)



are excluded from extraction results.

Below is an example response with all article fields present:

[{"article": {

"headline": "Article headline","datePublished": "2019-06-19T00:00:00","datePublishedRaw": "June 19, 2018","author": "Article author","authorsList": [

"Article author"],"inLanguage": "en","breadcrumbs": [

{"name": "Level 1","link": "http://example.com"

}],"mainImage": "http://example.com/image.png","images": ["http://example.com/image.png"

],"description": "Article summary","articleBody": "Article body ...","articleBodyHtml": "<article>Simplified HTML of article body ...","articleBodyRaw": "<div>Raw HTML of article body ...","videoUrls": [

"https://example.com/video"],"audioUrls": [

"https://example.com/audio"],"probability": 0.95,"url": "https://example.com/article"

},"query": {"id": "1564747029122-9e02a1868d70b7a3","domain": "example.com","userQuery": {"pageTypeHint": "article","url": "https://example/article"

}}

}]

Format of articleBodyHtml Field

The articleBodyHtml field in article extractions contains a normalized and simplified HTML version of the articlebody. It is easy to create your own CSS styles over this HTML so that the final look-and-feel is integrated with therest of your app.

The normalized HTML also allows for automated HTML processing which is consistent across websites. For example:* To get all images with their captions you can run //figure xpath and then ./img and ./figcaption * h tags



are normalized, making the article hierarchy easy to determine * Tables and lists can be extracted cleanly * Links areabsolute * Only semantic HTML tags are returned - no generic divs/spans are included

The supported tags and attributes are normalized as follows:

Con-tentType

Normalization Supported Elements/Attributes

Sec-tion-ing

All content is enclosed in a root article tag. Headings are nor-malized so that they always start with h2.

article (root only), h2, h3,h4, h5, h6, aside

Text Paragraphs are enclosed with p tag. Tables, lists, definition lists andblock quotes are supported.

p, table, tbody, thead,tfoot, th, tr, td, ul, ol, li,dl, dt, dd, blockquote

In-linetext

b tag is translated to strong. i tag is translated to em. a, br, strong, em, s, sup, sub,del, ins, u, cite

Pre-formattedtext

None pre, code

Mul-ti-me-diaele-ments

Multimedia elements are enclosed within figure generally. Cap-tions for these elements are included within the figcaption tagwhen available. If multimedia elements appear in the text as inlineelements within paragraphs they are kept as is (without enclosingthem in a figure element).

figure, figcaption, img,video, audio, iframe,embed, object, source

Sup-portedat-tributes

Tag attributes not in the suported supported list to the right are filteredfrom the output.

data-*, alt, cite, colspan,datetime, dir, href, label,rowspan, src, srcset,sizes, start, title, type,value, vspace

Example response:

<article>

<p>The range of use cases for web data extraction is rapidly increasing and with it→˓the necessary investment. Plus the number of websites continues to grow rapidly and→˓is expected to exceed 2 billion by 2020.</p>

<p>Presented by <a href="https://scrapinghub.com/">Scrapinghub</a>, the first Web→˓Data Extraction Summit will be held in Dublin, Ireland on 17th September 2019. This→˓is the first-ever event dedicated to web data and extraction and will be graced by→˓over 100 CEOs, Founders, Data Scientists and Engineers.</p>

<figure><iframe src="https://play.vidyard.com/7hJbbWtiNgipRiYHhTCDf6?v=4.2.13&→˓viral_sharing=0&embed_button=0&hide_playlist=1&color=FFFFFF&→˓playlist_color=FFFFFF&play_button_color=2A2A2A&gdpr_enabled=1&→˓type=inline&new_player_ui=1&vydata%5Butk→˓%5D=d057931dfb8520abe024ef4b2f68d0ad&vydata%5Bportal_id%5D=4367560&vydata→˓%5Bcontent_type%5D=blog-post&vydata%5Bcanonical_url%5D=https%3A%2F%2Fblog.→˓scrapinghub.com%2Fthe-first-web-data-extraction-summit&vydata%5Bpage_id→˓%5D=12510333185&vydata%5Bcontent_page_id%5D=12510333185&vydata%5Blegacy_→˓page_id%5D=12510333185&vydata%5Bcontent_folder_id%5D=null&vydata%5Bcontent_→˓group_id%5D=5623735666&vydata%5Bab_test_id%5D=null&vydata%5Blanguage_code→˓%5D=null&disable_popouts=1" title="Video"></iframe></figure> (continues on next page)




<p>With a promising line-up of talks and discussions accompanied by interesting→˓conversations and networking sessions with fellow data enthusiasts, followed by→˓food and drinks at the magnificent Guinness Storehouse, there are no reasons to→˓miss this event. What’s more, we are also giving out free swag! You will get your→˓own Extract Summit T-shirts on the day!</p>

<figure><img src="https://blog.scrapinghub.com/hubfs/Extract-Summit-Emails-images-tee-→˓aug2019-v1.gif" alt="Extract-Summit-Emails-images-tee-aug2019-v1"></figure>

</article>

5.1.4 Errors

Errors fall into two broad categories: request-level and query-level. Request-level errors occur when the HTTP APIserver can’t process the input that it receives. Query-level errors occur when specific query cannot be processed. Youcan detect these by checking the error field in query results.

Request-level

Examples include:

• Authentication failure

• Malformed request JSON

• Too many queries in request

• Request payload size too large

If a request-level error occurs, the API server will return a 4xx or 5xx response code. If possible, a JSON responsebody with content type application/problem+json will be returned that describes the error in accordancewith RFC-7807 - Problem Details for HTTP APIs

import requests

# Send a request with 101 queriesresponse = requests.post(

'https://autoextract.scrapinghub.com/v1/extract',auth=('[api key]', ''),json=[{'url': 'http://www.example.com', 'pageType': 'product'}] * 101)

print(response.status_code == requests.codes.ok) # Falseprint(response.status_code) # 413print(response.headers['content-type'] # application/problem+jsonprint(response.json()['title']) # Limit of 100 queries per request→˓exceededprint(response.json()['type']) # http://errors.xod.scrapinghub.com/→˓queries-limit-reached

In the above example of the queries-limit problem (identified by the URI type) the reason for 413 is indicated in thetitle. The type field should be used to check the error type as this will not change in subsequent versions. Therecould be more specific fields depending on the error providing additional details, e.g. delay before retrying next time.Such responses can be easily parsed and used for programmatic error handling.


https://tools.ietf.org/html/rfc7807


If it is not possible to return a JSON description of the error, then no content type header will be set for the responseand the response body will be empty.

Query-level

If the error field is present in an extraction result, then an error has occurred and the extraction result will not beavailable.

import requests

response = requests.post('https://autoextract.scrapinghub.com/v1/extract',auth=('[api key]', ''),json=[{'url': 'http://www.example.com/this-page-does-not-exist', 'pageType':

→˓'article'}])

print('error' in response.json()[0]) # Trueprint(response.json()[0]['error']) # Downloader error: http404

Reference

Request-level

Type Short descriptionhttp://errors.xod.scrapinghub.com/queries-limit-reached.html

Limit of 100 queries per request exceeded

http://errors.xod.scrapinghub.com/malformed-json.html Could not parse request JSONhttp://errors.xod.scrapinghub.com/rate-limit-exceeded.html

System-wide rate limit exceeded

http://errors.xod.scrapinghub.com/user-rate-limit-exceeded.html

User rate limit exceeded

http://errors.xod.scrapinghub.com/account-disabled.html Account has been disabled - contact supporthttp://errors.xod.scrapinghub.com/unrecognized-content-type.html

Unsupported request content type: should be appli-cation/json

http://errors.xod.scrapinghub.com/empty-request.html Empty request body - should be JSON documenthttp://errors.xod.scrapinghub.com/malformed-request.html Unparseable requesthttp://errors.xod.scrapinghub.com/http-pipelining-not-supported.html

Attempt to second HTTP request over TCP connec-tion

http://errors.xod.scrapinghub.com/unknown-uri.html Invalid API endpointhttp://errors.xod.scrapinghub.com/method-not-allowed.html

Invalid HTTP method (only POST is supported)


http://errors.xod.scrapinghub.com/queries-limit-reached.html

http://errors.xod.scrapinghub.com/queries-limit-reached.html

http://errors.xod.scrapinghub.com/malformed-json.html

http://errors.xod.scrapinghub.com/rate-limit-exceeded.html

http://errors.xod.scrapinghub.com/rate-limit-exceeded.html



http://errors.xod.scrapinghub.com/account-disabled.html

http://errors.xod.scrapinghub.com/unrecognized-content-type.html

http://errors.xod.scrapinghub.com/unrecognized-content-type.html

http://errors.xod.scrapinghub.com/empty-request.html

http://errors.xod.scrapinghub.com/malformed-request.html

http://errors.xod.scrapinghub.com/http-pipelining-not-supported.html

http://errors.xod.scrapinghub.com/http-pipelining-not-supported.html

http://errors.xod.scrapinghub.com/unknown-uri.html

http://errors.xod.scrapinghub.com/method-not-allowed.html

http://errors.xod.scrapinghub.com/method-not-allowed.html


Query-level

error contains Descriptionquery timed out 10 minute time out for query reachedmalformed url Requested URL cannot be parsednon-HTTP schemas arenot allowed

Only http and https schemas are allowed

Domain . . . is occupied,please retry in . . . sec-onds

Per-domain rate limiting was applied. It is recommended to retry after the specifiedinterval.

Downloader error: No re-sponse (network301)

Cannot honor the request because the protocol is not known

Downloader error: No re-sponse (network5)

Remote server closed connection before transfer was finished

Downloader error: Novisible elements

There are no visible elements in downloaded content

Downloader error:http304

Remote server returned HTTP status code 304 (not modified)


Remote server returned HTTP status code 404 (not found)


Remote server returned HTTP status code 404 (internal server error)

Proxy error:ssl_tunnel_error

SSL proxy tunneling error

Proxy error: banned Crawlera made several retries, but was unable to avoid banning. This flags antibanmeasures in actions, but doesn’t mean the proxy pool is exhausted. Retry is recom-mended.

Proxy error: do-main_forbidden

Domain is forbidden on Crawlera side

Proxy error: inter-nal_error

Internal proxy error

Proxy error: nxdomain Crawlera wasn’t able to resolve domain through DNS

Other, more rare, errors are also possible.

5.1.5 Restrictions and Failure Modes

• A maximum of 100 queries may be submitted in a single request. The total size of the request body cannotexceed 128KB.

• There is a global timeout of 10 minutes for queries. Queries can time out for a number of reasons, such asdifficulties during content download. If a query in a batched request times out, the API will return the results ofthe extractions that did succeed along with errors for those that timed out. We therefore recommend that you setthe HTTP timeout for API requests to over 10 minutes.

5.1.6 Batching Queries

Multiple queries can be submitted in a single API request, resulting in an equivalent number of query results.

Note: When using batch requests, each query is accounted towards usage limits separately. For example, sending a



batch request with 10 queries will incur the same cost as sending 10 requests with 1 query each.

import requests

response = requests.post('https://autoextract.scrapinghub.com/v1/extract',auth=('[api key]', ''),json=[{'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-

→˓with-python-3-support', 'pageType': 'article'},{'url': 'https://blog.scrapinghub.com/spidermon-scrapy-spider-monitoring',

→˓'pageType': 'article'},{'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}])

for query_result in response.json():print(query_result['article']['headline'])

Note that query results are not necessarily returned in the same order as the original queries. If you need an easyway to associate the results with the queries that generated them, you can pass an additional meta field in the query.The value that you pass will appear as the query/userQuery/meta field in the corresponding query result. Forexample, you can create a dictionary keyed on the meta field to match queries with their corresponding results:

import requests

queries = [{'meta': 'query1', 'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-

→˓scrapy-cloud-with-python-3-support', 'pageType': 'article'},{'meta': 'query2', 'url': 'https://blog.scrapinghub.com/spidermon-scrapy-spider-

→˓monitoring', 'pageType': 'article'},{'meta': 'query3', 'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType':

→˓'article'}]

response = requests.post('https://autoextract.scrapinghub.com/v1/extract',auth=('[api key]', ''),json=queries)

query_results = {result['query']['userQuery']['meta']: result for result in response.→˓json()}

for query in queries:query_result = query_results[query['meta']]print(query_result['article']['headline'])


CHAPTER 6

Unified Schema

See Unified Schema.

6.1 Unified Schema

The Unified Schema project aims to provide a standard definition for the different types of data such as products,articles, reviews, jobs etc. extracted across websites.

Note: All fields in the AutoExtract have the exact same definition in the Unified Schema. We also aim to maintainbackward compatibility while adding new fields. We also try our best to adhere to schema.org, only diverging whenthere is a reasonable benefit in doing so.

6.1.1 Product Schema

The following fields are available for products:

61


Field Format DescriptionaggregateRating

• Type: Dictionary• Fields:

1. ratingValue Number2. bestRating Number3. reviewCount Number

The overall rating, based on a col-lection of reviews or ratings

{'ratingValue': 4.0,'bestRating': 5.0,'reviewCount': 23

}

additionalProperty• Type: List• Items: Dictionary• Fields:

1. name String2. value String

Float ListDictionary

This name-value pair field holds in-formation pertaining to product spe-cific features that have no matchingproperty in the Product schema.

[{"name": "batteries","value": "1 Lithium ion

→˓batteries required.→˓(included)"},{"name": "Item model→˓number","value": "SM-A105G/DS"}]

brand• Type: String

The brand associated with the prod-uct

{"brand": "Samsung"}

No brand is returnedbreadCrumbs

• Type: List• Fields:

1. name String2. link String

A list of breadcrumbs with optionalname and URL.[{"name": ""Cell Phones &→˓Accessories"","link": "https://mjz.

→˓com/cell-phones-→˓accessories"}...]

description• Type: String

A description of the product

gtin• Type: List• Items: Dictionary• Fields:

1. type String2. value String

Standardized GTIN product identi-fier which is unique for a productacross different sellers. It includesthe following type: isbn10, isbn13,issn, ean13, upc, ismn, gtin8,gtin14. gtin14 corresponds to for-mer names EAN/UCC-14, SCC-14,DUN-14, UPC Case Code, UPCShipping Container Code.ean13also includes the jan (japnese articlenumber)

[{'type': 'isbn13', 'value→˓': '9781933624341'}]

images• Type: List• Items: String

A list of URL or data URL values ofall images of the product (may in-clude the main image).

mainImage• Type: String

A URL or data URL value of themain image of the product.

mpn• Type: String

The Manufacturer Part Number(MPN) of the product. The productwould have the same MPN acrossdifferent e-commerce websites.

name• Type: String

The name of the product

offers• Type: List• Items: Dictionary• Fields:

1. availabilityString

2. currency String3. listPrice String4. price String5. eligibleQuantity6. seller7. shippingInfo8.

availableAtOrFrom9. areaServed

10. itemCondition

This field contains rich informa-tion pertaining to all the buying op-tions offered on a product. De-tailed information regarding all theproperties returned in this fieldis available in the offers section.

[{"availability":

→˓"InStock","price":"129.99","currency":"$""itemCondition":{"type":"used",

"description":"Used -→˓Very Good"},"seller":{"name":"Merch Store","url":"https://mzi.com/→˓dr/amg/→˓seller=A8K32FFKI51FKN","identifier":→˓"A8K32FFKI51FKN","aggregateRating":{"reviewCount":479,"bestRating":5},"shippingInfo":{"minDays":"15","maxDays":"30","description":"Arrives→˓between September 3-18."}}}]

ratingHistogram• Type: List• Items: Dictionary• Fields:

1. ratingValueString

2. ratingCountNumber

3. ratingPercentageNumber

This fields provides the detailed dis-tribution of ratings across the entirerating scale

[{"ratingValue": "5",→˓"ratingPercentage": 61},{"ratingValue": "4",→˓"ratingPercentage": 12}{"ratingValue": "3",→˓"ratingPercentage": 6},{"ratingValue": "2",→˓"ratingPercentage": 5}{"ratingValue": "1",→˓"ratingPercentage": 16}]

releaseDate Date on which the product was re-leased or listed on the website inISO 8601 date format{"releaseDate": "2016-12-→˓18"}

relatedProducts• Type: List• Items: Dictionary• Fields:

1. relationshipNameString

2. products List

This field captures all products thatare recommended by the websitewhile browsing the product of inter-est. Related products can thus beused to gauge customer buying be-haviour, sponsored products as wellbest sellers in the same category.The relationshipName fielddescribes the relationship while theproducts field contains a listof items have the same productschema, thus extracting all availablefields as defined in this table

variants• Type: List• Items: Product

This field returns a list of variantsof the product. Each variant has thesame schema as the Product schemadefined in this table.

sku• Type: String

The Stock Keeping Unit (SKU) i.e.a merchant-specific identifier for theproduct

{"sku": "A123DK9823"}

width• Type: String

The width of the product

height• Type: String

The height of the product

depth• Type: String

The depth of the product

weight• Type: String

The weight of the product

volume• Type: String

The volume of the product

url• Required• Type: String

The URL of the product

62 Chapter 6. Unified Schema


offers

The offers field contains several fields as explained below that can be leveraged to get deep insights into the variousproduct offerings, associated seller information as well as inventory.

eligibleQuantity

This field gives details about bulk purchase offers available for the product.

Field Format DescriptionmaxValue Number Maximum value allowed.minValue Number Minimum value requiredvalue Number Exact value requiredunitText String Unit of measurementdescription String Free text from where this range was extracted

Let’s take the following example to examine the aforementioned fields

{'offers': [{'price': '11,98', 'currency': '$'},{'price': '10,78', 'currency': '$', 'eligibleQuantity': {'min_value': '48',

→˓'description': 'Buy 44 or more $9.33'}}]

}

availableAtOrFrom

The place(s) from which the offer can be obtained (e.g. store locations). It could contain a string, i.e.: online_only

6.1. Unified Schema 63


Field For-mat

Description

postal-Code

StringPostal code of the address

streetAd-dress

StringThe street address. For example, 1600 Amphitheatre Pkwy.

address-Country

StringThe country. For example, USA. You can also provide the two-letter ISO 3166-1 alpha-2country code. https://en.wikipedia.org/wiki/ISO_3166-1

address-Locality

StringThe locality in which the street address is, and which is in the region. For example, MountainView.

address-Region

StringThe region in which the locality is, and which is in the country. For example, California.

areaServed

The geographic area where a service or offered item is provided. The fields and the definition is the same as avail-ableAtOrFrom.

shippingInfo

Field Format Descriptioncurrency String Currency associated to the priceprice String Cost of shippingminDays Number Minimum number of days estimated for the deliverymaxDays Number Maximum number of days estimated for the deliveryaverageDays Number Average days for a deliverydescription String Any associated text describing the shipping infooriginAddress String or postalAddress Location of the warehouse where the item is shipped from

seller

This field provides the seller details including rating.

Field Format Descriptionname String Name of the sellerurl String URL for the seller’s pageidentifier String Unique identifier assigned to the seller on the websiteaggregateRating Dictionary The sellers rating. Same as aggregateRating in the product schema.

itemCondition

A predefined value and a textual description of the condition of the product included

Field For-mat

Description

type StringA predefined value of the condition of the product included in the offer. Takes onone of the following enumerated values ['NewCondition', 'DamagedCondition','RefurbishedCondition', 'UsedCondition']

de-scrip-tion

StringA textual description of the condition of the product included in the offer

64 Chapter 6. Unified Schema

https://en.wikipedia.org/wiki/ISO_3166-1


6.1.2 Article Schema

The following fields are available for articles:

Name Type DescriptionheadlineString Article headline or title.datePublishedString Date, ISO-formatted with ‘T’ separator, may contain a timezone.datePublishedRawString Same date but before parsing, as it appeared on the site.author String Author (or authors) of the article.authorsListList of strings All authors of the article split into separate strings, for example the author

value might be "Alice and Bob" and authorList value ["Alice","Bob"], while for a single author author value might be "Alice Johnes"and authorList value ["Alice Johnes"].

inLanguageString Language of the article, as an ISO 639-1 language code.breadcrumbsList of dic-

tionaries withname and linkoptional stringfields


mainImageString A URL or data URL value of the main image of the article.images List of strings A list of URL or data URL values of all images of the article (may include the main

image).descriptionString A short summary of the article, human-provided if available, or auto-generated.articleBodyString Text of the article, including sub-headings and image captions, with newline sepa-

rators.articleBodyRawString html of the article body.videoUrlsList of strings A list of URLs of all videos inside the article body.audioUrlsList of strings A list of URLs of all audios inside the article body.probabilityFloat Probability that this is a single article page.url String URL of page where this article was extracted.

6.1. Unified Schema 65

scrapinghub documentation - read the docs...note: most of the features provided by the api are also...

Documents