Download - Scraping for Stories
Paul Bradshaw Leanpub.com/scrapingforjournalists*
Scraping for stories
Why scraping
How to spot opportunities for scraping
Tools and traits: what can be scraped, and how
Why and how
Automating the repetitive gathering of data, e.g. Multiple tables in one pageWebpage tablesMultiple spreadsheetsMultiple PDFs
What is scraping?
Why is a government website carrying fake jobs?
Aron Pilhofer, News Rewired
https://www.youtube.com/watch?v=Efr-VEkwWoM
http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076
http://www.private-eye.co.uk/registry
Focus.
New entries - or disappearing ones
http://helpmeinvestigate.com/olympics/olympic-torch-relay-youth-target-missed-by-over-1000-places/
https://moveplanner.zoopla.co.uk/terms-and-conditions
http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?
http://www.nature.com/news/scientific-publishing-the-inside-track-1.15424
What makes a website suitable for scraping?
*
*
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
*
URL parameters
https://stores.sainsburys.co.uk/
HTML <table> HTML tag(s) JSON file XML TXT CSV or XLS PDF PDF which needs OCR
Document challenges
1 page, changes >1 pages, ‘next’ links pages linked from 1 index >1 pages, URL pattern Search results URL pattern Search results, uses cookie Search results, login needed
Hosting challenges
§
Patterns Look for structure in a webpage - how are elements distinguished? Think code and text
*
Chrome: right-click > Inspect
*
Inspector: right-click > Copy…
*
A tip about URLs
This bit isn’t needed.
This bit is just for SEO.
You can put anything there. Literally.
§
Do it now: Identify an online source of information you might scrape Think beyond tables: what about series of pages? Documents?
https://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/
Robots.txt http://www.tcij.org/robots.txt
Treat like any source: build in TGTBT checks Seek second sources Seek right of reply/confirmation
Data is just a lead
http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/
https://www.mediawiki.org/wiki/API:Main_page
Does it have an API?
Gives you a long term insight into the issue Allows you to spot things being removed and added
Scheduled scrapes
Paul Bradshaw Leanpub.com/scrapingforjournalists*
Thank you.