crawling and scraping tutorial at the digital methods summer school 2013
TRANSCRIPT
![Page 1: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/1.jpg)
Crawling and ScrapingThe Issuecrawler and the Lippmannian device.
Michael Stevenson
![Page 2: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/2.jpg)
Issuecrawler. What does it do?
![Page 3: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/3.jpg)
Body text
Body Text
Site
A
B
C
CRAWL STARTING POINTS
![Page 4: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/4.jpg)
Body text
Body Text
Site
A
B
C
CRAWL STARTING POINTS
Site
A
B
C
D
CRAWL DEPTH ONEfollow all starting points' outlinks
![Page 5: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/5.jpg)
Body text
Body Text
Site
A
B
C
CRAWL STARTING POINTS
Site
A
B
C
D
CRAWL DEPTH ONEfollow all starting points' outlinks
Site
A
B
C
D
E
F
G
H
CRAWL DEPTH TWOfollow all outlinks from the pages found in the previous depth
![Page 6: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/6.jpg)
Body text
Body Text
ANALYSIS SNOWBALLretain all links and sites discovered during the crawl
Site
A
B
C
D
E
F
G
H
![Page 7: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/7.jpg)
Body text
Body Text
ANALYSIS INTER-ACTORretain only links between the starting points
Site
A
B
C
![Page 8: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/8.jpg)
Body text
Body Text
ANALYSIS CO-LINKretain sites that receive links from at least two other sites
Site
B
D
![Page 9: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/9.jpg)
Climate change blogs network
Starting points: blogroll from RealClimate.org
![Page 10: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/10.jpg)
Climate change blogs network
Results: mix of blogs, social media, traditional media and governmental and non-governmental organizations.
![Page 11: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/11.jpg)
Climate change science network
Starting points: “science links” from RealClimate.orgResults: mix of governmental, non-governmental, educational and media organizations
![Page 12: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/12.jpg)
OK... We have the issue networks, but what can we
can say about their content?
![Page 13: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/13.jpg)
Lippmannian device.(aka the google scraper)
![Page 14: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/14.jpg)
What does it do?
1. Explore a source’s partisanship or commitment.
2. Show the issue agenda of an organization or movement.
Source cloud Issue cloud
Partisanship or commitment. Which sources mention the expert’s name?
Issue agenda. Which issues are on the agenda of an organization or movement?
![Page 15: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/15.jpg)
Lippmannian device. “Source cloud”Showing the partisanship or
commitment of sources to one name
Craig Venter's presence in the Synthetic Biology issue space, March 2008. Top sources on "synthetic biology" according to a Google query, with number of mentions of Venter per source, ordered.
![Page 16: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/16.jpg)
Lippmannian device. “Source cloud”
Method for showing the partisanship or commitment of sources to names
1. Gather source list (e.g. through Issuecrawler)2. Query source list for one or more experts
![Page 17: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/17.jpg)
Lippmannian device. “Source cloud”Showing the partisanship or
commitment of sources to names
Climate Change Skeptics: Who recognizes them?
(Digital Methods Initiative, 2007)https://wiki.digitalmethods.net/Dmi/ClimateChangeSkeptics
![Page 18: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/18.jpg)
Lippmannian device. “Making an Issue cloud”
An organization’s issue agenda (or commitment)
Public Knowledge, a digital rights NGO, has issues. Which are they most committed to?
![Page 19: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/19.jpg)
![Page 20: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/20.jpg)
![Page 21: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/21.jpg)
![Page 22: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/22.jpg)
Lippmannian device. “Issue cloud”
Showing the issue commitments of the NGO, Public Knowledge
Public Knowledge's issue commitment. Lower six issues on Public Knowledge's issue list, ranked according to number of mentions of issues on publicknowledge.org, 2 October 2009.
![Page 23: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/23.jpg)
![Page 24: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/24.jpg)
Lippmannian device. “Making an Issue cloud”
Greenpeace issues, http://www.greenpeace.org/international/campaigns.
Stop climate changeProtect ancient forestsDefending our OceansSay no to genetic engineeringEliminate toxic chemicalsDemand Peace and DisarmamentEnd the nuclear ageEncourage sustainable trade
Keep most significant issue language.
"climate change""ancient forests"oceans"genetic engineering""toxic chemicals"disarmament"nuclear power""sustainable trade"
![Page 25: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/25.jpg)
![Page 26: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/26.jpg)
Lippmannian device. “Issue cloud”
Greenpeace’s issue agenda (distribution of commitment)
Greenpeace's issue commitment. Greenpeace's campaign issue list, ranked according to number of mentions of issues on greenpeace.org, 11 October 2009.
![Page 27: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/27.jpg)
Lippmannian device. “Making an Issue cloud”
Multiple sources, multiple issues
What is the agenda of the global human rights network?
Which issues are at the top and
at the bottom of the agenda?
What is the current level of commitment to a particular issue?
![Page 28: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/28.jpg)
Lippmannian device. “Making an Issue cloud”
Multiple sources, multiple issues
This is more complicated, but still doable(Govcom.org, University of Pittsburg, UMass Amhearst, ongoing)
![Page 29: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/29.jpg)
Lippmannian device. “Making an Issue cloud”
Take three good lists of human rights organizations (global south, global north, UN’s)
![Page 30: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/30.jpg)
Lippmannian device. “Making an Issue cloud”
Make a list of all issues listed on all Websites
![Page 31: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/31.jpg)
![Page 32: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/32.jpg)
![Page 33: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/33.jpg)
Lippmannian device. “Issue cloud”
Showing the issue commitments of global human rights network
Global human rights issue agenda. Global human rights actors' issues, ranked according to the estimated number of Google mentions on a set of global human rights actors' websites, 31 March 2009.
![Page 34: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/34.jpg)
Lippmannian device. “Issue cloud”
Showing the issue commitments of global human rights network
Global human rights issue agenda, bottom. Global human rights actors' issues, ranked according to the estimated number of Google mentions on a set of global human rights actors' websites, 31 March 2009.
![Page 35: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/35.jpg)
Lippmannian device.
Partisanship check. Which side of the controversy is an actor on?
Use the source cloud
![Page 36: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/36.jpg)
Lippmannian device.
1. Check an organziation’s issue agenda. What are its current commitments?
2. Check a national or global movement’s issue agenda. What are its current commitments?
Use the issue cloud
![Page 37: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/37.jpg)
Questions.
![Page 38: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/38.jpg)
Exercise: Sourcing Climate Change
Skeptics.
![Page 39: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/39.jpg)
Body text
Body Text
Climate Change Sceptics on the Web (Frederick Seitz)
Research Question_To what extent are climate change 'skeptics' present in the climate change spaces on the Web?Findings_There is distance between the skeptics and the top of the search engine returns.
Source_google.comQuery_“Frederick Seitz”Method_Search for query “Frederick Seitz” in top 100. Organized in order.Tools_Google Scraper and Tag Cloud GeneratorDate_30 July 2007
Product_of the Digital Methods Initiative, dmi.mediastudies.nl. Analysis_by Bram Nijhof, Richard Rogers and Laura van der Vlies. Design_Anne Helmond.
CC_BY:NC:SA
campaigncc.org (1)
climateark.org (4)marshall.org (8)
realclimate.org (35)sourcewatch.org (21)
abc.net.au (0)
acfonline.org.au (0)
bbc.co.uk (0) bom.gov.au (0)
cbc.ca (0)
ciel.org (0)
climatechallenge.gov.uk (0)
climatechange.ca.gov (0)
climatechange.com.au (0)
climatechangecentral.com (0)
climatechangecollege.org (0)
climatecrisis.net (0)
climatescience.gov (0)
dar.csiro.au (0)
davidsuzuki.org (0)
defra.gov.uk (0)
dfat.gov.au (0)
ec.gc.ca (0)
ecn.ac.uk (0)
ecokids.ca (0)
ecy.wa.gov (0)
eea.europa.eu (0)
eldis.org (0)
energy.gov (0)
envirolink.org (0)
epa.gov (0)
exploratorium.edu (0)
faqs.org (0)
foe.co.uk (0)
ft.com (0)
g8.gov.uk (0)
gcrio.org (0)
greenpeace.org (0)
grida.no (0)
guardian.co.uk (0)
iea.org (0)
iisd.org (0)
ipcc.ch (0)
iucn.org (0)
ltscotland.org.uk (0)
metoffice.gov.uk (0)
mfe.govt.nz (0)
mofa.go.jp (0)
nature.com (0) nature.org (0)
ncdc.noaa.gov (0)
open2.net (0)
panda.org (0)
pewclimate.org (0)
royalsoc.ac.uk (0)
scidev.net (0)
scienceagogo.com (0)
state.gov (0)
theglobeandmail.com (0)
ucar.edu (0)
un.org (0)
unep.org (0)
who.int (0)
whoi.edu (0)
worldwildlife.org (0)
CLIMATE CHANGESCEPTICS
![Page 40: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/40.jpg)
Research Question:Which climate change issue actors mention the skeptics, and what kinds of actors are more likely to mention them?
Method:Comparative Query: skeptics in three source sets (‘top’ sources, climate change blogs and climate change science network), outputting source cloud for each.
![Page 41: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/41.jpg)
Source Sets:
(1) Top ten Google returns for “climate change” (mix of media as well as governmental organizations)
![Page 42: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/42.jpg)
Source Sets:
(2) Climate change blogs network (IssueCrawler results - mix of blogs, social media, traditional media and governmental and non-governmental organizations)
![Page 43: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/43.jpg)
Source Sets:
(3) Climate change science network (IssueCrawler results - governmental, non-governmental, educational and media organizations)
![Page 44: Crawling and Scraping tutorial at the Digital Methods Summer School 2013](https://reader034.vdocuments.site/reader034/viewer/2022042614/5580bb59d8b42ac6088b4ffd/html5/thumbnails/44.jpg)
Steps:- Install the DMI toolbar, and open the Lippmannian device (aka Google Scraper - see tools.digitalmethods.net).
- Acquire source sets and skeptics list.
- Enter source sets and skeptics names. Query the source sets separately, and remember to use “” to get exact returns.
- Wait, fill in CAPTCHA’s if necessary. Also use this moment to discuss hypotheses.
- Explore the output, and present findings.