internet archives and social science research - yeungnam university
DESCRIPTION
Talk given at Yeungnam University on April 6, 2014TRANSCRIPT
BIG DATA AND SOCIAL SCIENCE THEORY Leveraging Large Scale Data to Discover New Pa4erns in Society
Monday, April 7, 2014 CybermoCons @ Korea Yeungnam University
Ma4hew Weber Rutgers University School of CommunicaCon & InformaCon
2
Opportunity: The Internet Archive contains the largest single record of the history of the World Wide Web from 1995 to the present—a wealth of untapped research data.
Challenge: There is a significant lack of research-‐ready databases and tools available to the scholarly community
© Internet Archive 2013
© Internet Archive 2013
5
6
7
8
9
10
Opportunity: The ArchiveHub project aims to support the creaCon and disseminaCon of general guidelines & tools for conducCng theoreCcally and methodologically rigorous
longitudinal research using archival Web data
11
12
13
14
Dataset Research PotenAal Dates Captures Unique URLs
Hurricane Katrina Online networks and organizaConal resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; informaCon disseminaCon
2003 – 2012 1,694,236 663,740
Superstorm Sandy 2003 – 2012 41,703,112 20,013,455
US Senate Study the growth of poliCcal acCvity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarizaCon & media discourse
109th – 112th Congresses
26,965,770 8,674,397
US House 51,840,777 12,410,014
Occupy Wall Street
Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formaCon and role of alliances between SMOs
2010 – 2012 247,928,272 11,3259,655
US Media
Previous studies of news media organizaCons (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evoluConary pa4erns
2008 – 2012 1,315,132,555 539,184,823
15
http://archivehub.rutgers.edu
16
Tracing the Emergence of OrganizaConal Forms
17
Environment: OrganizaCons compete for scare resources; during rapid periods of
disrupCon, new entrants seek “protected” niches (Weber & Monge 2014)
PopulaAon: In digital spaces, online connecCons provide communicaCve representaCons of
informaCon flows (Weber & Monge, 2012)
FormaCon of Ces (e.g. hyperlinks) can posiCvely impact long-‐term likelihood of organizaCon survival (Weber, 2012)
OrganizaAon: OrganizaCons adapt internally, reconfiguring team structures and
developing new rouCnes for knowledge sharing (Ellison, Gibbs & Weber, In Press; Weber & Kim, Under Review)
18
Big Data… Big Theory?
• Networks are central to social movements in that links between nodes can be influenCal in collecCve acCon
• Examples of nodes includes parCcipants, organizaCons, media and communicaCons technologies • Social networks and social movements (Diani, 2003)
• The interacCon between actors, and between actors and hashtags, collecCvely represent a networked form of organizaCon • Network form of organizaCon (Powell, 1990)
Over time, dyadic communication will become prevalent in an emerging networked organization. H1:
As a social movement develops as an emerging network form of organization, the organizational structure will be increasingly clustered.
H2:
Data
• TriangulaCon of data insulates against false readings from large-‐scale data (see Lazer, Kennedy, King and Vespignani, 2014)
• Internet Archive: – 14 websites; 4,504 hyperlink dyads over a 2-‐month period.
• Lexis Nexis: – Search conducted to assess U.S. newspaper coverage of OWS from the early stages of the
movement in September 2011 through Sept. 2012 – Search OWS keywords, e.g. “Occupy Wall Street,” “Occupy Oakland”
• Twi4er – Gnip PowerTrack
• Search by keywords; captures a larger volume of Twi4er data than other opCons – Sample includes October 17, 2011, through January 5, 2012. IniCal study focused on the
criCcal two-‐month period from November 1 through December 31, 2011, – 750,816 tweets across the two-‐month period.
21
OWS News Coverage
OWS on the Web
• 335 seed organizaCons based on records from #OccupyResearch • Data extracted for 2011 & 2012, based on “both matching”
24
0
2
4
6
8
10
12
14
16
18
Millions
Captures per Month
Maximal Cores (k Coreness)
25
Aug. 2011 Jan. 2012
26
-‐
10,000.00
20,000.00
30,000.00
40,000.00
50,000.00
60,000.00
70,000.00
80,000.00
Edges
60
80
100
120
140
160
180
VerAces
27
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Density
28
0
10
20
30
40
50
60
70
80
90
100
Clusters
29
ImplicaCons
• Big Data: – Guiding data collecCon with theoreCcally grounded quesCons avoids the
“needle-‐in-‐the-‐haystack” problem – Leverage advances in compuCng with exisCng theories to develop robust
studies of social science phenomenon
• Big Theory: – Expanding prior theories on networked organizaConal forms and form
emergence (evoluConary) – Building toward a macro theory of organizaConal form emergence based on
resource availability and networks
30
• Want data? – Email me! [email protected] – ArchiveHub: h4p://archivehub.rutgers.edu
• Collaborators – Kris Carpenter & Vinay Goel, Internet Archive – David Lazer, Northeastern University
31 Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers