Download - Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging
1
Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with
Social TaggingJUAN DIEGO BORRERO, [email protected]
ESTRELLA GUALDA, [email protected]
University of Huelva
Seminários CIEO - Universidade do AlgarveFaro, 31 October, 2012
2
Table of Contents
• 1. Introduction• 2. Theoretical perspective
– Web 2.0 and Collaborative tagging
– Tagging and Folksonomy– The collective knowledge
inherent in social tags– Tagging and Social
networks– Social Web and its impact
on Information Retrieval (IR) and Recommender Systems (RS)
• 3. Methodology– 3.1. Data Collection
procedure– 3.2. Analysis procedure.
SNA• 4. Results
– 4.1. Centralization: Authority– 4.2. Node Tags: Users
producing Tags• 5. Discussion
– 5.1. Centrality and Power– 5.2. Central Tags: Users
producing Tags• 6. Conclusions and future
research
3
1. IntroductionWhat puzzles?
1. The era of Big Data and Social Media has begun!
E.g., Twitter, Facebook, Tumbrl, Delicious, Youtube, Flickr, Wikipedia…
2. Will it transform how we study human communication and social relations?
3. Will it alter what ‘research’ means?
Some or all of the above?
4
1. IntroductionWhat puzzles?
1.Big Data is notable not because of its size, but because of its relationality to other data. Big Data is fundamentally networked. Its value comes from the patterns that can be derived by making connections between pieces of data, about an individual, about individuals in relation to others, about groups of people, or simply about the structure of information itself.
2. Big Data is important because it refers to an analytic phenomenon playing out in academia.
3. Big data is important because of its popular salience.
5
1. IntroductionTagging
• New technologies have made it possible for a wide range of people to produce, share, interact with, and organize data.
• People can classify the huge amount of information at her/his disposal in the form of tags.
6
1. IntroductionTagging in Delicious
Keywords freely chosen by users employed to annotate various types of digital content, or suggested by Delicious
Source: www.delicious.com
7
1. IntroductionSocial Tagging Systems
Many users add metadata in the form of tags
Resulting collective tag structure
Source: http://www.idonato.com/2009/05/27/fun-with-tag-clouds/
Source: http://blog.hubspot.com/blog/tabid/6307/bid/7372/9-Reasons-Why-Your-Social-Media-Strategy-Isn-t-Working.aspx/
Source: http://bvdt.tuxic.nl/index.php/the-wisdom-of-the-crowds-in-the-audiovisual-archive-domain/
8
1. IntroductionDelicious
Delicious is a free social bookmarking website for storing, sharing and discovering web bookmarks
Source: www.delicious.com
9
1. IntroductionOur Assumption
• Big Data offers the humanistic disciplines a new way to work in the quantitative side and it also offers other kind of objective method for analysis.
• Although in reality, working with Big Data is still subjective.
• Due to this, it is crucial to begin asking questions about the analytic assumptions, methodological frameworks, and underlying biases embedded in the Big Data phenomenon.
10
1. IntroductionOur Objectives
1.Proposing a methodology to use big data from Web 2.0 in social research,
2.Applying it to extract automatically data from Delicious social bookmarking website, and
3.To show the type of results that this kind of analysis can offer to social scientists.
4.We focus our study in globalization agriculture community, and pay special attention to SNA
11
2. Theoretical perspectiveWeb 2.0… and collaborative tagging
Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as platform, and an attempt to understand the rules for success on that new platform (O’Reilly, 2007)
Collaborative – or social – tagging is the activity in the Web 2.0 of annotating digital resources with keywords - tags (Golder and Huberman, 2006; Trant, 2009).
Source: http://www.laurenwood.org/anyway/2007/11/web-20-buzzwords/
12
2. Theoretical perspective… collaborative tagging
A collaborative tagging system is mainly composed of three interconnected components
users, tags, and resources(Smith, 2008)
Webpages, photos,
videos…
Collaborative – or social – tagging is the activity in the Web 2.0 of annotating digital resources with keywords - tags (Golder and Huberman, 2006; Trant, 2009).
13
2. Theoretical perspective… collaborative tagging and folksonomy
Social tagging systems aggregate the tags of all users and describe the resources in a so-called folksonomy (Vander Wal, 2004)
Synonyms global warming = climate change
Terms variations globalization = globalisation poor=poors
problems
14
2. Theoretical perspective… folksonomy and collective knowledge
Bottom-up process…
…the tags of many different users are aggregated and the resulting collective tag structure – such as tag cloud – depicts the collective knowledge of Web users (Cress et al., 2012)
Source: http://blog.cimmyt.org/?p=6052
15
2. Theoretical perspectiveTagging and social networks
A particular class of networks is the bipartite networks, whose nodes are divided into two sets –e.g. users and tags.
An opinion network (Maslov and Zhang, 2001; Blattner et al., 2007), is a network in which users connect to the objects that they gather.
The structure of Social tagging websites can be viewed as a network of three different node types: the U users, the R resources (web sites – URLs) and the T tags that the U users deploy to tag the R web sites.
Source: Authors
Figure 1. A Bipartite Network made of three users U=(u,u’,u’’), three tags T=(t,t’,t’’) and two kinds of links: between users RU (straight lines), and between users and tags RT (dashed lines)
16
2. Theoretical perspectiveSocial web and its impact on Information Retrieval (IR) and Recommender Systems (RS)
1. From Social IR point of view -i.e. IR that uses folksonomies- IT creates algorithms for folksonomies in order to identify which information is relevant and to identify communities to their need, this paper aims to exhibit a methodology to retrieve big data from Web 2.0 environment.
2. We introduce social tagging as basis for recommendations focused into a ternary relation between users, resources, and tags, to discover latent patterns links to the activity of collaborative tagging, which could be basic in order to provide effective recommendations to different actors.
17
3. Methodology
• Data set from: Delicious – www.delicious.com –.
• Delicious = social bookmarking system whose – Content is created, annotated and viewed by its
users. – Non-hierarchical classification system: users can tag
each of their bookmarks on the Delicious website, and provides knowledge about the URL marked
– Collective nature: • view bookmarks added or annotated by other users. • organize existing tags into groups (tag bundles).
18
3.1. Data Collection procedure
Collected annotations made in Social Bookmarking Services.At least four parts:• 1. Link to the resource (website…)• 2. One or more tags• 3. User who makes the annotation• 4. Moment/ time when the annotation is made
• This article focus more on the co-occurrence of users, resources and tags (user, resource, tag).
Dataset collected : U = {u1; u2; : : : ; uK}, R = {r1; r2; : : ; rM}, and T = {t1; t2; : : ; tN}
19
3.1. Process to retrieve the data(A) Start point. Identify the search attributes. Authoritative source as baseline to find keywords connected to the idea of ‘globalization of agriculture’
– Wikipedia definition of “critics of globalization (popular, high reputation)
– Other starts points (future)– Selected (manually= researcher expertise) main
concepts from the website homepages, tag clouds or topics.
– Identified the 5 seed keywords (globalization + agriculture, food, organic, and GMO)
– Other concepts rejected
(B) With a Perl program web-crawling was made, gathering the sample of users, URLs and tags
- For globalization+agriculture; globalization+food; globalization+organic; globalization+GMO
- 22 April 2011 and 21 May 2011 (one completed month)
- Results: 10,220 taggings that involved 851 users on 1,077 URLs and 1,720 tags.
(C) Program in Haskell to reduce the amount of data by cutting the URLs and using key words, including the identification of synonyms, the elimination of words with capital letters and derivatives such as words in plural.
(D) Dataset for analysis
Figure 2. Data Collection Procedure
Source: Authors
20
Example: final dataset
526 urls 1,700 tags 851 usersSource: Authors
21
Table 1. Keywords Used in the topic “Globalization of agriculture”
Search attributesused
Number ofresulting tags
(I+II)
More frequent Tags /
Main Tags
Globalization (I) +agriculture (II)
1,116 Food (268), economics (176), environment (145), politics
(85), trade (81), sustainability (70)
Globalization (I) +food (II)
1,682 Economy (180), economics (171), environment (122), sustainability (78), politics
(60)
Globalization (I) +organic (II)
22 Business (3), fair-trade (3)
Globalization (I) +GMO (II)
54 Food (13), agriculture (12)
Source: Authors
22
3.2. Analysis procedure: SNANetwork analysis
• Node centrality: identification of the nodes that are more “central” than othersNetwork level property = idea of the node’s social power based on how well it “connects” to the network.
• Degree of a node = Number of direct connections individuals have with others in the groupHighest degree = exerts influence (or authority).
In-degree = number of incoming ties that reflect the popularity of a website. As a result, the prominent, well-connected members (those with a high degree of centrality) are usually the opinion leaders.
Out-degree = number of outgoing ties which determine if a particular user is an active or passive participant within the network.
Software Pajek (big series of data): Delicious bookmarking system’s user is simply using Delicious, latent structures, power that emerges from
the network…
23
Figure 3. Hyperlink Network Energy Kamada-Kawai Map.Bipartite Network userurl
Source: Authors by Pajek
24
Results 4.1. Centralization (Authority)
Centralization: userURL
URL’s Indegree: Sum of total inbound linksUser’s Outdegree: Sum of the total outbound links
Network highly centralized within a few nodes:
Only 10 URLs from 526 (1.90%) account for 32.29% links to URLs.10 URLs got 3,290 inbound links from a total of 10,219.
Only 10 users from 851 (1.17%) account for 14.05% links to URLs.These 10 users produced 1,436 outbound links from a total of 10,219.
10 most centralized websites. Nine of them were media-based (online newpapers such as The New York Times, BBC, The Guardian, Washington Post, Financial Times, Reason, The Nation, Spiegel and The Economist) (Table 2)
Identification of Users with a greater degree of centrality.Mritiunjoy user play a very important role in the network. Mritiunjoy joined to Delicious on 12 march, 2007 and to the date he has 10,020 links and is following 38 users.Mritiunjoy Mohanty - is a professor at the Indian Institute of Management Calcutta, India and his Research Interests are Political Economy of growth and development.
25
Table 2. Top Authoritative Sites in the hyperlink network
Indegree Outdegree
1 1203 http://www.nytimes.com/ 433 /mritiunjoy
2 674 http://news.bbc.co.uk/ 195 /laura208
3 365 http://www.guardian.co.uk/ 127 /rd108
4 186 http://www.washingtonpost.com/ 112 /amaah
5 158 http://www.ft.com/ 111 /thepouncer
6 154 http://www.reason.com/ 100 /anilius
7 147 http://www.thenation.com/ 100 /emmarlyb
8 137 http://www.spiegel.de/ 87 /adorngeography
9 136 http://www.foodfirst.org/ 86 /pagolnari
10 130 http://www.economist.com/ 85 /freemanlcSource: Authors
26
Figure 4. user-user Unipartite Network Energy Kamada-Kawai MapDegree Cut-off = 1. Size: Degree
Source: Authors by Pajek
27
Figure 5. user-user Unipartite Network Energy Kamada-Kawai Map
Degree Cut-off = 30. Nodes = 211. Size: Betweeness
Source: Authors by Pajek
28
Source: Authors by Pajek
Figure 6. user-user Unipartite Network Energy Kamada-Kawai Map
Degree Cut-off = 30. Nodes = 211. Size: Closeness
29
Source: Authors by Pajek
Figure 7. user-user Unipartite Network Energy Kamada-Kawai Map
Degree Cut-off = 30. Nodes = 211. Size: Degree
30
Figure 8. Hyperlink Network. 851 users arranged in rank order by number of outbound links and 1,077 URLs arranged in rank order
by number of inbound links
Why?/ How come that a few users and websites are better connected than the majority?
Source: Authors
31
Value of identified nodes (websites) due to:
• The links that they receive (its instrumental nature)
• The profile of these organizations (newspapers that channel big quantities of resources – information) (quality of the links) = central URLs with authority.
32
Results. 4.2. Node Tags: Users producing Tags
• Collective tag structure (excluded the key search words, such as globalization, agriculture, food and organic, and GMO) produced with Wordle.
• Sizes of the terms in the tag clouds are proportional to the weights - the top 25 highest weighted tags.
• Tag clouds: identifying the topical groupings in a tag network– Identification of topics around globalization of
agriculture
33
Figure 9. Tag Cloud for Agriculture Globalization Network Identified on the delicious Data Set
Resulting main key topics were economics and the environment Main keywords used by users to describe or characterise in Delicious the topic ‘globalization of agriculture’.
Source: Authors by wordle
34
50 more frequent TAGS. Tags used more than 20 times
Economics 350 World 47 BBC 30
Environment 274 Global 46 Future 30
Sustainability 153 Capitalism 45 Geography 30
Politics 152 Green 43 Water 30
Economy 144 Research 42 Nutrition 29
Trade 131 Crisis 41 Government 27
Business 99 International 41 Wto 27
Poverty 97 Oil 38 Agribusiness 26
Culture 84 Prices 37 Ecology 25
Farming 84 Activism 35 Europe 25
Africa 83 News 35 Globalwarming 23
Health 78 Science 35 Reference 22
Development 76 Hunger 34 Technology 22
Energy 76 Usa 34 Biofuel 21
India 65 Inflation 32 Corporations 21
China 59 History 31 Farmers 21
Policy 55 Local 31
35
Discussion: 5.1. Centrality and Power
New York Times in this network of globalization of agriculture in Delicious surpasses by far other URLs (with 1,203 inbound links, followed by BBC website with 674 ones).
Most cited, recommended or considered websites with regards to a topic occupy a central place and have an important role in the process of dissemination of news, events, trending topics, ideology, culture and etcetera.
Identification of key collective actors (represented here through URLs) allows a better comprehension of leadership, influence process, and power-related structures.
For social practitioners, is a good way to identify key informants in a community through whom disseminating useful and important information.
Very inequal distribution of power of the URLs cited by users in the topic globalization of agriculture.
- Important accumulation of inlinks.
ADVANTAGES OF THIS TYPE OF KNOWLEDGEFOR RESEARCHING AND INTERVENING
36
Discussion. 5.1. Centrality and Power
• FOCUS ON Users: identification of key actors that disseminate and share URLs, as the previously cited Mritiunjoy– Determine from where key elements that structure the network
emerge. • Why ‘that’ so important actor in the network of
globalization of agriculture? – Key actors in this type of network could configure and
reconfigure the evolution of the network (TIME), and structure and even manipulate the type of interchange of resources in Delicious or in similar bookmarking sites.
• Is it by chance? Are most prominent actors in a type of website like Delicious corresponding to a profile of very active and participative people? Do they usually work (or have as hobby) in this area and this is why accumulate and tag so many URLs in Delicious? – Further steps of the research.
37
5.2. Central Tags: Users producing Tags
• Tags suggested by the website + Added new tags in a creative way• ‘Tag cloud’: visual approach to the language used by users• From a total of 1700 tags two words were the main ones.• Each user could label a URL with an unlimited number of tags
(average 12 tags per user, max 433 and min 2). • Most frequently tags used were the words: ‘economics’ (350 citations
out of 1700 tags -20.6%-) and ‘environment’ (273, 16%). • Other very frequent tags were also sustainability (153), politics (152),
economy (144), trade (131), business (99), poverty (97), culture (84), farming (84), africa (83), health (78), and development (76), representing these 13 tags in relatives terms one out of four labelled tags around the topic (25,9%).
Questions: • Reasons of the prominence of the two first tags around the
globalization of agriculture. • Are some of the 1700 found tags used in a interchangeable basis?
– Why sometimes the word economics is used sometimes, and why other times is used economy?
– Are they used in the same way at classifying the URLs?
38
Conclusions: achieved goals
• Presenting this methodology to use big data from Web 2.0 in socioeconomic research, and the illustration from a social bookmarking site (Delicious) is:
• A first step towards the development of empirical techniques capable of automatically differentiating groups of individuals with common interests, and individuals who occupy a more central position.
• First stone in the difficult process of understanding and discovering patterns in the process that characterize users tagging URLs for collaborative reasons.
• Utility: Discovering latent patterns = provide effective recommendations to different actors.
• Understanding the community of more than a thousand links. • Retrieval and analysis of information: complex but easy =
working in interdisciplary teams
39
Other topics for Researching: Future
• Improvements are necessary regarding in retrieval methods and the implementation of Information Retrieval and Recommender Systems techniques
• Influence of first tags on the following ones. Role of innovation and creativity at tagging
• Evolution and usage of language around an issue along time.• Ideological and terminological approaches in the national/ international
arena • Use of some tags at classifying URLs and the distinction among users in
the way they use some words/tags– Distinction between scientifics/ other professionals or users? – Identify users with the same patterns at tagging, or URLs that were similarly
labelled: study structural equivalences• Other possible studies based in retrieving the pages and making content
analysis • Why some labels are present/ absent? • Are there “traditions”/ “fashions” at tagging in the Web 2.0? • Comparing results from Delicious and from other social bookmarking sites• Go in-depth about users (if possible)• And other explorations, other starting points, other bookmarking sites, other
indicators, complementary to those used in this illustration
40
Possible Applications• Producing and manipulating public opinion (at recommending and
describing websites) and markets– If we know the interests of users belonging to a network, we could also be
able to make recommendations• Recommender Systems, changes into a ternary relation between
users, resources, and tags, more complex to manage. • Important for researchers interested in formulating strategies for
intervention and mobilisation, but also practitioners, and companies could make use of this.
• The discovering of the central elements in a network (users and URLs), at the same time that the tags used by users could be key to design future strategies for the dissemination of messages and to achieve more success in the communications, making use of important keywords, for instance, to atract more attention, etc.
• Implementation of Information Retrieval and Recommender Systems techniques in social commerce and social media contexts.
• Applications in advertising, mobilising, etc.• Security, Social Studies, Market studies, consumers• Time: longitudinal analysis• Etcétera