04-an actor-centric approach (syrian ff case study ... · pdf...

Detecting Toxic Content using Open Source Social Media:

An Actor-‐Centric Approach

The SecDev Group, 2014

Notice:

This paper summarizes research conducted by The SecDev Group, as part of a Public Safety Canada, Kanishka-‐funded project looking at social media analytics and the prevention of violent extremism. Citation of this document is allowed, provided appropriate acknowledgement is given.

Table of Contents

1.0 Introduction ..................................................................................................................................... 1

1.1 What is Kanishka? ........................................................................................................................ 1

1.2 About SecDev Kanishka Research ................................................................................................ 1

1.3 This Report ................................................................................................................................... 2

1.4 Why Twitter? ............................................................................................................................... 2

2.0 Seed Account Identification ............................................................................................................. 4

2.1 Seed Candidate Validation Criteria .............................................................................................. 4

2.2 Data Collection ............................................................................................................................. 4

2.3 Analysis of Candidate Accounts ................................................................................................... 5

2.4 Seed Candidate Verification ........................................................................................................ 6

3.0 Seed Network Construction ............................................................................................................. 7

3.1 Social Network Modelling Using Twitter Interactions ................................................................. 7

3.2 Data Collection Using Snowball Sampling .................................................................................... 7

3.3 Seed Network Construction ......................................................................................................... 8

3.4 Seed Community Detection ......................................................................................................... 8

3.5 Validation of Seed Community Membership ............................................................................... 9

4.0 Toxic Content Analysis ................................................................................................................... 10

4.1 What is Toxic Content? .............................................................................................................. 10

4.2 Geospatial Analysis of Toxic Content Consumption .................................................................. 10

5.0 Conclusions .................................................................................................................................... 13

5.1 Summary of Research Findings .................................................................................................. 13

5.2 Discussion of Methods and Techniques .................................................................................... 13

5.2 Recommendations for Future Research .................................................................................... 14

1

1.0 Introduction

1.1 What is Kanishka? The Kanishka Project is a multi-‐year initiative funded by the Government of Canada to support terrorism-‐focused research. Unveiled on June 23, 2011, the project is named after the Air India Flight 182 plane that was bombed on June 23, 1985, killing 329 people, most of them Canadians.

The initiative invests in research to increase understanding of the recruitment methods and tactics of terrorists, to help produce more effective policies, tools and resources for law enforcement and people on the front lines. Although the project's primary focus is on research, it also supports other activities necessary to build knowledge and create a network of researchers and students that spans multiple disciplines and research organizations.

The overarching goal of the Kanishka Project is to improve Canada's ability to counter terrorism and violent extremism at home and abroad. This report provides an account of one of the case studies funded by a grant provided to the The SecDev Group under the Kanishka Project.

1.2 About SecDev Kanishka Research Over the past year The SecDev Group engaged in a set of practical experiments exploring techniques and methods for detecting violent extremist content and communities at risk of radicalization online.

Our approach was inspired by the public health approach to violence reduction developed by the World Health Organization. We started with four basic assumptions:

• Violent extremist groups are active and savvy users of social media spaces; • While pathways to radicalization and violence are highly idiosyncratic, socialization plays an

important role; therefore tracking and analyzing on-‐line ties and toxic content has potential utility.1

• Open-‐source social media (OSSM) analytics has the potential to generate information that could prove useful to improving public safety through the prevention of violent extremism;

• Methods and techniques are in their infancy. Our work is exploratory. A main purpose is to raise questions and identify areas for further research.

Our open source research explored different techniques for identifying online networks that encourage violence, as well as toxic content and its audiences. We also did some initial exploration of audience geo-‐location, as we thought this could provide potentially useful information for local preventative strategies.

1 Ragheb, Abdo. 2014. Review of Social Science Literature on Radicalization to Assess Operational Utility for Open Source Social Media Research in the Interests of Prevention of Violent Extremism. The SecDev Group

2

1.3 This Report This report provides a summary account of one of the several case study experiments conducted by The SecDev Group under a research grant from the Kanishka Project. The principal focus of this case study was to investigate the “actor-‐centric” approach to surfacing an online network promoting violent extremism – in this case foreign fighters in Syria who embrace ISIL objectives – with a view to then exploring potential hallmark content that could be of interest to Prevention of Violent Extremism (PVE) practitioners concerned about the foreign fighter phenomenon. The case study featured a series of experiments, each building on the results of the next:

• The first stage of the study sought to identify a Twitter account belonging to an individual who could be verified to be a foreign national, currently engaged in armed combat in the Syrian Civil War (see Section 2).

• The Twitter account of the individual verified to be a Syrian foreign fighter (FF) was then used to construct a “seed network,” for the purposes of identifying a community of Twitter users who are influenced by, or share this individual’s extremist ideology (see Section 3).

• Once the “seed community” consisting of individuals most closely aligned with the views and interests of the original “seed account” was identified, the corpus of its social media interactions was evaluated for presence and prominence of toxic content (see Section 4).

The final section of this report provides an overview of the research findings; a discussion of the methods and techniques employed in the course of this case study, and proposes directions for future research on this topic (see Section 5).

It is important to note that the primary goal of this study was to examine the methods and techniques for detection of networks of radicalized individuals and toxic content on open social media platforms. As such, examples and analysis of social media use by Syrian foreign fighters presented in this report are to be viewed as a vehicle for the demonstration of said methods and techniques.

1.4 Why Twitter? For practical purposes, data collection for this case study was limited to Twitter, a popular social media platform. The main reasons for this are presented below.

• Unlike other social media platforms, Twitter users intend for their tweets (i.e. posts) to be accessible by the public, thus minimizing the potential for violating the user’s privacy;

• Twitter provides a free and open API2, making it possible to automate collection of significant volumes of data for offline analysis;

• Twitter’s API provides access to structured data, greatly simplifying analysis as compared to data collected by scraping websites, or obtained via other unstructured sources;

• Unlike some social media platforms (e.g. Facebook), Twitter encourages users who do not know each other to interact and share content;

• The Middle East has some of the most engaged Twitter users around the globe;3

2 API stands for application programming interface, i.e. a means for direct computer-‐to-‐computer interaction

3

• When it comes to groups that promote violent “jihad” ideologies, such as those inspired by al-‐Qaeda, more and more are taking to public online spaces to promote their cause and reach new recruits;

• Foreign fighters in Syria’s civil war are heavy users of social media. Many were active in social media prior to taking up arms; once in country, some re-‐engage actively with followers from their home country, to promote ISIL, answer questions and encourage.4

3 In the Middle East, Twitter Rules (http://www.emarketer.com/Article/Middle-‐East-‐Twitter-‐Rules/1009737) 4 See, for example, the companion piece of SecDev Research; Abdo, Ragheb. 2014. Assessment of a Syrian Foreign Fighter’s Twitter Trajectory (The SecDev Group, unpublished manuscript).

4

2.0 Seed Account Identification The objective of the first experiment was two-‐fold:

1. Identify an active social media user account that may belong to a foreign national currently participating in the Syrian Civil War.

2. Using the corpus of this individual’s social media activity, and any additional content embedded or linked to therein, verify that this person is indeed a Syrian foreign fighter (FF).

2.1 Seed Candidate Validation Criteria To conclusively verify that a given Twitter account belongs to a person who is in fact a Syrian FF, the following validation criteria were devised:

I. Evidence of interaction with other Twitter users so as to allow for social network modeling in later experiments, typical of an average Twitter user (as opposed to a broadcast account used solely for purposes of content distribution).

II. Sufficient evidence to conclusively place the account holder within their country of origin (e.g. references to local events, geo-‐spatial meta-‐data, images or video placing the account holder in a specific location).

III. A period of inactivity associated with travel from the account holder’s country of origin to Syria. IV. Sufficient evidence to conclude that the account holder has travelled to Syria for the purposes of

engaging in armed combat (e.g. references to local events, geo-‐spatial meta-‐data, images or video placing the account holder in a specific location, engaged in activities of interest).

2.2 Data Collection In order to evaluate candidate Twitter accounts against Syrian FF criteria, the tweets associated with said accounts were downloaded using an open source Python script.5

It is worth noting that use of Twitter’s public API is subject to limits both in terms of frequency access as well the volume of data that can be obtained.6 In the case of data collection against a specific account, Twitter limits access to the latest 3,200 tweets, meaning that in some cases it may not be possible to obtain a complete record of a given account’s activity.

It is also important to recognize that some account holders regularly delete their tweets. This may present a challenge since the use of Twitter’s public API is governed by the Terms of Service agreement, which requires one to delete any tweets they have downloaded if they become aware that the author has requested those tweets to be deleted.7

5 The script used to perform data collection can be found at https://gist.github.com/yanofsky/5436496 6 Why the 3200 tweet user timeline limit and will it ever change? (https://dev.twitter.com/discussions/276) 7 Working with Timelines -‐ Handling deleted tweets (https://dev.twitter.com/discussions/10035)

5

2.3 Analysis of Candidate Accounts The search for a Twitter account belonging to a Syrian FF began with analysis of open sources, focussing primarily on print media. The logic behind this approach was to minimize potential privacy concerns by relying on the use of information that is already part of the public record. Our initial search resulted in a list of names of six known Canadian foreign fighters. None of persons identified in this list of candidates were found to be active Twitter users and were disqualified from further analysis.

To proceed, an external source8 was used to obtain a list of 15 additional candidate Twitter accounts. Once the complete records of the additional candidate accounts were downloaded, they were subjected to manual analysis with intent to satisfy the Syrian FF validation criteria.

Analysis of the fourth account (@AbuDujanaBrtany) showed repeated references to another Twitter account (@RadicalIslamist). Given the account’s name and the context of its presence, the decision was made to download the tweets associated with @RadicalIslamist. Manual review of the data showed several pro-‐Islamic State of Iraq in the Sham (ISIS) tweets and a re-‐tweet (equivalent to a forwarded email) from an account called @Hamidur1988, who purported to be a member of ISIS (see Figure 1).

Figure 1 -‐ Tweets linking the @Hamidur1988 account to the Ask.fm service and self-‐identifying to be involved with ISIS

Further investigation of the interaction depicted by Figure 1 led to a post on Ask.fm9 answering a question about how the account holder felt when he first arrived in the “conflict zone.” The assumed conflict zone was Syria, given that the account holder also claimed to be a member of ISIS.

The usernames of both the Twitter and the Ask.fm accounts contained the words “Al Britani” (Arabic for Britain), indicating that the user may have a connection to the United Kingdom. In addition, all of the posts reviewed were written in English. Given the substantial evidence for @Hamidur1988 being a Syrian FF, the account’s data was downloaded for further analysis.

8 The external source in question was Mubin Shaikh who, acting in his capacity as an advisor to the Kanishka Project, provided us with fifteen Twitter accounts of individuals known to be affiliated with the Syrian FF community. 9 “Ask.fm is a Latvia-‐based social networking website where users can ask other users questions, with the option of anonymity.” (Source: http://en.wikipedia.org/wiki/Ask.fm)

6

2.4 Seed Candidate Verification

I -‐ Interaction with Other Users At the time of collection, the @Hamidur1988 account’s record of activity contained 1,907 tweets, spanning nearly a one and a half years. Over this period, Hamidur1988 would have averaged 3.6 tweets a day. Manual inspection of the account’s overall activity showed a sufficient level of interaction with other Twitter users.10

II -‐ Country of Origin Analysis of @Hamidur1988’s tweets yielded a tweet geo-‐tagged as originating from Portsmouth, United Kingdom. In all, a total of four tweets referenced Portsmouth, including one from a Twitter user who asked Hamidur1988 to inform others in Portsmouth of a prayer session at Portsmouth Central Masjid. Taken together, these references indicated with considerable confidence that the individual behind the @Hamidur1988 account was indeed a member of the Portsmouth community.

III -‐ Period of Travel The stipulated period of travel was evident in the noticeable absence of content posted during November of 2013. The exception was a status update made on November 17: “In the west we have everything but we never content. Here we have nothing but it feels like we have everything. Subhanallah.”11 The tweet implies that @Hamidur1988 was no longer in the West.

IV -‐ Presence in Syria and Engagement in Armed Combat Following the lack of activity during his travel to Syria, @Hamidur1988’s account showed an increase in posts on ISIS. These posts primarily featured Ask.fm questions on @Hamidur1988’s participation in fighting with ISIS in the “Sham.”

From August 2013, until the suspected time of travel to Syria, @Hamidur1988’s communication was most frequent with another Twitter user, @jamanwtf. Further inspection of the @jamanwtf account revealed that he was a “face-‐to-‐face” friend of @Hamidur1988, named Iftikhar Jaman, who had purportedly been a Syrian FF. Open sources were used to corroborate that Jaman was indeed in Syria, and that he had in fact been killed, explaining the sudden cessation of activity on his Twitter account on November 29, 2013.12

Conclusion In light of the findings presented above, it was possible to conclude with a high degree of confidence that @Hamidur1988 Twitter account did indeed belong to a Syrian FF.

10 For more information please see SecDev’s companion study, Abdo, Ragheb. 2014. Analysis of a Syrian Foreign Fighter’s Twitter Feed (The SecDev Group, unpublished manuscript). 11 The original tweet can be accessed at https://twitter.com/Hamidur1988/status/403480384652206080 12 British 'celebrity jihadi' and chef dies in Syria (Source: http://www.telegraph.co.uk/news/worldnews/middleeast/syria/10524179/British-‐celebrity-‐jihadi-‐and-‐chef-‐dies-‐in-‐Syria.html)

7

3.0 Seed Network Construction In this section we present the results of the second experiment conducted as part of this case study. Once we identified an account belonging to a verified Syrian FF, we proceeded to construct a social network graph using @Hamidur1988’s account as the seed. The intent behind this experiment was to identify a closely knit group of like-‐minded individuals, who like @Hamidur1988, engaged in production and dissemination of content that promotes violent extremist action.

3.1 Social Network Modelling Using Twitter Interactions The success of social media platforms like Facebook and Twitter is in large part due to the manner in which they allow their users to recreate social behaviour humans naturally engage in offline. Much like the social behaviour we exhibit offline, online interactions include:

• Formal relationships, manifested by “following” someone’s Twitter account and having other Twitter users “follow” your own account;

• Social conversations, such as tweets that reply to another tweet, or mention a specific Twitter account; and

• Information sharing via the re-‐tweet function, to facilitate the propagation of the message across the network.

It is important to remember that the social media activities visible to the public typically represent a small fraction of the sum total of any one person’s social interactions. However, given that extremist views are by definition outside of the mainstream, social media platforms enable individuals who may otherwise find themselves isolated, to find others who share their extremist perspective, unhampered by physical geography.

With these assumptions in mind we set out to construct a social network graph based on the public Twitter activity of @Hamidur1988.

3.2 Data Collection Using Snowball Sampling Snowball sampling is a common approach to collecting data on members of a hard to reach population. The essence of this non-‐probability sampling technique13 is to use existing study subjects to recruit additional subjects from among their acquaintances.

Given that the sample members are not recruited using a sample frame, the technique is subject to a number of potential biases. The chief among these biases is the extent to which a given member is known to others within the population of interest, which has a direct impact on the likelihood of an individual being included in the sample.14

Despite its limitations, one can easily infer that the use of snowball sampling presented the most readily operationalized method of data collection for the purposes of this study. To collect the data required

13 Non-‐probability sampling restricts the research findings from being generalized to the whole population. 14 A good starting point for additional information on snowball sampling can be found at https://www.fort.usgs.gov/LandsatSurvey/SnowballSampling

8

for the construction of a social network graph, snowball sampling was operationalized in the following manner:

• All of the tweets collected from the @Hamidur1988 account were processed to identify other Twitter users, who were either mentioned or were the target of a reply.

• Once identified, each Twitter user was ranked according to the frequency of their presence within the corpus of @Hamidur1988’s Twitter activity.

• The data for the top 30 users was then collected in the manner discussed in Section 2.2. • The process of user identification, ranking, and data collection was then repeated one more

time.

After the snowball sample collection was completed, the final two-‐hop, top-‐30 snowball sample collected data from 1058 Twitter accounts, for a total of 2,760,309 unique tweets.15

3.3 Seed Network Construction Social network analysis is a process by which individuals and their interactions are captured in the form of a graph. The properties this graph are then examined for insights which may not be easily deduced from other analytical approaches such as frequency-‐based and content analyses. 16

All social networks consist of two basic components: nodes and edges (i.e. a connection between two nodes). For the purposes of this study, a node represents an individual Twitter user, while an edge is used to represent interaction between two users.

Tweets captured by the snowball sample as described in the previous section, were processed for the purposes of constructing a social network graph. The process involved identification and extraction of interactions, according to their source (i.e. the author of the tweet) and target (a Twitter user who was either mentioned in the tweet or was being replied to).

After extracting each source-‐to-‐target interaction, the data was imported into Gephi, a popular open source software package for visualizing and analyzing large networks graphs.17 The resultant social network graph contained a total of 128,796 nodes and 217,621 edges. Of the 1058 accounts in the sample, only 513 were labelled as sources (i.e. contained interactions with other users).

3.4 Seed Community Detection Once the social network graph was constructed, the next step was to use the properties of this graph to identify other accounts belonging to the Syrian FF Twitter community. One means of achieving this

15 In the course of this experiment, a number of variations on the snowball sampling were examined. Among these were using only 10 of the most frequently occurring users for each account, as well as allowing for collection against 3rd degree accounts (i.e. friends of friends or friends). The top 30, two-‐hop approach was determined to be optimal in that it collected enough data to enable social network construction, in a reasonable amount of time, while avoiding difficulties associated with construction and analysis of exceedingly large graphs. 16 Bartlett J. & Miller C. The State of the Art: A Literature Review of Social Media Intelligence Capabilities for Counter-‐Terrorism (p. 35), November 2013. 17 For more information on Gephi and its capabilities, please visit https://gephi.org

9

objective is to use the graph property known as modularity. Modularity is a measure of the underlying structure of a network or graph, indicating the degree of division of nodes within the graph into clusters or communities.18

Once modularity was applied to the @Hamidur1988 seed network, a number of clusters were identified. The cluster with the largest number of source nodes19 included @Hamidur1988 himself, as well 75 other source accounts, and consisted of a total of 4,268 nodes and 10,103 edges.20

Manual inspection of the frequently shared content within this network was found to be consistent with the material of interest to the Syrian FF community21, including a high volume what can be called toxic content.22

3.5 Validation of Seed Community Membership Determining with a high degree of certainty that a Twitter account belongs to a Syrian FF requires an approach similar to that described in Section 2.2 of this report. Thus, the method of identification of Syrian FF Twitter accounts using modularity clustering is expected to be limited in its accuracy. Potential alternatives to the time-‐ and resource-‐intensive manual verification are discussed in Section 5 of this report.

However, other approaches to verification, such as comparison of findings to those of other researchers and practitioners working on the same topic, can provide an alternative means of validation. Such an opportunity was presented in April of 2014, when the International Centre for the Study of Radicalization and Political Violence (ICSR) published a paper titled “#Greenbirds: Measuring Importance and Influence in Syrian Foreign Fighter Networks.”23

The #Greenbirds study was the first of a number of forthcoming research reports based on a database of verified Syrian FFs who maintain an active presence on Twitter. When compared to the accounts captured by the snowball sample used for this study, all ten (100%) of the #Greenbirds accounts mentioned in the paper were present in the seed network, with 80% also present in the largest seed community.

18 Newman, M. E. J. and Girvan, M. "Finding and evaluating community structure in networks." Phys. Rev. E 69, no. 2 (2004): 026113. 19 Source nodes, as opposed to target nodes, represent Twitter accounts which contained tweets that interacted with other Twitter users (i.e. have outgoing edges). 20 In the course of experimentation, data collection and modularity clustering were attempted a total of six times. The findings presented herein describe the results obtained during the sixth and final attempt. Some of the failures encountered during earlier attempts can be attributed to technical difficulties related to the development of the script used to perform the collection. Please contact the SecDev Group for more information. 21 Supra note 10 22 Please see Section 4.0 for the definition of what constitutes “toxic content” for the purposes of this study. 23 The full text of the study can be accessed at http://icsr.info/wp-‐content/uploads/2014/04/ICSR-‐Report-‐Greenbirds-‐Measuring-‐Importance-‐and-‐Infleunce-‐in-‐Syrian-‐Foreign-‐Fighter-‐Networks.pdf

10

4.0 Toxic Content Analysis Underlying this case study is the premise that members of radicalized communities (such as Syrian FFs), both consume and disseminate videos, pictures, and articles that seek to validate and legitimize their beliefs and activities. To the extent that such content can be a means of recruitment and influence, developing a method for identification and monitoring of such content, and understanding the patterns of its consumption, can offer valuable insights to PVE researchers and practitioners.

In the final phase of this case study we set-‐out to explore the viability of using the Syrian FF online community to identify widely circulated toxic content, and whether or not it would be possible to estimate the geography of its distribution.

4.1 What is Toxic Content? To define what constitutes “toxic content” for the purposes of this case study, we relied on the findings of the social science literature review24 and the longitudinal, hnad-‐coded, content analysis of the @Hamidur1988 Twitter feed,25 which is another component of SecDev’s Kanishka research.

Based on the outcomes of our research, three “toxicity” criteria were used to assess content identified via communications sampled from the Syrian FF Twitter community. For a piece of content (i.e. article, imagery, or video) to be considered toxic it needs to conclusively address the following themes:

a. Alienation of Muslims from their home countries; b. Promotion of grievances between Muslims with non-‐Muslims; and c. Calls for violence.

In addition, for the purposes of this case study the content had to be in English, so as to be deemed accessible to non-‐Arabic speakers residing in Western countries.

4.2 Geospatial Analysis of Toxic Content Consumption This section describes the methodology that was developed to identify instances of widely-‐circulated toxic content, and the geographic distribution of individuals engaged in its consumption and dissemination.

Indexing and Ranking The first step of this process involved indexing and ranking all of the content contained within the Syrian FF online community sample. To ensure that each piece of content (represented by a URL) was properly counted, a Python script was written to expand any shortened URLs into their full form.26

24 Supra note 2 25 Supra note 10 26 Links to external content posted on Twitter are often shortened using with Twitter’s own http://t.co facility, or via third party services such as http://bit.ly. Expanding these shortened URLs is important because while the full form URL to a given piece of content will always be the same, there can be multiple shortened URLs pointing to the same resource.

11

Once expanded, the URLs were assessed against two criteria: frequency of sharing within the Syrian FF online community, and frequency of sharing across Twitter at large.27 To assist in ranking the overall prominence of a given URL, a balanced metric was constructed to provide a single score using both factors.28

Starting with the most prominent URL, each piece of content was manually assessed against the “toxicity” criteria defined in Section 4.1 of this report. Four of high-‐ranking URLs were found to lead to content that was no longer available. In addition, two more URLs were found to be linked to unrelated material (appeals for locating missing persons). Finally, the URL with the eighth-‐highest rank, an ISIS video titled “Establishment of the Islamic State Part 8 -‐Shaykh Abu Yahya Al-‐Libi” was found to satisfy the toxic content criteria (see Figure 2).29

Capturing Historic Twitter Activity Involving an Expanded URL The URL of the YouTube video selected for this experiment was then used to identify Twitter users who had taken part in its distribution, and as such could be assumed to constitute an engaged audience actively consuming or promoting such content.

27 The frequency of a given URLs appearance across Twitter at large can be estimated using the “re-‐tweets” metric provided by Twitter. Usage of this metric ought to be treated with caution, as botnets and other forms of automated posting can artificially inflate the prominence of tweet, making it appear to be more popular than it actually is. 28 Given the fact that the Syrian FF community is much smaller than the overall number of users on Twitter, the number of re-‐tweets (R) was typically far larger than the number of times a piece of content was shared within the community (C). As such, the balanced prominence score (B) was derived as follows: B = SQRT(C2+(LOG10R)

2) 29 The video can be viewed at http://youtu.be/qtPHw0lh-‐VA

Figure 2 -‐ A screenshot from an ISIS YouTube video selected for collection

12

At this point we were faced with two data collection issues:

• The first had to do with the common usage of shortened URLs. Given that the expanded URL we intended to search for was obtained after the data was collected from Twitter, it could not be used as a reliable search criterion.30

• The second issue involved the limitations of the Twitter public API. Since we intended to collect all tweets that included the URL posted between a certain data range, our requirement fell well outside of the capabilities made available by the public API.31

To address this problem, a decision was made to perform data collection for this phase of the study using DataSift, a commercial social media aggregation service.32 By using DataSift to conduct the search we were able to overcome both of the issues mentioned above. First, DataSift provides a number of augmentations to the raw data provided by the social media platforms themselves, including ability to capture and search on expanded URLs. Second, DataSift offers access to the entire Twitter archive, going back more than 3 years.

Data Collection The search window for data extraction was set one day before the targeted YouTube video was posted to 14 days after. Once completed, the DataSift search yielded a total of 1668 tweets.

Of the tweets that were captured by the search, 753 were found to have been posted by the same Twitter account ,(االلججززررااوويي ذذرر ااببوو@) most likely operated by someone using software for automated content promotion. The remaining 915 tweets appeared to originate from Twitter accounts that did not engage in mass broadcasts.33

Analysis of Geospatial Metadata Analysis of the geospatial data embedded in the social media interactions was found to be of low quality, insufficient for practical applications. Namely, while some of the Twitter users who had forwarded the toxic content URL did have geo-‐tagging enabled, no geo-‐tags were applied to any of the re-‐tweets. Similarly, some of the tweets contained user-‐specified information concerning their country of origin, but given the low incidence and reliability of that data it too was deemed insufficient.

In all, while the method for identification of toxic content and monitoring of its consumption presented above showed some promise, a number of issues were encountered that prevented it from returning an unqualified successful demonstration. For further discussion on this please see the Conclusions section of this report.

30 While searching for the full URL may have resulted in collection of some instances of its dissemination, none of the shortened URLs would be picked-‐up by the search. 31 Twitter’s public API provides limited access to approximately 1% of the entire Twitter activity, going back approximately 5 days. 32 More information on DataSift and the services it provides can be found at http://www.datasift.com 33 However, it is still possible that some of the tweets collected were posted or re-‐tweeted via a more sophisticated botnet.

13

5.0 Conclusions In this section we provide an overview of the research findings; a discussion of the methods and techniques employed in the course of this case study and propose directions for future research on this topic.

5.1 Summary of Research Findings Based on the outcomes of the experiments presented in this report we can confidently state the following:

• By applying specific validation criteria, it is possible to conclude with a high degree of confidence that an individual behind a social media persona is a member of a radicalized community of violent extremists.

• Using a verified source as a starting point, snowball sampling and modularity clustering provide an effective means of enumerating an online community of violent extremists and individuals who share their beliefs and aspirations.

• Monitoring the interactions of an online community whose members represent an extremist ideology can be an effective means of identifying toxic content and analysing its distribution.

5.2 Discussion of Methods and Techniques

Seed Account Identification and Verification Although we ultimately relied on information provided by an external source, the Twitter account that was selected for verification could well have been located using open sources.34 What is perhaps more striking is the degree of confidence that can be achieved in verifying an individual’s place of residence, and other activities such as international travel, through the use of public social media activity.

Whether application of such research methods in countries with strict privacy regimes, such as Canada or Germany, would be deemed acceptable, remains an unanswered question.

Seed Network Construction and Community Enumeration The application of snowball sampling and modularity clustering was demonstrated to be an effective alternative to a purely manual construction of a social network. We were further encouraged by the overlap in membership of the Syrian FF community identified by this study and the ICSR Syrian FF database.

To decrease the likelihood of false positives, the community enumeration method employed by our study would benefit from the addition of a verification stage. This could be done either manually or by employing a version of the verification criteria, operationalized for a machine learning classifier.35

34 Supra note 11, as an example of a lead that would have connected the investigation to @Hamidur1988 35 The SecDev Group’s in-‐depth analysis of the @Hamidur1988’s Twitter stream could potentially be used to facilitate such a system. Supra note 10.

14

Toxic Content Identification and Monitoring Although we were unsuccessful in obtaining reliable geospatial metrics of the distribution of individuals consuming and disseminating toxic content, it is clear that a number of steps could be taken in order to improve the technique.

Using multiple sources of toxic content would likely increase the number of sampled accounts, potentially increasing the chances of collecting sufficient geospatial metadata. Similarly, collecting additional tweets from each sampled account would increase the chances of obtaining geospatial metadata or mentions of geographic landmarks.

Finally, if the information of interest is ultimately aggregate in nature, data collection can be conducted in a manner better able to avoid collection of potentially sensitive private details. One example would be the DataSift demographics enhancement36 which provides enhanced geospatial and demographic information, while anonymizing the rest of the metadata to improve privacy protection.

5.2 Recommendations for Future Research All of the methods and techniques employed in the course this case study would benefit from external validation. More specifically, it would be instructive to apply the same sequence of seed account identification and validation, followed by seed network construction in the context of a different group of violent extremists.

It is also worth noting that this study involved a considerable amount of manual content analysis. This suggests that applications of machine learning to this area could significantly enhance our technological capacity for detection of weak signals of radicalization.

A working hypothesis is that community-‐based or community-‐facing PVE practitioners would be very interested in having fine-‐grained and real-‐time data-‐feeds on trending toxic content within foreign fighter online networks, and may also be interested in general geo-‐located information on content consumers.37 Knowledge of trending, toxic content – like a specific VE video – provides a data-‐point for engagement with community members, to raise awareness, promote dialogue and thereby stimulate community protective factors. We recommend this as a discussion worth having.

Finally, development of privacy and ethics protocols for conducting social media analytics for PVE should be considered. For one, it would provide the necessary guidance to researchers and practitioners. But perhaps more importantly, it would contribute to the development of normative standards concerning acceptable use of open source social media.38

n

36 A Twitter augmentation data stream provided by Demographics Pro (http://www.demographicspro.com/) 37 For example, 10 Twitter users in your geo-‐located catchment area are actively consuming this content. 38 For a more substantive discussion concerning the issues of legality, privacy, and ethics of conducting open source social media research for PVE see SecDev’s Kanishka Research Summary Report.

04-an actor-centric approach (syrian ff case study ... · pdf...

Documents