[ieee 2013 national conference on communications (ncc) - new delhi, india (2013.2.15-2013.2.17)]...

5
Tracking On-line Radicalization Using Investigative Data Mining Pooja Wadhwa, M.P.S Bhatia Computer Engineering Department Netaji Subhas Institute of Technology Azad Hind Fauj Marg, Sector 3, Dwarka, New Delhi [email protected], [email protected] Abstract—The increasing complexity and emergence of Web 2.0 applications have paved way for threats arising out of the use of social networks by cyber extremists (Radical groups). Radicalization (also called cyber extremism and cyber hate propaganda) is a growing concern to the society and also of great pertinence to governments & law enforcement agencies all across the world. Further, the dynamism of these groups adds another level of complexity in the domain, as with time, one may witness a change in members of the group and hence has motivated many researchers towards this field. This proposal presents an investigative data mining approach for detecting the dynamic behavior of these radical groups in online social networks by textual analysis of the messages posted by the members of these groups along with the application of techniques used in social network analysis. Some of the preliminary results obtained through partial implementation of the approach are also discussed. Keywords- Cyber extremism; Dynamic Social Networks; Data Mining; Networks; Social Network Analysis I. INTRODUCTION The rise and development of Internet and the World Wide Web have provided a global network for sharing information and collaborating in trusting relationships and with the ease of accessibility, they have proliferated in our lives to an extent where the users can access / share information anywhere anytime. Rapid development of information technology and Web 2.0 have provided a platform for the evolution of terrorist organizations, extremists from traditional pyramidal structure to a technology enabled networked structure. Cyber extremism, cyber hate propaganda (also called online radicalization) have emerged as one of the prominent threats to the society, governments and law enforcement agencies. Internet today, offers a virtual base to these groups from where they can plan their activities, actions, reporting, exchange of ideas, fundraising and recruitment [1] and hence we observe that there is another facet to the use of internet which is ‘dark’ or ‘ugly’ and certainly undesirable for the society. With the advent of Web 2.0, many social media have emerged which offer an easy means of exchanging ideas encompassing the barrier of physical proximity. Web forums have emerged as a major media for terrorists for promoting violence and distribution of propaganda materials [2]. In addition, Blogs have also provided a propaganda platform for extremist or terrorist groups to promote their ideologies [3]. Recently, there has been a shift of these radical / extremist groups to social networking sites like Facebook, Twitter, YouTube etc. where they have been posting videos, recruiting new members, spreading their propaganda and thus have fuelled the motivation towards the analysis of the rich information available in these social networking sites. Since the platform is not monitored by any government / agency, makes itself more vulnerable to its use by these elements as it provides them to connect with their groups/leaders online at any time, post messages and ideologies. Another important aspect of consideration while carrying out the analysis of these groups is that pure link analysis may lead to inconsistent results as the importance of a member in a group cannot be understood by analyzing its connections to other members at any time. A situation may occur, where a member might be posting messages relating to different agendas at different instance of time. The content analysis of the message may therefore reveal the presence of a hidden group which is related to a specific topic. This paper explores the above two aspects by presenting an overview of the existing literature on the subject and presents an approach for dynamic detection of radical groups active in social networking sites with the application of data mining and social networks analysis techniques. II. RELATED WORK The problem of analysis of extremism in social networking sites can be considered as an intersection of cyber security and the topic based Social Network Analysis. Hence our literature survey covers relevant papers in the area of social network analysis and data mining along with the exhaustive survey of security domain. A. Data Mining and its applications to Security The various forms of knowledge discovery in the field of counterterrorism have been classified in [5], as four main groups:- (i) Prediction: Aims to deduce the outcome or meaning of a particular situation by collecting and observing its properties. (ii) Clustering: It refers to putting together groups of objects or situations, whose members resemble each other. 978-1-4673-5952-8/13/$31.00 ©2013 IEEE

Upload: mps

Post on 23-Dec-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Tracking On-line Radicalization Using Investigative Data Mining

Pooja Wadhwa, M.P.S Bhatia Computer Engineering Department

Netaji Subhas Institute of Technology Azad Hind Fauj Marg, Sector 3, Dwarka, New Delhi [email protected], [email protected]

Abstract—The increasing complexity and emergence of Web 2.0 applications have paved way for threats arising out of the use of social networks by cyber extremists (Radical groups). Radicalization (also called cyber extremism and cyber hate propaganda) is a growing concern to the society and also of great pertinence to governments & law enforcement agencies all across the world. Further, the dynamism of these groups adds another level of complexity in the domain, as with time, one may witness a change in members of the group and hence has motivated many researchers towards this field. This proposal presents an investigative data mining approach for detecting the dynamic behavior of these radical groups in online social networks by textual analysis of the messages posted by the members of these groups along with the application of techniques used in social network analysis. Some of the preliminary results obtained through partial implementation of the approach are also discussed.

Keywords- Cyber extremism; Dynamic Social Networks; Data Mining; Networks; Social Network Analysis

I. INTRODUCTION The rise and development of Internet and the World Wide Web have provided a global network for sharing information and collaborating in trusting relationships and with the ease of accessibility, they have proliferated in our lives to an extent where the users can access / share information anywhere anytime. Rapid development of information technology and Web 2.0 have provided a platform for the evolution of terrorist organizations, extremists from traditional pyramidal structure to a technology enabled networked structure. Cyber extremism, cyber hate propaganda (also called online radicalization) have emerged as one of the prominent threats to the society, governments and law enforcement agencies. Internet today, offers a virtual base to these groups from where they can plan their activities, actions, reporting, exchange of ideas, fundraising and recruitment [1] and hence we observe that there is another facet to the use of internet which is ‘dark’ or ‘ugly’ and certainly undesirable for the society. With the advent of Web 2.0, many social media have emerged which offer an easy means of exchanging ideas encompassing the barrier of physical proximity. Web forums have emerged as a major media for terrorists for promoting violence and distribution of propaganda materials [2]. In addition, Blogs have also

provided a propaganda platform for extremist or terrorist groups to promote their ideologies [3].

Recently, there has been a shift of these radical / extremist groups to social networking sites like Facebook, Twitter, YouTube etc. where they have been posting videos, recruiting new members, spreading their propaganda and thus have fuelled the motivation towards the analysis of the rich information available in these social networking sites. Since the platform is not monitored by any government / agency, makes itself more vulnerable to its use by these elements as it provides them to connect with their groups/leaders online at any time, post messages and ideologies. Another important aspect of consideration while carrying out the analysis of these groups is that pure link analysis may lead to inconsistent results as the importance of a member in a group cannot be understood by analyzing its connections to other members at any time. A situation may occur, where a member might be posting messages relating to different agendas at different instance of time. The content analysis of the message may therefore reveal the presence of a hidden group which is related to a specific topic.

This paper explores the above two aspects by presenting an overview of the existing literature on the subject and presents an approach for dynamic detection of radical groups active in social networking sites with the application of data mining and social networks analysis techniques.

II. RELATED WORK The problem of analysis of extremism in social

networking sites can be considered as an intersection of cyber security and the topic based Social Network Analysis. Hence our literature survey covers relevant papers in the area of social network analysis and data mining along with the exhaustive survey of security domain.

A. Data Mining and its applications to Security The various forms of knowledge discovery in the field of

counterterrorism have been classified in [5], as four main groups:-

(i) Prediction: Aims to deduce the outcome or meaning of a particular situation by collecting and observing its properties.

(ii) Clustering: It refers to putting together groups of objects or situations, whose members resemble each other.

978-1-4673-5952-8/13/$31.00 ©2013 IEEE

(iii) Understanding connections: Aims to understand how objects, processes and especially people are connected.

(iv) Understanding the world of others: In the crime domain this includes sentiment analysis or determining which criminal groups are in connection to cooperating groups by exchanging messages, tools and information.

Each of the form of knowledge discovery can be applied to the following cases as per [5]:-

TABLE I. FORMS OF KNOWLEDGE DISCOVERY

Prediction Clustering Understanding Connections

Understanding world of

others

Crime Prediction of unsolved crimes

Finding similar crimes

Detecting inter-related crimes

Detecting common attributes of crimes

Criminal Prediction of next attack of criminal

Detecting similar criminals

Detecting friendship links of criminals

Detecting relationship of criminals

Criminal

Network

Prediction of a missing member in criminal network

Detecting cliques and subgroups in criminal networks

Positional analysis of criminal networks members

Detecting similar groups, finding emerging networks.

In case of Counter Terrorism, a new form of Data Mining known as Investigative Data Mining (IDM) has emerged [6] and is defined as “The technique which models data to predict behavior, assesses risk, determine associations and help in neutralizing the terrorist network” [7]. We can say that Investigative Data Mining is based on the methods of traditional data mining used in combination with modern methods that originate from research in Algorithms and Artificial Intelligence. Examples are the discovery of interesting links between people (social networks) and other entities.

B. Survey of the Analysis of the Dark Web According to [4], there are three broad methods of

activity by non-state actors: “activism”, “hactivism” and “cyberterrorism”. Activism refers to the normal, non-disruptive use of the internet in support of an agenda or cause. Hactivism is the union of hacking and activism and incorporates the use of hacking practices with the intent of disrupting normal operations but not causing serious damage. Cyberterrorism on the other hand, is the convergence of cyberspace and terrorist activity. According to [8][9], majority of terrorist groups focus their activity in the area of activism:- publicity and propaganda spreading, fundraising, recruitment, networking and mobilization. As the use of internet is reaching across the globe and is serving as a connectivity platform for people across the globe with the facility of sharing information, the platform is also being intensively used by the terrorist groups as a media for communication, propaganda spreading, online hate,

recruitment [4]. There have been many studies on analyzing the presence of hate/extremist/activist groups (in this paper we refer activists as groups/people engaged in spreading hate/extremism in respect of national security) on the web in recent years [9][10][8]. It has been found by [9] that middle-eastern Islamic groups are the most active exploiters of the internet. An interesting study on understanding the structure of criminal and terrorist networks was carried out in [11]. Sentiment and Affect analysis of dark Web forums has been provided in [12] where the author has applied Machine learning approaches to address multilingual content present on Web Forums. Many studies have been carried out for analyzing Extremist groups present on web i.e. forums, Blogs, websites [10][8][13][14].

It has been found that the modern terrorist network is no longer hierarchical [4], and may be viewed as dynamic association of nodes and hubs which tend to relocate with time. An understanding about the relationships between such elements must consider the focus on keynodes in communication with time. Data related to the activity of these groups is dynamic [15] and requires time based collection, analysis and response, but due to the limitation of scalability of human skills and abilities; a gap is bound to exist. Another aspect highlighted in [15] is the difficulty in collecting data from dark networks, as terrorists survive to the extent their actions are hidden.

C. Social Network Analysis Social Network Analysis (SNA) [16] is a set of powerful

techniques to identify social roles, important groups and hidden structures in organizations and groups.

Realm of Social Network Analysis can be classified into:- understanding Social dynamics, discovering organizational structure, community identification and visualization according to [17] . The task of understanding how a network is formed with actors and their behavior is called Social Dynamics. Social Dynamics refers to the behavior of groups that result from the interactions of individual group members as well to the study of the relationship between individual interactions and group level behaviors. If the aim is to understand the organization of a social network with the identification of key nodes and their impact on the overall network, the area known as Discovering / understanding of Organizational structure emerges. Community Identification refers to discovering the groups that emerge due to specific response / activity patterns among users in a network and has emerged as an active and interesting area of research. Several studies have been conducted in order to investigate the community structure of real and Online Social Networks [18][19].

We can say that techniques mentioned in Table I. involve the use of data mining approaches and may serve as a first step to refinement, over which IDM techniques such as SNA (Social Network Analysis) can be applied to reveal hidden structures. In cases of counter terrorism, SNA techniques can play an important role in knowledge discovery process [20] [21][22].

III. RESEARCH GAPS There has been a steady increase in the rise of extremism

on the internet on many social networking sites [1]. The survey of literature has highlighted the following research gaps:-

• It is difficult to gather and analyze large amount of data as data may contain multimedia and multilingual content. Further the data might be composed of large number of structured and unstructured files with large number of hyperlinks. Integration of data from multiple sources poses another challenge [15].

• There is absence of significant literature in dynamic tracking of extremist groups in social networking sites. It has been highlighted that the problem of handling dynamic data is challenging.

• It has also been pointed out that data about Dark Networks is difficult to collect [15].

• None of the existing literature has a major contribution towards the evolution /study on the dynamics of extremist groups in social networking sites to the best of our knowledge.

• Use of investigative data mining approach [6] in identifying dynamic nodes of extremist groups in social networking sites is a novel idea which needs to be explored.

IV. THE APPROACH The research gaps mentioned above suggest the scope of

research which exists in this domain. However, our research will address the problems of dynamic tracking of radicalization with the use of investigative data mining techniques incorporating Web mining for Information Extraction, Traditional data mining for data preprocessing, filtering, clustering and Social Network Analysis for community evolution, tracking and community analysis. The approach at present will focus only on the textual content present in the message as we feel that the presence of specific words provide indication about the context and hence we will ignore the hyperlinks posted about the pictures and other websites. Our approach considers the fact that pure link analysis may lead to inconsistent results as the importance of a member in a group cannot be understood by analyzing its connections to other members at any time. A situation may occur, where a member might be posting messages relating to different agendas at different instance of time. The content analysis of the message may reveal the presence of a hidden group which is related to a specific topic. A situation may also occur when a particular member is posting different messages in more than one communities at an instance of time. Considering all these facts, we have incorporated data mining as a preprocessing step before applying social network analysis techniques. The broad approach is shown in Figure 1 and will be explained in detail subsequently. The detailed approach is shown in Figure 2.

Figure 1. Broad Approach

A. Web Mining techniques to capture data The approach will focus on dynamic analysis of the

community structure of a large On-line Social Networks (OSN). In our case twitter has been chosen for the case study. Twitter supports hashtag based conversation. Hashtags are also used to mark individual messages as relevant to a particular group, and to mark individual messages as belonging to a particular topic or "channel". Hence, we would first sample data for a duration of about a month corresponding to a specific hashtag like jihad, Al-Qaeda etc. This will give us the first level of filter in mining messages corresponding to specific topic. In order to fetch data, a customized crawler will be written in python to fetch tweets corresponding to specific tweets. Messages/ tweets captured will be filtered for additional details like sender information, receiver information, date, time, language, whether it is a reply_to message, text. Thus, each tweet will have a format comprising of above mentioned fields along with text. The crawler will fetch tweets for a month and we will store tweet in the form of excel database from where further processing can be done.

B. Data mining for pre-processing message text After the messages have been captured, message files

will be sampled at an interval of 60 minutes for pre-processing. First the messages will be filtered for English language keeping in view that messages corresponding to other languages will require additional language processing capabilities which are at present not available with us. Also any hyperlinks present in the text will be removed so as to reduce the processing overhead. In order to apply text mining capabilities on message, message will be tokenized where a token will correspond to a word. These tokens will be further processed for removal of stop words like of, the, a, an, for, about etc. which we feel will add unnecessary processing overhead. After the stop words are removed stemming will be performed where all the words which are extended from root words will be replaced by root words. For example ‘jihadist’, ‘jihadology’ can be considered to be originated from root word ‘jihad’ and hence can be replaced by root words. After the pre-processing, frequency count can be used for identifying top ten 1-gram, 2-grams from the messages. These will correspond to the most occurring subtopics of discussion within a topic. Out of the two lists containing high frequency counts of 1-grams and 2-grams we will consider only top five n-grams which will ensure that these are the ones being most talked about by the users. This will provide us with top categories of subtopics based on frequency count. However, we might need to manually intervene to find out whether the subtopics identified are themselves related by users or theme. This means a subtopic identified might be

Web mining techniques to capture data

Data Mining to filter, cluster data

Dynamic Social Network Analysis

related to specific users or the subtopics might be related by context. Thus manual analysis will provide a further insight into this. After the subtopics are identified, the messages/ tweets will be clustered as per subtopics. Now at this stage we have tweets corresponding to specific topics along with additional details of sender, receiver, date, time, text and these subtopics will serve as precursor for the number of communities ‘k’ which may be active within a given topic (hashtag in this case) at the instance of sample time. These communities can also be referred to as hidden as they might not be visible without the application of data mining .

Figure 2. Approach for Tracking Radicalization in On-line Social

Networks

C. Social Network Analysis to discover evolution and tracking of communities As observed, data related to the activity of the extremist

groups is dynamic [17] and requires time based collection, analysis and response, but due to the limitation of scalability of human skills and abilities, a gap is bound to exist.

Once the number of communities ‘k’ has been decided we will model the behavior of nodes which correspond to the member of groups with time. Hence dynamic social network analysis and visualization will serve as a crucial step in understanding the complex process of evolution of members in these groups. After the messages are grouped corresponding to various subtopics, the information will be mapped to a graph which will have a directed edge from node ‘A’ to node ‘B’ if there is any communication among the two within that specific amount of sampled time interval (which in our case is 60 minutes). The sampled time interval can be selected as per one’s choice observing the message burst. This will result in a directed graph containing ‘k’ communities. We can now apply dynamic community algorithm like evolutionary k-means clustering so as to observe the evolution of a community. In addition we may also identify central nodes, bridge nodes each time a community evolves. This will result in the identification of most active key members and the members who act as a

carrier of information from one community to other. In addition, by keeping track of the number of active nodes in a group we may also determine whether a topic is more influential than other topics at any instance of time. One can also predict whether a topic has been most influential from past. We propose in this respect a ‘topic-entity mapping scheme’, where will be a star network containing topic as the central hub and nodes are the entities/ users who will be posting messages on the topic. Thus, at each instance of sampled time, we will have new nodes being added to star network, the visualization will result in active nodes in community being shown in bright as compared to older nodes in light shades. Thus the ‘topic-entity map’ will provide an insight into the history of topic. This is shown in Figure 3.

Figure 3. Topic-Entity Map

In Fig. 3, we can see that at sample time instance say t, nodes labeled ‘C’, ‘B’, ‘E’, ‘F’ are active with respect to subtopic and in past nodes labeled ‘A’,’D’ and ‘F’ have also talked about it (shown by dotted lines). Thus the subtopic has influence on 6 nodes.

In order to model evolution of new communities, each time the data mining step above will result either in same communities or in change of existing communities (observed by new n-gram on the basis of frequency count) which will be modeled by our social network analysis approach. Keeping in view the delicate nature of extremism, we will sample data for an entire day at an interval of 60 minutes which might increase/ decrease on the basis of data burst. The duration has been chosen, keeping in mind that the subtopics will not change frequently and are likely to be active over a period of time so that the analysis can be done. The approach can be extended to few days once the behavior of these groups is understood by careful analysis.

V. PRELIMINARY EXPERIMENT We conducted a preliminary experiment by capturing

data from Twitter for one day using our customized crawler written in python. Tweets were captured corresponding to hashtag “Al- Qaeda” and were initially filtered for English language. For our analysis purpose, we analyzed tweets on an hourly basis to identify categories.

The data was preprocessed via our script written in python, incorporating the use of NLTK library. The data capture was tokenized, stripped off stopwords and stemmed. In addition we also had to create a wastelist comprising of some noisy words which we considered irrelevant to the

domain of security. The wastelist was used for final filtering. After the data was filtered, we identified top 1-grams, 2-grams and 3-grams based on frequency count.

A. Observations It was observed that the results relating to 3-grams were

not information and frequency rich, as the topmost 3-grams had frequency count of 1; hence 3-grams were discarded for our research. We considered only 1-grams and 2-grams for topic identification. Since the aim of the research is to find top categories of topics within predefined hashtags, over which we expect communities to evolve over a period of time, so results of hourly slots with top 1-gram and 2-grams were analyzed for an entire day to determine final categories manually. It was also observed that manual intervention was an important step as during data analysis at some particular time we were not able to find any relevant topics, hence these were discarded. Some of the top 1-gram, 2-gram identified on an hourly basis are shown in Figure 4. , while the values in grey depict the final categories.

Figure 4. Hourly Tweet Analysis

VI. CONCLUSION AND FUTURE WORK Our preliminary experiment finds a strong relevance with

respect to the entire approach as it brings to light some interesting findings and also offers some future directions for our work. Firstly, we were able to detect top categories of topics from 1-grams and 2-grams and were able to conclude that n-grams with n>=3 did not resulted in relevant topics. Also we feel that a fully automated approach will not be able to detect topics successfully without incorporating manual intervention as the tweets are extremely dynamic in nature and at an instance of time we may incorporate no important topics relevant to the domain. Further, we feel that frequent n-gram approach can be used in finding relevant keywords from hashtags based conversation but the approach may need some modification when used for topic identification from large corpus where messages may span few lines e.g blogs. In that case, topic modeling approaches may be tried. Further, we will extend the experiment over a period of days to identify a fixed window over which we can fix category detection. We will fully implement the approach we have discussed in the paper as a part of our future work.

REFERENCES [1] Declan mcCullagh. “White House: need to monitor online

‘extremism’”. http://news.cnet.com/8301-31921_3-20087677-281/white-house-need-to-monitor-online-extremism/ , August 3,2011.

[2] Yulei Zhang, Shuo Zheng, Li Fan, Yan dang, Catheine A. Larson, Hsinchun Chen, Dark Web Forums Portal: Searching and Analysing Jihadist Forums. IEEE, 2009.

[3] Michael Chau, Jennifer Xu, Mining Communities and their relationships in blogs: A study of online hate groups. International Journal of Human-Computer Studies, 65 (2007) 57-70, Elsevier.

[4] A Framework for understanding Terrorist Use of the Internet, Canadian Centre for Intelligence and Security Studies. ITAC, Volume 2006-2.

[5] Fatih OZGUL, Claus ATZENBECK, Ahmet CELIK, Zeki ERDEM, 2011. Incorporating data sources and methodologies for crime data mining,2011, IEEE .

[6] Muhammad Akram Shaikh, Wang Jiaxin, Investigative Data Mining : Identifying Key Nodes in Terrorist Networks, 2006, IEEE.

[7] NasrullahMemon, Abdul Rasool Qureshi, Investigative data mining and its Application in Counterterrorism. Proceedings of the 5th WSEAS International Conference on Applied Informatics and Communications, September 15-17,2005.

[8] Phyllis B. Gerstenfeld, Diana R.Grant, Chau-Pu Chiang, Hate Online : A Content Analysis of Extremist Internet Sites. Analysis of Social Issues and Public Policy, Vol. 3 , No.1, 2003, pp. 29-44.

[9] Jailun Qin, Yilu Zhou, H. Chen, 2010. A multi-region empirical study on the presence of global extremist organizations, Journal of Information Systems frontier, 2010, Springer.

[10] Yilu Zhou, Edna Reid, Jialun Qin, H. Chen, Guampi Lai, U.S Domestic Extremist Groups on the Web: Link and Content Analysis, university of Arizona.

[11] Jennifer Xu, H. Chen, The Topology of Dark Networks, October 2008 .Communications of the ACM.

[12] Hsinchun Chen, Sentiment and Affect analysis of Dark Web Forums: Measuring Radicalization on the Internet, 2008, IEEE.

[13] Yulei Zhang, ShuoZeng, Li Fan, Yan Dang, Catherine A. Larson , H. Chen, Dark Web Forums Portal: Searching and Analysing Jihadist Forums, 2009, IEEE.

[14] Michael Chau, Jennifer Xu, Mining communities and their relationships in blogs: A study of online hate groups. International Journal of Human-Computer studies, 2006, Springer.

[15] Nancy C. Roberts, Tracking and disrupting dark networks: Challenges of data collection and analysis, Journal of Information Systems Frontier, 2011, Springer.

[16] S. Wasserman & K. Faust, Social Network Analysis: methods and applications. 1994,Cambridge University Press, Cambridge.

[17] Pooja Wadhwa, M.P.S Bhatia, Social Network Analysis: Trends, Techniques and Future Prospects, 2012, ARTCOM , in press.

[18] S. Fortunato, Community detection in graphs, Physics Reports, vol. 486,pp. 75-174, 2010.

[19] Yihan Guan, Yiliang Jin, Community Structure and Information Propagation in a twitter Network, Stanford Project Report, December 8, 2010.

[20] Rabeah Al-Zaidy, Benjamin C.M. Fung ,Amr M. Youssef , Towards Discovering Criminal Communities from Textual Data, 2011, SAC’11.ACM.

[21] Sun Duo-Yong, GuoShu-Quan, Zhang Hai, Li Ben-Xian, Study of Covert networks of Terroristic Organisations Based on Text Analysis, 2011, IEEE.

[22] Sun Duo-Yong, GuoShu-Quan, Zhang Hai, Li Ben-Xian, Study on Covert networks of terrorists based on Interactive Relationship Hypothesis, 2011, IEEE.