04-an actor-centric approach (syrian ff case study ... · pdf...

16
Detecting Toxic Content using Open Source Social Media: An ActorCentric Approach The SecDev Group, 2014 Notice: This paper summarizes research conducted by The SecDev Group, as part of a Public Safety Canada, Kanishkafunded project looking at social media analytics and the prevention of violent extremism. Citation of this document is allowed, provided appropriate acknowledgement is given.

Upload: buibao

Post on 06-Mar-2018

214 views

Category:

Documents


2 download

TRANSCRIPT

 

Detecting  Toxic  Content  using  Open  Source  Social  Media:  

An  Actor-­‐Centric  Approach  

 

The  SecDev  Group,  2014  

 

 

 

 

 

 

 

Notice:  

This  paper  summarizes  research  conducted  by  The  SecDev  Group,  as  part  of  a  Public  Safety  Canada,  Kanishka-­‐funded  project  looking  at  social  media  analytics  and  the  prevention  of  violent  extremism.  Citation  of  this  document  is  allowed,  provided  appropriate  acknowledgement  is  given.  

Table  of  Contents    

1.0   Introduction  .....................................................................................................................................  1  

1.1   What  is  Kanishka?  ........................................................................................................................  1  

1.2   About  SecDev  Kanishka  Research  ................................................................................................  1  

1.3   This  Report  ...................................................................................................................................  2  

1.4   Why  Twitter?  ...............................................................................................................................  2  

2.0   Seed  Account  Identification  .............................................................................................................  4  

2.1   Seed  Candidate  Validation  Criteria  ..............................................................................................  4  

2.2   Data  Collection  .............................................................................................................................  4  

2.3   Analysis  of  Candidate  Accounts  ...................................................................................................  5  

2.4   Seed  Candidate  Verification  ........................................................................................................  6  

3.0   Seed  Network  Construction  .............................................................................................................  7  

3.1   Social  Network  Modelling  Using  Twitter  Interactions  .................................................................  7  

3.2   Data  Collection  Using  Snowball  Sampling  ....................................................................................  7  

3.3   Seed  Network  Construction  .........................................................................................................  8  

3.4   Seed  Community  Detection  .........................................................................................................  8  

3.5   Validation  of  Seed  Community  Membership  ...............................................................................  9  

4.0   Toxic  Content  Analysis  ...................................................................................................................  10  

4.1   What  is  Toxic  Content?  ..............................................................................................................  10  

4.2   Geospatial  Analysis  of  Toxic  Content  Consumption  ..................................................................  10  

5.0   Conclusions  ....................................................................................................................................  13  

5.1   Summary  of  Research  Findings  ..................................................................................................  13  

5.2   Discussion  of  Methods  and  Techniques  ....................................................................................  13  

5.2   Recommendations  for  Future  Research  ....................................................................................  14  

 

 

 

 

1    

1.0   Introduction  

1.1   What  is  Kanishka?  The  Kanishka  Project  is  a  multi-­‐year  initiative  funded  by  the  Government  of  Canada  to  support  terrorism-­‐focused  research.  Unveiled  on  June  23,  2011,  the  project  is  named  after  the  Air  India  Flight  182  plane  that  was  bombed  on  June  23,  1985,  killing  329  people,  most  of  them  Canadians.  

The  initiative  invests  in  research  to  increase  understanding  of  the  recruitment  methods  and  tactics  of  terrorists,  to  help  produce  more  effective  policies,  tools  and  resources  for  law  enforcement  and  people  on  the  front  lines.    Although  the  project's  primary  focus  is  on  research,  it  also  supports  other  activities  necessary  to  build  knowledge  and  create  a  network  of  researchers  and  students  that  spans  multiple  disciplines  and  research  organizations.  

The  overarching  goal  of  the  Kanishka  Project  is  to  improve  Canada's  ability  to  counter  terrorism  and  violent  extremism  at  home  and  abroad.    This  report  provides  an  account  of  one  of  the  case  studies  funded  by  a  grant  provided  to  the  The  SecDev  Group  under  the  Kanishka  Project.  

1.2   About  SecDev  Kanishka  Research  Over  the  past  year  The  SecDev  Group  engaged  in  a  set  of  practical  experiments  exploring  techniques  and  methods  for  detecting  violent  extremist  content  and  communities  at  risk  of  radicalization  online.    

Our  approach  was  inspired  by  the  public  health  approach  to  violence  reduction  developed  by  the  World  Health  Organization.  We  started  with  four  basic  assumptions:  

• Violent  extremist  groups  are  active  and  savvy  users  of  social  media  spaces;  • While  pathways  to  radicalization  and  violence  are  highly  idiosyncratic,  socialization  plays  an  

important  role;  therefore  tracking  and  analyzing  on-­‐line  ties  and  toxic  content  has  potential  utility.1  

• Open-­‐source  social  media  (OSSM)  analytics  has  the  potential  to  generate  information  that  could  prove  useful  to  improving  public  safety  through  the  prevention  of  violent  extremism;  

• Methods  and  techniques  are  in  their  infancy.  Our  work  is  exploratory.  A  main  purpose  is  to  raise  questions  and  identify  areas  for  further  research.  

Our  open  source  research  explored  different  techniques  for  identifying  online  networks  that  encourage  violence,  as  well  as  toxic  content  and  its  audiences.  We  also  did  some  initial  exploration  of  audience  geo-­‐location,  as  we  thought  this  could  provide  potentially  useful  information  for  local  preventative  strategies.  

                                                                                                                         1  Ragheb,  Abdo.    2014.  Review  of  Social  Science  Literature  on  Radicalization  to  Assess  Operational  Utility  for  Open  Source  Social  Media  Research  in  the  Interests  of  Prevention  of  Violent  Extremism.    The  SecDev  Group  

 

2    

1.3   This  Report  This  report  provides  a  summary  account  of  one  of  the  several  case  study  experiments  conducted  by  The  SecDev  Group  under  a  research  grant  from  the  Kanishka  Project.    The  principal  focus  of  this  case  study  was  to  investigate  the  “actor-­‐centric”  approach  to  surfacing  an  online  network  promoting  violent  extremism  –  in  this  case  foreign  fighters  in  Syria  who  embrace  ISIL  objectives  –  with  a  view  to  then  exploring  potential  hallmark  content  that  could  be  of  interest  to  Prevention  of  Violent  Extremism  (PVE)  practitioners  concerned  about  the  foreign  fighter  phenomenon.    The  case  study  featured  a  series  of  experiments,  each  building  on  the  results  of  the  next:  

• The  first  stage  of  the  study  sought  to  identify  a  Twitter  account  belonging  to  an  individual  who  could  be  verified  to  be  a  foreign  national,  currently  engaged  in  armed  combat  in  the  Syrian  Civil  War  (see  Section  2).  

• The  Twitter  account  of  the  individual  verified  to  be  a  Syrian  foreign  fighter  (FF)  was  then  used  to  construct  a  “seed  network,”  for  the  purposes  of  identifying  a  community  of  Twitter  users  who  are  influenced  by,  or  share  this  individual’s  extremist  ideology  (see  Section  3).  

• Once  the  “seed  community”  consisting  of  individuals  most  closely  aligned  with  the  views  and  interests  of  the  original  “seed  account”  was  identified,  the  corpus  of  its  social  media  interactions  was  evaluated  for  presence  and  prominence  of  toxic  content  (see  Section  4).  

The  final  section  of  this  report  provides  an  overview  of  the  research  findings;  a  discussion  of  the  methods  and  techniques  employed  in  the  course  of  this  case  study,  and  proposes  directions  for  future  research  on  this  topic  (see  Section  5).  

It  is  important  to  note  that  the  primary  goal  of  this  study  was  to  examine  the  methods  and  techniques  for  detection  of  networks  of  radicalized  individuals  and  toxic  content  on  open  social  media  platforms.    As  such,  examples  and  analysis  of  social  media  use  by  Syrian  foreign  fighters  presented  in  this  report  are  to  be  viewed  as  a  vehicle  for  the  demonstration  of  said  methods  and  techniques.    

1.4   Why  Twitter?  For  practical  purposes,  data  collection  for  this  case  study  was  limited  to  Twitter,  a  popular  social  media  platform.    The  main  reasons  for  this  are  presented  below.  

• Unlike  other  social  media  platforms,  Twitter  users  intend  for  their  tweets  (i.e.  posts)  to  be  accessible  by  the  public,  thus  minimizing  the  potential  for  violating  the  user’s  privacy;  

• Twitter  provides  a  free  and  open  API2,  making  it  possible  to  automate  collection  of  significant  volumes  of  data  for  offline  analysis;  

• Twitter’s  API  provides  access  to  structured  data,  greatly  simplifying  analysis  as  compared  to  data  collected  by  scraping  websites,  or  obtained  via  other  unstructured  sources;  

• Unlike  some  social  media  platforms  (e.g.  Facebook),  Twitter  encourages  users  who  do  not  know  each  other  to  interact  and  share  content;  

• The  Middle  East  has  some  of  the  most  engaged  Twitter  users  around  the  globe;3  

                                                                                                                         2  API  stands  for  application  programming  interface,  i.e.  a  means  for  direct  computer-­‐to-­‐computer  interaction  

 

3    

• When  it  comes  to  groups  that  promote  violent  “jihad”  ideologies,    such  as  those  inspired  by  al-­‐Qaeda,  more  and  more  are  taking  to  public  online  spaces  to  promote  their  cause  and  reach  new  recruits;  

• Foreign  fighters   in  Syria’s  civil  war  are  heavy  users  of  social  media.  Many  were  active   in  social  media   prior   to   taking   up   arms;   once   in   country,   some   re-­‐engage   actively  with   followers   from  their  home  country,  to  promote  ISIL,  answer  questions  and  encourage.4  

                                                                                                                                                                                                                                                                                                                                                                                                       3  In  the  Middle  East,  Twitter  Rules  (http://www.emarketer.com/Article/Middle-­‐East-­‐Twitter-­‐Rules/1009737)  4  See,  for  example,  the  companion  piece  of  SecDev  Research;  Abdo,  Ragheb.  2014.    Assessment  of  a  Syrian  Foreign  Fighter’s  Twitter  Trajectory  (The  SecDev  Group,  unpublished  manuscript).    

 

4    

2.0   Seed  Account  Identification  The  objective  of  the  first  experiment  was  two-­‐fold:  

1. Identify  an  active  social  media  user  account  that  may  belong  to  a  foreign  national  currently  participating  in  the  Syrian  Civil  War.  

2. Using  the  corpus  of  this  individual’s  social  media  activity,  and  any  additional  content  embedded  or  linked  to  therein,  verify  that  this  person  is  indeed  a  Syrian  foreign  fighter  (FF).  

2.1   Seed  Candidate  Validation  Criteria  To  conclusively  verify  that  a  given  Twitter  account  belongs  to  a  person  who  is  in  fact  a  Syrian  FF,  the  following  validation  criteria  were  devised:  

I. Evidence  of  interaction  with  other  Twitter  users  so  as  to  allow  for  social  network  modeling  in  later  experiments,  typical  of  an  average  Twitter  user  (as  opposed  to  a  broadcast  account  used  solely  for  purposes  of  content  distribution).  

II. Sufficient  evidence  to  conclusively  place  the  account  holder  within  their  country  of  origin  (e.g.  references  to  local  events,  geo-­‐spatial  meta-­‐data,  images  or  video  placing  the  account  holder  in  a  specific  location).  

III. A  period  of  inactivity  associated  with  travel  from  the  account  holder’s  country  of  origin  to  Syria.  IV. Sufficient  evidence  to  conclude  that  the  account  holder  has  travelled  to  Syria  for  the  purposes  of  

engaging  in  armed  combat  (e.g.  references  to  local  events,  geo-­‐spatial  meta-­‐data,  images  or  video  placing  the  account  holder  in  a  specific  location,  engaged  in  activities  of  interest).  

2.2   Data  Collection  In  order  to  evaluate  candidate  Twitter  accounts  against  Syrian  FF  criteria,  the  tweets  associated  with  said  accounts  were  downloaded  using  an  open  source  Python  script.5  

It  is  worth  noting  that  use  of  Twitter’s  public  API  is  subject  to  limits  both  in  terms  of  frequency  access  as  well  the  volume  of  data  that  can  be  obtained.6    In  the  case  of  data  collection  against  a  specific  account,  Twitter  limits  access  to  the  latest  3,200  tweets,  meaning  that  in  some  cases  it  may  not  be  possible  to  obtain  a  complete  record  of  a  given  account’s  activity.  

It  is  also  important  to  recognize  that  some  account  holders  regularly  delete  their  tweets.    This  may  present  a  challenge  since  the  use  of  Twitter’s  public  API  is  governed  by  the  Terms  of  Service  agreement,  which  requires  one  to  delete  any  tweets  they  have  downloaded  if  they  become  aware  that  the  author  has  requested  those  tweets  to  be  deleted.7  

                                                                                                                         5  The  script  used  to  perform  data  collection  can  be  found  at  https://gist.github.com/yanofsky/5436496  6  Why  the  3200  tweet  user  timeline  limit  and  will  it  ever  change?  (https://dev.twitter.com/discussions/276)  7  Working  with  Timelines  -­‐  Handling  deleted  tweets  (https://dev.twitter.com/discussions/10035)  

 

5    

2.3   Analysis  of  Candidate  Accounts  The  search  for  a  Twitter  account  belonging  to  a  Syrian  FF  began  with  analysis  of  open  sources,  focussing  primarily  on  print  media.    The  logic  behind  this  approach  was  to  minimize  potential  privacy  concerns  by  relying  on  the  use  of  information  that  is  already  part  of  the  public  record.    Our  initial  search  resulted  in  a  list  of  names  of  six  known  Canadian  foreign  fighters.    None  of  persons  identified  in  this  list  of  candidates  were  found  to  be  active  Twitter  users  and  were  disqualified  from  further  analysis.  

To  proceed,  an  external  source8  was  used  to  obtain  a  list  of  15  additional  candidate  Twitter  accounts.    Once  the  complete  records  of  the  additional  candidate  accounts  were  downloaded,  they  were  subjected  to  manual  analysis  with  intent  to  satisfy  the  Syrian  FF  validation  criteria.  

Analysis  of  the  fourth  account  (@AbuDujanaBrtany)  showed  repeated  references  to  another  Twitter  account  (@RadicalIslamist).    Given  the  account’s  name  and  the  context  of  its  presence,  the  decision  was  made  to  download  the  tweets  associated  with  @RadicalIslamist.  Manual  review  of  the  data  showed  several  pro-­‐Islamic  State  of  Iraq  in  the  Sham  (ISIS)  tweets  and  a  re-­‐tweet  (equivalent  to  a  forwarded  email)  from  an  account  called  @Hamidur1988,  who  purported  to  be  a  member  of  ISIS  (see  Figure  1).  

 

 

Figure  1  -­‐  Tweets  linking  the  @Hamidur1988  account  to  the  Ask.fm  service  and  self-­‐identifying  to  be  involved  with  ISIS  

Further  investigation  of  the  interaction  depicted  by  Figure  1  led  to  a  post  on  Ask.fm9  answering  a  question  about  how  the  account  holder  felt  when  he  first  arrived  in  the  “conflict  zone.”  The  assumed  conflict  zone  was  Syria,  given  that  the  account  holder  also  claimed  to  be  a  member  of  ISIS.  

The  usernames  of  both  the  Twitter  and  the  Ask.fm  accounts  contained  the  words  “Al  Britani”  (Arabic  for  Britain),  indicating  that  the  user  may  have  a  connection  to  the  United  Kingdom.  In  addition,  all  of  the  posts  reviewed  were  written  in  English.    Given  the  substantial  evidence  for  @Hamidur1988  being  a  Syrian  FF,  the  account’s  data  was  downloaded  for  further  analysis.  

                                                                                                                         8  The  external  source  in  question  was  Mubin  Shaikh  who,  acting  in  his  capacity  as  an  advisor  to  the  Kanishka  Project,  provided  us  with  fifteen  Twitter  accounts  of  individuals  known  to  be  affiliated  with  the  Syrian  FF  community.  9  “Ask.fm  is  a  Latvia-­‐based  social  networking  website  where  users  can  ask  other  users  questions,  with  the  option  of  anonymity.”  (Source:  http://en.wikipedia.org/wiki/Ask.fm)  

 

6    

2.4   Seed  Candidate  Verification  

I  -­‐  Interaction  with  Other  Users  At  the  time  of  collection,  the  @Hamidur1988  account’s  record  of  activity  contained  1,907  tweets,  spanning  nearly  a  one  and  a  half  years.    Over  this  period,  Hamidur1988  would  have  averaged  3.6  tweets  a  day.    Manual  inspection  of  the  account’s  overall  activity  showed  a  sufficient  level  of  interaction  with  other  Twitter  users.10  

II  -­‐  Country  of  Origin  Analysis  of  @Hamidur1988’s  tweets  yielded  a  tweet  geo-­‐tagged  as  originating  from  Portsmouth,  United  Kingdom.    In  all,  a  total  of  four  tweets  referenced  Portsmouth,  including  one  from  a  Twitter  user  who  asked  Hamidur1988  to  inform  others  in  Portsmouth  of  a  prayer  session  at  Portsmouth  Central  Masjid.  Taken  together,  these  references  indicated  with  considerable  confidence  that  the  individual  behind  the  @Hamidur1988  account  was  indeed  a  member  of  the  Portsmouth  community.  

III  -­‐  Period  of  Travel  The  stipulated  period  of  travel  was  evident  in  the  noticeable  absence  of  content  posted  during  November  of  2013.    The  exception  was  a  status  update  made  on  November  17:  “In  the  west  we  have  everything  but  we  never  content.  Here  we  have  nothing  but  it  feels  like  we  have  everything.  Subhanallah.”11    The  tweet  implies  that  @Hamidur1988  was  no  longer  in  the  West.  

IV  -­‐  Presence  in  Syria  and  Engagement  in  Armed  Combat  Following  the  lack  of  activity  during  his  travel  to  Syria,  @Hamidur1988’s  account  showed  an  increase  in  posts  on  ISIS.    These  posts  primarily  featured  Ask.fm  questions  on  @Hamidur1988’s  participation  in  fighting  with  ISIS  in  the  “Sham.”  

From  August  2013,  until  the  suspected  time  of  travel  to  Syria,  @Hamidur1988’s  communication  was  most  frequent  with  another  Twitter  user,  @jamanwtf.    Further  inspection  of  the  @jamanwtf  account  revealed  that  he  was  a  “face-­‐to-­‐face”  friend  of  @Hamidur1988,  named  Iftikhar  Jaman,  who  had  purportedly  been  a  Syrian  FF.    Open  sources  were  used  to  corroborate  that  Jaman  was  indeed  in  Syria,  and  that  he  had  in  fact  been  killed,  explaining  the  sudden  cessation  of  activity  on  his  Twitter  account  on  November  29,  2013.12  

Conclusion  In  light  of  the  findings  presented  above,  it  was  possible  to  conclude  with  a  high  degree  of  confidence  that  @Hamidur1988  Twitter  account  did  indeed  belong  to  a  Syrian  FF.  

 

                                                                                                                         10  For  more  information  please  see  SecDev’s  companion  study,  Abdo,  Ragheb.  2014.  Analysis  of  a  Syrian  Foreign  Fighter’s  Twitter  Feed  (The  SecDev  Group,  unpublished  manuscript).  11  The  original  tweet  can  be  accessed  at  https://twitter.com/Hamidur1988/status/403480384652206080  12  British  'celebrity  jihadi'  and  chef  dies  in  Syria  (Source:  http://www.telegraph.co.uk/news/worldnews/middleeast/syria/10524179/British-­‐celebrity-­‐jihadi-­‐and-­‐chef-­‐dies-­‐in-­‐Syria.html)  

 

7    

3.0   Seed  Network  Construction  In  this  section  we  present  the  results  of  the  second  experiment  conducted  as  part  of  this  case  study.    Once  we  identified  an  account  belonging  to  a  verified  Syrian  FF,  we  proceeded  to  construct  a  social  network  graph  using  @Hamidur1988’s  account  as  the  seed.    The  intent  behind  this  experiment  was  to  identify  a  closely  knit  group  of  like-­‐minded  individuals,  who  like  @Hamidur1988,  engaged  in  production  and  dissemination  of  content  that  promotes  violent  extremist  action.  

3.1   Social  Network  Modelling  Using  Twitter  Interactions  The  success  of  social  media  platforms  like  Facebook  and  Twitter  is  in  large  part  due  to  the  manner  in  which  they  allow  their  users  to  recreate  social  behaviour  humans  naturally  engage  in  offline.    Much  like  the  social  behaviour  we  exhibit  offline,  online  interactions  include:  

• Formal  relationships,  manifested  by  “following”  someone’s  Twitter  account  and  having  other  Twitter  users  “follow”  your  own  account;  

• Social  conversations,  such  as  tweets  that  reply  to  another  tweet,  or  mention  a  specific  Twitter  account;  and  

• Information  sharing  via  the  re-­‐tweet  function,  to  facilitate  the  propagation  of  the  message  across  the  network.  

It  is  important  to  remember  that  the  social  media  activities  visible  to  the  public  typically  represent  a  small  fraction  of  the  sum  total  of  any  one  person’s  social  interactions.    However,  given  that  extremist  views  are  by  definition  outside  of  the  mainstream,  social  media  platforms  enable  individuals  who  may  otherwise  find  themselves  isolated,  to  find  others  who  share  their  extremist  perspective,  unhampered  by  physical  geography.    

With  these  assumptions  in  mind  we  set  out  to  construct  a  social  network  graph  based  on  the  public  Twitter  activity  of  @Hamidur1988.  

3.2   Data  Collection  Using  Snowball  Sampling  Snowball  sampling  is  a  common  approach  to  collecting  data  on  members  of  a  hard  to  reach  population.    The  essence  of  this  non-­‐probability  sampling  technique13  is  to  use  existing  study  subjects  to  recruit  additional  subjects  from  among  their  acquaintances.  

Given  that  the  sample  members  are  not  recruited  using  a  sample  frame,  the  technique  is  subject  to  a  number  of  potential  biases.    The  chief  among  these  biases  is  the  extent  to  which  a  given  member  is  known  to  others  within  the  population  of  interest,  which  has  a  direct  impact  on  the  likelihood  of  an  individual  being  included  in  the  sample.14  

Despite  its  limitations,  one  can  easily  infer  that  the  use  of  snowball  sampling  presented  the  most  readily  operationalized  method  of  data  collection  for  the  purposes  of  this  study.    To  collect  the  data  required  

                                                                                                                         13  Non-­‐probability  sampling  restricts  the  research  findings  from  being  generalized  to  the  whole  population.  14  A  good  starting  point  for  additional  information  on  snowball  sampling  can  be  found  at  https://www.fort.usgs.gov/LandsatSurvey/SnowballSampling  

 

8    

for  the  construction  of  a  social  network  graph,  snowball  sampling  was  operationalized  in  the  following  manner:  

• All  of  the  tweets  collected  from  the  @Hamidur1988  account  were  processed  to  identify  other  Twitter  users,  who  were  either  mentioned  or  were  the  target  of  a  reply.  

• Once  identified,  each  Twitter  user  was  ranked  according  to  the  frequency  of  their  presence  within  the  corpus  of  @Hamidur1988’s  Twitter  activity.  

• The  data  for  the  top  30  users  was  then  collected  in  the  manner  discussed  in  Section  2.2.  • The  process  of  user  identification,  ranking,  and  data  collection  was  then  repeated  one  more  

time.  

After  the  snowball  sample  collection  was  completed,  the  final  two-­‐hop,  top-­‐30  snowball  sample  collected  data  from  1058  Twitter  accounts,  for  a  total  of  2,760,309  unique  tweets.15  

3.3   Seed  Network  Construction  Social  network  analysis  is  a  process  by  which  individuals  and  their  interactions  are  captured  in  the  form  of  a  graph.    The  properties  this  graph  are  then  examined  for  insights  which  may  not  be  easily  deduced  from  other  analytical  approaches  such  as  frequency-­‐based  and  content  analyses.  16  

All  social  networks  consist  of  two  basic  components:  nodes  and  edges  (i.e.  a  connection  between  two  nodes).    For  the  purposes  of  this  study,  a  node  represents  an  individual  Twitter  user,  while  an  edge  is  used  to  represent  interaction  between  two  users.  

Tweets  captured  by  the  snowball  sample  as  described  in  the  previous  section,  were  processed  for  the  purposes  of  constructing  a  social  network  graph.    The  process  involved  identification  and  extraction  of  interactions,  according  to  their  source  (i.e.  the  author  of  the  tweet)  and  target  (a  Twitter  user  who  was  either  mentioned  in  the  tweet  or  was  being  replied  to).  

After  extracting  each  source-­‐to-­‐target  interaction,  the  data  was  imported  into  Gephi,  a  popular  open  source  software  package  for  visualizing  and  analyzing  large  networks  graphs.17    The  resultant  social  network  graph  contained  a  total  of  128,796  nodes  and  217,621  edges.    Of  the  1058  accounts  in  the  sample,  only  513  were  labelled  as  sources  (i.e.  contained  interactions  with  other  users).  

3.4   Seed  Community  Detection  Once  the  social  network  graph  was  constructed,  the  next  step  was  to  use  the  properties  of  this  graph  to  identify  other  accounts  belonging  to  the  Syrian  FF  Twitter  community.    One  means  of  achieving  this  

                                                                                                                         15  In  the  course  of  this  experiment,  a  number  of  variations  on  the  snowball  sampling  were  examined.    Among  these  were  using  only  10  of  the  most  frequently  occurring  users  for  each  account,  as  well  as  allowing  for  collection  against  3rd  degree  accounts  (i.e.  friends  of  friends  or  friends).    The  top  30,  two-­‐hop  approach  was  determined  to  be  optimal  in  that  it  collected  enough  data  to  enable  social  network  construction,  in  a  reasonable  amount  of  time,  while  avoiding  difficulties  associated  with  construction  and  analysis  of  exceedingly  large  graphs.  16  Bartlett  J.  &  Miller  C.  The  State  of  the  Art:  A  Literature  Review  of  Social  Media  Intelligence  Capabilities  for  Counter-­‐Terrorism  (p.  35),  November  2013.  17  For  more  information  on  Gephi  and  its  capabilities,  please  visit  https://gephi.org  

 

9    

objective  is  to  use  the  graph  property  known  as  modularity.    Modularity  is  a  measure  of  the  underlying  structure  of  a  network  or  graph,  indicating  the  degree  of  division  of  nodes  within  the  graph  into  clusters  or  communities.18  

Once  modularity  was  applied  to  the  @Hamidur1988  seed  network,  a  number  of  clusters  were  identified.    The  cluster  with  the  largest  number  of  source  nodes19  included  @Hamidur1988  himself,  as  well  75  other  source  accounts,  and  consisted  of  a  total  of  4,268  nodes  and  10,103  edges.20  

Manual  inspection  of  the  frequently  shared  content  within  this  network  was  found  to  be  consistent  with  the  material  of  interest  to  the  Syrian  FF  community21,  including  a  high  volume  what  can  be  called  toxic  content.22  

3.5   Validation  of  Seed  Community  Membership  Determining  with  a  high  degree  of  certainty  that  a  Twitter  account  belongs  to  a  Syrian  FF  requires  an  approach  similar  to  that  described  in  Section  2.2  of  this  report.    Thus,  the  method  of  identification  of  Syrian  FF  Twitter  accounts  using  modularity  clustering  is  expected  to  be  limited  in  its  accuracy.    Potential  alternatives  to  the  time-­‐  and  resource-­‐intensive  manual  verification  are  discussed  in  Section  5  of  this  report.  

However,  other  approaches  to  verification,  such  as  comparison  of  findings  to  those  of  other  researchers  and  practitioners  working  on  the  same  topic,  can  provide  an  alternative  means  of  validation.    Such  an  opportunity  was  presented  in  April  of  2014,  when  the  International  Centre  for  the  Study  of  Radicalization  and  Political  Violence  (ICSR)  published  a  paper  titled  “#Greenbirds:  Measuring  Importance  and  Influence  in  Syrian  Foreign  Fighter  Networks.”23  

The  #Greenbirds  study  was  the  first  of  a  number  of  forthcoming  research  reports  based  on  a  database  of  verified  Syrian  FFs  who  maintain  an  active  presence  on  Twitter.    When  compared  to  the  accounts  captured  by  the  snowball  sample  used  for  this  study,  all  ten  (100%)  of  the  #Greenbirds  accounts  mentioned  in  the  paper  were  present  in  the  seed  network,  with  80%  also  present  in  the  largest  seed  community.  

                                                                                                                         18  Newman,  M.  E.  J.  and  Girvan,  M.  "Finding  and  evaluating  community  structure  in  networks."  Phys.  Rev.  E  69,  no.  2  (2004):  026113.    19  Source  nodes,  as  opposed  to  target  nodes,  represent  Twitter  accounts  which  contained  tweets  that  interacted  with  other  Twitter  users  (i.e.  have  outgoing  edges).  20  In  the  course  of  experimentation,  data  collection  and  modularity  clustering  were  attempted  a  total  of  six  times.  The  findings  presented  herein  describe  the  results  obtained  during  the  sixth  and  final  attempt.    Some  of  the  failures  encountered  during  earlier  attempts  can  be  attributed  to  technical  difficulties  related  to  the  development  of  the  script  used  to  perform  the  collection.    Please  contact  the  SecDev  Group  for  more  information.  21  Supra  note  10  22  Please  see  Section  4.0  for  the  definition  of  what  constitutes  “toxic  content”  for  the  purposes  of  this  study.  23  The  full  text  of  the  study  can  be  accessed  at  http://icsr.info/wp-­‐content/uploads/2014/04/ICSR-­‐Report-­‐Greenbirds-­‐Measuring-­‐Importance-­‐and-­‐Infleunce-­‐in-­‐Syrian-­‐Foreign-­‐Fighter-­‐Networks.pdf  

 

10    

4.0   Toxic  Content  Analysis  Underlying  this  case  study  is  the  premise  that  members  of  radicalized  communities  (such  as  Syrian  FFs),  both  consume  and  disseminate  videos,  pictures,  and  articles  that  seek  to  validate  and  legitimize  their  beliefs  and  activities.    To  the  extent  that  such  content  can  be  a  means  of  recruitment  and  influence,  developing  a  method  for  identification  and  monitoring  of  such  content,  and  understanding  the  patterns  of  its  consumption,  can  offer  valuable  insights  to  PVE  researchers  and  practitioners.  

In  the  final  phase  of  this  case  study  we  set-­‐out  to  explore  the  viability  of  using  the  Syrian  FF  online  community  to  identify  widely  circulated  toxic  content,  and  whether  or  not  it  would  be  possible  to  estimate  the  geography  of  its  distribution.  

4.1   What  is  Toxic  Content?  To  define  what  constitutes  “toxic  content”  for  the  purposes  of  this  case  study,  we  relied  on  the  findings  of  the  social  science  literature  review24  and  the  longitudinal,  hnad-­‐coded,  content  analysis  of  the  @Hamidur1988  Twitter  feed,25  which  is  another  component  of  SecDev’s  Kanishka  research.    

Based  on  the  outcomes  of  our  research,  three  “toxicity”  criteria  were  used  to  assess  content  identified  via  communications  sampled  from  the  Syrian  FF  Twitter  community.    For  a  piece  of  content  (i.e.  article,  imagery,  or  video)  to  be  considered  toxic  it  needs  to  conclusively  address  the  following  themes:  

a. Alienation  of  Muslims  from  their  home  countries;  b. Promotion  of  grievances  between  Muslims  with  non-­‐Muslims;  and  c. Calls  for  violence.  

In  addition,  for  the  purposes  of  this  case  study  the  content  had  to  be  in  English,  so  as  to  be  deemed  accessible  to  non-­‐Arabic  speakers  residing  in  Western  countries.  

4.2   Geospatial  Analysis  of  Toxic  Content  Consumption    This  section  describes  the  methodology  that  was  developed  to  identify  instances  of  widely-­‐circulated  toxic  content,  and  the  geographic  distribution  of  individuals  engaged  in  its  consumption  and  dissemination.  

Indexing  and  Ranking  The  first  step  of  this  process  involved  indexing  and  ranking  all  of  the  content  contained  within  the  Syrian  FF  online  community  sample.    To  ensure  that  each  piece  of  content  (represented  by  a  URL)  was  properly  counted,  a  Python  script  was  written  to  expand  any  shortened  URLs  into  their  full  form.26  

                                                                                                                         24  Supra  note  2  25  Supra  note  10  26  Links  to  external  content  posted  on  Twitter  are  often  shortened  using  with  Twitter’s  own  http://t.co  facility,  or  via  third  party  services  such  as  http://bit.ly.    Expanding  these  shortened  URLs  is  important  because  while  the  full  form  URL  to  a  given  piece  of  content  will  always  be  the  same,  there  can  be  multiple  shortened  URLs  pointing  to  the  same  resource.  

 

11    

Once  expanded,  the  URLs  were  assessed  against  two  criteria:  frequency  of  sharing  within  the  Syrian  FF  online  community,  and  frequency  of  sharing  across  Twitter  at  large.27    To  assist  in  ranking  the  overall  prominence  of  a  given  URL,  a  balanced  metric  was  constructed  to  provide  a  single  score  using  both  factors.28  

Starting  with  the  most  prominent  URL,  each  piece  of  content  was  manually  assessed  against  the  “toxicity”  criteria  defined  in  Section  4.1  of  this  report.    Four  of  high-­‐ranking  URLs  were  found  to  lead  to  content  that  was  no  longer  available.    In  addition,  two  more  URLs  were  found  to  be  linked  to  unrelated  material  (appeals  for  locating  missing  persons).    Finally,  the  URL  with  the  eighth-­‐highest  rank,  an  ISIS  video  titled  “Establishment  of  the  Islamic  State  Part  8  -­‐Shaykh  Abu  Yahya  Al-­‐Libi”  was  found  to  satisfy  the  toxic  content  criteria  (see  Figure  2).29  

Capturing  Historic  Twitter  Activity  Involving  an  Expanded  URL  The  URL  of  the  YouTube  video  selected  for  this  experiment  was  then  used  to  identify  Twitter  users  who  had  taken  part  in  its  distribution,  and  as  such  could  be  assumed  to  constitute  an  engaged  audience  actively  consuming  or  promoting  such  content.  

                                                                                                                         27  The  frequency  of  a  given  URLs  appearance  across  Twitter  at  large  can  be  estimated  using  the  “re-­‐tweets”  metric  provided  by  Twitter.    Usage  of  this  metric  ought  to  be  treated  with  caution,  as  botnets  and  other  forms  of  automated  posting  can  artificially  inflate  the  prominence  of  tweet,  making  it  appear  to  be  more  popular  than  it  actually  is.  28  Given  the  fact  that  the  Syrian  FF  community  is  much  smaller  than  the  overall  number  of  users  on  Twitter,  the  number  of  re-­‐tweets  (R)  was  typically  far  larger  than  the  number  of  times  a  piece  of  content  was  shared  within  the  community  (C).    As  such,  the  balanced  prominence  score  (B)  was  derived  as  follows:  B  =  SQRT(C2+(LOG10R)

2)  29  The  video  can  be  viewed  at  http://youtu.be/qtPHw0lh-­‐VA  

Figure  2  -­‐  A  screenshot  from  an  ISIS  YouTube  video  selected  for  collection  

 

12    

At  this  point  we  were  faced  with  two  data  collection  issues:  

• The  first  had  to  do  with  the  common  usage  of  shortened  URLs.    Given  that  the  expanded  URL  we  intended  to  search  for  was  obtained  after  the  data  was  collected  from  Twitter,  it  could  not  be  used  as  a  reliable  search  criterion.30  

• The  second  issue  involved  the  limitations  of  the  Twitter  public  API.    Since  we  intended  to  collect  all  tweets  that  included  the  URL  posted  between  a  certain  data  range,  our  requirement  fell  well  outside  of  the  capabilities  made  available  by  the  public  API.31    

To  address  this  problem,  a  decision  was  made  to  perform  data  collection  for  this  phase  of  the  study  using  DataSift,  a  commercial  social  media  aggregation  service.32    By  using  DataSift  to  conduct  the  search  we  were  able  to  overcome  both  of  the  issues  mentioned  above.    First,  DataSift  provides  a  number  of  augmentations  to  the  raw  data  provided  by  the  social  media  platforms  themselves,  including  ability  to  capture  and  search  on  expanded  URLs.    Second,  DataSift  offers  access  to  the  entire  Twitter  archive,  going  back  more  than  3  years.  

Data  Collection  The  search  window  for  data  extraction  was  set  one  day  before  the  targeted  YouTube  video  was  posted  to  14  days  after.  Once  completed,  the  DataSift  search  yielded  a  total  of  1668  tweets.  

Of  the  tweets  that  were  captured  by  the  search,  753  were  found  to  have  been  posted  by  the  same  Twitter  account ,(االلججززررااوويي  ذذرر  ااببوو@)   most  likely  operated  by  someone  using  software  for  automated  content  promotion.    The  remaining  915  tweets  appeared  to  originate  from  Twitter  accounts  that  did  not  engage  in  mass  broadcasts.33  

Analysis  of  Geospatial  Metadata  Analysis  of  the  geospatial  data  embedded  in  the  social  media  interactions  was  found  to  be  of  low  quality,  insufficient  for  practical  applications.    Namely,  while  some  of  the  Twitter  users  who  had  forwarded  the  toxic  content  URL  did  have  geo-­‐tagging  enabled,  no  geo-­‐tags  were  applied  to  any  of  the  re-­‐tweets.    Similarly,  some  of  the  tweets  contained  user-­‐specified  information  concerning  their  country  of  origin,  but  given  the  low  incidence  and  reliability  of  that  data  it  too  was  deemed  insufficient.  

In  all,  while  the  method  for  identification  of  toxic  content  and  monitoring  of  its  consumption  presented  above  showed  some  promise,  a  number  of  issues  were  encountered  that  prevented  it  from  returning  an  unqualified  successful  demonstration.    For  further  discussion  on  this  please  see  the  Conclusions  section  of  this  report.  

                                                                                                                         30  While  searching  for  the  full  URL  may  have  resulted  in  collection  of  some  instances  of  its  dissemination,  none  of  the  shortened  URLs  would  be  picked-­‐up  by  the  search.    31  Twitter’s  public  API  provides  limited  access  to  approximately  1%  of  the  entire  Twitter  activity,  going  back  approximately  5  days.  32  More  information  on  DataSift  and  the  services  it  provides  can  be  found  at  http://www.datasift.com  33  However,  it  is  still  possible  that  some  of  the  tweets  collected  were  posted  or  re-­‐tweeted  via  a  more  sophisticated  botnet.  

 

13    

5.0   Conclusions  In  this  section  we  provide  an  overview  of  the  research  findings;  a  discussion  of  the  methods  and  techniques  employed  in  the  course  of  this  case  study  and  propose  directions  for  future  research  on  this  topic.  

5.1   Summary  of  Research  Findings  Based  on  the  outcomes  of  the  experiments  presented  in  this  report  we  can  confidently  state  the  following:  

• By  applying  specific  validation  criteria,  it  is  possible  to  conclude  with  a  high  degree  of  confidence  that  an  individual  behind  a  social  media  persona  is  a  member  of  a  radicalized  community  of  violent  extremists.  

• Using  a  verified  source  as  a  starting  point,  snowball  sampling  and  modularity  clustering  provide  an  effective  means  of  enumerating  an  online  community  of  violent  extremists  and  individuals  who  share  their  beliefs  and  aspirations.  

• Monitoring  the  interactions  of  an  online  community  whose  members  represent  an  extremist  ideology  can  be  an  effective  means  of  identifying  toxic  content  and  analysing  its  distribution.  

5.2   Discussion  of  Methods  and  Techniques  

Seed  Account  Identification  and  Verification  Although  we  ultimately  relied  on  information  provided  by  an  external  source,  the  Twitter  account  that  was  selected  for  verification  could  well  have  been  located  using  open  sources.34    What  is  perhaps  more  striking  is  the  degree  of  confidence  that  can  be  achieved  in  verifying  an  individual’s  place  of  residence,  and  other  activities  such  as  international  travel,  through  the  use  of  public  social  media  activity.  

Whether  application  of  such  research  methods  in  countries  with  strict  privacy  regimes,  such  as  Canada  or  Germany,  would  be  deemed  acceptable,  remains  an  unanswered  question.  

Seed  Network  Construction  and  Community  Enumeration  The  application  of  snowball  sampling  and  modularity  clustering  was  demonstrated  to  be  an  effective  alternative  to  a  purely  manual  construction  of  a  social  network.    We  were  further  encouraged  by  the  overlap  in  membership  of  the  Syrian  FF  community  identified  by  this  study  and  the  ICSR  Syrian  FF  database.  

To  decrease  the  likelihood  of  false  positives,  the  community  enumeration  method  employed  by  our  study  would  benefit  from  the  addition  of  a  verification  stage.    This  could  be  done  either  manually  or  by  employing  a  version  of  the  verification  criteria,  operationalized  for  a  machine  learning  classifier.35    

                                                                                                                         34  Supra  note  11,  as  an  example  of  a  lead  that  would  have  connected  the  investigation  to  @Hamidur1988  35  The  SecDev  Group’s  in-­‐depth  analysis  of  the  @Hamidur1988’s  Twitter  stream  could  potentially  be  used  to  facilitate  such  a  system.  Supra  note  10.  

 

14    

Toxic  Content  Identification  and  Monitoring  Although  we  were  unsuccessful  in  obtaining  reliable  geospatial  metrics  of  the  distribution  of  individuals  consuming  and  disseminating  toxic  content,  it  is  clear  that  a  number  of  steps  could  be  taken  in  order  to  improve  the  technique.  

Using  multiple  sources  of  toxic  content  would  likely  increase  the  number  of  sampled  accounts,  potentially  increasing  the  chances  of  collecting  sufficient  geospatial  metadata.    Similarly,  collecting  additional  tweets  from  each  sampled  account  would  increase  the  chances  of  obtaining  geospatial  metadata  or  mentions  of  geographic  landmarks.  

Finally,  if  the  information  of  interest  is  ultimately  aggregate  in  nature,  data  collection  can  be  conducted  in  a  manner  better  able  to  avoid  collection  of  potentially  sensitive  private  details.    One  example  would  be  the  DataSift  demographics  enhancement36  which  provides  enhanced  geospatial  and  demographic  information,  while  anonymizing  the  rest  of  the  metadata  to  improve  privacy  protection.  

5.2   Recommendations  for  Future  Research  All  of  the  methods  and  techniques  employed  in  the  course  this  case  study  would  benefit  from  external  validation.    More  specifically,  it  would  be  instructive  to  apply  the  same  sequence  of  seed  account  identification  and  validation,  followed  by  seed  network  construction  in  the  context  of  a  different  group  of  violent  extremists.  

It  is  also  worth  noting  that  this  study  involved  a  considerable  amount  of  manual  content  analysis.  This  suggests  that  applications  of  machine  learning  to  this  area  could  significantly  enhance  our  technological  capacity  for  detection  of  weak  signals  of  radicalization.  

A  working  hypothesis  is  that  community-­‐based  or  community-­‐facing  PVE  practitioners  would  be  very  interested  in  having  fine-­‐grained  and  real-­‐time  data-­‐feeds  on  trending  toxic  content  within  foreign  fighter  online  networks,  and  may  also  be  interested  in  general  geo-­‐located  information  on  content  consumers.37    Knowledge  of  trending,  toxic  content  –  like  a  specific  VE  video  –  provides  a  data-­‐point  for  engagement  with  community  members,  to  raise  awareness,  promote  dialogue  and  thereby  stimulate  community  protective  factors.    We  recommend  this  as  a  discussion  worth  having.  

Finally,  development  of  privacy  and  ethics  protocols  for  conducting  social  media  analytics  for  PVE  should  be  considered.    For  one,  it  would  provide  the  necessary  guidance  to  researchers  and  practitioners.    But  perhaps  more  importantly,  it  would  contribute  to  the  development  of  normative  standards  concerning  acceptable  use  of  open  source  social  media.38  

 

n  

                                                                                                                         36  A  Twitter  augmentation  data  stream  provided  by  Demographics  Pro  (http://www.demographicspro.com/)  37  For  example,  10  Twitter  users  in  your  geo-­‐located  catchment  area  are  actively  consuming  this  content.  38  For  a  more  substantive  discussion  concerning  the  issues  of  legality,  privacy,  and  ethics  of  conducting  open  source  social  media  research  for  PVE  see  SecDev’s  Kanishka  Research  Summary  Report.