violence det ijcnlp13-slideshare

Click to edit Master subtitle styleA Weakly Supervised Bayesian Model for Violence Detection in Social Media

Elizabeth Cano*, Yulan He*, Kang Liu+, Jun Zhao+

*School of Engineering and Applied Science Aston University, UK+Institute of Automation Chinese Academy of Sciences, China

Click to edit Master subtitle style

2

o Introduction

o Research Challenges

o Violence Detection Model

o Deriving word priors

o Experiments

Outline


3

Introduction


4

Introduction


5

Introduction

Objectives

Identification of suspicious tweets

Violence-related Topic detection

Extraction of violent and criminal events appearing in social media

Objectives


6

Violence-related content Characterised by the use of terms expressing aggression and

attitudes towards violence

Violence-related content Analysis Identifying violence polarity in piece of text (violence-related or

non-violence related) Involves the detection of particular types of sentiments not

necessarily negative (e.g. anger, shame, excitement)

IntroductionViolence-related content analysis


7

Challenges

Restricted number of characters

Irregular and ill-formed words Wide variety of language

Evolving jargon (e.g. slang and teenage lingo)

Event-dependent vocabulary characterising violence-related content

• Volatile jargon relevant to particular events. While sentiment and affect lexicon rarely changes in time, words relevant to violence tend to be event dependent

E.g., “fire” and “flame” are negative during the UK riots 2011, but appear to be positive in the London Olympics 2012.

E.g. “#Jan25” violence-related during the Egyptian revolution

IntroductionCharacterising violence-related tweets


8

Topic Classification of short texts Standard supervised machine learning methods [Milne-et-

al 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]

Alleviate micropost sparsity by making use of external knowledge sources (e.g. DBpedia)[Michelson-et-al 2010][Cano-et-al 2013]

Weakly Supervised approaches JST model [Lin&He 2009][Lin&He2012]

Partially-Labeled LDA (PLDA) [Ramage et al., 2011]

Related WorkViolence-related classification in Social Media


9

Rely on supervised classification techniques or do not cater for the violence detection challenges.

Do not perform discover topics with an associated document category.



10

Topic Classification of short texts Standard supervised machine learning methods [Milne-et-al

2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012] Alleviate micropost sparsity by making use of external

knowledge sources (e.g. DBpedia)[Michelson-et-al 2010][Cano-et-al 2013]

Rely on supervised classification techniques or do not cater for the violence detection challenges.

Do not perform discover topics with an associated document category.


Since violence-related events tend to occur during short to medium life-spans, methods relying only on labeled data can rapidly become outdated.


11

How to characterise violence-polarity?

How to build a model to discriminate across documents to identify violence-related content?

How to provide overall information to understand the type of violence-related events?

Violence-related classification in Social MediaChallenges


12

Violence Detection Model (VDM)Problem Formulation and Proposed Method


13

Accessing Topics via Word Distributions

o Novel Bayesian Modelling Approach for: Identifying violent content in social media No need of labelled data Inspired by the previous work on sentiment analysis, in

particular on the JST model[Lin&He 2009][Lin&He2012]

o Use of knowledge sources (e.g. DBpedia) Priors derivation strategies


14



15


Each Tweet can involve multiple topics

Topics


16


Each tweet involves as well words with different violence-polarity

Violence Polarity

Casting these intuitions into a generative probabilistic process [Blei-et-al 2003]

- Each document is a random mixture of corpus-wide topics- Each word is drawn from one of those topics


17


Text

Violence polarity

Document

violence-related

non-violence-related

Text

Violence polarity

violence-related


Document


violence-related


18

DNd

word topic

word

vioLabel Violence probability

Violabel/topiclanguage model

violenceLabel/ topic probability

Violence Detection Model (VDM)


19


• Choose ω Beta(ε), φ∼ 0 Dir(β∼ 0), φ Dir(β). ∼

• For each category (violent or non-violent) c

For each topic z under the document category c

o Choose θcz ~ Dir(α)

• For each doc m

Choose πm ~ Dir(γ)

For each word wi in doc mo choose xm,n Mult (ω); ∼o If xm,n =0,

choose a word wm,n Mult(φ∼ 0); o if xm,n =1,

choose a tweet category label cm,n Mult (∼ πm ),

choose a topic zm,n Mult(θ∼ cm,n ), choose a word wm,n ∼

Mult(φcm,n ,zm,n ).


20


• Single document category-topic distribution shared across all the documents.

• Assumes words are generated either from a category-specific topic distribution or from a general background model.


21

Deriving Word Priors


22

• Violence Lexicon Preparation• DBpedia articles from violent related topics• Twitter Data for Jan-Dec 2010 (10% Twitter Firehose)

Non-Violence-related

twilightsandwich

awardmoonrecord

commonexcitedgreat

Violence-relatedfightwar

protestriots

conflictbomb

troublefear

Violence Lexicon


23

Deriving PriorsUsing DBpedia Categories• Structured Semantic Web

Representation of data derived from Wikipedia

Maintained by thousand of editors Evolves and adapts as knowledge

changes [Syed et al, 2008]

• Cover a broad range of topics

• Characterise topics with a large number of resources

DBpedia* Yago2 Freebase

Resources 2.35 million 447million 3.6 million

Classes 359 562,312 1,450

Properties 1,820 253,213,842

7,000


24

Deriving PriorsUsing DBpedia Categories

Violence

Terrorism

War

…

Revolutionary Terror

Military Operations

….

Guerrilla Warfare

….


25

• Business & Finance• Disaster & Accident • Education • Entertainment & Culture• Environment• Health & Medical• Hospital & Recreation• Labor • Law &Crime

Obtaining Priors from Tweets

1 million Tweets annotated with OpenCalais derived topics including:

•Politics• Religion & Belief• Social Issues• Sports• Technology &Internet• War & Conflict 8,338 tweets


26

Datasets for Priors

Tweets (TW) DBpedia (DB) DBpedia chunked (DCH)

Violent-related 10,432 4,082 32,174

Non violent-related 11,411 11,411 11,411

• Use OpenCalais to annotate tweets• Extracted tweets labelled as “War & Conflict” and

considered them as violence-related annotations• OpenCalais has low F-measure of 38% when evaluated on

our manually annotated test set

• DBpedia abstracts have longer sentences than tweets• Generated tweet size documents by chunking the abstracts

into 9 or less words


27

• Corpus Word Entropy captures the dispersion of the usage of word w in the corpus SD

• Class Word Entropy characterises the usage of a word in a particular document class

• Relative Word Entropy provides information on the relative importance of that word to a given document class

Relative Word Entropy


DBpedia-Chunked Priors

DBpedia-derived Priors Tweets-derived Priors

Violent NotViolent Violent NotViolent Violent NotViolent

group customer group gop rebel ey

alleg win power lov destro nnw

armour diff suffer back sectar vot

resid good soc good anti soc

cult sen palest twees mortat aid

separat eat knif interest amnest job

influ surve rebel right drug good

democr afford campaign answer fighter congrat

28

Word Priors Obtained using RWE


29

Experiments


30

Datasets for Experiments

Training set Testing set

Violence-related10,581

759

Non violence-related 1,000

• TREC Microblog 2011 corpus• Comprises over 16 million tweets sampled over a two week

period (January 23rd to February 8th, 2011)• includes 49 different events

• violence-related ones such as Egyptian revolution, and Moscow airport bombing

• non-violence related such as the Super Bowl seating fiasco


31

• Learned from labelled features• Word priors are used as labelled feature constraints

• Train MaxEnt classifier with Generalized Expectation (GE) [Druck et al., 2008] or Posterior Regularization (PR) [Ganchev et al., 2010]

• Joint Sentiment-Topic (JST) model [Lin&He 2009][Lin&He2012]

• Set the number of sentiment classes to 2 (violent or non-violent)

• Partially-Labeled LDA (PLDA) [Ramage et al., 2011]

• Assume that some document labels are observed and model per-label latent topics

• Supervised information is incorporated at the document level rather than at the word level

• The training set is labelled as violent or non-violent using OpenCalais

Baselines


32

• ME-GE and ME-PR perform poorly

• Best result obtained using VDM with word priors derived from TW using RWE

• Source data for deriving word priors• DB does not improve over TW

• DCH boosts F-measure in JST and is close to TW for VDM

• RWE consistently outperforms IG for both JST and VDM

Violence Classification Results


33

Varying Number of Topics


34

Violence-related topics Non violence-related topics

Topic Coherence Evaluation


35

Topic 1 Topic 2 Topic 3 Topic 4

egypt middle internet crash

tahrir east egypt kill

cair give phone moscow

strees power block bomb

police idea word airport

protester government service tweets

square spread government injure

arm uprise shut arrest

report fall facebook dead

Protest in Tahrir Square

Middle East uprise

Government shut down Facebook

Moscow Airport bombing

Example Violence-Related Topics


36

Questions?

Yulan He [email protected] Kang Liu [email protected] Jun Zhao [email protected]

Elizabeth Cano [email protected]

Slides available at http://www.slideshare.net/ampaeli

mailto:[email protected]