violence det ijcnlp13-slideshare

36
Click to edit Master subtitle style A Weakly Supervised Bayesian Model for Violence Detection in Social Media Elizabeth Cano * , Yulan He * , Kang Liu + , Jun Zhao + * School of Engineering and Applied Science Aston University, UK + Institute of Automation Chinese Academy of Sciences, China

Upload: amparo-elizabeth-cano

Post on 18-Dec-2014

292 views

Category:

Technology


0 download

DESCRIPTION

Presentation for the paper entitled: "A Weakly Supervised Bayesian Model for Violence Detection in Social Media" presented at the IJCNLP 2013

TRANSCRIPT

Page 1: Violence det ijcnlp13-slideshare

Click to edit Master subtitle styleA Weakly Supervised Bayesian Model for Violence Detection in Social Media

Elizabeth Cano*, Yulan He*, Kang Liu+, Jun Zhao+

*School of Engineering and Applied Science Aston University, UK+Institute of Automation Chinese Academy of Sciences, China

Page 2: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

2

o Introduction

o Research Challenges

o Violence Detection Model

o Deriving word priors

o Experiments

Outline

Page 3: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

3

Introduction

Page 4: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

4

Introduction

Page 5: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

5

Introduction

Objectives

Identification of suspicious tweets

Violence-related Topic detection

Extraction of violent and criminal events appearing in social media

Objectives

Page 6: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

6

Violence-related content Characterised by the use of terms expressing aggression and

attitudes towards violence

Violence-related content Analysis Identifying violence polarity in piece of text (violence-related or

non-violence related) Involves the detection of particular types of sentiments not

necessarily negative (e.g. anger, shame, excitement)

IntroductionViolence-related content analysis

Page 7: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

7

Challenges

Restricted number of characters

Irregular and ill-formed words Wide variety of language

Evolving jargon (e.g. slang and teenage lingo)

Event-dependent vocabulary characterising violence-related content

• Volatile jargon relevant to particular events. While sentiment and affect lexicon rarely changes in time, words relevant to violence tend to be event dependent

E.g., “fire” and “flame” are negative during the UK riots 2011, but appear to be positive in the London Olympics 2012.

E.g. “#Jan25” violence-related during the Egyptian revolution

IntroductionCharacterising violence-related tweets

Page 8: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

8

Topic Classification of short texts Standard supervised machine learning methods [Milne-et-

al 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]

Alleviate micropost sparsity by making use of external knowledge sources (e.g. DBpedia)[Michelson-et-al 2010][Cano-et-al 2013]

Weakly Supervised approaches JST model [Lin&He 2009][Lin&He2012]

Partially-Labeled LDA (PLDA) [Ramage et al., 2011]

Related WorkViolence-related classification in Social Media

Page 9: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

9

Rely on supervised classification techniques or do not cater for the violence detection challenges.

Do not perform discover topics with an associated document category.

Related WorkViolence-related classification in Social Media

Page 10: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

10

Topic Classification of short texts Standard supervised machine learning methods [Milne-et-al

2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012] Alleviate micropost sparsity by making use of external

knowledge sources (e.g. DBpedia)[Michelson-et-al 2010][Cano-et-al 2013]

Rely on supervised classification techniques or do not cater for the violence detection challenges.

Do not perform discover topics with an associated document category.

Related WorkViolence-related classification in Social Media

Since violence-related events tend to occur during short to medium life-spans, methods relying only on labeled data can rapidly become outdated.

Page 11: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

11

How to characterise violence-polarity?

How to build a model to discriminate across documents to identify violence-related content?

How to provide overall information to understand the type of violence-related events?

Violence-related classification in Social MediaChallenges

Page 12: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

12

Violence Detection Model (VDM)Problem Formulation and Proposed Method

Page 13: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

13

Accessing Topics via Word Distributions

o Novel Bayesian Modelling Approach for: Identifying violent content in social media No need of labelled data Inspired by the previous work on sentiment analysis, in

particular on the JST model[Lin&He 2009][Lin&He2012]

o Use of knowledge sources (e.g. DBpedia) Priors derivation strategies

Page 14: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

14

Accessing Topics via Word Distributions

Page 15: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

15

Accessing Topics via Word Distributions

Each Tweet can involve multiple topics

Topics

Page 16: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

16

Accessing Topics via Word Distributions

Each tweet involves as well words with different violence-polarity

Violence Polarity

Casting these intuitions into a generative probabilistic process [Blei-et-al 2003]

- Each document is a random mixture of corpus-wide topics- Each word is drawn from one of those topics

Page 17: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

17

Accessing Topics via Word Distributions

Text

Violence polarity

Document

violence-related

non-violence-related

Text

Violence polarity

violence-related

non-violence-related

Document

non-violence-related

violence-related

Page 18: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

18

DNd

word topic

word

vioLabel Violence probability

Violabel/topiclanguage model

violenceLabel/ topic probability

Violence Detection Model (VDM)

Page 19: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

19

Violence Detection Model (VDM)

• Choose ω Beta(ε), φ∼ 0 Dir(β∼ 0), φ Dir(β). ∼

• For each category (violent or non-violent) c

For each topic z under the document category c

o Choose θcz ~ Dir(α)

• For each doc m

Choose πm ~ Dir(γ)

For each word wi in doc mo choose xm,n Mult (ω); ∼o If xm,n =0,

choose a word wm,n Mult(φ∼ 0); o if xm,n =1,

choose a tweet category label cm,n Mult (∼ πm ),

choose a topic zm,n Mult(θ∼ cm,n ), choose a word wm,n ∼

Mult(φcm,n ,zm,n ).

Page 20: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

20

Violence Detection Model (VDM)

• Single document category-topic distribution shared across all the documents.

• Assumes words are generated either from a category-specific topic distribution or from a general background model.

Page 21: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

21

Deriving Word Priors

Page 22: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

22

• Violence Lexicon Preparation• DBpedia articles from violent related topics• Twitter Data for Jan-Dec 2010 (10% Twitter Firehose)

Non-Violence-related

twilightsandwich

awardmoonrecord

commonexcitedgreat

Violence-relatedfightwar

protestriots

conflictbomb

troublefear

Violence Lexicon

Page 23: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

23

Deriving PriorsUsing DBpedia Categories• Structured Semantic Web

Representation of data derived from Wikipedia

Maintained by thousand of editors Evolves and adapts as knowledge

changes [Syed et al, 2008]

• Cover a broad range of topics

• Characterise topics with a large number of resources

DBpedia* Yago2 Freebase

Resources 2.35 million 447million 3.6 million

Classes 359 562,312 1,450

Properties 1,820 253,213,842

7,000

Page 24: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

24

Deriving PriorsUsing DBpedia Categories

Violence

Terrorism

War

Revolutionary Terror

Military Operations

….

Guerrilla Warfare

….

Page 25: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

25

• Business & Finance• Disaster & Accident • Education • Entertainment & Culture• Environment• Health & Medical• Hospital & Recreation• Labor • Law &Crime

Obtaining Priors from Tweets

1 million Tweets annotated with OpenCalais derived topics including:

•Politics• Religion & Belief• Social Issues• Sports• Technology &Internet• War & Conflict 8,338 tweets

Page 26: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

26

Datasets for Priors

Tweets (TW) DBpedia (DB) DBpedia chunked (DCH)

Violent-related 10,432 4,082 32,174

Non violent-related 11,411 11,411 11,411

• Use OpenCalais to annotate tweets• Extracted tweets labelled as “War & Conflict” and

considered them as violence-related annotations• OpenCalais has low F-measure of 38% when evaluated on

our manually annotated test set

• DBpedia abstracts have longer sentences than tweets• Generated tweet size documents by chunking the abstracts

into 9 or less words

Page 27: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

27

• Corpus Word Entropy captures the dispersion of the usage of word w in the corpus SD

• Class Word Entropy characterises the usage of a word in a particular document class

• Relative Word Entropy provides information on the relative importance of that word to a given document class

Relative Word Entropy

Page 28: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

DBpedia-Chunked Priors

DBpedia-derived Priors Tweets-derived Priors

Violent NotViolent Violent NotViolent Violent NotViolent

group customer group gop rebel ey

alleg win power lov destro nnw

armour diff suffer back sectar vot

resid good soc good anti soc

cult sen palest twees mortat aid

separat eat knif interest amnest job

influ surve rebel right drug good

democr afford campaign answer fighter congrat

28

Word Priors Obtained using RWE

Page 29: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

29

Experiments

Page 30: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

30

Datasets for Experiments

Training set Testing set

Violence-related10,581

759

Non violence-related 1,000

• TREC Microblog 2011 corpus• Comprises over 16 million tweets sampled over a two week

period (January 23rd to February 8th, 2011)• includes 49 different events

• violence-related ones such as Egyptian revolution, and Moscow airport bombing

• non-violence related such as the Super Bowl seating fiasco

Page 31: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

31

• Learned from labelled features• Word priors are used as labelled feature constraints

• Train MaxEnt classifier with Generalized Expectation (GE) [Druck et al., 2008] or Posterior Regularization (PR) [Ganchev et al., 2010]

• Joint Sentiment-Topic (JST) model [Lin&He 2009][Lin&He2012]

• Set the number of sentiment classes to 2 (violent or non-violent)

• Partially-Labeled LDA (PLDA) [Ramage et al., 2011]

• Assume that some document labels are observed and model per-label latent topics

• Supervised information is incorporated at the document level rather than at the word level

• The training set is labelled as violent or non-violent using OpenCalais

Baselines

Page 32: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

32

• ME-GE and ME-PR perform poorly

• Best result obtained using VDM with word priors derived from TW using RWE

• Source data for deriving word priors• DB does not improve over TW

• DCH boosts F-measure in JST and is close to TW for VDM

• RWE consistently outperforms IG for both JST and VDM

Violence Classification Results

Page 33: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

33

Varying Number of Topics

Page 34: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

34

Violence-related topics Non violence-related topics

Topic Coherence Evaluation

Page 35: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

35

Topic 1 Topic 2 Topic 3 Topic 4

egypt middle internet crash

tahrir east egypt kill

cair give phone moscow

strees power block bomb

police idea word airport

protester government service tweets

square spread government injure

arm uprise shut arrest

report fall facebook dead

Protest in Tahrir Square

Middle East uprise

Government shut down Facebook

Moscow Airport bombing

Example Violence-Related Topics

Page 36: Violence det ijcnlp13-slideshare

Click to edit Master subtitle style

36

Questions?

Yulan He [email protected] Kang Liu [email protected] Jun Zhao [email protected]

Elizabeth Cano [email protected]

Slides available at http://www.slideshare.net/ampaeli