detecting cyberbullying with morphosemantic...

Detecting Cyberbullying with Morphosemantic Patterns

Michal Ptaszynski1, Fumito Masui1, Yoko Nakajima2, Yasutomo Kimura3, Rafal Rzepka4 and Kenji Araki4

1. Kitami Institute of Technology2. Kushiro National College of Technology

3. Otaru University of Commerce4. Hokkaido University

Kitami Institute of Technology

1. Cyberbullying as social problem

2. Previous research

3. Proposed method

4. Experiments

5. Future work

Outline

Cyberbullying

- Slandering and humiliating people on the Internet.

- Recently noticed social problem.

Introduction

HELP by ICT

INTERNET PATROL• Internet monitoring by PTA.• Request site admin to

remove harmful entries.• High cost of time

and fatigue for net-patrol members.

Previous Research

2009 2010 2011 2012 2013 2014 2015

Affect analysis of cyberbullying data

SO-PMI-IR / phrases

SVM / optimization

Michal Ptaszynski, P. Dybala, T. Matsuba, F. Masui, R. Rzepka, K. Araki, and Y. Momouchi. 2010. In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis. International Journal of Computational Linguistics Res

earch, Vol. 1, Issue 3, pp. 135-154, 2010.

T. Matsuba, F. Masui, A. Kawai, N. Isu. 2011. Study on the polarity classification model for the purpose of detecting harmful information on informal school sites (in Japanese), In Proceedings of NLP2011, pp. 388-391.

Michal Ptaszynski, P. Dybala, T. Matsuba, F. Masui, R. Rzepka and K. Araki. 2010. Machine Learning and Affect Analysis Against Cyber-Bullying. In Proceedings of AISB’10, 29th March – 1st April 2010. Category Relevance

Optimization

T. Nitta, F. Masui, M. Ptaszynski, Y. Kimura, R. Rzepka, K. Araki. 2013. Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximi

zation. In Proceedings of IJCNLP 2013, pp. 579-586.

Patent No. 2013-245813. Inventors: FumitoMasui, Michal Ptaszynski, Nitta Taisei.

Patent name: An Apparatus and Method for Detection of Harmful Entries on Internet

2013PATENT

Language Combinatorics/ Preprocessing

M. Ptaszynski, F. Masui, Y. Kimura, R. Rzepka, K. Araki. 2015. Extracting Patterns of Harmful Expressions for Cyberbullying Detection, 7th Language & Technology Conference (LTC'15), 2015.11.27-29.

Language Combinatorics

Michal Ptaszynski, F. Masui, Y. Kimura, R. Rzepka, K. Araki. 2015. Brute Force Works Best Against Bullying, IJCAI 2015 Workshop on

Intelligent Personalization (IP 2015), Buenos Aires, 2015.07.25-31

Automatic acquisition of harmful words

S. Hatakeyama, F. Masui, M. Ptaszynski, K. Yamamoto. 2015. Improving Performance of Cyberbullying Detection Method with Double Filtered Point-wise Mutual Information. ACM Symposium on Cloud Computing 2015 (SoCC'15), August 2015.

Previous Research

2009 2010 2011 2012 2013 2014 2015

SO-PMI-IR / phrases

SVM / optimization




Category Relevance Optimization





2013PATENT









Michal Ptaszynski, P. Dybala, T. Matsuba, F. Masui, R. Rzepka and K. Araki. 2010. Machine Learning and Affect Analysis Against Cyber-Bullying. In Proceedings of AISB’10, 29th March – 1st April 2010.



Previous Research

2009 2010 2011 2012 2013 2014 2015

SO-PMI-IR / phrases







2013PATENT








SVM / optimization



Previous Research

2009 2010 2011 2012 2013 2014 2015


SVM / optimization




Optimization





2013PATENT








SO-PMI-IR / phrases


Previous Research

2009 2010 2011 2012 2013 2014 2015


SO-PMI-IR / phrases

SVM / optimization

















2013PATENT

Previous Research

2009 2010 2011 2012 2013 2014 2015


SO-PMI-IR / phrases

SVM / optimization





Optimization





2013PATENT








Previous Research

2009 2010 2011 2012 2013 2014 2015 2016


SO-PMI-IR / phrases

SVM / optimization





Optimization





2013PATENT








Featuresophistication

simple→ →sophisticatedsemanticpat.

syntacticpat.wordpatterns

phrasesbag-of-words

words

Proposed Method

Morphology Semantics =

Previously used for:• analysis of Indonesian suffix in Wordnet [*1]

• analysis of Croatian lexis [*2]

[*1] Christiane Fellbaum, Anne Osherson, and Peter E. Clark. 2009. Putting semantics into Word-Net’s “morphosemantic” links. [*2] Ida Raffaelli. 2013. The model of morphosemantic patterns in the description of lexical architecture.

Morphosemantics

Morpho-semantics+

noun,verb,adjective,etc.

actor,action,object,patient,etc.

effectiveforlanguageswithstronglyrelated

morphologyandsemantics(e.g.Japanese)

Morphological analysis

“John killed Mary.”

“noun verb(past) noun”

MeCabStandard tool morphologyfor Japanese [*3]

[*3] http://mecab.sourceforge.net

Semantic role labelling

“actor action patient”

ASA(Argument Structure Analyzer)

Thesaurus based of predicate argument structure analyzer

for Japanese [*4]

“John killed Mary.”

[*4] http://cl.it.okayama-u.ac.jp/study/project/asa/asa-scala

Example of morphosemantic structure (MS)

Japanese : ニホンウナギが絶滅危惧種に指定され、完全養殖によるウナギの量産に期待が高まっている。

Transcription: Nihonunagi ga zetsumetsu kigushu ni shitei sare, kanzen yoshoku ni yoru unagi no ryousan ni kitai gatakamatte iru.

English : As Japanese eel has been specified as an endangered species, the expectations grow towards mass production of eel in full aquaculture.

MS : [Object] [Agent] [State change] [Action] [Noun]

[State change] [Object] [State change]

Pattern Extraction

Pattern Extraction

Sentence patterns = ordered non-repeated combinations of sentence elements.

for 1 ≤ k ≤ n , there is all possible k-long patterns, and

Extract patterns from all sentences and calculate occurrence.

Michal Ptaszynski, R. Rzepka, K. Araki and Y. Momouchi. 2011. Language combinatorics: A sentence pattern extraction architecture based on combinatorial explosion. Int. J. of Computational Linguistics (IJCL), Vol. 2, Issue 1, pp. 24-36.

SPEC – Sentence Pattern Extraction arChitecture

Pattern Extraction

Example: What a nice day !

5-element pattern: What a nice day ! (1)

4-el. patterns: 3-el. patterns: 2-el. patterns: 1-el. patterns: What a nice * ! a nice * ! What a WhatWhat a nice day What a nice What * ! aWhat a * day ! What a * ! nice * ! nice

(5) (10) (10) (5). . .

. . .

. . .

. . .

Pattern Extraction

Sentence patterns = ordered non-repeated combinations of sentence elements.

for 1 ≤ k ≤ n , there is all possible k-long patterns, and

Normalized pattern weight

Score for one sentence

Michal Ptaszynski, R. Rzepka, K. Araki and Y. Momouchi. 2011. Language combinatorics: A sentence pattern extraction architecture based on combinatorial explosion. Int. J. of Computational Linguistics (IJCL), Vol. 2, Issue 1, pp. 24-36.

SPEC – Sentence Pattern Extraction arChitecture

Classify new input

with pattern

list

Dataset•Actual data collected by Internet Patrol (annotated by experts)

•From unofficial school forums (BBS)•Provided by Human Right Center in Japan (Mie Prefecture)

•According to the Definition by Japanese Ministry of Education (MEXT)

•1,490 harmful and 1,508 non-harmful entries

Experiment setup

Pattern List Modification1. All patterns2. Zero-patterns deleted3. Ambiguous patterns deleted

10-fold Cross Validation

All patterns vs. only n-grams

Weight Calculation Modifications1. Normalized2. Award length3. Award length and occurrence

Automatic threshold optmization

One experiment = 420 runs

1. MorphologyPreprocessing 2. Semantics

3. Morphosemantics

Results

OptimizedforPOSSemanticrolesMorphosemantics↓ Pr ReF1Acc Pr ReF1Acc Pr Re F1Acc

F-score0.530.950.680.550.630.74 0.680.670.61 0.760.680.64Precision0.930.030.060.510.930.060.110.540.850.100.180.55Accuracy 0.58 0.780.660.610.800.490.61 0.690.620.720.670.65

BEP 0.610.670.64

Results

Best F-score:



BEP 0.610.670.64

Similar for all

Results

Best F-score:



BEP 0.610.670.64

Best Precision:

BestAccuracy:

1. Only semantics2. Morphosemantics3. POS

BestBEP:

Similar for all

Results

Statisticalsignificance

Results


Difference with POS always significant

Results



Difference between Only semantics and Morphosemantics

almost neversignificant

Results



Difference between Only semantics and Morphosemantics

almost neversignificant

Semanticsalone– usuallymoreeffectivethanfullMorphosemantic structure

Useslessinformation– alsomoreefficient

ButadvantagetoMorphosemantics couldbeacoincidence– needmoredata,furtherexperiments

ResultsComparison with state-of-the-art

Results

Proposed method:• More efficient

(user does almost nothing)• Applicable to other languages• Can point out non-harmful elements

Comparison with state-of-the-art

Conclusions

• Presented research on cyberbullying detection.• Proposed novel method.

• Automatic extraction of sophisticated morphosemantic patterns.• Used patterns in classification of cyberbullying.• Tested on actual data obtained by Internet patrol.• Outperformed previous methods.• Requires minimal human effort.

Future work

• Apply different preprocessing and classifiers for further improvement.

• Test on new data • Obtain new data by applying in practice.• Verify the actual amount of CB information on the Internet and

reevaluate in more realistic conditions.

Thank you for your kind attention!

Michal [email protected]

Kitami Institute of Technology

detecting cyberbullying with morphosemantic...

Documents