data representation for social media using emojis as ... · 1 primrose street, london ec2a 2ex...

36
Using emojis as universal sentence representation for social media data Alexis DUTOT - 22/05/2019 PARIS NLP S3 MEETUP #5

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Using emojis as universal sentence representation for social media

data

Alexis DUTOT-

22/05/2019

PARIS NLP S3

MEETUP #5

Page 2: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

2

● Introduction

● DeepMoji

● Internal challenges

● Our approach: Unimoji

● Conclusion & perspectives

Page 3: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Introduction

3

Page 4: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

4

Introduction

Linkfluence

- Social Media Intelligence company

- Activities: software & market research

- 2 products:

- Radarly

- Search

- 250+ employees over 6 offices

Page 5: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

5

Our day-to-day work

Research Production

- Read papers

- Technological watch

- Prototype new features

- Train models

- Science popularization

- Implement new features to fit in the

production pipeline (near real-time

inference)

- Build batch computations for AI features

not computed in real-time

- Enhance the processing pipeline

Introduction

Page 6: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

6

Our day-to-day work

6

Production environment

- Research playground

- Machine learning & NLP toolkits

- Programming languages

Introduction

Page 7: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

7

Our pipeline

Language detection

NER extraction

Categorization

Opinion mining

Location & user inference

Stats:

● ~ 1200 documents per second

● > 60 languages

● > 10 platforms (social medias & web)

● > 65 models in the pipeline

Introduction

Page 8: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

8

Our pipeline

Stats:

● ~ 1200 documents per second

● > 60 languages

● > 10 platforms (social medias & web)

● > 65 models in the pipeline

Introduction

Language detection

NER extraction

Categorization

Opinion mining

Location & user inference

Page 9: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Opinion miningIntroduction

- Sentiment Analysis: Document-level

sentiment analysis with 4 classes: positive,

negative, neutral and mixed

- Emotion detection: Document-level

multi-emotion detection with 7 classes:

anger, disgust, fear, joy, love, sadness and

surprise

9

Page 10: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Introduction

● Initial goal: enhance the sentiment analysis algorithm that was in the production pipeline

● Challenges:

○ Social media posts are noisy user-generated content: spelling mistakes, grammatical errors,

contractions, abbreviations, specific terms, ...

○ Very few annotated corpora available with few examples per corpus

○ The majority of these corpora are in English and “domain-specific”

10

Sentiment analysis task for social media data is limited by the scarcity of manually

annotated data

Opinion mining

Page 11: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Opinion miningIntroduction

Use distant supervision methods to make models learn useful text representations (like emotional

content) before modeling these tasks directly:

● Use specific hashtags: #good, #bad, #angry, #fml to automatically label high volume of data

(Mohammad, 2012)

● Use predefined positive and negative emoticons or emojis sets for automatic data labelling (Deriu

et al., 2016, Tang et al., 2014) → Our previous sentiment analysis model

● Pre-train a model to predict emojis given a document to learn a rich emotional text representation

and fine-tune it on a specific opinion mining task: DeepMoji (Felbo et al., 2017)

11

How can we leverage this “lack” of manually annotated data ?

Page 12: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

DeepMoji

12

Page 13: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

DeepMoji: leverage the power of emoji to accurately encode the

emotional content of texts.

The power of emoji

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm (Felbo et al., 2017)

DeepMoji

13

https://deepmoji.mit.edu/

This was soooo FUN !!! 😁😁 [this, was, soo, fun, !!] POSITIVE

→ Build a training set of 1.2B tweets with emojis as noisy labels

This was soooo FUN !!! 😁😁 [this, was, soo, fun, !!] 😁

→ Pre-train a model to predict an emoji probability distribution given a text

→ Fine-tune this model on a specific opinion mining task (sentiment analysis, emotion detection & sarcasm

detection)

Page 14: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

The modelDeepMoji

14

2-layers BiLTSM with attention

Pre-training Transfer learning

Fine-tuning is done using the chain-thaw approach:

sequentially fine-tune one layer at a time

Page 15: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Advantages of DeepMojiDeepMoji

15

● SoTA on 3 opinion mining tasks (before BERT’s arrival)

● Really good fit for our use-case: opinion mining on social media posts

● Simple and easy-to-read code written in Keras to perform tests and reproduce results

Page 16: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Internal Challenges

16

Page 17: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Challengesreminder

Internal challenges

17

1. Perform opinion mining on many (>60) languages on every social media

platforms :

DeepMoji requires manually annotated data for each target task and for each

language

2. Handle at least 1200 documents per second without making the hardware

costs skyrocket :

We assume that a Bi-LSTM would not be an option

Multilingual problem

Computational problem

Page 18: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Limitations & resourcesInternal challenges

18

Research environnement

Production environnement

- Hardware: 4 GTX 1080 Ti

- Frameworks: Keras + Tensorflow

Tensorflow offers an “almost” stable

Java API (ONNX or DeepLearning4J not

mature yet)

- CPUs-only production instances

- Current processing pipeline on Apache

Storm (JVM) does not handle

batching

Not ideal for deep learning models

Page 19: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Our ideaUnimoji

19

Deep Learning is awesome !

J’adore mon nouvel iPhone

Detesto el fin de la Casa de Papel… NEGATIVE

POSITIVE

👍 0.35

😔 0.002Doc2Emoji

TRAINED ON ENGLISH

❤ 0.68

😢 0.001

😡 0.36

😂 0.005

POSITIVE

Emoji2SentimentTRAINED ON ENGLISH ANNOTATED CORPORA

Doc2EmojiTRAINED ON FRENCH

Doc2EmojiTRAINED ON SPANISH

Emojis are universal across the languages and are more and more used upon social media platforms

Page 20: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Proof of ConceptInternal challenges

20

● Validating the approach: use DeepMoji pre-trained (predicts emoji probability distribution) + MLP

(predicts sentiment from the distribution)

Small loss of accuracy compared to fine-tuned methods (2-5 points) → acceptable

● Reproduce DeepMoji pre-training on our own English data

● Issues:

1. 1 epoch: 12 days (too long)

2. Inference time in production: 50 ms/input (too slow)

Page 21: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Internal challenges

21

Tackling the computational problem

At this point:

● Impossible to use a RNN architecture in production● Need an alternative...

1. Can we replace the DeepMoji architecture with a computationally cheaper one while preserving a good emotional context representation ?

2. Can this emotional context representation using emojis be used to perform multilingual opinion mining tasks ?

Page 22: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Our approach: Unimoji

22

Page 23: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Doc2EmojiUnimoji

23

Different CNNs architectures tried

Final architecture is a combination of:

● Attentive convolutions (Yin, 2017)

● 2-layers CNN architecture used by SwissCheese team, winners of Task

1-A of SemEval2016 (Deriu et al., 2016)

Light attentive convolution layerDoc2Emoji architecture that we used

EN

FR

ES

Page 24: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Unimoji

24

Statistics:

● Dataset: 512M tweets

● Training: 44 h/epoch (vs 12 days/epoch)

● Predict in production: 5 ms/input (vs 50 ms/epoch)

Our architecture performed almost as good as DeepMoji

Is this representation accurate enough to resolve opinion mining tasks ?

Top 1 and 5 emoji prediction accuracies

EN

FR

ES

Doc2Emoji

Page 25: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Emoji2TaskUnimoji

25

Architecture: 2-layers neural network

Comparing the quality of learnt sentence representations: benchmarking over DeepMoji approach

1. Can we replace the DeepMoji architecture with a computationally cheaper one while preserving a good emotional context representation ?

EN

FR

ES

Page 26: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Emoji2TaskUnimoji

26

EN

FR

ES

2. Can this emotional context representation using emojis be used to perform multilingual opinion mining tasks ?

Train 3 new Doc2Emoji models: French, German, Simplified Chinese

Experiments: Sentiment analysis & Emotion detection

Page 27: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Multilingual Sentiment analysis

Unimoji

27

Training: SemEval 2016 Task 4-A dataset (3 classes: negative, positive, neutral)

Evaluation: internally annotated data in English, German, French & Chinese

Results: (vs previous algorithm)

● English accuracy improvement: ~ +10% (90% acc)

● French accuracy improvement: ~ +7% (87% acc)

● German accuracy improvement: ~ +6% (81% acc)

● Chinese accuracy improvement: ~ -30% (40% acc)

The multilingual approach improved the results for all languages except for Chinese

→ Emojis context in Chinese ≠ Emojis context in English

Page 28: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Unimoji

28

Multilingual Emotion detection

Love & sadness

Anger & disgust

Surprise

Training: SemEval 2018 Task 1-Ec dataset

We kept only 7 emotions : anger, disgust, fear, joy, love, sadness and surprise (multilabel classification)

Evaluation: internally annotated data in English, German, French

Results:

● English accuracy : 85%

● French accuracy: 80%

● German accuracy: 77%

Results → good enough to validate our approach

Page 29: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Unimoji

29

Validating our approach

2. Can this emotional context representation using emojis can be used to perform multilingual opinion mining tasks ?

*If the emotional context in which emojis are used is not too different from the context of the language in which the Emoji2Task was trained on.

*

1. Can we replace the DeepMoji architecture with a computationally cheaper one while preserving a good emotional context representation ?

Page 30: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Conclusion & Perspectives

30

Page 31: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

So far...Conclusion & Perspectives

31

● Integrated our Unimoji model for sentiment analysis and emotion detection for 6

languages: French, English, Spanish, Portuguese, German and Italian

● For the Simplified Chinese model, Doc2Emoji model was fine-tuned on a Chinese

sentiment analysis dataset (improving accuracy by ~20%)

● Plan to add more languages to the model...

Page 32: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Key takeawaysConclusion & Perspectives

32

● 10x faster Doc2Emoji architecture based on CNNs with small accuracy loss

● Unimoji = Modular architecture: one can change the Doc2Emoji/Emoji2Task architectures

with any model

● 2 opinions mining tasks trained using the same English emoji probability distribution as

emotional representation:

○ Sentiment analysis (improving our inference accuracy)

○ Emotion detection (new feature !)

Page 33: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

Key takeawaysConclusion & Perspectives

33

● Doc2Emoji can be fine-tuned for any language if a reliable manually annotated dataset is

available

● Such model have limitations: different emotional contexts for emoji, different emoji

distribution across 2 languages, ...

Page 34: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

What’s next ?

34

Conclusion & Perspectives

● Add more languages

● Continue to explore limitations

● Don’t focus only on emojis

→ Explore Cross-lingual models (LASER, XLM)

● New opinion tasks

→ Saracasm detection, hate detection, optimism/pessimism, ...

Page 35: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

35

Thank you !

Questions ?

Page 36: data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX contact-uk@linkfluence.com DÜSSELDORF Erkrather Straße 234b, 40233 Düsseldorf kontakt@linkfluence.com

LONDON

1 Primrose Street, London EC2A [email protected]

DÜSSELDORF

Erkrather Straße 234b, 40233 Dü[email protected]

SHANGHAI上海昌平路68号510-512室 近西苏州路Rm 512, 68 Changping Road, [email protected]

SINGAPORECapital Tower #12-01, 168 Robinson Road, 068912 [email protected]

PARIS

5, rue Choron, 75009 [email protected]

SAN FRANCISCO575 Market Street #11, San Francisco CA [email protected]

36