low computational cost algorithms for photo clustering and mail signature detection in the cloud

44
Low computational cost algorithms for photo clustering and mail signature detection in the cloud Daniel Manchón Co-directors: Xavi Giró (UPC) Omar Pera (Pixable) 1

Upload: xavier-giro

Post on 03-Jul-2015

162 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Low computational cost algorithms for photo clustering and mail

signature detection in the cloud!

Daniel Manchón Co-directors: Xavi Giró (UPC) Omar Pera (Pixable)

1

Page 2: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation!

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering

• Mail signature detection

• Conclusions

• Introduction

• Requirements

• Design

• Results

2

Page 3: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Motivation: Photo clustering

3

Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Page 4: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Motivation: Mail signature detection

4

Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Page 5: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Motivation: Cloud computing

5

Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Page 6: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship!

• GPI research assistant

• Photo clustering

• Mail signature detection

• Conclusions

• Introduction

• Requirements

• Design

• Results

6

Page 7: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Pixable internship

- Social photos aggregation!- Photo ranking!- Editorial content!- Contacts feeds!- Owned by Singtel

- Photo storage!- Synchronization across multiple devices!- Support for RAW

- CallerID application!- Multiple contact source support!- Contact backup and synchronization!- SPAM detection

7

Page 8: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Photofeed tasks• Instagram source (in-production)

• Referrals and invitations method

• "New relic" integration

• Photo clustering and summarization

• Photo download service (in-production)

8

Page 9: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

• Mail scrapping monitorization

• Signature detection!

• Identity analysis improvement

• Tooling (in-production)

Contactive tasks

9

Page 10: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant!

• Photo clustering

• Mail signature detection

• Conclusions

• Introduction

• Requirements

• Design

• Results

10

Page 11: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

GPI research assistant• Mediaeval 2013 (published paper)

• ICMR SEWM (published paper)

• Pyxel software framework

• Mediaeval 2014

11

Multimedia retrieval conference

GPI: Image and Video Processing Group

Page 12: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering!

• Mail signature detection

• Conclusions

• Introduction!

• Requirements

• Design

• Results

12

Page 13: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Photo Clustering: Intro

PhotoTOC [Platt et al, PACRIM 2003]

State of the artEvent detection

13

Page 14: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering!

• Mail signature detection

• Conclusions

• Introduction

• Requirements!

• Design

• Results

14

Page 15: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Photo Clustering: Requirements

• User data stored in Amazon cloud and MongoDB.

• Low computing

• Easily configurable using REST API

• Event generation

• Visual and metadata information available

• F1 and NMI as evaluation metrics

• 400k annotated photo dataset

Mediaeval requirements Photofeed constrains

15

Page 16: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering!

• Mail signature detection

• Conclusions

• Introduction

• Requirements

• Design!

• Results

16

Page 17: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

Hi, I’m John. Hi, I’m Emily.

(a) Temporal sorting by each user independently

17

Page 18: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

(b) Temporal-based oversegmentation in mini-clusters

PhotoTOC [Platt et al, PacRim 2003]

18

Page 19: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design(b) Temporal-based oversegmentation in mini-clusters, mean values modelization

19

Username= John T.taken= 2010-09-10 02:10:12 GPS= (42.1,-10) tags= live,stage,deerhunter

Username= emily T.taken= 2010-12-13 02:11:10 GPS= (43,-8.40) tags= live,deerhunter

Username= emily T.taken= 2010-12-13 03:11:10 GPS= (no data) tags= live,stones

Username= emily T.taken= 2010-12-14 23:11:10 GPS= (43.2,-8.2) tags= sound, test

Page 20: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

(c) Sequential merging of mini-clusters

?t

avg(·) avg(·) avg(·)avg(·)

20

Page 21: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

(c) Sequential merging of mini-clusters

21

Page 22: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering!

• Mail signature detection

• Conclusions

• Introduction

• Requirements

• Design

• Results

22

Page 23: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Results

e

x

c

x

R

x

=|c

x

\ e

x

||e

x

|

P

x

R

x

Precision(P ) Recall(R)

F1

F1 = 2PR

P +R

UPC 3rd place of 12 teams!!!

23

Page 24: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering

• Mail signature detection!

• Conclusions

• Introduction!

• Requirements

• Design

• Results

24

Page 25: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Mail signature detection: Intro

• Email information extraction

• SPAM detection

• Low computation

State of the artKEY TOPICS

Learning to extract signature and reply lines from email [Vitor R. Carvalho and William W. Cohen, 2004 ]

25

Page 26: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering

• Mail signature detection!

• Conclusions

• Introduction

• Requirements!

• Design

• Results

26

Page 27: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Mail signature detection: Requirements• Mail scrapping service improvement

• Pre-process the input to reduce the execution time

• Adapt the mail scrapping service to Contactive product

?fewer information

filter only signatures

MongoDB entries

User mailbox

id 89012name John Doeemail [email protected] Id 7788455367_ephone 789675463

27

Mail scrapping

service

Page 28: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering

• Mail signature detection!

• Conclusions

• Introduction

• Requirements

• Design!

• Results

28

Page 29: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design2. Problem Definition and Corpus

A signature block is the set of lines, usually in the end of a message, that contain information about the sender,

such as personal name, affiliation, postal address, web address, email address, telephone number, etc. Quotes from famous persons and creative ASCII drawings are often present in this block also. An example of a signature block can be seen in last six lines of the email message pictured in Figure 1 (marked with the line label <sig>). Figure 1 also contains six lines of text that were quoted from a preceding message (marked with the line label <reply>). In this paper we will call such lines reply lines.

<other> From: [email protected] <other> To: Vitor Carvalho <[email protected]> <other> Subject: Re: Did you try to compile javadoc recently? <other> Date: 25 Mar 2004 12:05:51 -0500 <other> <other> Try cvs update –dP, this removes files & directories that have been <other> deleted from cvs. <other> - W <other> <reply> On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: <reply> > I’ve just checked-out the baseline m3 code and <reply> > "Ant dist" is working fine, but "ant javadoc" is not. <reply> > Thanks <reply> > Vitor <other> <sig> ------------------------------------------------------------------ <sig> William W. Cohen “Would you drive a mime <sig> [email protected] nuts if you played a <sig> http://www.wcohen.com blank audio tape <sig> Associate Research Professor full blast?” <sig> CALD, Carnegie-Mellon University - S. Wright

Figure 1 - Excerpt from a labeled email message

Below we first consider the task of detecting signature blocks—that is, classifying messages as to whether or

not they contain a signature block. We next consider signature line extraction. This is the task of classifying lines within a message as to whether or not they belong to a signature block. In our experiments, we perform signature line extraction only on messages which are known to contain a signature block.

To obtain a corpus of messages for signature block detection, we began with messages from the 20 Newsgroups dataset (Lang, 1995). We began by separating the messages into two groups P and N, using the following heuristic. We first looked for pairs of messages from the same sender and whose last T lines were identical. If T was larger than or equal to 6, then one of the messages from this sender (randomly chosen) was placed in group P (which contains messages likely to have a signature block). If T was less than or equal to 1, a sample message from this sender was placed in group N. These groups were supplemented with messages from our personal inboxes (to provide a sample of more recent emails) and manually checked for correctness. This resulted in a final set of 617 messages (all from different senders) containing a signature block, and a set of 586 messages not having a signature block.

For the extraction experiments, the 617-message dataset was manually annotated for signature lines. It was also annotated for reply lines (as in Figure 1). As noted above, the identification of reply lines can be helpful in tasks such as email threading, and certain types of content-based message classification; and as we will demonstrate below, our signature line extraction techniques can also be successfully applied to identifying reply lines. The final dataset has 33,013 lines. Of these, 3,321 lines are in signature blocks, and 5,587 are reply lines.

(a) Split the K last mail lines and retrieve the annotations

Last K lines

Ground truth annotations

29

Page 30: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

2. Problem Definition and Corpus A signature block is the set of lines, usually in the end of a message, that contain information about the sender,

such as personal name, affiliation, postal address, web address, email address, telephone number, etc. Quotes from famous persons and creative ASCII drawings are often present in this block also. An example of a signature block can be seen in last six lines of the email message pictured in Figure 1 (marked with the line label <sig>). Figure 1 also contains six lines of text that were quoted from a preceding message (marked with the line label <reply>). In this paper we will call such lines reply lines.

<other> From: [email protected] <other> To: Vitor Carvalho <[email protected]> <other> Subject: Re: Did you try to compile javadoc recently? <other> Date: 25 Mar 2004 12:05:51 -0500 <other> <other> Try cvs update –dP, this removes files & directories that have been <other> deleted from cvs. <other> - W <other> <reply> On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: <reply> > I’ve just checked-out the baseline m3 code and <reply> > "Ant dist" is working fine, but "ant javadoc" is not. <reply> > Thanks <reply> > Vitor <other> <sig> ------------------------------------------------------------------ <sig> William W. Cohen “Would you drive a mime <sig> [email protected] nuts if you played a <sig> http://www.wcohen.com blank audio tape <sig> Associate Research Professor full blast?” <sig> CALD, Carnegie-Mellon University - S. Wright

Figure 1 - Excerpt from a labeled email message

Below we first consider the task of detecting signature blocks—that is, classifying messages as to whether or

not they contain a signature block. We next consider signature line extraction. This is the task of classifying lines within a message as to whether or not they belong to a signature block. In our experiments, we perform signature line extraction only on messages which are known to contain a signature block.

To obtain a corpus of messages for signature block detection, we began with messages from the 20 Newsgroups dataset (Lang, 1995). We began by separating the messages into two groups P and N, using the following heuristic. We first looked for pairs of messages from the same sender and whose last T lines were identical. If T was larger than or equal to 6, then one of the messages from this sender (randomly chosen) was placed in group P (which contains messages likely to have a signature block). If T was less than or equal to 1, a sample message from this sender was placed in group N. These groups were supplemented with messages from our personal inboxes (to provide a sample of more recent emails) and manually checked for correctness. This resulted in a final set of 617 messages (all from different senders) containing a signature block, and a set of 586 messages not having a signature block.

For the extraction experiments, the 617-message dataset was manually annotated for signature lines. It was also annotated for reply lines (as in Figure 1). As noted above, the identification of reply lines can be helpful in tasks such as email threading, and certain types of content-based message classification; and as we will demonstrate below, our signature line extraction techniques can also be successfully applied to identifying reply lines. The final dataset has 33,013 lines. Of these, 3,321 lines are in signature blocks, and 5,587 are reply lines.

Lines

N Feature Patterns

(b) feature extraction

30

Design

Page 31: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design(c) SVM training and model generation

2. Problem Definition and Corpus A signature block is the set of lines, usually in the end of a message, that contain information about the sender,

such as personal name, affiliation, postal address, web address, email address, telephone number, etc. Quotes from famous persons and creative ASCII drawings are often present in this block also. An example of a signature block can be seen in last six lines of the email message pictured in Figure 1 (marked with the line label <sig>). Figure 1 also contains six lines of text that were quoted from a preceding message (marked with the line label <reply>). In this paper we will call such lines reply lines.

<other> From: [email protected] <other> To: Vitor Carvalho <[email protected]> <other> Subject: Re: Did you try to compile javadoc recently? <other> Date: 25 Mar 2004 12:05:51 -0500 <other> <other> Try cvs update –dP, this removes files & directories that have been <other> deleted from cvs. <other> - W <other> <reply> On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: <reply> > I’ve just checked-out the baseline m3 code and <reply> > "Ant dist" is working fine, but "ant javadoc" is not. <reply> > Thanks <reply> > Vitor <other> <sig> ------------------------------------------------------------------ <sig> William W. Cohen “Would you drive a mime <sig> [email protected] nuts if you played a <sig> http://www.wcohen.com blank audio tape <sig> Associate Research Professor full blast?” <sig> CALD, Carnegie-Mellon University - S. Wright

Figure 1 - Excerpt from a labeled email message

Below we first consider the task of detecting signature blocks—that is, classifying messages as to whether or

not they contain a signature block. We next consider signature line extraction. This is the task of classifying lines within a message as to whether or not they belong to a signature block. In our experiments, we perform signature line extraction only on messages which are known to contain a signature block.

To obtain a corpus of messages for signature block detection, we began with messages from the 20 Newsgroups dataset (Lang, 1995). We began by separating the messages into two groups P and N, using the following heuristic. We first looked for pairs of messages from the same sender and whose last T lines were identical. If T was larger than or equal to 6, then one of the messages from this sender (randomly chosen) was placed in group P (which contains messages likely to have a signature block). If T was less than or equal to 1, a sample message from this sender was placed in group N. These groups were supplemented with messages from our personal inboxes (to provide a sample of more recent emails) and manually checked for correctness. This resulted in a final set of 617 messages (all from different senders) containing a signature block, and a set of 586 messages not having a signature block.

For the extraction experiments, the 617-message dataset was manually annotated for signature lines. It was also annotated for reply lines (as in Figure 1). As noted above, the identification of reply lines can be helpful in tasks such as email threading, and certain types of content-based message classification; and as we will demonstrate below, our signature line extraction techniques can also be successfully applied to identifying reply lines. The final dataset has 33,013 lines. Of these, 3,321 lines are in signature blocks, and 5,587 are reply lines.

31

Feature matrix [KxN]

Vector ground truth [K]

+ SVM training Model=

Page 32: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design(c) SVM training and model generation

Model● Other ● Reply ● Signature

Lines

Classes

pre-process

Features

32

Page 33: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering

• Mail signature detection!

• Conclusions

• Introduction

• Requirements

• Design

• Results

33

Page 34: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Results

Precision =TP

TP + FP

Recall =TP

TP + FN

F1 = 2Precision ·Recall

Precision+Recall

34

With annotated dataset Without annotated dataset

Manual evaluation

Contactive user base mailboxes

Page 35: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Outline• Motivation

• Tasks summary

• Pixable internship

• GPI research assistant

• Photo clustering

• Mail signature detection

• Conclusions

• Introduction

• Requirements

• Design

• Results

35

Page 36: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Conclusions• Academic

• Papers: Mediaeval 2013 and ICMR SEWM, and Mediaeval 2014 on preparation.

• UPC Pyxel framework foundations

• Industrial

• Contributions to Pixable in production servers:

• Instagram integration

• Photofeed Downloader

• Mail signature detection: Proof of concept successful.

• Work in the USA!36

Page 37: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Thank you very much!!Q&A

37

Page 38: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

BACKUP SLIDES

38

Page 39: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

39

(c) Sequential merging of mini-clusters

Weighted modalities

● creation (or upload) time ● geolocation ● textual labels ● same user

Page 40: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

40

(c) Sequential merging of mini-clusters

Geolocation (d=haversine)Time stamp (d=L1)

Text labels (d=Jaccard) Same user (d=boolean)

Page 41: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

41

(c) Sequential merging of mini-clusters

Page 42: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

42

(c) Sequential merging of mini-clusters

42

Mean and std. deviation learned on pairs of photos within

the same training event.

Page 43: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

43

(c) Sequential merging of mini-clusters

43

phi function

Page 44: Low computational cost algorithms for photo clustering and mail signature detection in the cloud

Design

44

(c) Sequential merging of mini-clusters

decision threhold