low computational cost algorithms for photo clustering and mail signature detection in the cloud
TRANSCRIPT
Low computational cost algorithms for photo clustering and mail
signature detection in the cloud!
Daniel Manchón Co-directors: Xavi Giró (UPC) Omar Pera (Pixable)
1
Outline• Motivation!
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering
• Mail signature detection
• Conclusions
• Introduction
• Requirements
• Design
• Results
2
Motivation: Photo clustering
3
Low computational cost algorithms for photo clustering and mail signature detection in the cloud
Motivation: Mail signature detection
4
Low computational cost algorithms for photo clustering and mail signature detection in the cloud
Motivation: Cloud computing
5
Low computational cost algorithms for photo clustering and mail signature detection in the cloud
Outline• Motivation
• Tasks summary
• Pixable internship!
• GPI research assistant
• Photo clustering
• Mail signature detection
• Conclusions
• Introduction
• Requirements
• Design
• Results
6
Pixable internship
- Social photos aggregation!- Photo ranking!- Editorial content!- Contacts feeds!- Owned by Singtel
- Photo storage!- Synchronization across multiple devices!- Support for RAW
- CallerID application!- Multiple contact source support!- Contact backup and synchronization!- SPAM detection
7
Photofeed tasks• Instagram source (in-production)
• Referrals and invitations method
• "New relic" integration
• Photo clustering and summarization
• Photo download service (in-production)
8
• Mail scrapping monitorization
• Signature detection!
• Identity analysis improvement
• Tooling (in-production)
Contactive tasks
9
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant!
• Photo clustering
• Mail signature detection
• Conclusions
• Introduction
• Requirements
• Design
• Results
10
GPI research assistant• Mediaeval 2013 (published paper)
• ICMR SEWM (published paper)
• Pyxel software framework
• Mediaeval 2014
11
Multimedia retrieval conference
GPI: Image and Video Processing Group
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering!
• Mail signature detection
• Conclusions
• Introduction!
• Requirements
• Design
• Results
12
Photo Clustering: Intro
PhotoTOC [Platt et al, PACRIM 2003]
State of the artEvent detection
13
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering!
• Mail signature detection
• Conclusions
• Introduction
• Requirements!
• Design
• Results
14
Photo Clustering: Requirements
• User data stored in Amazon cloud and MongoDB.
• Low computing
• Easily configurable using REST API
• Event generation
• Visual and metadata information available
• F1 and NMI as evaluation metrics
• 400k annotated photo dataset
Mediaeval requirements Photofeed constrains
15
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering!
• Mail signature detection
• Conclusions
• Introduction
• Requirements
• Design!
• Results
16
Design
Hi, I’m John. Hi, I’m Emily.
(a) Temporal sorting by each user independently
17
Design
(b) Temporal-based oversegmentation in mini-clusters
PhotoTOC [Platt et al, PacRim 2003]
18
Design(b) Temporal-based oversegmentation in mini-clusters, mean values modelization
19
Username= John T.taken= 2010-09-10 02:10:12 GPS= (42.1,-10) tags= live,stage,deerhunter
Username= emily T.taken= 2010-12-13 02:11:10 GPS= (43,-8.40) tags= live,deerhunter
Username= emily T.taken= 2010-12-13 03:11:10 GPS= (no data) tags= live,stones
Username= emily T.taken= 2010-12-14 23:11:10 GPS= (43.2,-8.2) tags= sound, test
Design
(c) Sequential merging of mini-clusters
?t
avg(·) avg(·) avg(·)avg(·)
20
Design
(c) Sequential merging of mini-clusters
21
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering!
• Mail signature detection
• Conclusions
• Introduction
• Requirements
• Design
• Results
22
Results
e
x
c
x
R
x
=|c
x
\ e
x
||e
x
|
P
x
R
x
Precision(P ) Recall(R)
F1
F1 = 2PR
P +R
UPC 3rd place of 12 teams!!!
23
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering
• Mail signature detection!
• Conclusions
• Introduction!
• Requirements
• Design
• Results
24
Mail signature detection: Intro
• Email information extraction
• SPAM detection
• Low computation
State of the artKEY TOPICS
Learning to extract signature and reply lines from email [Vitor R. Carvalho and William W. Cohen, 2004 ]
25
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering
• Mail signature detection!
• Conclusions
• Introduction
• Requirements!
• Design
• Results
26
Mail signature detection: Requirements• Mail scrapping service improvement
• Pre-process the input to reduce the execution time
• Adapt the mail scrapping service to Contactive product
?fewer information
filter only signatures
MongoDB entries
User mailbox
id 89012name John Doeemail [email protected] Id 7788455367_ephone 789675463
27
Mail scrapping
service
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering
• Mail signature detection!
• Conclusions
• Introduction
• Requirements
• Design!
• Results
28
Design2. Problem Definition and Corpus
A signature block is the set of lines, usually in the end of a message, that contain information about the sender,
such as personal name, affiliation, postal address, web address, email address, telephone number, etc. Quotes from famous persons and creative ASCII drawings are often present in this block also. An example of a signature block can be seen in last six lines of the email message pictured in Figure 1 (marked with the line label <sig>). Figure 1 also contains six lines of text that were quoted from a preceding message (marked with the line label <reply>). In this paper we will call such lines reply lines.
<other> From: [email protected] <other> To: Vitor Carvalho <[email protected]> <other> Subject: Re: Did you try to compile javadoc recently? <other> Date: 25 Mar 2004 12:05:51 -0500 <other> <other> Try cvs update –dP, this removes files & directories that have been <other> deleted from cvs. <other> - W <other> <reply> On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: <reply> > I’ve just checked-out the baseline m3 code and <reply> > "Ant dist" is working fine, but "ant javadoc" is not. <reply> > Thanks <reply> > Vitor <other> <sig> ------------------------------------------------------------------ <sig> William W. Cohen “Would you drive a mime <sig> [email protected] nuts if you played a <sig> http://www.wcohen.com blank audio tape <sig> Associate Research Professor full blast?” <sig> CALD, Carnegie-Mellon University - S. Wright
Figure 1 - Excerpt from a labeled email message
Below we first consider the task of detecting signature blocks—that is, classifying messages as to whether or
not they contain a signature block. We next consider signature line extraction. This is the task of classifying lines within a message as to whether or not they belong to a signature block. In our experiments, we perform signature line extraction only on messages which are known to contain a signature block.
To obtain a corpus of messages for signature block detection, we began with messages from the 20 Newsgroups dataset (Lang, 1995). We began by separating the messages into two groups P and N, using the following heuristic. We first looked for pairs of messages from the same sender and whose last T lines were identical. If T was larger than or equal to 6, then one of the messages from this sender (randomly chosen) was placed in group P (which contains messages likely to have a signature block). If T was less than or equal to 1, a sample message from this sender was placed in group N. These groups were supplemented with messages from our personal inboxes (to provide a sample of more recent emails) and manually checked for correctness. This resulted in a final set of 617 messages (all from different senders) containing a signature block, and a set of 586 messages not having a signature block.
For the extraction experiments, the 617-message dataset was manually annotated for signature lines. It was also annotated for reply lines (as in Figure 1). As noted above, the identification of reply lines can be helpful in tasks such as email threading, and certain types of content-based message classification; and as we will demonstrate below, our signature line extraction techniques can also be successfully applied to identifying reply lines. The final dataset has 33,013 lines. Of these, 3,321 lines are in signature blocks, and 5,587 are reply lines.
(a) Split the K last mail lines and retrieve the annotations
Last K lines
Ground truth annotations
29
2. Problem Definition and Corpus A signature block is the set of lines, usually in the end of a message, that contain information about the sender,
such as personal name, affiliation, postal address, web address, email address, telephone number, etc. Quotes from famous persons and creative ASCII drawings are often present in this block also. An example of a signature block can be seen in last six lines of the email message pictured in Figure 1 (marked with the line label <sig>). Figure 1 also contains six lines of text that were quoted from a preceding message (marked with the line label <reply>). In this paper we will call such lines reply lines.
<other> From: [email protected] <other> To: Vitor Carvalho <[email protected]> <other> Subject: Re: Did you try to compile javadoc recently? <other> Date: 25 Mar 2004 12:05:51 -0500 <other> <other> Try cvs update –dP, this removes files & directories that have been <other> deleted from cvs. <other> - W <other> <reply> On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: <reply> > I’ve just checked-out the baseline m3 code and <reply> > "Ant dist" is working fine, but "ant javadoc" is not. <reply> > Thanks <reply> > Vitor <other> <sig> ------------------------------------------------------------------ <sig> William W. Cohen “Would you drive a mime <sig> [email protected] nuts if you played a <sig> http://www.wcohen.com blank audio tape <sig> Associate Research Professor full blast?” <sig> CALD, Carnegie-Mellon University - S. Wright
Figure 1 - Excerpt from a labeled email message
Below we first consider the task of detecting signature blocks—that is, classifying messages as to whether or
not they contain a signature block. We next consider signature line extraction. This is the task of classifying lines within a message as to whether or not they belong to a signature block. In our experiments, we perform signature line extraction only on messages which are known to contain a signature block.
To obtain a corpus of messages for signature block detection, we began with messages from the 20 Newsgroups dataset (Lang, 1995). We began by separating the messages into two groups P and N, using the following heuristic. We first looked for pairs of messages from the same sender and whose last T lines were identical. If T was larger than or equal to 6, then one of the messages from this sender (randomly chosen) was placed in group P (which contains messages likely to have a signature block). If T was less than or equal to 1, a sample message from this sender was placed in group N. These groups were supplemented with messages from our personal inboxes (to provide a sample of more recent emails) and manually checked for correctness. This resulted in a final set of 617 messages (all from different senders) containing a signature block, and a set of 586 messages not having a signature block.
For the extraction experiments, the 617-message dataset was manually annotated for signature lines. It was also annotated for reply lines (as in Figure 1). As noted above, the identification of reply lines can be helpful in tasks such as email threading, and certain types of content-based message classification; and as we will demonstrate below, our signature line extraction techniques can also be successfully applied to identifying reply lines. The final dataset has 33,013 lines. Of these, 3,321 lines are in signature blocks, and 5,587 are reply lines.
Lines
N Feature Patterns
(b) feature extraction
30
Design
Design(c) SVM training and model generation
2. Problem Definition and Corpus A signature block is the set of lines, usually in the end of a message, that contain information about the sender,
such as personal name, affiliation, postal address, web address, email address, telephone number, etc. Quotes from famous persons and creative ASCII drawings are often present in this block also. An example of a signature block can be seen in last six lines of the email message pictured in Figure 1 (marked with the line label <sig>). Figure 1 also contains six lines of text that were quoted from a preceding message (marked with the line label <reply>). In this paper we will call such lines reply lines.
<other> From: [email protected] <other> To: Vitor Carvalho <[email protected]> <other> Subject: Re: Did you try to compile javadoc recently? <other> Date: 25 Mar 2004 12:05:51 -0500 <other> <other> Try cvs update –dP, this removes files & directories that have been <other> deleted from cvs. <other> - W <other> <reply> On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote: <reply> > I’ve just checked-out the baseline m3 code and <reply> > "Ant dist" is working fine, but "ant javadoc" is not. <reply> > Thanks <reply> > Vitor <other> <sig> ------------------------------------------------------------------ <sig> William W. Cohen “Would you drive a mime <sig> [email protected] nuts if you played a <sig> http://www.wcohen.com blank audio tape <sig> Associate Research Professor full blast?” <sig> CALD, Carnegie-Mellon University - S. Wright
Figure 1 - Excerpt from a labeled email message
Below we first consider the task of detecting signature blocks—that is, classifying messages as to whether or
not they contain a signature block. We next consider signature line extraction. This is the task of classifying lines within a message as to whether or not they belong to a signature block. In our experiments, we perform signature line extraction only on messages which are known to contain a signature block.
To obtain a corpus of messages for signature block detection, we began with messages from the 20 Newsgroups dataset (Lang, 1995). We began by separating the messages into two groups P and N, using the following heuristic. We first looked for pairs of messages from the same sender and whose last T lines were identical. If T was larger than or equal to 6, then one of the messages from this sender (randomly chosen) was placed in group P (which contains messages likely to have a signature block). If T was less than or equal to 1, a sample message from this sender was placed in group N. These groups were supplemented with messages from our personal inboxes (to provide a sample of more recent emails) and manually checked for correctness. This resulted in a final set of 617 messages (all from different senders) containing a signature block, and a set of 586 messages not having a signature block.
For the extraction experiments, the 617-message dataset was manually annotated for signature lines. It was also annotated for reply lines (as in Figure 1). As noted above, the identification of reply lines can be helpful in tasks such as email threading, and certain types of content-based message classification; and as we will demonstrate below, our signature line extraction techniques can also be successfully applied to identifying reply lines. The final dataset has 33,013 lines. Of these, 3,321 lines are in signature blocks, and 5,587 are reply lines.
31
Feature matrix [KxN]
Vector ground truth [K]
+ SVM training Model=
Design(c) SVM training and model generation
Model● Other ● Reply ● Signature
Lines
Classes
pre-process
Features
32
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering
• Mail signature detection!
• Conclusions
• Introduction
• Requirements
• Design
• Results
33
Results
Precision =TP
TP + FP
Recall =TP
TP + FN
F1 = 2Precision ·Recall
Precision+Recall
34
With annotated dataset Without annotated dataset
Manual evaluation
Contactive user base mailboxes
Outline• Motivation
• Tasks summary
• Pixable internship
• GPI research assistant
• Photo clustering
• Mail signature detection
• Conclusions
• Introduction
• Requirements
• Design
• Results
35
Conclusions• Academic
• Papers: Mediaeval 2013 and ICMR SEWM, and Mediaeval 2014 on preparation.
• UPC Pyxel framework foundations
• Industrial
• Contributions to Pixable in production servers:
• Instagram integration
• Photofeed Downloader
• Mail signature detection: Proof of concept successful.
• Work in the USA!36
Thank you very much!!Q&A
37
BACKUP SLIDES
38
Design
39
(c) Sequential merging of mini-clusters
Weighted modalities
● creation (or upload) time ● geolocation ● textual labels ● same user
Design
40
(c) Sequential merging of mini-clusters
Geolocation (d=haversine)Time stamp (d=L1)
Text labels (d=Jaccard) Same user (d=boolean)
Design
41
(c) Sequential merging of mini-clusters
Design
42
(c) Sequential merging of mini-clusters
42
Mean and std. deviation learned on pairs of photos within
the same training event.
Design
43
(c) Sequential merging of mini-clusters
43
phi function
Design
44
(c) Sequential merging of mini-clusters
decision threhold