1 modeling intention in email vitor r. carvalho ph.d. thesis defensethesis committee: language...
DESCRIPTION
3 Why The most successful e-communication application. Great tool to collaborate, especially in different time zones. Very cheap, fast, and convenient. Multiple uses: task manager, contact manager, doc archive, to-do list, etc. Increasingly popular Clinton adm. left 32 million s to the National Archives Bush adm….more than 100 million in 2009 (expected) Visible impact Office workers in the U.S. spend at least 25% of the day on – not counting handheld use [Shipley & Schwalbe, 2007]TRANSCRIPT
1
Modeling Intention in Email
Vitor R. Carvalho
Ph.D. Thesis Defense Thesis Committee:Language Technologies Institute William W. Cohen (chair)School of Computer Science Tom M. Mitchel Carnegie Mellon University Robert E. KrautJuly 22th 2008 Lise Getoor (Univ. Maryland)
2
Outline1. Motivation
2. Email Acts
3. Email Leaks
4. Recommending Email Recipients
5. Learning Robust Rank Models
6. User Study
3
Why Email
The most successful e-communication application. Great tool to collaborate, especially in different time zones. Very cheap, fast, and convenient. Multiple uses: task manager, contact manager, doc archive, to-do list, etc.
Increasingly popular Clinton adm. left 32 million emails to the National Archives Bush adm….more than 100 million in 2009 (expected)
Visible impact Office workers in the U.S. spend at least 25% of the day on email – not
counting handheld use
[Shipley & Schwalbe, 2007]
[Shipley & Schwalbe, 2007]
4
Hard to manage
People get overwhelmed Costly interruptions Serious impacts on work productivity Increasingly difficult to manage requests, negotiate shared tasks and
keep track of different commitments
People make horrible mistakes Send messages to the wrong persons Forget to address intended recipients “Oops, Did I just hit reply-to-all?”
[Dabbish & Kraut, CSCW-2006].[Belloti et al. HCI-2005]
5
Thesis
►We present evidence that email management can be potentially improved by the effective use of machine learning techniques to model different aspects of user intention.
6
Outline1. Motivation
2. Email Acts ◄
3. Preventing Email Information Leaks
4. Recommending Email Recipients
5. Learning Robust Rank Models
6. User Study
7
Classifying Email into Acts [Cohen, Carvalho & Mitchell, EMNLP-04][Cohen, Carvalho & Mitchell, EMNLP-04]
Verb
Commisive Directive
Deliver Commit Request Propose
Amend
Noun
Activity
OngoingEvent
MeetingOther
Delivery
Opinion Data
Verb
Commisive Directive
Deliver Commit Request Propose
Amend
Noun
Activity
OngoingEvent
MeetingOther
Delivery
Opinion Data
Nouns
Verbs An Act is described as a verb-
noun pair (e.g., propose meeting, request information) - Not all pairs make sense
One single email message may contain multiple acts
Try to describe commonly observed behaviors, rather than all possible speech acts in English
Also include non-linguistic usage of email (e.g. delivery of files)
8
Data & Features Data: Carnegie Mellon MBA students competition
Semester-long project for CMU MBA students. Total of 277 students, divided in 50 teams (4 to 6 students/team). Rich in task negotiation.
1700+ messages (from 5 teams) were manually labeled. One of the teams was double labeled, and the inter-annotator agreement ranges from 0.72 to 0.83 (Kappa) for the most frequent acts.
Features N-grams: 1-gram, 2-gram, 3-gram,4-gram and 5-gram Pre-Processing
Remove Signature files, quoted lines (in-reply-to) [Jangada package] Entity normalization and substitution patterns:
“Sunday”…”Monday” →[day], [number]:[number] → [hour], “me, her, him ,us or them” → [me], “after, before, or during” →
[time], etc
9
Classification Performance
5-fold cross-validation over 1716 emails, SVM with linear kernel
[Carvalho & Cohen, HLT-ACTS-06][Cohen, Carvalho & Mitchell, EMNLP-04]
10
Predicting Acts from Surrounding Acts
DeliverRequest
CommitProposeRequest
Commit
Deliver
Commit
Deliver Strong correlation between previous and next message’s acts
Example of Email Thread Sequence
Both Context and Content have predictive value for email act classification
[Carvalho & Cohen, SIGIR-05]
Collective classification problem → Dependency Network
11
Collective Classification with Dependency Networks (DN)
In DNs, the full joint probability distribution is approximated with a set of conditional distributions that can be learned independently. The conditional probabilities are calculated for each node given its Markov blanket.
))(|Pr()Pr( i
ii XBlanketXX Parent
MessageChild Message
Current
Message
Request
Commit
Deliver
… ……
[Heckerman et al., JMLR-00][Neville & Jensen, JMLR-07] Inference: Temperature-driven Gibbs
sampling
[Carvalho & Cohen, SIGIR-05]
12
Act by Act Comparative Results
37.66
30.74
47.81
58.27
47.25
36.84
42.01
44.98
42.55
32.77
52.42
58.37
49.55
40.72
38.69
43.44
0 10 20 30 40 50 60 70
Commissive
Commit
Meeting
Directive
Request
Propose
Deliver
dData
Kappa Values (%)
Baseline CollectiveModest improvements over the baseline
Only on acts related to negotiation: Request, Commit, Propose, Meet, Commissive, etc.
Kappa values with and without collective classification, averaged over four team test sets in the leave-one-team out experiment.
13
Key Ideas Summary
Introduced a new taxonomy of acts tailored to email communication Good levels of inter-annotator agreement Showed that it can be automated Proposed a collective classification algorithm for threaded messages
Related Work: Speech Act Theory [Austin, 1962;Searle,1969], Coordinator system
[Winograd,1987], Dialog Acts for Speech Recognition, Machine Translation, and other dialog-based systems. [Stolcke et al., 2000] [Levin et al., 03], etc.
Related applications: Focus message in threads/discussions [Feng et al, 2006], Action-items
discovery [Bennett & Carbonell, 2005], Task-focused email summary [Corsten-Oliver et al, 2004], Predicting Social Roles [Leusky, 2004], etc.
14
Applications of Email Acts Iterative Learning of Email Tasks and Email Acts
Predicting Social Roles and Group Leadership
Detecting Focus on Threaded Discussions
Semantically Enhanced Email
Email Act Taxonomy Refinements
[Kushmerick & Khousainov, IJCAI-05]
[Leusky,SIGIR-04][Carvalho et al, CEAS-07]
[Feng et al., HLT/NAACL-06]
[Scerri et al, DEXA-07]
[Lampert et al, AAAI-2008 EMAIL]
15
Outline
1. Motivation
2. Email Acts
3. Preventing Email Information Leaks ◄
4. Recommending Email Recipients ◄
5. Learning Robust Rank Models
6. User Study
16
17
18
http://www.sophos.com/
19
Preventing Email Info Leaks
1. Similar first or last names, aliases, etc
2. Aggressive auto-completion of email addresses
3. Typos
4. Keyboard settings
Disastrous consequences: expensive law suits, brand reputation damage, negotiation setbacks, etc.
Email Leak: email accidentally sent to wrong person
[Carvalho & Cohen, SDM-07]
20
Preventing Email Info Leaks
1. Similar first or last names, aliases, etc
2. Aggressive auto-completion of email addresses
3. Typos
4. Keyboard settings
[Carvalho & Cohen, SDM-07] Method
1. Create simulated/artificial email recipients
2. Build model for (msg.recipients): train classifier on real data to detect synthetically created outliers (added to the true recipient list).
Features: textual(subject, body), network features (frequencies, co-occurrences, etc).
3. Detect potential outliers - Detect outlier and warn user based on confidence.
21
Simulating Email Leaks
Else: Randomly select an address book entry
321
Marina.wang @enron.comGenerate a random email address NOT in Address Book
Several options: Frequent typos, same/similar last names, identical/similar first
names, aggressive auto-completion of addresses, etc.
We adopted the 3g-address criteria: On each trial, one of the msg recipients is randomly chosen
and an outlier is generated according to:
22
Data and Baselines
Enron email dataset, with a realistic setting For each user, ~10% most
recent sent messages were used as test
Some basic preprocessing
Baseline methods: Textual similarity Common baselines in IR
Rocchio/TFIDF Centroid [1971]Create a “TfIdf centroid” for each user in Address Book. For testing, rank according to cosine similarity between test msg and each centroid.
Knn-30 [Yang & Chute, 1994]Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set.
23
Using Non-Textual Features
1. Frequency features Number of received, sent and
sent+received messages (from this user)
2. Co-Occurrence Features Number of times a user co-occurred
with all other recipients.
3. Auto features For each recipient R, find Rm
(=address with max score from 3g-address list of R), then use score(R)-score(Rm) as feature.
Combine with text-only scores using perceptron-based reranking, trained on simulated leaks
Non-textual Features
Text-based Feature (KNN30 score or TFIDF score)
24
Email Leak Results[Carvalho & Cohen, SDM-07]
0.406
0.558 0.56
0.804
0.7480.718
0.814
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Leak Prediction
Acc
urac
y (P
rec@
1) .
RandomTfIdfKnn30Knn30+FrequencyKnn30+CooccurKnn30+AutoKnn30+All_Above
25
Finding Real Leaks in Enron How can we find it?
Grep for “mistake”, “sorry” or “accident” Note: must be from one of the Enron users
Found 2 valid cases:1. Message germany-c/sent/930, message has 20 recipients, leak is
alex.perkins@2. kitchen-l/sent items/497, it has 44 recipients, leak is rita.wynne@
Prediction results: The proposed algorithm was able to find these two leaks
“Sorry. Sent this to you by mistake.”,“I accidentally sent you this reminder”
26
Another Email Addressing Problem
Sometimes people just forget an intended recipient
27
Forgetting an intended recipient Particularly in large organizations,
it is not uncommon to forget to CC an important collaborator: a manager, a colleague, a contractor, an intern, etc.
More frequent than expected (from Enron Collection) at least 9.27% of the users have forgotten to add a desired email
recipient. At least 20.52% of the users were not included as recipients (even
though they were intended recipients) in at least one received message.
Cost of errors in task management can be high: Communication delays, deadlines can be missed Opportunities wasted, costly misunderstandings, task delays
[Carvalho & Cohen, ECIR-2008]
28
Data and Features Easy to obtain labeled data
Two Ranking problems Predicting TO+CC+BCC Predicting CC+BCC
Features & Methods Textual: Rocchio (TFIDF) and KNN Non-Textual: Frequency, Recency and Co-Occurrence
Number of messages received and/or sent (from/to this user) How often was a particular user addressed in the last 100 msgs Number of times a user co-occurred with all other recipients. Co-occurr
means “two recipients were addressed in the same message in the training set”
29
Email Recipient Recommendation
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
TOCCBCC CCBCC
MA
P
FrequencyRecencyTFIDFKNNPerceptron
36 Enron users
44000+ queriesAvg: ~1267 q/user
[Carvalho & Cohen, ECIR-08]
MRR ~ 0.5
30
Rank Aggregation
Many “Data Fusion” methods 2 types:
Normalized scores: CombSUM, CombMNZ, etc. Unnormalized scores: BordaCount, Reciprocal Rank Sum, etc.
Reciprocal Rank: The sum of the inverse of the rank of document in each ranking.
Rankingsq iq
i drankdRR
)(1)(
[Aslam & Montague, 2001]; [Ogilvie & Callan, 2003]; [Macdonald & Ounis, 2006]
31
Rank Aggregation Results
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
TOCCBCC CCBCC
MA
P
FrequencyRecencyTFIDFKNNPerceptronFusion (MRR)
32
Intelligent Email Auto-completionTOCCBCC
CCBCC
[Carvalho & Cohen, ECIR-08]
33
Related Work Email Leak
[Boufaden et al., 2005]: proposed a privacy enforcement system to monitor specific privacy breaches (student names, student grades, IDs).
[Lieberman and Miller, 2007]: Prevent leaks based on faces
Recipient Recommendation [Pal & McCallum, 2006], [Dredze et al, 2008]: CC Prediction problem,
Recipient prediction based on summary keywords
Expert Search in Email [Dom et al.,2003], [Campbell et al,2003], [Balog & de Rijke, 2006],
[Balog et al, 2006],[Soboroff, Craswell, de Vries (TREC-Enterprise 2005-06-07…)]
34
Outline
1. Motivation
2. Email Acts
3. Preventing Email Information Leaks
4. Recommending Email Recipients
5. Learning Robust Ranking Models ◄
6. User Study
35
Can we learn a better ranking function? Learning to Rank: machine learning to improve ranking
Many recently proposed methods: RankSVM RankBoost Committee of Perceptrons
Meta-Learning Method Learn Robust Ranking Models in the pairwise-based framework
[Elsas, Carvalho & Carbonell, WSDM-08]
[Freund et al, 2003][Joachims, KDD-02]
36
Pairwise-based Ranking
)()( jiji dfdfdd
mimiiii xwxwxwddf ...,w)( 2211
Rank q
d1
d2
d3
d4
d5
d6
...
dT
We assume a linear function f
0, jiji ddwdd
Goal: induce a ranking function f(d) s.t.
Constraints:
),...,,( 62616 mxxx
Paired instance
O(n) mislabels produce O(n^2) mislabeled pairs
37
Effect of Pairwise Outiers
RankSVM
SEAL-1
38
Effect of Pairwise Outiers
-0.5
0
0.5
1
1.5
2
-3 -2 -1 0 1 2 3
LossLoss Function
RankSVM
RankSVM
Pairwise Score Pl
39
Effect of Pairwise Outiers
RankSVM
-0.5
0
0.5
1
1.5
2
-3 -2 -1 0 1 2 3
Loss
RankSVM
Loss Function
Robust to outliers, but not convexPairwise Score Pl
40
Ranking Models: 2 Stages
Base Ranker
Sigmoid Rank
Non-convex:
Final model
Base ranking model
e.g., RankSVM, Perceptron, ListNet,
etc.
Minimize (a very close approximation for) the empirical error: number of misranks
Robust to outliers (label noise)
41
Learning
SigmoidRank Loss
Learning with Gradient Descent
RP
NRRkSigmoidRanwddwsigmoidwL )],(1[min 2
xexsigmoid
1
1)(
)( )()(
)()()1(
kdrankSigmoi
k
kk
kk
wLw
www
)],(1)[,( 2)( )(NRRNRR
RP
kdrankSigmoi ddwsigmoidddwsigmoidwwL
43
Email Recipient Results[Carvalho, Elsas, Cohen and Carbonell, SIGIR 2008 LR4IR]
0.3
0.35
0.4
0.45
0.5
0.55
TOCCBCC
MAP
Frequency
Recency
TFIDF
KNN
Percep
MRR (Fusion)
RankSVM
RankSVM+Sigmoid
Percep+Sigmoid
44
Email Recipient Results[Carvalho, Elsas, Cohen and Carbonell, SIGIR 2008 LR4IR]
0.3
0.35
0.4
0.45
0.5
0.55
CCBCC
MAP
Frequency
Recency
TFIDF
KNN
Percep
MRR (Fusion)
RankSVM
RankSVM+Sigmoid
Percep+Sigmoid
45
Set Expansion (SEAL) Results
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
SEAL-1 SEAL-2 SEAL-3
MA
P
PercepPercep+SigmoidRankSVMRankSVM+Sigmoid
[Wang & Cohen, ICDM-2007]
[18 features, ~120/60 train/test splits, ~half relevant]
[Carvalho et al, SIGIR 2008 LR4IR]
46
Letor Results
[#queries/#features: (106/25) (50/44) (75/44)]
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Ohsumed Trec3 Trec4
MA
P
PercepPercep+SigmoidRankSVMRankSVM+Sigmoid
[Carvalho et al, SIGIR 2008 LR4IR]
47
Related Work
Classification with non-convex loss functions: tradeoff for outlier robustness, accuracy, scalability, etc.
[Perez-Cruz et al, 2003], [Xu et al., 2006], [Zhan & Shen, 2005], [Collobert et al, 2006], [Liu et al, 2005], [Yang and Hu, 2008]
Ranking with other non-convex loss functions FRank [Tsai et al, 2007]: a fidelity-based loss function optimized in the
boosting framework, query normalization may be interfering in performance gains, not a general stage (meta) learner
48
Outline
1. Motivation
2. Email Acts
3. Preventing Email Information Leaks
4. Recommending Email Recipients
5. Learning Robust Ranking Models
6. User Study ◄
49
User Study
Choosing an Email System:
Gmail, Yahoo!Mail, etc. Widely adopted, interface/compatibility issues
Develop a new client Perfect control, longer development, low adoption.
Mozilla Thunderbird Open source community, easy mechanism to install
extensions, millions of users.
50
User Study: Cut Once
Cut Once, a Mozilla Thunderbird extension for Leak Detection and Recipient Recommendation
A few issues … Poor documentation and limited customization of interface JavaScript is slow: Imposed computational restrictions
Disregard rare words and rare users. Implement two ‘lightweight’ ranking methods:
1) TFIDF 2) MRR (Frequency, Recency, TFIDF)
[Balasubramanyan, Carvalho and Cohen, AAAI-2008 EMAIL]
51
Cut Once Screenshots
Main Window after installation
52
Cut Once Screenshots
53
Cut Once ScreenshotsLogged: Cut Once Usage
- Time, Confidence and Position in rank of clicked recommendations
- Baseline Ranking Method
54
User Study Description
4 week long study most subjects from Pittsburgh After 1 week, qualified users were invited to continue. 20% of
compensation was paid after 1 week After 4 weeks, users were fully compensated (+ final questionnaire)
26 subjects finished study 4 female and 22 male. Median age 28.5. Total of 2315 sent messages. Averages: 113 address book entries Mostly students. A few sys admin, 1 professor, 2 staff member.
Randomly assigned to two different ranking methods: TFIDF and MRR
55
Recipient Suggestions 17 subjects used the functionality (in 5.28% of their sent msgs).
Average of 1 accepted suggestion per 24.37 sent messages.
56
Comparison of Ranking Methods
Difference is not statistically significant
MRR better than TFIDF
Clicked Rank
Average Rank: 3.14 versus 3.69
Rank Quality: 3.51 versus 3.43
Rough estimate: factor of 5.5
5.5*4 = 22 weeks of user study or
5.5*26 = 143 subjects for 4 weeks
57
Results: Leak Detection
18 subj used the leak deletion (in 2.75% of their sent msgs).
Most frequent reported use was to “clean up” the addressee list: Removing unwanted people (inserted by Reply-all) Remove themselves (automatically added)
5 real leaks were reported, from 4 different users
These users did not use Cut Once to remove the leaks Clicked on the “cancel” button, and removed manually
Uncomfortable or unfamiliar with interface Under pressure because of 10-sec timer
58
Results: Leak Detection 5 leaks → 4 users.
Network Admin: two users with similar userIDs System Admin: wrong auto-completion in 2 or 3 situations Undergrad: two acquaintances with similar names Grad student: reply-all case
Correlations 2 users used TFIDF, 2 MRR No significant correlation with size of Address Book or #sent msgs Correlation with “non-student” occupations (95% confidence)
Estimate: 1 leak every 463 sent messages Assuming a binomial dist with p=5/2315, then 1066 messages are
required send at least one leak (with 90% confidence).
59
Final Questionnaire(Higher = Better)
Not
60
Frequent complaints
Training and Optimization
Interface
61
Final QuestionnaireCompose-then-address instead of
address-then-compose behavior
62
Conclusions Email acts
A taxonomy of intentions in email communication Categorization can be automated
Addressed Two Types of Email Addressing Mistakes Email Leaks (accidentally adding non-intended recipients) Recipient recommendation (forgetting intended recipients)
Framed as a supervised learning problems Introduced new methods for Leak detection Proposed several models for recipient recommendation, including
combinations of base methods.
Proposed a new general purpose ranking algorithm Robust to outliers - outperformed state-of-the-art rankers on the
recipient recommendation task (and other ranking tasks)
User Study using a Mozilla Thunderbird extension Caught 5 real leaks, and showed reasonably good prediction quality Showed clear potential to be adopted by a large number of email users
63
Proof: non-existence of a better advisor
Given the finite set of good advisors An
64
Proof: non-existence of a better advisor
Q.E.D.
Assuming
Given the finite set of advisors An
65
Thank you.