1 modeling intention in email vitor r. carvalho ph.d. thesis defensethesis committee: language...

1

Modeling Intention in Email

Vitor R. Carvalho

Ph.D. Thesis Defense Thesis Committee:Language Technologies Institute William W. Cohen (chair)School of Computer Science Tom M. Mitchel Carnegie Mellon University Robert E. KrautJuly 22th 2008 Lise Getoor (Univ. Maryland)

2

Outline1. Motivation

2. Email Acts

3. Email Leaks

4. Recommending Email Recipients

5. Learning Robust Rank Models

6. User Study

3

Why Email

The most successful e-communication application. Great tool to collaborate, especially in different time zones. Very cheap, fast, and convenient. Multiple uses: task manager, contact manager, doc archive, to-do list, etc.

Increasingly popular Clinton adm. left 32 million emails to the National Archives Bush adm….more than 100 million in 2009 (expected)

Visible impact Office workers in the U.S. spend at least 25% of the day on email – not

counting handheld use

[Shipley & Schwalbe, 2007]

[Shipley & Schwalbe, 2007]

4

Hard to manage

People get overwhelmed Costly interruptions Serious impacts on work productivity Increasingly difficult to manage requests, negotiate shared tasks and

keep track of different commitments

People make horrible mistakes Send messages to the wrong persons Forget to address intended recipients “Oops, Did I just hit reply-to-all?”

[Dabbish & Kraut, CSCW-2006].[Belloti et al. HCI-2005]

5

Thesis

►We present evidence that email management can be potentially improved by the effective use of machine learning techniques to model different aspects of user intention.

6

Outline1. Motivation

2. Email Acts ◄

3. Preventing Email Information Leaks



6. User Study

7

Classifying Email into Acts [Cohen, Carvalho & Mitchell, EMNLP-04][Cohen, Carvalho & Mitchell, EMNLP-04]

Verb

Commisive Directive

Deliver Commit Request Propose

Amend

Noun

Activity

OngoingEvent

MeetingOther

Delivery

Opinion Data

Verb

Commisive Directive

Deliver Commit Request Propose

Amend

Noun

Activity

OngoingEvent

MeetingOther

Delivery

Opinion Data

Nouns

Verbs An Act is described as a verb-

noun pair (e.g., propose meeting, request information) - Not all pairs make sense

One single email message may contain multiple acts

Try to describe commonly observed behaviors, rather than all possible speech acts in English

Also include non-linguistic usage of email (e.g. delivery of files)

8

Data & Features Data: Carnegie Mellon MBA students competition

Semester-long project for CMU MBA students. Total of 277 students, divided in 50 teams (4 to 6 students/team). Rich in task negotiation.

1700+ messages (from 5 teams) were manually labeled. One of the teams was double labeled, and the inter-annotator agreement ranges from 0.72 to 0.83 (Kappa) for the most frequent acts.

Features N-grams: 1-gram, 2-gram, 3-gram,4-gram and 5-gram Pre-Processing

Remove Signature files, quoted lines (in-reply-to) [Jangada package] Entity normalization and substitution patterns:

“Sunday”…”Monday” →[day], [number]:[number] → [hour], “me, her, him ,us or them” → [me], “after, before, or during” →

[time], etc

9

Classification Performance

5-fold cross-validation over 1716 emails, SVM with linear kernel

[Carvalho & Cohen, HLT-ACTS-06][Cohen, Carvalho & Mitchell, EMNLP-04]

10

Predicting Acts from Surrounding Acts

DeliverRequest

CommitProposeRequest

Commit

Deliver

Commit

Deliver Strong correlation between previous and next message’s acts

Example of Email Thread Sequence

Both Context and Content have predictive value for email act classification

[Carvalho & Cohen, SIGIR-05]

Collective classification problem → Dependency Network

11

Collective Classification with Dependency Networks (DN)

In DNs, the full joint probability distribution is approximated with a set of conditional distributions that can be learned independently. The conditional probabilities are calculated for each node given its Markov blanket.

))(|Pr()Pr( i

ii XBlanketXX Parent

MessageChild Message

Current

Message

Request

Commit

Deliver

… ……

[Heckerman et al., JMLR-00][Neville & Jensen, JMLR-07] Inference: Temperature-driven Gibbs

sampling

[Carvalho & Cohen, SIGIR-05]

12

Act by Act Comparative Results

37.66

30.74

47.81

58.27

47.25

36.84

42.01

44.98

42.55

32.77

52.42

58.37

49.55

40.72

38.69

43.44

0 10 20 30 40 50 60 70

Commissive

Commit

Meeting

Directive

Request

Propose

Deliver

dData

Kappa Values (%)

Baseline CollectiveModest improvements over the baseline

Only on acts related to negotiation: Request, Commit, Propose, Meet, Commissive, etc.

Kappa values with and without collective classification, averaged over four team test sets in the leave-one-team out experiment.

13

Key Ideas Summary

Introduced a new taxonomy of acts tailored to email communication Good levels of inter-annotator agreement Showed that it can be automated Proposed a collective classification algorithm for threaded messages

Related Work: Speech Act Theory [Austin, 1962;Searle,1969], Coordinator system

[Winograd,1987], Dialog Acts for Speech Recognition, Machine Translation, and other dialog-based systems. [Stolcke et al., 2000] [Levin et al., 03], etc.

Related applications: Focus message in threads/discussions [Feng et al, 2006], Action-items

discovery [Bennett & Carbonell, 2005], Task-focused email summary [Corsten-Oliver et al, 2004], Predicting Social Roles [Leusky, 2004], etc.

14

Applications of Email Acts Iterative Learning of Email Tasks and Email Acts

Predicting Social Roles and Group Leadership

Detecting Focus on Threaded Discussions

Semantically Enhanced Email

Email Act Taxonomy Refinements

[Kushmerick & Khousainov, IJCAI-05]

[Leusky,SIGIR-04][Carvalho et al, CEAS-07]

[Feng et al., HLT/NAACL-06]

[Scerri et al, DEXA-07]

[Lampert et al, AAAI-2008 EMAIL]

15

Outline

1. Motivation

2. Email Acts

3. Preventing Email Information Leaks ◄

4. Recommending Email Recipients ◄


6. User Study

18

http://www.sophos.com/

19

Preventing Email Info Leaks

1. Similar first or last names, aliases, etc

2. Aggressive auto-completion of email addresses

3. Typos

4. Keyboard settings

Disastrous consequences: expensive law suits, brand reputation damage, negotiation setbacks, etc.

Email Leak: email accidentally sent to wrong person

[Carvalho & Cohen, SDM-07]

20

Preventing Email Info Leaks

1. Similar first or last names, aliases, etc

2. Aggressive auto-completion of email addresses

3. Typos

4. Keyboard settings

[Carvalho & Cohen, SDM-07] Method

1. Create simulated/artificial email recipients

2. Build model for (msg.recipients): train classifier on real data to detect synthetically created outliers (added to the true recipient list).

Features: textual(subject, body), network features (frequencies, co-occurrences, etc).

3. Detect potential outliers - Detect outlier and warn user based on confidence.

21

Simulating Email Leaks

Else: Randomly select an address book entry

321

Marina.wang @enron.comGenerate a random email address NOT in Address Book

Several options: Frequent typos, same/similar last names, identical/similar first

names, aggressive auto-completion of addresses, etc.

We adopted the 3g-address criteria: On each trial, one of the msg recipients is randomly chosen

and an outlier is generated according to:

22

Data and Baselines

Enron email dataset, with a realistic setting For each user, ~10% most

recent sent messages were used as test

Some basic preprocessing

Baseline methods: Textual similarity Common baselines in IR

Rocchio/TFIDF Centroid [1971]Create a “TfIdf centroid” for each user in Address Book. For testing, rank according to cosine similarity between test msg and each centroid.

Knn-30 [Yang & Chute, 1994]Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set.

23

Using Non-Textual Features

1. Frequency features Number of received, sent and

sent+received messages (from this user)

2. Co-Occurrence Features Number of times a user co-occurred

with all other recipients.

3. Auto features For each recipient R, find Rm

(=address with max score from 3g-address list of R), then use score(R)-score(Rm) as feature.

Combine with text-only scores using perceptron-based reranking, trained on simulated leaks

Non-textual Features

Text-based Feature (KNN30 score or TFIDF score)

24

Email Leak Results[Carvalho & Cohen, SDM-07]

0.406

0.558 0.56

0.804

0.7480.718

0.814

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Leak Prediction

Acc

urac

y (P

rec@

1) .

RandomTfIdfKnn30Knn30+FrequencyKnn30+CooccurKnn30+AutoKnn30+All_Above

25

Finding Real Leaks in Enron How can we find it?

Grep for “mistake”, “sorry” or “accident” Note: must be from one of the Enron users

Found 2 valid cases:1. Message germany-c/sent/930, message has 20 recipients, leak is

alex.perkins@2. kitchen-l/sent items/497, it has 44 recipients, leak is rita.wynne@

Prediction results: The proposed algorithm was able to find these two leaks

“Sorry. Sent this to you by mistake.”,“I accidentally sent you this reminder”

26

Another Email Addressing Problem

Sometimes people just forget an intended recipient

27

Forgetting an intended recipient Particularly in large organizations,

it is not uncommon to forget to CC an important collaborator: a manager, a colleague, a contractor, an intern, etc.

More frequent than expected (from Enron Collection) at least 9.27% of the users have forgotten to add a desired email

recipient. At least 20.52% of the users were not included as recipients (even

though they were intended recipients) in at least one received message.

Cost of errors in task management can be high: Communication delays, deadlines can be missed Opportunities wasted, costly misunderstandings, task delays

[Carvalho & Cohen, ECIR-2008]

28

Data and Features Easy to obtain labeled data

Two Ranking problems Predicting TO+CC+BCC Predicting CC+BCC

Features & Methods Textual: Rocchio (TFIDF) and KNN Non-Textual: Frequency, Recency and Co-Occurrence

Number of messages received and/or sent (from/to this user) How often was a particular user addressed in the last 100 msgs Number of times a user co-occurred with all other recipients. Co-occurr

means “two recipients were addressed in the same message in the training set”

29

Email Recipient Recommendation

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

TOCCBCC CCBCC

MA

P

FrequencyRecencyTFIDFKNNPerceptron

36 Enron users

44000+ queriesAvg: ~1267 q/user


MRR ~ 0.5

30

Rank Aggregation

Many “Data Fusion” methods 2 types:

Normalized scores: CombSUM, CombMNZ, etc. Unnormalized scores: BordaCount, Reciprocal Rank Sum, etc.

Reciprocal Rank: The sum of the inverse of the rank of document in each ranking.

Rankingsq iq

i drankdRR

)(1)(

[Aslam & Montague, 2001]; [Ogilvie & Callan, 2003]; [Macdonald & Ounis, 2006]

31

Rank Aggregation Results

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

TOCCBCC CCBCC

MA

P

FrequencyRecencyTFIDFKNNPerceptronFusion (MRR)

32

Intelligent Email Auto-completionTOCCBCC

CCBCC


33

Related Work Email Leak

[Boufaden et al., 2005]: proposed a privacy enforcement system to monitor specific privacy breaches (student names, student grades, IDs).

[Lieberman and Miller, 2007]: Prevent leaks based on faces

Recipient Recommendation [Pal & McCallum, 2006], [Dredze et al, 2008]: CC Prediction problem,

Recipient prediction based on summary keywords

Expert Search in Email [Dom et al.,2003], [Campbell et al,2003], [Balog & de Rijke, 2006],

[Balog et al, 2006],[Soboroff, Craswell, de Vries (TREC-Enterprise 2005-06-07…)]

34

Outline

1. Motivation

2. Email Acts



5. Learning Robust Ranking Models ◄

6. User Study

35

Can we learn a better ranking function? Learning to Rank: machine learning to improve ranking

Many recently proposed methods: RankSVM RankBoost Committee of Perceptrons

Meta-Learning Method Learn Robust Ranking Models in the pairwise-based framework

[Elsas, Carvalho & Carbonell, WSDM-08]

[Freund et al, 2003][Joachims, KDD-02]

36

Pairwise-based Ranking

)()( jiji dfdfdd

mimiiii xwxwxwddf ...,w)( 2211

Rank q

d1

d2

d3

d4

d5

d6

...

dT

We assume a linear function f

0, jiji ddwdd

Goal: induce a ranking function f(d) s.t.

Constraints:

),...,,( 62616 mxxx

Paired instance

O(n) mislabels produce O(n^2) mislabeled pairs

37

Effect of Pairwise Outiers

RankSVM

SEAL-1

38


-0.5

0

0.5

1

1.5

2

-3 -2 -1 0 1 2 3

LossLoss Function

RankSVM

RankSVM

Pairwise Score Pl

39


RankSVM

-0.5

0

0.5

1

1.5

2

-3 -2 -1 0 1 2 3

Loss

RankSVM

Loss Function

Robust to outliers, but not convexPairwise Score Pl

40

Ranking Models: 2 Stages

Base Ranker

Sigmoid Rank

Non-convex:

Final model

Base ranking model

e.g., RankSVM, Perceptron, ListNet,

etc.

Minimize (a very close approximation for) the empirical error: number of misranks

Robust to outliers (label noise)

41

Learning

SigmoidRank Loss

Learning with Gradient Descent

RP

NRRkSigmoidRanwddwsigmoidwL )],(1[min 2

xexsigmoid

1

1)(

)( )()(

)()()1(

kdrankSigmoi

k

kk

kk

wLw

www

)],(1)[,( 2)( )(NRRNRR

RP

kdrankSigmoi ddwsigmoidddwsigmoidwwL

43

Email Recipient Results[Carvalho, Elsas, Cohen and Carbonell, SIGIR 2008 LR4IR]

0.3

0.35

0.4

0.45

0.5

0.55

TOCCBCC

MAP

Frequency

Recency

TFIDF

KNN

Percep

MRR (Fusion)

RankSVM

RankSVM+Sigmoid

Percep+Sigmoid

44

Email Recipient Results[Carvalho, Elsas, Cohen and Carbonell, SIGIR 2008 LR4IR]

0.3

0.35

0.4

0.45

0.5

0.55

CCBCC

MAP

Frequency

Recency

TFIDF

KNN

Percep

MRR (Fusion)

RankSVM

RankSVM+Sigmoid

Percep+Sigmoid

45

Set Expansion (SEAL) Results

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

SEAL-1 SEAL-2 SEAL-3

MA

P

PercepPercep+SigmoidRankSVMRankSVM+Sigmoid

[Wang & Cohen, ICDM-2007]

[18 features, ~120/60 train/test splits, ~half relevant]

[Carvalho et al, SIGIR 2008 LR4IR]

46

Letor Results

[#queries/#features: (106/25) (50/44) (75/44)]

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Ohsumed Trec3 Trec4

MA

P

PercepPercep+SigmoidRankSVMRankSVM+Sigmoid

[Carvalho et al, SIGIR 2008 LR4IR]

47

Related Work

Classification with non-convex loss functions: tradeoff for outlier robustness, accuracy, scalability, etc.

[Perez-Cruz et al, 2003], [Xu et al., 2006], [Zhan & Shen, 2005], [Collobert et al, 2006], [Liu et al, 2005], [Yang and Hu, 2008]

Ranking with other non-convex loss functions FRank [Tsai et al, 2007]: a fidelity-based loss function optimized in the

boosting framework, query normalization may be interfering in performance gains, not a general stage (meta) learner

48

Outline

1. Motivation

2. Email Acts



5. Learning Robust Ranking Models

6. User Study ◄

49

User Study

Choosing an Email System:

Gmail, Yahoo!Mail, etc. Widely adopted, interface/compatibility issues

Develop a new client Perfect control, longer development, low adoption.

Mozilla Thunderbird Open source community, easy mechanism to install

extensions, millions of users.

50

User Study: Cut Once

Cut Once, a Mozilla Thunderbird extension for Leak Detection and Recipient Recommendation

A few issues … Poor documentation and limited customization of interface JavaScript is slow: Imposed computational restrictions

Disregard rare words and rare users. Implement two ‘lightweight’ ranking methods:

1) TFIDF 2) MRR (Frequency, Recency, TFIDF)

[Balasubramanyan, Carvalho and Cohen, AAAI-2008 EMAIL]

51

Cut Once Screenshots

Main Window after installation

52

Cut Once Screenshots

53

Cut Once ScreenshotsLogged: Cut Once Usage

- Time, Confidence and Position in rank of clicked recommendations

- Baseline Ranking Method

54

User Study Description

4 week long study most subjects from Pittsburgh After 1 week, qualified users were invited to continue. 20% of

compensation was paid after 1 week After 4 weeks, users were fully compensated (+ final questionnaire)

26 subjects finished study 4 female and 22 male. Median age 28.5. Total of 2315 sent messages. Averages: 113 address book entries Mostly students. A few sys admin, 1 professor, 2 staff member.

Randomly assigned to two different ranking methods: TFIDF and MRR

55

Recipient Suggestions 17 subjects used the functionality (in 5.28% of their sent msgs).

Average of 1 accepted suggestion per 24.37 sent messages.

56

Comparison of Ranking Methods

Difference is not statistically significant

MRR better than TFIDF

Clicked Rank

Average Rank: 3.14 versus 3.69

Rank Quality: 3.51 versus 3.43

Rough estimate: factor of 5.5

5.5*4 = 22 weeks of user study or

5.5*26 = 143 subjects for 4 weeks

57

Results: Leak Detection

18 subj used the leak deletion (in 2.75% of their sent msgs).

Most frequent reported use was to “clean up” the addressee list: Removing unwanted people (inserted by Reply-all) Remove themselves (automatically added)

5 real leaks were reported, from 4 different users

These users did not use Cut Once to remove the leaks Clicked on the “cancel” button, and removed manually

Uncomfortable or unfamiliar with interface Under pressure because of 10-sec timer

58

Results: Leak Detection 5 leaks → 4 users.

Network Admin: two users with similar userIDs System Admin: wrong auto-completion in 2 or 3 situations Undergrad: two acquaintances with similar names Grad student: reply-all case

Correlations 2 users used TFIDF, 2 MRR No significant correlation with size of Address Book or #sent msgs Correlation with “non-student” occupations (95% confidence)

Estimate: 1 leak every 463 sent messages Assuming a binomial dist with p=5/2315, then 1066 messages are

required send at least one leak (with 90% confidence).

59

Final Questionnaire(Higher = Better)

Not

60

Frequent complaints

Training and Optimization

Interface

61

Final QuestionnaireCompose-then-address instead of

address-then-compose behavior

62

Conclusions Email acts

A taxonomy of intentions in email communication Categorization can be automated

Addressed Two Types of Email Addressing Mistakes Email Leaks (accidentally adding non-intended recipients) Recipient recommendation (forgetting intended recipients)

Framed as a supervised learning problems Introduced new methods for Leak detection Proposed several models for recipient recommendation, including

combinations of base methods.

Proposed a new general purpose ranking algorithm Robust to outliers - outperformed state-of-the-art rankers on the

recipient recommendation task (and other ranking tasks)

User Study using a Mozilla Thunderbird extension Caught 5 real leaks, and showed reasonably good prediction quality Showed clear potential to be adopted by a large number of email users

63

Proof: non-existence of a better advisor

Given the finite set of good advisors An

64

Proof: non-existence of a better advisor

Q.E.D.

Assuming

Given the finite set of advisors An

65

Thank you.

1 modeling intention in email vitor r. carvalho ph.d. thesis defensethesis committee: language...

Documents