implicit queries for email
DESCRIPTION
Implicit Queries for Email. Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research). Search + Email. Email is the number 1 activity on the internet Fast, easy and cheap Search is number 2 Highly lucrative (billion market – targeted ads) Why not put them together? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/1.jpg)
Implicit Queries for Email
Vitor R. Carvalho
(Joint work with Joshua Goodman, at Microsoft Research)
![Page 2: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/2.jpg)
2
Search + Email
Email is the number 1 activity on the internet Fast, easy and cheap
Search is number 2 Highly lucrative (billion market – targeted ads)
Why not put them together? Make users happy Make more money
![Page 3: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/3.jpg)
3
![Page 4: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/4.jpg)
4
Implicit Queries for Email
Find good search keywords in email messages 1 Click (or less) for users to do search
Lots of possible User Interfaces Add hyperlinks to words in message List keywords in a sidebar Perform search automatically; show results (Gmail)
Closely related to finding keywords for advertising
![Page 5: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/5.jpg)
5
Main Contributions
1) Extract Keyphrases Similar to Information
Extraction Several features
2) Rank/Display Maxent probability
estimates
3) Select/Filter Restrict to MSN Query
Logs (7.5 million entries)
![Page 6: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/6.jpg)
6
Email Dataset
20 Hotmail volunteers (not MS employees) Spam, “subs” and “wanted” folders 6 human annotators labeled 1143 msgs according to
the following instructions:
These are mail messages from real Hotmail users. Imagine that you were the recipient of each message. If your email program were to automatically perform a query to a search engine like MSN Search or Google for you, what wordsor phrases would you want the engine to search for?
In some messages, there may be no words worth searching for. In others, there may be several. When possible, the words or phrases should actually occur in the messages you annotate.
![Page 7: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/7.jpg)
7
TF-IDF baseline
Extract all possible keyphrases from email (up to 5 tokens)
Rank keyphrases by TF-IDF scores
TF = term frequency: number of times each keyphrase occurs in the email message
IDF = 1/DF = number of documents the keyphrase occurs in corpus
Top1 – percentage of “ranked-1st keyphrases” that were labeled as relevant
Top10 – number of keyphrases in the top-10 rank that were labeled as relevant, normalized by the total number of relevant keyphrases (no message had more than 10 relevant keyphrases)
Keyphrases TF-IDF
Port Angeles 0.450
Lake Crescent 0.120
Atlanta 0.090
Mt. Baker 0.045
… …
Top-1(%)
Top-10(%)
TF-IDF 4.87 9.86
![Page 8: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/8.jpg)
8
First Improvement:
Constrain Results to Query Log File
Query log file: top 7.5 million queries to MSN Search
Only return keyphrases from an email if they occur in the query log file Faster – only process
keyphrases in message that occur in the query log file.
Creates some errors Removes some errors – such as
“occur in the” Works better!
Top-1(%)
Top-10(%)
TF-IDF 4.87 9.86
TF-IDF with query log restriction
10.86 30.56
![Page 9: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/9.jpg)
9
Adding More Features1) Query Log Frequency
Frequency and log(frequency) of keyphrase
2) Capitalization Word capitalized before/after, # capitalized initials in phrase,
# capitalized letters in phrase, etc
3) Phrase Length Number of characters and number of tokens
4) TF + IDF based features TF, IDF, from Body and from Subject
5) Punctuation and Alphanumeric Punct before/after, has no alpha, has numbers only, etc
6) Email Specific Has FW: in subject, has RE: in subject
![Page 10: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/10.jpg)
10
Maximum Entropy Learner (a.k.a. Logistic Regression)
Computes
y is 1 if keyphrase is relevant is the feature vector (previous slide features) Weight vector w learned using a type of
Generalized Iterative Scaling alg. (SCGIS).
Rank and cutoff based on probability estimate
exp( | )
1 exp
x wP y x
x w
x
![Page 11: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/11.jpg)
11
Rank and cutoff based on probability
Keyphrases
Port Angeles Lake Crescent Olympic National Park Atlanta Mt. Baker Hurricane Ridge Marymere Fall Beaches on the west coast
Probability
0.121
0.105
0.034
0.031
0.022
0.012
0.0090.004
Cutoff = 10%
![Page 12: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/12.jpg)
12
Performance AnalysisTop-1 Top-10
TF-IDF (one single feature and no query log restriction) 4.87 9.86
TF-IDF (one single feature) 10.86 30.56
Baseline → 2 features: TF and IDF 11.33 32.03
Baseline + Query Frequency 23.13* 41.82*
Baseline + Phrase Length 12.81 33.25
Baseline + Capitalization 21.43* 44.71*
Baseline + Punctuation 13.47 33.02
Baseline + Email Specific 11.34 32.03
Baseline + Alphanumeric 11.66 32.65
Baseline + All Features 33.55* 55.26*
10-fold cross-validation on the 1143 email messages
![Page 13: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/13.jpg)
13
Performance Analysis
Implicit Feedback Performance
0
10
20
30
40
50
60
Top 1 Score Top 10 Score
![Page 14: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/14.jpg)
14
Using Other Learning Algorithms
![Page 15: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/15.jpg)
15
Opportunities for Future Work
1. Relax the Query Log restriction
2. Use real advertisement data
3. Use feedback from users (user can be annoyed, etc)
4. Use personalization (age, gender, place, etc)
![Page 16: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/16.jpg)
16
Conclusions
Implicit Query task → finding good search keywords Use of large query log from MSN Search Maxent to combine features and output probabilities
– ranking and display cutoff Most meaningful features are associated with query
frequency and capitalization
Results several times better than baseline TF-IDF (top 1 and top 10 scores)
![Page 17: Implicit Queries for Email](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815a95550346895dc81506/html5/thumbnails/17.jpg)
17
Thank you