the enron and w3c collections

22
The Enron and W3C The Enron and W3C Collections Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 ICAIL 2007, DESI Workshop, June 4 th th , 2007 , 2007 University of Maryland

Upload: dore

Post on 12-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

The Enron and W3C Collections. Tamer Elsayed and Douglas W. Oard. University of Maryland. ICAIL 2007, DESI Workshop, June 4 th , 2007. Variants of Email Search. Searcher. Collection. Rich multimodal data Emails Phone calls Databases. The (Extended) Enron Collection. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Enron and W3C Collections

The Enron and W3C CollectionsThe Enron and W3C Collections

Tamer Elsayed and Douglas W. Oard

ICAIL 2007, DESI Workshop, June 4ICAIL 2007, DESI Workshop, June 4thth, 2007, 2007

University of Maryland

Page 2: The Enron and W3C Collections

The Enron and W3C Collections

ParticipantParticipant Non-participantNon-participant

PersonalPersonal My own emails Shneiderman’sPostel’s

OrganizationOrganization Help desksWhite House

Enron

PublicPublic Online communities

Usenet newsW3C

Variants of Email Search

SearcherSearcher

Col

lect

ion

Col

lect

ion

Page 3: The Enron and W3C Collections

The Enron and W3C Collections

Rich multimodal data Emails Phone calls Databases

The (Extended) Enron Collection

Page 4: The Enron and W3C Collections

The Enron and W3C Collections

“Public” version of Enron collection (CMU) 150 sets of rescued Outlook email folders 517,431 emails, 52% duplicates, 133,581 unique addresses Subset annotated w/genre, speech act, mentioned calls, …

Extended Enron email collection (Aspen Systems) Attachments, additional email (later release, redaction)

Phone calls from/to Enron traders (Shohomish PUD) Transcribed subset from 52 DVDs of recorded audio Recovered from scanned transcripts using OCR 93 annotated with date, time, participants, mentioned names,

mentioned emails, mentioned meetings, ... Relational databases (Aspen Systems)

The (Extended) Enron Collection

Page 6: The Enron and W3C Collections

The Enron and W3C Collections

Phone Call Transcripts

Message-ID: <24-20010126-19435570-20020114-R>

Message-Type: PhoneCall

Date: Fri, 26 Jan 2001 19:43:55 -0600 (CST)

From: [email protected]

To: [email protected]

Parties: [email protected], [email protected]

Subject: Snohornish deal, Houston Chronicle Article, Bonuses e-mail, Houston Chronicle Article, Deal, email to Jane King

Subject-TimePos: 145, 313, 713, 775, 920, 1018

InCallNames: Christian, Ken Lay, Greg, Chris Foster, Stewie, Stewie, Mike, Mike, Laverado, Mike, Kim, Shari, Greg, Forney, Stewie, Jane King, Shari

InCallNames-TimePos: 42, 81, 90, 95, 96, 143, 146, 190, 262, 266, 522, 580, 780, 1007, 1018, 1038, 1067

Keywords: CDWR, email, email

Keywords-TimePos: 55, 689, 1038

X-From: Stack, Shari <>

X-To: Wolfe, Greg <>

X-Parties: Stack, Shari <>, Wolfe, Greg <>

X-AudioFile: 24-20010126-19435570-20020114-R.wav

X-TranscriptFile: 24-20010126-19435570-20020114-R.txt

SHARI STACK: Hey.

GREG WOLFE: All right, let me get my fax machine workin'. Uh - [laughs]

SHARI: [laughs] She's like, it was so easy, I could make you a lot of money [laughs]. She's like, he said it so desperate. She goes I hate to laugh at people, but - [laughs]

GREG: Did you, um, did you, ah, ah tell her about the, ah, that voice mail?

SHARI: Yeah, I said - I said Greg [inaudible] he's got the - they got a mob connection [langhs] - his friend threw away the business card after the meeting.[both laughing]

SHARI: But, my God - my God, and so anyway, have you talked to Chnstian about this 'cause Christian apparently talked to him twice today.

GREG: Oh, he sent a - Christian sent an e-mail shortly after, you know, that, and said we're not doin' business with this guy.

SHARI: [laughs]

GREG: Ah, so I still don't understand why this guy's trying to get in the middle of us and CDWR and I guess -

SHARI: [laughs]

Page 7: The Enron and W3C Collections

The Enron and W3C Collections

Message Header

Main BodySalutationSalutation

Signature BlockSignature Block

Quoted Header QuotedText

Message Body

Quoted SignatureQuoted Signature

Quoted Main Body

Typical Enron Email

-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !

Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'

Hope all is well.Count me in for the group present.See ya next week if not earlier

Please call me (713) 207-5233

Liza

Elizabeth Sager713-853-6349

Hi Shari

Thanks!

Shari

Page 8: The Enron and W3C Collections

The Enron and W3C Collections

Research Problems (Enron)

Threading Email Classification Social Network Analysis Mention Resolution

Page 9: The Enron and W3C Collections

The Enron and W3C Collections

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the callwill be too late for him.

Who is that “Sheila”?

Sheila ??

Page 10: The Enron and W3C Collections

The Enron and W3C Collections

Rich Evidence about Identity

[email protected] m scott

suebobsusan scott

sue

susan

m scott

[email protected]

scott susan

susan m scott

susan scott

[email protected] scott

friday

sscott5susan

sscott

susan m scott

com members

66,715 models 82,084

addr-name3,151

addr-nickname19,708

addr-addr

Page 11: The Enron and W3C Collections

The Enron and W3C Collections

Test Collection of Mention Resolution

CandidatesCollection Emails Identities Queries Min. Avg. Max.

SagerSager 1,628 627 51 1 4 11

ShapiroShapiro 974 855 49 1 8 21

Enron-subsetEnron-subset 54,018 27,340 78 1 152 489

Enron-allEnron-all 248,451 123,783 78 3 518 1785

Sager

Shapiro

Enron-subsetEnron-all

Test CollectionsTest Collections

Page 12: The Enron and W3C Collections

The Enron and W3C Collections

Evaluation

Task named-mention ranked list of people

Measures Mean Reciprocal Rank Success @ K

Success @ 1 Confidence-based scoring

Page 13: The Enron and W3C Collections

The Enron and W3C Collections

Limitations (Mention Resolution)

Small number of queries Only resolved by Enron employees

Much easier Most of participants are outsides

Measures focus only on accuracy

Page 14: The Enron and W3C Collections

The Enron and W3C Collections

Identity-Content Interplay

Search for People

Search for Content

SocialSocialContextContext

TopicalTopicalContextContext

Page 15: The Enron and W3C Collections

The Enron and W3C Collections

W3C Collection

Set of mailing lists public not private Topically-oriented

~175,000 emails Introduced at TREC 2005 50 topics (x 2 years) relevance judgments available for ad-hoc

retrieval

Page 16: The Enron and W3C Collections

The Enron and W3C Collections

Research Problems (W3C)

Expert Finding Topic ranked list of experts

Know-item Retrieval Query ranked list of emails

Discussion Search (i.e., ad-hoc retrieval) Pro/con retrieval Query ranked list of emails

Page 17: The Enron and W3C Collections

The Enron and W3C Collections

Topic Type AnalysisFind categories amenable to pro/con classification (TREC 2005-Enterprise Track)

Number of Topics in Categories

0 5 10 15 20 25 30

F: Reasons, design rationales

E: Definitions, functionality

D: Problems, impacts

C: Discuss an issue

B: Methods, tips, solutions

A: Comparions, usefulness, relationships

Category

Page 18: The Enron and W3C Collections

The Enron and W3C Collections

Limitations (Pro/Con Retrieval)

Not private/personal communication Mailing lists receivers are hidden Topical categories are unbalanced Developed by researchers NOT users

Page 19: The Enron and W3C Collections

The Enron and W3C Collections

Related Projects Others working with CMU’s Enron emails

Berkeley, CMU, U Mass, SIAM Workshop

University of Southern California ISI/ICT eArchivarius, Postel collection (Anton Leuski)

Georgia Tech Research Institute PERPOS Presidential records (Bill Underwood)

Page 20: The Enron and W3C Collections

The Enron and W3C Collections

Conclusion

Two email test collections Public Hundreds of thousands of emails Annotated emails and transcripts Tasks and ground truth

Need for “real” user needs Development of evaluation measures for utility

Page 21: The Enron and W3C Collections

The Enron and W3C Collections

For More Information

Joint Institute for Knowledge Discovery http://www.umiacs.umd.edu/jikd

Page 22: The Enron and W3C Collections

The Enron and W3C Collections

Running System