semex: a platform for personal information management and integration xin (luna) dong university of...

61
Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Upload: olivia-anastasia-pope

Post on 17-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Semex: A Platform for Personal Information Management and Integration

Xin (Luna) DongUniversity of Washington

June 24, 2005

Page 2: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

IntranetInternet

Is Your Personal Information

a Mine or a Mess?

Page 3: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

IntranetInternet

Is Your Personal Information

a Mine or a Mess?

Page 4: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Questions Hard to Answer Where are my SEMEX papers and

presentation slides (maybe in an attachment)?

Page 5: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Index Data from Different SourcesE.g. Google, MSN desktop search

IntranetInternet

Page 6: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Questions Hard to Answer Where are my SEMEX papers and

presentation slides (maybe in an attachment)?

Who are working on SEMEX? What are the emails sent by my

PKU alumni? What are the phone numbers and

emails of my coauthors?

Page 7: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Organize Data in a Semantically Meaningful Way

IntranetInternet

Page 8: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Questions Hard to Answer Where are my SEMEX papers and

presentation slides (maybe in an attachment)?

Who are working on SEMEX? What are the emails sent by my PKU

alumni? What are the phone numbers and

emails of my coauthors? Whom of SIGMOD’05 authors do I

know?

Page 9: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Integrate Organizational and Public Data with Personal Data

IntranetInternet

Page 10: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

OriginitatedFrom

PublishedIn

ConfHomePage

ExperimentOf

ArticleAbout

BudgetOf

CourseGradeIn

AddressOf

Cites

CoAuthor

FrequentEmailer

HomePage

Sender

EarlyVersion

Recipient

AttachedTo

PresentationFor

ComeFrom

Page 11: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

SEMEX (SEMantic EXplorer) – I. Provide a Logical View of Data

Cites

Event

Message

Document

Web Page

Presentation

Cached

SoftcopySoftcopySender,

Recipients

Organizer, Participants

Person

Paper

Author

Homepage

HTMLMail &

calendar Papers Files Presentations

Page 12: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

SEMEX (SEMantic EXplorer) – II. On-the-fly Data Integration

Cites

Event

Message

Document

Web Page

Presentation

Cached

SoftcopySoftcopySender,

Recipients

Organizer, Participants

Person

Paper

Author

Homepage

Page 13: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

How to Find Alon’s Papers on My Desktop?

Page 14: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

How to Find Alon’s Papers on My Desktop? – Google Search Results

Send me the semex demo slides again?

Search Alon Halevy

Page 15: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

How to Find Alon’s Papers on My Desktop? – Google Search Results

Ignore previous request, I found

them

Search Alon Halevy

Page 16: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

How to Find Alon’s Papers on My Desktop? – Google Search Results

Page 17: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Semex Goal Build a Personal Information

Management (PIM) system prototype that provides a logical view of personal informationBuild the logical view automatically

Extract object instances and associations Remove instance duplications

Leverage the logical view for on-the-fly data integration

Exploit the logical view for information search and browsing to improve people’s productivity

Be resilient to the evolution of the logical view

Page 18: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

An Ideal PIM is a Magic Wand

Page 19: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

An Ideal PIM is a Magic Wand

Page 20: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Outline Problem definition and project goals Technical issues:

System architecture and instance extraction [CIDR’05]

Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and

evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]

Overarching PIM Themes

Page 21: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

DomainManagement

Module

DomainModel

ReferenceReconciliater

Association DB

Extractors

Indexer Index

ObjectsAssociations

Word PPT PDF Latex Email Webpage Excel DB

Integrator

Searcher Browser Analyzer

DomainManager

Data Analysis Module

DomainModel

Data Collection Module

ReferenceReconciliater

Association DB

Extractors

Indexer Index

ObjectsAssociations

Word PPT PDF Latex Email Webpage Excel DB

Integrator

Searcher Browser Analyzer

System Architecture

DomainManager

Page 22: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Outline Problem definition and project goals Technical issues:

System architecture and instance extraction [CIDR’05]

Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and

evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]

Overarching PIM Themes

Page 23: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Reference Reconciliation in Semex

Xin (Luna) Dong

xin dong

•¶ ðà xinluna dong

luna

dongxin

x. dong

Lab-#dong xin

dong xin luna

Names

Emails

Page 24: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Semex Without Reference Reconciliation Search results for luna

luna dongSenderOfEmails(3043)RecipientOfEmails(2445)MentionedIn(94)

23 persons

Page 25: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Semex Without Reference Reconciliation Search results for luna

Xin (Luna) DongAuthorOfArticles(49)MentionedIn(20)

23 persons

Page 26: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Semex Without Reference Reconciliation

A Platform for Personal Information Management and Integration

Page 27: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Semex Without Reference Reconciliation

9 Persons: dong xin xin dong

Page 28: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Semex NEEDS Reference Reconciliation

Page 29: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Reference Reconciliation

A very active area of research in Databases, Data Mining and AI. (Surveyed in [Cohen, et al. 2003])

Traditional approaches assume matching tuples from a single tableBased on pair-wise comparisons

Harder in our context

Page 30: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Challenges Article: a1=(“Bounds on the Sample Complexity of

Bayesian Learning”, “703-746”, {p1,p2,p3}, c1)

a2=(“Bounds on the sample complexity of bayesian learning”,

“703-746”, {p4,p5,p6}, c2) Venue: c1=(“Computational learning theory”, “1992”,

“Austin, Texas”) c2=(“COLT”, “1992”, null)

Person: p1=(“David Haussler”, null)p2=(“Michael Kearns”, null)p3=(“Robert Schapire”, null) p4=(“Haussler, D.”, null)p5=(“Kearns, M. J.”, null)p6=(“Schapire, R.”, null)

Page 31: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Challenges Article: a1=(“Bounds on the Sample Complexity of Bayesian

Learning”, “703-746”, {p1,p2,p3}, c1)a2=(“Bounds on the sample complexity of bayesian

learning”,“703-746”, {p4,p5,p6}, c2)

Venue: c1=(“Computational learning theory”, “1992”, “Austin, Texas”)

c2=(“COLT”, “1992”, null) Person: p1=(“David Haussler”, null)

p2=(“Michael Kearns”, null)p3=(“Robert Schapire”, null) p4=(“Haussler, D.”, null)p5=(“Kearns, M. J.”, null)p6=(“Schapire, R.”, null) p7=(“Robert Schapire”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

1. MultipleClasses

3. Multi-valueAttributes

2. LimitedInformation

?

?

Page 32: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Intuition

Complex information spaces can be considered as networks of instances and associations between the instances

Key: exploit the network, specifically, the clues hidden in the associations

Page 33: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

I. Exploiting Richer Evidences Cross-attribute similarity –

Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)

Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7

Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article

Page 34: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Considering Only Attribute-wise Similarities Cannot Merge Persons Well

1750

1950

2150

2350

2550

2750

2950

3150

3350

1 2 3 4

Evidence

#(P

erso

n P

arti

tio

ns)

1409

Person references: 24076 Real-world persons (gold-standard):1750

3159

Page 35: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Considering Richer Evidence Improves the Recall

3159

2169 21692096

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

1409

346

Person references: 24076 Real-world persons:1750

Page 36: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

II. Propagate Information between Reconciliation Decisions Article: a1=(“Distributed Query Processing”,“169-180”,

{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,

{p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

Page 37: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

3159

2169 21692096

3159

2146 2135

2022

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-w ise Name&Email Article Contact

Evidence

#(Pe

rson

Par

titio

ns)

Traditional Propagation

Propagating Information between Reconciliation Decisions Further Improves Recall

Person references: 24076 Real-world persons:1750

Page 38: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

III. Reference Enrichment

p2=(“Michael Stonebraker”, null, {p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)

p8-9 =(“mike”, “[email protected]”, {p7})

V

XXV

Page 39: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

References Enrichment Improves Recall More than Information Propagation

3159

2169 21692096

3169

2036 2036

19101750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

Traditional Enrichment Propagation

Person references: 24076 Real-world persons:1750

Page 40: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

artit

ions

)

Traditional Enrichment Propagation Full

Applying Both Information Propagation and Reference Enrichment Gets the Highest Recall

Person references: 24076 Real-world persons:1750

1409

125346

Page 41: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Outline Problem definition and project goals Technical issues:

System architecture and instance extraction [CIDR’05]

Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and

evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]

Overarching PIM Themes

Page 42: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Importing External Data Sources

Cites

Event

Message

Document

Web Page

Presentation

Cached

SoftcopySoftcopySender,

Recipients

Organizer, Participants

Person

Paper

Author

Homepage

Page 43: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Traditional approaches: proceed in two steps Step 1. Schema matching (Surveyed in

[Rahm&Bernstein, 2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title”

in table Article Step 2. Query discovery [Miller et al., 2000]

Take term matching as input, generate mapping expressions (typically queries)

E.g., SELECT Article.title as paperTitle, Person.name as author

FROM Article, PersonWHERE Article.author = Person.id

Intuition—Explore associations in schema mapping

Page 44: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Traditional approaches: proceed in two steps Step 1. Schema matching (Surveyed in [Rahm&Bernstein,

2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title” in

table Article Step 2. Query discovery [Miller et al., 2000]

Take term matching as input, generate mapping expressions (typically queries)

E.g., SELECT Article.title as paperTitle, Person.name as author

FROM Article, PersonWHERE Article.author = Person.id

User’s input is needed to fill in the gap between Step 1 output and Step 2 input

Our approach: check association violations to filter inappropriate matching candidates

Intuition—Explore associations in schema mapping

Page 45: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Integration Example

Person(name, email) Book(title, year) Article(title, page) Conference(name, year)

Webpage-item (title, author, conf, year)

publishedIn

authoredBy

authoredBy

Page 46: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Integration Example

Person(name, email) Book(title, year) Article(title, page) Conference(name, year)

Webpage-item (title, author, conf, year)

authoredBy

Person(name, email) Book(title, year) Article(title, page) Conference(name, year)

Webpage-item (title, author, conf, year)

publishedInauthoredBy

Page 47: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Outline Problem definition and project goals Technical issues:

System architecture and instance extraction [CIDR’05]

Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and

evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]

Overarching PIM Themes

Page 48: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Explore the association network – 1. Find the relationship between two instances Example: How did I know this

person? Solution: Lineage

Find an association chain between two object instances

Shortest chain? “Earliest” chain OR “Latest” chain

Page 49: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Explore the association network – 2. Find all instances related to a given keyword Example: Who are working on “Schema

Matching”? Solution:

Naive approach: index object instances on attribute values

A list of papers on schema matching A list of emails on schema matching A list of persons working on schema matching A list of conferences for schema-matching papers A list of institutes that conduct schema-matching

research Our approach: index objects on the attributes of

associated objects

Page 50: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Explore the association network – 3. Rank returned instances in a keyword search Example: What are important

papers on “schema matching”? Solution:

Naive approach: rank by TF/IDF metric

Our approach: ranking by Significance score: PageRank measure Relevance score: TF/IDF metric Usage score: last visit time and

modification time

Page 51: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Explore the association network – 4. Fuzzy Queries Queries we pose today—something we can

describe Find me something with (related to) keyword X Find me the co-authors of Person Y

Fuzzy queries: Q: What do I want to know? A: In this webpage, 5 papers are written by

your friends Q: What significant things have happened

today? A: The President wrote an email to you!!

Page 52: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Outline Problem definition and project goals Technical issues:

System architecture and instance extraction [CIDR’05]

Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and

evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]

Overarching PIM Themes

Page 53: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

The Domain Model

Event

Message

Document

Web Page

Presentation

Cached

SoftcopySoftcopySender,

Recipients

Organizer, Participants

Person

Paper

Author

Homepage

The logical view is described with a domain model

Semex provides very basic classes and associations as a default domain model

Users can personalize the domain model

cite

Page 54: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Problems in Domain Model Personalization Problem: hard to precisely model a domain

At certain point we are not able to give a precise domain model

Not enough knowledge of the domain Inherently evolution of a domain Non-existence of a precise model

Overly detailed models may be a burden to users

Modeling every details of the information on one’s desktop is often overwhelming

We may want to leave part of the domain unstructured

Extract descriptions at different levels of granularity Address v.s. street, city, state, zip

Page 55: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Malleable Schemas

Clean Schema

Structured datasources

Unstructured datasources

Malleable Schema

Key idea: capture the important aspects of the domain model without committing to a strict schema

Page 56: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Malleable Schema Introduce “text” into schemas

Phrases as element names E.g., “InitialPlanningPhaseParticipant”

Regular expressions as element namesE.g., “*Phone”, “State|Province”

Chains as element namesE.g., “name/firstName”

Introduce imprecision into queriesSELECT S.~name, S.~phoneFROM Student as S, ~Project as PWHERE (S ~initialParticipant P) AND

(P.name = “Semex”)

Page 57: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Outline Problem definition and project goals Technical issues:

System architecture and instance extraction [CIDR’05]

Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and

evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]

Overarching PIM Themes

Page 58: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Overarching PIM Themes It is PERSONAL data!

How to build a system supporting users in their own habitat?

How to create an ‘AHA!’ browsing experience and increase user’s productivity?

There can be any kind of INFORMATION How to combine structured and un-structured

data? We are pursuing life-long data MANAGEMENT

What is the right granularity for modeling personal data?

How to manage data and schema that evolve over time?

PERSONAL

INFORMATION

MANAGEMENT

Page 59: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Related Work

Personal Information Management Systems Indexing

Stuff I’ve Seen (MSN Desktop Search)[Dumais et al., 2003]

Google Desktop Search [2004] Richer relationships

MyLifeBits [Gemmell et al., 2002] Placeless Documents [Dourish et al., 2000] LifeStreams [Freeman and Gelernter, 1996]

Objects and associations Haystack [Karger et al., 2005]

Page 60: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Summary

60 years passed since the personal Memex was envisioned It’s time to get serious Great challenges for data management

Deliverables of the project An approach to automatically build a

database of objects and associations from personal data

An algorithm for on-the-fly integration Algorithms for data analysis for

association search and browsing The concept of malleable schema as a

modeling tool A PIM system incorporating the above

Page 61: Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

co-worker

Association Network for Semex Project: Semex

Person: Luna

participant

advisor

co-worker

Person: AlonprojectLeader

co-worker

Person: Jayant

Advice-giver

Person: Michelle

Person: Yuhan

participant

participant

ArticleAbout

ArticleAbout

ArticleAboutCIDR

publishedIn

publishedIn

publishedIn