searching data with substance and style

53
Searching Data with Substance and Style Amélie Marian Rutgers University http://www.cs.rutgers.edu/~amelie 1

Upload: amelie-marian

Post on 30-Jun-2015

128 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Searching data with substance and style

1

Searching Data with Substance and StyleAmélie MarianRutgers University

http://www.cs.rutgers.edu/~amelie

Page 2: Searching data with substance and style

Amélie Marian - Rutgers University

2

Semi-structured Data Processing

•Large amount of data online and in personal devices▫Structure (style)▫Text content (substance)▫Different sources (soul)

▫Finding the data we need can be difficult

Page 3: Searching data with substance and style

Amélie Marian - Rutgers University

3

Semi-structured Data Processing at RutgersSPIDR Lab

•Personal Information Search▫Semi-structured data▫Need for high -quality search tools

•Structuring of User Web Posts▫Large amount of user-generated data untapped▫Text has inherent structure▫Use of text for guiding search and analyze data

•Data Corroboration▫Conflicting sources of data▫Need to identify true facts

Page 4: Searching data with substance and style

4

Unified Structure and Content Search for Personal Information Management Systems

Joint work with:

Wei WangChristopher PeeryThu Nguyen

Computer Science, Rutgers University

Amélie Marian - Rutgers University

Page 5: Searching data with substance and style

Personal Information Search

Information that can be used for personal information search• Content (keywords)• Metadata (file size, modification time, etc.)• Structure

▫ Directory (external)▫ File structure (internal): XML, LaTeX tags, Picture tags, etc.▫ Partially known

Amélie Marian - Rutgers University

5

Web PersonalData

Search for specific documentsSearch for relevant documents

Page 6: Searching data with substance and style

6

PIMS Project Description

• Data and query models that unify content and structure

• Scoring framework to rank unified search results

• Query processing algorithms and index structures to score and rank answers efficiently

• Evaluation of the quality and efficiency of the unified scoring

Amélie Marian - Rutgers University

NSF CAREER Award July 2009-2014

EDBT’08ICDE’08 (demo)DEB’09EDBT’11TKDE (accepted)

Page 7: Searching data with substance and style

Amélie Marian - Rutgers University

Separate Structure and Content

7

Directory: //Home

Keywords: Halloween, witch

Target file: Halloween party pictures taken at home where someone wears a witch costume

File

Boundary

Page 8: Searching data with substance and style

Amélie Marian - Rutgers University

Current Search Tools

Current search tools (i.e. web, desktop, GDS) mostly rely on ranking and filtering.▫ Ranking content keywords▫ Filtering additional conditions (e.g., metadata, structure)

Find a jpg file saved in directory /Desktop/Pictures/Homethat contains the words “Halloween witch”

This approach is often insufficient.▫ Filtering forces a binary decision. Gif files and files under

directory /Archive/Pictures/Home are not returned.▫ Structure and content are strictly separated. Files under

directory /Pictures/Halloween are not returned.

8

Page 9: Searching data with substance and style

Amélie Marian - Rutgers University

Unified Approach

Goal: Unify structure and content▫ Develop a unified view of directory and file structure▫ Allow for a single query to contain both structure and

content components and to be answered at once▫ Return results even if queries are incomplete or

contain mistakes

Approach:▫ Define a unified data model by ignoring file

boundaries▫ Define a unified query model▫ Define relaxations to approximate unified queries▫ Define relevance score for unified queries

9

Page 10: Searching data with substance and style

Amélie Marian - Rutgers University

Unified Structure and Content

10

Target file: Halloween party pictures taken at home where someone wears a witch costume

File

Boundaryroot

“Halloween”“witch”

Home

//Home[.//“Halloween” and .//“witch”]

Page 11: Searching data with substance and style

Amélie Marian - Rutgers University

From Query to Answers

11

Relaxation

Query

User

Relaxed Queries

DAG

Matching

Scoring

Matches/ Answers

Ranked Answers(TA algorithm)

Page 12: Searching data with substance and style

Amélie Marian - Rutgers University

Query Relaxations

Target: IMG_1391.gif• Edge Generalization ── missing terms

▫ /Desktop/Home → /Desktop//Home

• Path Extension ── only remember prefix▫ /Desktop/Pictures → /Desktop/Pictures//*

• Node Generalization ── misremember structure/content▫ //Home//Halloween → //Home//{Halloween}

• Node Inversion ── misremember order▫ /Desktop//Home//{Halloween} → /Desktop//(Home//{Halloween})

• Node Deletion ── extraneous terms▫ /Desktop/Backup/Pictures//Home → /Desktop//Pictures//Home

12

Page 13: Searching data with substance and style

Amélie Marian - Rutgers University

DAG RepresentationIDF score

▫ Function of how many files match the query

▫ DAG stores IDF scoring information

13

p – Picturesh – Home /p/h (exact match)

//* (match all)

//p/h /(p/h)

//p//*

/p//h

//h//*1 - /p/h//*

2 - //p/h//*

3 - //(p/h)

//p//h 2 3

//n

1

Page 14: Searching data with substance and style

Amélie Marian - Rutgers University

14

Query Evaluation• Top-k query processing

▫ Branch-and-bound approach• Lazy evaluation of the relaxed DAG structure

▫ DAG is query dependent and has to be generated at runtime

▫ We developed two algorithms to speed up query evaluation DAGJump allows skip unnecessary parts of the DAG (sorted

accesses) RandomDAG allows to zoom in on the relevant part of the

DAG (random accesses)• Matching of answers using dedicated data structures

We extended PathStack (Bruno et al. ICDE’02) to support permutations (NIPathstack)

Page 15: Searching data with substance and style

Amélie Marian - Rutgers University

Traditional Content TF∙IDF Scoring

• Consider files as “bag of terms”

• TF (Term Frequency)▫ A file that mentions a query term more often is more

relevant▫ TF could be normalized by file length

• IDF (Inverse Document Frequency)

▫ Terms that appear in too many files have little differentiation power in determining relevance

• TF∙IDF Scoring

▫ Aggregate TF and IDF scores across all query terms

15

qt

tdt idftfdqscore ,),(

Page 16: Searching data with substance and style

Amélie Marian - Rutgers University

Unified IDF ScoreFor a unified data tree T, a path query PQ, and a file F,

we define:• IDF Score

where N is total number of files, and is the set of files that match PQ in T.

16

NPQTmatches

N

PQscoreidf log

),(log

)(

),( PQTmatches

Page 17: Searching data with substance and style

Amélie Marian - Rutgers University

TF Score

17

,3,2,)1log()(1

nxxxf n

Path query: //a//{b}

a

/

b

“b e f b f”

d

c

File F

Normalized

f(0.25)+f(0.4)

matchstruct = 1

matchcontent = 2

nodesstruct = 4

nodescontent = 5

0.25

Normalized

0.4

∑f(x)

TF Score

affects relative impact on TF to unified scores 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

f(x)

x

“”

Page 18: Searching data with substance and style

Amélie Marian - Rutgers University

Unified ScoreAggregate IDF and TF scores across all relaxed queries

18

/a/b (exact match)

//a/b /a//b

1.0 0.15 0.8 0.25 0.8 0.1 ...idf tf idf tf idf tf

0.15

* * *

0.2 0.08

...

...

+

0.875

Unified Score

tf*idf

Page 19: Searching data with substance and style

Amélie Marian - Rutgers University

Experimental Setup• Platform

PC with a 64-bit hyper-threaded 2.8GHz Intel Xeon processor, 2GB memory, a 10K RPM 70GB SCSI disk, Linux 2.6.16 kernel, Sun Java 1.5.0 JVM.

• Data Set▫Files and directories from the environment of a

graduate student (15Gb)▫95,172 files (document 59%, email 34%) in 7,788

directories. Average directory depth is 6.3 with the longest being 12.

▫57M nodes in the unified data tree, with 49M (86%) leaf content nodes

19

Page 20: Searching data with substance and style

Amélie Marian - Rutgers University

Relevance Comparison•Use Lucene as a comparison basis•Content-only

Use the standard Lucene content indexing and search

•Content:DirCreate two Lucene indexes: content terms, and

terms from the directory pathnames (treated as a small file)

•Content+DirAugment content index with directory path terms

20

Page 21: Searching data with substance and style

Amélie Marian - Rutgers University

Case Study

▫ Search for a witch costume picture taken at home on Halloween

Target: IMG_1391.gif (tagged with “witch” and “Halloween”)

21

Query Type

Query Condition Comment Rank

U //home[.//”witch” and .//”halloween”]

Accurate condition 1

U //halloween/witch/”home” Structure / content switched

1

C {witch, halloween} Accurate condition 20

C:D {witch, halloween} : {home} Accurate condition 1

C:D {witch, home} : {halloween} Structure / content switched

245-252

Page 22: Searching data with substance and style

Amélie Marian - Rutgers University

CDFs (Impact of Inaccuracies)

22

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 10 100

Pe

rce

nta

ge

of Q

ue

rie

s

Rank

UC:DC+D

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 10 100

Pe

rce

nta

ge

of Q

ue

rie

s

Rank

UC:DC+D

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 10 100

Pe

rce

nta

ge

of Q

ue

rie

s

Rank

UC:DC+D

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 10 100

Pe

rce

nta

ge

of Q

ue

rie

s

Rank

UC:DC+D

50% error, 1 swap

50% error, 2 swap

100% error, 1 swap

100% error, 2 swap

Page 23: Searching data with substance and style

Amélie Marian - Rutgers University

Query Processing Performance

23

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 2 4 6 8 10

Pe

rce

nta

ge

of Q

ue

rie

s

Query Processing Time (sec)

UC:D

Page 24: Searching data with substance and style

Amélie Marian - Rutgers University

Personal Information Search Contributions• A multi-dimensional search framework that

supports fuzzy query conditions• Scoring techniques for fuzzy query conditions

against a unified view of structure and content Improves search accuracy over content-based methods by

leveraging both structure and content information as well as relationships between the terms

Shows improvements over existing techniques (GDS, topX)

• Efficient index structures and optimizations to efficiently process multi-dimensional and unified queries Significantly reduced the overall query processing time

• Future work directions: User studies, Twig matching, Result granularity, Context

24

Page 25: Searching data with substance and style

Structuring and Searching Web Content

Joint work with:

Gayatree Ganu Computer Science, Rutgers UniversityNoémie ElhadadBiomedical Informatics, Columbia University

User Review Structure Analysis Project – URSAPatient Emotion and stRucture SEarch USer interface -

PERSEUS

Page 26: Searching data with substance and style

26

URSA:User Review Structure Analysis Project Description

•Aim:Better understanding of user reviewsBetter search and access of user reviews

•Tasks:Structure Identification and AnalysisText and Structure SearchSimilarity Search in Social Networks

Amélie Marian - Rutgers University

Google Research Award – April 2008

WebDB’09

Page 27: Searching data with substance and style

27

Online Reviewing Systems: Citysearch

Data in Reviews•Structured metadata•Textual review body

Sentiment information Information on product

specific features

Users are inconvenienced because:• Large number of reviews

available• Hard to find relevant

reviews• Vague or undefined

information needsAmélie Marian - Rutgers University

Page 28: Searching data with substance and style

Amélie Marian - Rutgers University

28

Data Description•Restaurant reviews extracted from

Citysearch, New York (http://newyork.citysearch.com)

•The corpus contains:▫5531 restaurants

- associated structured information (name, location, cuisine type)- a set of reviews

▫52264 reviews, of which 1359 are editorial reviews- structured information (star rating, username, date)- unstructured text (title, body, pros, cons)

▫32284 distinct users- Distinct username information

•Dataset accessible at http://www.research.rutgers.edu/~gganu/datasets/

Page 29: Searching data with substance and style

Amélie Marian - Rutgers University

29

Structure Identification•Classification of review sentences with

topic and sentiment informationSentence Topics

Food

Miscellaneous

Price

Service

Ambience

Anecdotes

Sentence Sentiment

Positive

Negative

Neutral

Conflict

Page 30: Searching data with substance and style

Amélie Marian - Rutgers University

30

Text Based Recommendation System: Evaluation Setting

•For evaluation, we separated three non-overlapping test sets of about 260 reviews:▫Test A and B : Users who have reviewed at

least two restaurants (so that training set has at least one review)

▫Test C : Users with at least 5 reviews•For measuring accuracy of prediction we

use the Root Mean Square Error (RMSE)

Page 31: Searching data with substance and style

Amélie Marian - Rutgers University

31

Text-Based Recommendation System: Steps

•Text-derived rating score▫Regression-based rating

•Goals1. Predicting the metadata star rating2. Predicting the text-derived score

• Only predicts the score, not the content of the reviews• Lower standard deviations: lower RMSE

•Prediction Strategies▫Average-based prediction ▫Personalized prediction

Page 32: Searching data with substance and style

Amélie Marian - Rutgers University

32

Regression-based Text Rating• Use text of reviews to generate a rating• Different categories and sentiment should have

different importance in the rating

Method• We use multivariate quadratic regression• Each normalized sentence type [(category,

sentiment)] is a variable in the regression• Dependent variable is metadata star-rating

• Used training sets to learn the weights for each sentence type; weights are used in computing text-based score

Page 33: Searching data with substance and style

Amélie Marian - Rutgers University 33

Regression-based Text Rating• Regression Constant: 3.68

• Regression Weights (First order variables)

• Regression Weights (Second order variables)

Regression Weights Positive Negative Neutral Conflict

Food 2.62 -2.65 -0.08 -0.69

Price 0.39 -2.12 -1.27 0.93

Service 0.85 -4.25 -1.83 0.36

Ambience 0.75 -0.27 0.16 0.21

Anecdotes 0.95 -1.75 0.06 -0.19

Miscellaneous 1.30 -2.62 -0.30 0.36

Food and Negative Price and Service appear to be most important

Regression Weights Positive Negative Neutral Conflict

Food -1.99 2.05 -0.14 0.67

Price -0.27 2.04 2.17 -1.01

Service -0.52 3.15 1.76 0.34

Ambience -0.44 0.81 -0.28 -0.61

Anecdotes -0.40 2.03 -0.03 -0.20

Miscellaneous -0.65 2.38 0.5 -0.10

Page 34: Searching data with substance and style

Amélie Marian - Rutgers University 34

Regression-Based Text Rating

Restaurant Average-based Prediction • Prediction using average rating given to a restaurant by

all users (we also tried user-average and combined)• RMSE Errors:

Predicting Star Ratings TEST A TEST B TEST C

Using Star Rating 1.127 1.267 1.126

Using Sentiment-based text rating 1.126 1.224 1.046

Predicting Sentiment Text Rating TEST A TEST B TEST C

Using Star Rating 0.703 0.718 0.758

Using Sentiment-based text rating 0.545 0.557 0.514

Baseline Case

Predicting using text does better than popularly used star rating

Page 35: Searching data with substance and style

Amélie Marian - Rutgers University

35

Clustering-based strategies for recommendations•KNN based on a clustering over star ratings

▫Little improvement over baseline▫Does not take into account the textual

information▫Sparse data▫Cold start problem▫Hard clustering not appropriate

•Soft clustering ▫Partitions objects into clusters,▫Each user has a membership probability to each

cluster

Page 36: Searching data with substance and style

Information Bottleneck Method•Foundations in Rate Distortion Theory•Allows choosing tradeoff between

▫Compression (number of clusters T)▫Quality estimated through the average

distortion between cluster points and cluster centroid (β parameter)

•Shown to work well with sparse datasets

N. Slonim, SIGIR 2002

Page 37: Searching data with substance and style

Amélie Marian - Rutgers University

37

Leveraging text content for personalized predictions

•Use the sentence types (categories, sentiments) within the reviews as features

•Users clustered based on the type of information in their reviews

•Predictions are made using membership probabilities of clusters to find neighbors

Page 38: Searching data with substance and style

38

Example: Clustering using iIB algorithm

Restaurant1 Restaurant2 Restaurant3

Food Positive

Food Negative

Price Positive

Price Negative

Food Positive

Food Negative

Price Positive

Price Negative

Food Positive

Food Negative

Price Positive

Price Negative

User1

0.6 0.2 0.2 - - - - - - - - -

User2

0.3 0.6 0.1 - 0.9 - 0.1 - 0.6 0.1 0.2 0.1

User3

0.7 0.1 0.15 0.05 - - - - 0.2 0.8 - -

User4

0.9 0.05 0.05 - 0.3 0.4 0.2 0.1 - - - -

User5

- - - - - - - - - 0.7 0.3 -

Restaurant1 Restaurant2 Restaurant3

User1 4 - -

User2 2 5 4

User3 4 ??? 3

User4 5 2 -

User5 - - 1

Input matrix to the iIB algorithm(before normalization)

Page 39: Searching data with substance and style

Amélie Marian - Rutgers University

39

Example: Soft-clustering Prediction

•For each cluster we compute the cluster contribution for the test restaurant

• Weighted average of ratings given to the restaurant

Contribution (c2,r2)=4.793, Contribution(c3,r2)=3.487

•We compute the final prediction based on the cluster contributions for the test restaurant and the test user’s membership probabilities

Restaurant1 Restaurant2 Restaurant3

User1 4 - -

User2 2 5 4

User3 4 * 3

User4 5 2 -

User5 - - 1

Cluster1 Cluster2 Cluster3

User1 0.040 0.057 0.903User2 0.396 0.202 0.402User3 0.380 0.502 0.118User4 0.576 0.015 0.409User5 0.006 0.990 0.004

Cluster Membership Probabilities

= 4.042

User rating (star or text)

Page 40: Searching data with substance and style

iIB Algorithm• Experimented with different values of β and T, used

β=20, T=100.RMSE errors and percentage improvement over baseline:

• Always improve by using text features for clustering for the traditional goal of predicting star ratings

• Even small improvement in RMSE are useful (Netflix, precision in top-k)

Predicting Star Ratings TEST A TEST B TEST CUsing Star Rating 1.103 (2.13%) 1.242 (1.74%) 1.106 (1.78%)

Using Sentiment-based text rating 1.113 (1.15%) 1.211(1.06%) 1.046(0%)

Predicting Sentiment Text Rating TEST A TEST B TEST CUsing Star Rating 0.692 (1.56%) 0.704(1.95%) 0.742(2.11%)

Using Sentiment-based text rating 0.544(0.18%) 0.549(1.44%) 0.514(0%)

Page 41: Searching data with substance and style

Amélie Marian - Rutgers University

41

URSA: Qualitative Predictions

•Predict sentiment towards each topic•Cluster users along each dimension

separately•Use threshold to classify sentiment

(actual and predicted)

A-1A-0.9

A-0.8

A-0.7

A-0.6

A-0.5

A-0.4

A-0.3

A-0.2

A-0.1

A-0

0%

20%

40%

60%

80%

100%

P-0

P-0.3

P-0.6

P-0.9

P-0P-0.1P-0.2P-0.3P-0.4P-0.5P-0.6P-0.7P-0.8P-0.9P-1

Accura

cy

θact

Prediction accuracy for positive ambience.

Page 42: Searching data with substance and style

42

PERSEUS Project Description

Patient Emotion and StRucture SEarch USer Interface▫Large amount of patient-produced data

• Difficult to search and understand• Patients need help finding information• Health professionals could learn from the data

▫Analyze and Search patient forums, mailing lists and blogs• Topical information• Specific Language• Time sensitive• Emotionally charged

Amélie Marian - Rutgers University

Google Research Award – April 2010NSF CDI Type I – October 2010-2013

Page 43: Searching data with substance and style

43

PERSEUS Project Description

▫Automatically add structure to free-text• Use of context information

• “hair loss” side effect or symptom

• Approximate structure

▫Use structure to guide search• Need for high recall, but good precision• Find users with similar experiences• Various results granularities

• Thread vs. sentence• Context dependent• Needs to take approximation into account

Amélie Marian - Rutgers University

Page 44: Searching data with substance and style

Amélie Marian - Rutgers University

44

Structuring and Searching Web ContentContributions• Leveraged automatically generated structure to

improve predictions▫ Around 2% RMSE improvements▫ Used inferred structure to group users using soft

clustering techniques• Qualitative predictions

▫ High Accuracy• Future directions

▫ Extension to healthcare domains▫ Use of inferred structure to guide search▫ Use user clusters in search▫ Adapt to various result granularities▫ Take classification inaccuracies into account

Page 45: Searching data with substance and style

Amélie Marian - Rutgers University

45

Web Data CorroborationJoint work with:Minji Wu Computer Science, Rutgers University

Collaborators:Serge Abiteboul, Alban Galland INRIAPierre Senellart Telecom ParisTechMagda Procopiuc, Divesh Srivasatava AT&T Research LabsLaure Berti-Equille IRD

Page 46: Searching data with substance and style

Minji Wu - Rutgers University

46

Motivations

• Information on web sources are unreliable▫Erroneous▫Misleading▫Biased▫Outdated

•Users need to check web sites to confirm the information▫Data corroboration

Page 47: Searching data with substance and style

Minji Wu - Rutgers University

47

Example: What is the gas mileage of my Honda Civic?

Query: “honda civic 2007 gas mileage” on MSN Search

• Is the top hit; the honda.com site unbiased?

• Is the autoweb.com web site trustworthy?

• Are all these values referring to the correct year of the model?

Users may check several web sites to get an answer

Page 48: Searching data with substance and style

48

Example: Identifying good business listings•NYC restaurant information from 6

sources▫Yellowpages▫Menupages▫Yelp▫Foursquare▫OpenTable▫Mechanical Turk (check streetview)

Amélie Marian - Rutgers University

Which listings are correct ?

Page 49: Searching data with substance and style

49

Data Corroboration Project Description

Trustworthy sources report true facts True facts come from trustworthy sources

• Sources have different▫ Coverage▫ Domain▫ Dependencies▫ Overlap

Amélie Marian - Rutgers University

Conflict resolution with maximum coverage

Microsoft Live Labs Search Award – May 2006

WebDB’07WSDM’10IS’11DEB’11

Page 50: Searching data with substance and style

50

Top-k Join: Project DescriptionIntegrate and aggregate information from several sources

Amélie Marian - Rutgers University

CleanDB’06PVLDB’10

(“minji”, “vldb10”, 0.2)

(“minji”, “SIN”, 0.1)

(“SIN”, “vldb10”, 0.9)

(“amélie”, “vldb10”, 0.5)

(“minji”, “amélie”, 1.0)

(“amélie”, “SIN”, 0.3)

Page 51: Searching data with substance and style

Amélie Marian - Rutgers University

51

Data CorroborationContributions• Probabilistic model for corroboration

▫ Fact uncertainty▫ Source trustworthiness▫ Source coverage▫ Conflict between sources

• Fixpoint techniques to compute truth values of facts and source quality estimates

• Top-k query algorithms for computing corroborated answers• Open Issues:

▫ Functional dependencies▫ Time▫ Social network▫ Uncertain data▫ Source dependence

Page 52: Searching data with substance and style

Amélie Marian - Rutgers University

52

Conclusions

•New Challenges in web data management▫Semi-structured data

PIMS User reviews

▫Multiple sources of data Conflicting information Low quality data providers (Web 2.0)

•SPIDR lab at Rutgers focuses on helping users identify useful data in the wealth of information available

Page 53: Searching data with substance and style

Amélie Marian - Rutgers University

53

Thank you!