paul alexandru chirita stefania costache siegfried handschuh wolfgang nejdl 1* l3s research center...

33
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE 16 TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2007 SESSION: SEMANTIC WEB AND WEB 2.0 PTAG: Large Scale Automatic Generation of Personalized Annotation TAGs for the Web 1

Upload: andrew-jones

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

PAUL ALEXANDRU CHIRITASTEFANIA COSTACHESIEGFRIED HANDSCHUHWOLFGANG NEJDL

1* L3S RESEARCH CENTER2* NATIONAL UNIVERSITY OF IRELAND

PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2007SESSION: SEMANTIC WEB AND WEB 2.0

PTAG: Large Scale Automatic Generation of Personalized

Annotation TAGs for the Web1

Page 2: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Outline

AbstractIntroductionPrevious WorkAutomatic Personalized Web AnnotationsExperimental ResultsConclusionsFuture WorkComments

2

Page 3: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Abstract

The success of the Semantic Web depends on the availability of Web pages annotated with metadata

In this paper they propose P-TAG, a method which automatically generates personalized tags for Web pages produces keywords relevant to its textual content also to the data residing on the surfer’s Desktop

Empirical evaluations with several algorithms pursuing this approach showed very promising results

3

Page 4: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Introduction (1/3)

The Semantic Web a vision of a future Web of machine-understandable documents and data

Annotations are the main instrument, which enrich content with metadata in order to ease its automatic processing The problem of traditional manual or semi-automatic

annotation Alternative method: tagging

4

Page 5: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Introduction (2/3)

Why automatic tagging? Webpage are growth very fast Recommendation

Why personalization? Automatically generated tags have the drawback of

presenting only a generic view

5

Page 6: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Introduction (3/3)

Problems of user profile These profiles are laborious to create and need constant

maintenance in order to reflect the changing interest of the user

Personal Desktop usually contains a very rich document corpus of personal information Can and should be exploited for user personalization

6

Page 7: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Previous work (1/2)7

- Generating annotations for web Brooks and Montanez [4]

analyzed the effectiveness of tags for classifying blog entries and found that manual tags are less effective content

descriptors than automated ones Cimiano et.al. [10, 11]

Proposed PANKOW (Pattern-based Annotation through Knowledge on the Web)

Employs an unsupervised, pattern-oriented approach to categorize an instance with respect to a given ontology

C-PANKOW: enhanced version of PANKOW It requires an input ontology and output instances of the

ontological concepts Annotation is always directly rooted on the text of the web page

Page 8: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Previous work (2/2)8

- Generating annotations for web (cont’d) Dill et. al. [14]

Present a platform for large-scale text analytics and automatic semantic tagging

The system spots knows terms in a webpage and relates it to existing instances of a given ontology

- Text Mining for Keywords Extraction- Text Mining for Keywords Association

Page 9: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Automatic personalized web annotations (1/4)

9

Three approaches to generate personalized web page annotations Document Oriented Extraction Keyword Oriented Extraction Hybrid Extraction

Page 10: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Automatic personalized web annotations (2/4)10

Document Oriented Extraction

Page 11: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Automatic personalized web annotations (3/4)11

Keyword Oriented Extraction

Page 12: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Automatic personalized web annotations (4/4)12

Hybrid Extraction

Page 13: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Experimental13

Experimental Setup Documents set of personal desktop

E-mails 、 Web cache documents 、 all files (user selected paths) For the annotation, the input web page were categorized

Small (below 4KB) Medium (between 4KB and 32KB) Large (more than 32KB)

Total of 96 web pages were used as input to be annotated Over 2000 resulted annotations Each proposed keyword was rated 0 (not relevant) or 1

(relevant) Measured the quality of the produced annotations using

precision The precision at level K (P@K)

Page 14: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Experimental Results (1/5)14

Document Oriented Extraction

Small web pages

Medium web pages

Large web pages

Page 15: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Experimental Results (2/5)15

Keyword Oriented Extraction

Small web pages Medium web pages Large web pages

Page 16: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Experimental Results (3/5)16

Hybrid Oriented Extraction

Small web pages Medium web pages Large web pages

Page 17: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Experimental Results (4/5)17

Precision at the first three output annotations for the best methods of each category

Page 18: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Experimental Results (5/5)18

Examples of annotations

Page 19: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Applications19

Personalized Web SearchWeb Recommendations for Desktop TasksOntology Learning

Page 20: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Conclusions20

Our technique overcomes the burden of manual tagging

The system does not require any manual definition of interest profiles

The system proposes a more diverse range of tags which are closer to the personal viewpoint of the user

The results produced provide a high user satisfaction

Page 21: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Future Work21

A shared server approach that supports social tagging Diversity

Keywords are generated from millions of sources Scalability High utility for web search, analytics and advertising Instant update

Page 22: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Comments22

In regard to the automatic tags generation, the existing tools are good enough to implement the system

Tag recommendation is a good incentive for user to give tags

Automatic tagging are aids, for the social network on the web, user’s tags represented a comprehension of “what the people is”

Page 23: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Finding Similar Documents23

Cosine Similarity Based on TFxIDF

The weight of terms calculated from ttiti idftfw *,,

Vectors of two documentsVectors of two documents

For all terms of two documents

For all terms of two documents

Weights of term t for two documents

Weights of term t for two documents

Page 24: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Extracting Keywords from Documents24

Keyword extraction algorithms usually take a text document as input and then return a list of keywords

Each keyword has associated a value representing the confidence

Page 25: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Extracting Keywords from Documents25

For keyword extraction, they use the following methods

Term Frequency

Lexical Compounds

Sentence Selection

Document Frequency

Page 26: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Term Frequency26

This is necessary especially for longer documents, because more informative terms tend to appear towards beginning

Number of terms in the document

Number of terms in the document

Position of the first appearance of the term

Position of the first appearance of the term

Page 27: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Lexical Compounds27

Noun analysis is the simplest approach for lexical compound

Step1: part-of-speech tagging for the document

Step2: finding the pattern of { adjective? , noun+ }

Step3: ordering the patterns by frequency

Zero or one

Zero or one

One or more

One or more

Page 28: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Sentence Selection28

This technique builds upon sentence oriented document summarization

Ranking the document sentences according to their salience score [26]

Number of significant words in the sentence

Number of significant words in the sentence

Total number of words in the sentenceTotal number of words in the sentence

* Significant word

Position scorePosition score

Optional parameterOptional

parameter

Number of query terms present in a sentence

Number of query terms present in a sentence

Number of terms in a query

Number of terms in a query

Page 29: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Sentence Selection29

Significant word

Number of sentences in the documentNumber of sentences in the document

Page 30: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Finding of Similar Keyword30

For find related keywords, they use the following methods

Term Co-occurrence Statistics

Thesaurus Based Extraction

Page 31: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Term Co-occurrence Statistics31

Extracted keywords from web pageExtracted keywords from web page

Page 32: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Similarity Coefficients32

Cosine similarity

Mutual Information

Likelihood Ratio

Page 33: PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE

Thesaurus Based Extraction33