csa402 lecture 17 the architecture of a generic ahs...

12
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta. CSA402: Lecture 17 1 CSA402 Lecture 17 The Architecture of a Generic AHS: Personal WebWatcher References Mladenic, D. (1996), Personal WebWatcher: design and implementation. Available on- line at http://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/papers/PWW/pwwTR.ps.Z Mladenic, D. (1999), Machine learning used by Personal WebWatcher. Available on- line at http://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/papers/PWW/pwwACAI99.ps.gz Additional information about Personal WebWatcher can be found at http://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/index.html

Upload: others

Post on 06-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 1

CSA402

Lecture 17

The Architecture of a

Generic AHS:

Personal WebWatcher

References

Mladenic, D. (1996), Personal WebWatcher: design and implementation. Available on-line athttp://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/papers/PWW/pwwTR.ps.Z

Mladenic, D. (1999), Machine learning used by Personal WebWatcher. Available on-line athttp://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/papers/PWW/pwwACAI99.ps.gz

Additional information about Personal WebWatcher can be found athttp://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/index.html

Page 2: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 2

Overview

• We have seen the goals and objectivesof Adaptive Hypertext Systems

• We have seen how research into ITShas influenced ITS-based AHSarchitecture

• We have seen how to represent userinterests through User Modeling

• We have seen how InformationRetrieval can be used to search forrelevant documents based on a userquery

• Today, we will see how PersonalWebWatcher recommends documentsto a user, based on an analysis of thedocuments that a user browses

Page 3: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 3

Personal WebWatcher

• Personal WebWatcher was inspired byWebWatcher

• WebWatcher analyses user interactionsin a Web site to offer an improvedservice to future users of that site

• WebWatcher is not a personal assistant

• Personal WebWatcher (PWW) is apersonal search assistant

• PWW observes users of the WWW andsuggests pages that they may beinterested in

• PWW learns the individual interests ofits users from the Web pages that theusers visit

• The learned user model is then used tosuggest new HTML pages to the user

Page 4: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 4

Architecture of Personal WebWatcher

(From Mladenic, D. 1998 Personal WebWatcher: design and implementation)

• PWW consists of two main parts:

• a Web proxy server• a learner

• The proxy saves URLs of visiteddocuments to disk

• The learner uses them to generate amodel of user interests

Page 5: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 5

The Proxy Server

• Consists of three main parts:

• the proxy itself (together with afetcher that retrieves a document)

• the adviser• the classifier

• When a user requests a Web page, theproxy fetches, adding advice if it is inHTML format

• Advice is added by the adviser, whichextracts hyperlinks from the documentand sends them to the classifier

• The classifier compares the linked-todocuments to the user model, and, ifthere is a close enough match betweena user interest and the target document,it will recommend the link to the user

Page 6: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 6

The Learner

• The learner has two modes ofoperation

• LEARNER: in which it learnes anew user model from scratch

• UPDATER: in which it updates anexisting user model

• In LEARNER mode the learner mustdecide how to define the domain

• In UPDATER mode, the domain isalready defined, and so knows whichterms to use to represent a document invector space.

• In either mode, the learner processesthe document that the user visited*and* the documents lying one linkaway from the visited document

• It processes visited documents aspositive examples of the user's interest,and non-visited documents as negativeexamples

• The learner operates "off-line"

Page 7: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 7

The Learner model

(From Mladenic, D. 1998 Personal WebWatcher: design and implementation)

Page 8: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 8

The Model of User Interests

• The model is used to predict if somedocument is a positive example(relevant) or a negative example (non-relevant) to a user interest

• The model is applied to documentslinked to from a user-requesteddocument

• The hyperlinks in the requesteddocument are marked up accordingly

• However, because it is time-consumingto process the linked-to documents inreal-time, an estimation is made basedon the hyperlink in the requesteddocument...

UserHL →{pos, neg}

Is the hyperlink a positive or negative exampleof the user interests?

Page 9: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 9

• The hyperlink must be marked up insome way to attempt to captureinformation related to the destinationdocument - PWW uses an extendedrepresentation of a hyperlink

• Apart from the hyperlink's anchor text,the extended representation includesunderlined words, words in a windowaround the link source (e.g, thesentence), and words in headings abovethe hyperlink.

• Because the learner learns off-line,documents that the user has visited andunvisited hyperlinks from thosedocuments can be processed tocompare the predicted results to theactual user behaviour

UserDOC : Document →{pos, neg}Using the model generated from documents to

predict "interestingness" of hyperlinks

• PWW can also predict documentcontent based on a given hyperlink

DocumentHL : Hyperlink →document

Page 10: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 10

Representing documents in PWW

• PWW has two overall objectives:

• To learn about user interests fromfull-text representations ofdocuments

• To recommend links to user basedon predictions of the interestingnessof documents at the destination ofhyperlinks

• In IR, documents are typicallyrepresented as TFIDF-vectors

• PWW operates largely on HTMLdocuments, which have a structure

• Users may also want to add terms(which do not appear in the HTMLdocument) to the documentrepresentation (e.g., WebWatcher)

• PWW uses TFIDF-vector together withsuccess feedback (did the user follow arecommended link?)

Page 11: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 11

• If the advice was poor, then useadditional information to changeadvice to reflect what the user actuallydid.

• Term weight in the TFIDF vector isderived using mutual informationbetween term frequency and classvalue (using Quinlan's Decision Trees)

• Mutual information assigns higherweights to terms which are better atdistinguishing between interesting anduninteresting documents (based onobservation)

The Learning Algorithm

• PWW compares a Naive (simple)Bayesian classifier on frequencyvectors with the k-Nearest Neighbourapproach to generate a model of userinterests

• The Bayesian Classifier on booleanvectors assumes term independence

Page 12: CSA402 Lecture 17 The Architecture of a Generic AHS ...staff.um.edu.mt/csta1/courses/lectures/csa4020/csa402.17.pdfArchitecture of Personal WebWatcher (From Mladenic, D. 1998 Personal

© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.

CSA402: Lecture 17 12

• It defines the probability of class c forsome document doc that contains termw as proportional to

P(c)ΠwP(c|w)

• Yang's variation of k-NearestNeighbour defines the relevance ofclass c given document doc as

rel(c|doc) = ik=∑ 1 similarity(doc,Di) P(c|Di)

• PWW experiments show that k-NearestNeighbour gave slightly better resultsthan the Naive Bayesian Classifier, butthat the overall difference between thetwo learning algorithms was notsignificant (Mladenic, 1999)