© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 1
CSA402
Lecture 17
The Architecture of a
Generic AHS:
Personal WebWatcher
References
Mladenic, D. (1996), Personal WebWatcher: design and implementation. Available on-line athttp://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/papers/PWW/pwwTR.ps.Z
Mladenic, D. (1999), Machine learning used by Personal WebWatcher. Available on-line athttp://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/papers/PWW/pwwACAI99.ps.gz
Additional information about Personal WebWatcher can be found athttp://www.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/pww/index.html
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 2
Overview
• We have seen the goals and objectivesof Adaptive Hypertext Systems
• We have seen how research into ITShas influenced ITS-based AHSarchitecture
• We have seen how to represent userinterests through User Modeling
• We have seen how InformationRetrieval can be used to search forrelevant documents based on a userquery
• Today, we will see how PersonalWebWatcher recommends documentsto a user, based on an analysis of thedocuments that a user browses
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 3
Personal WebWatcher
• Personal WebWatcher was inspired byWebWatcher
• WebWatcher analyses user interactionsin a Web site to offer an improvedservice to future users of that site
• WebWatcher is not a personal assistant
• Personal WebWatcher (PWW) is apersonal search assistant
• PWW observes users of the WWW andsuggests pages that they may beinterested in
• PWW learns the individual interests ofits users from the Web pages that theusers visit
• The learned user model is then used tosuggest new HTML pages to the user
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 4
Architecture of Personal WebWatcher
(From Mladenic, D. 1998 Personal WebWatcher: design and implementation)
• PWW consists of two main parts:
• a Web proxy server• a learner
• The proxy saves URLs of visiteddocuments to disk
• The learner uses them to generate amodel of user interests
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 5
The Proxy Server
• Consists of three main parts:
• the proxy itself (together with afetcher that retrieves a document)
• the adviser• the classifier
• When a user requests a Web page, theproxy fetches, adding advice if it is inHTML format
• Advice is added by the adviser, whichextracts hyperlinks from the documentand sends them to the classifier
• The classifier compares the linked-todocuments to the user model, and, ifthere is a close enough match betweena user interest and the target document,it will recommend the link to the user
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 6
The Learner
• The learner has two modes ofoperation
• LEARNER: in which it learnes anew user model from scratch
• UPDATER: in which it updates anexisting user model
• In LEARNER mode the learner mustdecide how to define the domain
• In UPDATER mode, the domain isalready defined, and so knows whichterms to use to represent a document invector space.
• In either mode, the learner processesthe document that the user visited*and* the documents lying one linkaway from the visited document
• It processes visited documents aspositive examples of the user's interest,and non-visited documents as negativeexamples
• The learner operates "off-line"
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 7
The Learner model
(From Mladenic, D. 1998 Personal WebWatcher: design and implementation)
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 8
The Model of User Interests
• The model is used to predict if somedocument is a positive example(relevant) or a negative example (non-relevant) to a user interest
• The model is applied to documentslinked to from a user-requesteddocument
• The hyperlinks in the requesteddocument are marked up accordingly
• However, because it is time-consumingto process the linked-to documents inreal-time, an estimation is made basedon the hyperlink in the requesteddocument...
UserHL →{pos, neg}
Is the hyperlink a positive or negative exampleof the user interests?
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 9
• The hyperlink must be marked up insome way to attempt to captureinformation related to the destinationdocument - PWW uses an extendedrepresentation of a hyperlink
• Apart from the hyperlink's anchor text,the extended representation includesunderlined words, words in a windowaround the link source (e.g, thesentence), and words in headings abovethe hyperlink.
• Because the learner learns off-line,documents that the user has visited andunvisited hyperlinks from thosedocuments can be processed tocompare the predicted results to theactual user behaviour
UserDOC : Document →{pos, neg}Using the model generated from documents to
predict "interestingness" of hyperlinks
• PWW can also predict documentcontent based on a given hyperlink
DocumentHL : Hyperlink →document
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 10
Representing documents in PWW
• PWW has two overall objectives:
• To learn about user interests fromfull-text representations ofdocuments
• To recommend links to user basedon predictions of the interestingnessof documents at the destination ofhyperlinks
• In IR, documents are typicallyrepresented as TFIDF-vectors
• PWW operates largely on HTMLdocuments, which have a structure
• Users may also want to add terms(which do not appear in the HTMLdocument) to the documentrepresentation (e.g., WebWatcher)
• PWW uses TFIDF-vector together withsuccess feedback (did the user follow arecommended link?)
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 11
• If the advice was poor, then useadditional information to changeadvice to reflect what the user actuallydid.
• Term weight in the TFIDF vector isderived using mutual informationbetween term frequency and classvalue (using Quinlan's Decision Trees)
• Mutual information assigns higherweights to terms which are better atdistinguishing between interesting anduninteresting documents (based onobservation)
The Learning Algorithm
• PWW compares a Naive (simple)Bayesian classifier on frequencyvectors with the k-Nearest Neighbourapproach to generate a model of userinterests
• The Bayesian Classifier on booleanvectors assumes term independence
© 2001. Chris Staff. Department of Computer Science and AI, University of Malta.
CSA402: Lecture 17 12
• It defines the probability of class c forsome document doc that contains termw as proportional to
P(c)ΠwP(c|w)
• Yang's variation of k-NearestNeighbour defines the relevance ofclass c given document doc as
rel(c|doc) = ik=∑ 1 similarity(doc,Di) P(c|Di)
• PWW experiments show that k-NearestNeighbour gave slightly better resultsthan the Naive Bayesian Classifier, butthat the overall difference between thetwo learning algorithms was notsignificant (Mladenic, 1999)