web usage miningand using ontology for capturing web usage semantic

56
İsmail Hakkı Toroslu Middle East Technical University Department of Computer Engineering Ankara, Turkey eb Usage eb Usage Mining Mining and Using Ontology and Using Ontology for Capturing Web Usage Semantic for Capturing Web Usage Semantic

Upload: distinguished-lecturer-series-leon-the-mathematician

Post on 05-Dec-2014

1.007 views

Category:

Education


1 download

DESCRIPTION

Professor Ismail Toroslu gave a lecture on "Web Usage Mining and Using Ontology for Capturing Web Usage Semantic" in the Distinguished Lecturer Series - Leon The Mathematician. More Information available at: http://dls.csd.auth.gr

TRANSCRIPT

Page 1: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

İsmail Hakkı Toroslu

Middle East Technical UniversityDepartment of Computer Engineering

Ankara, Turkey

Web Usage Web Usage MiningMining and Using Ontology and Using Ontology for Capturing Web Usage Semanticfor Capturing Web Usage Semantic

Page 2: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

PART IPART I

A New Approach for Reactive A New Approach for Reactive Web Usage Data ProcessingWeb Usage Data Processing

Page 3: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Web Mining

• Previous Session Reconstruction Heuristics

• Smart-SRA

• Agent Simulator

• Experimental Results

• Conclusion

OUTLINE OUTLINE

Page 4: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Web Mining Web Mining

• Data Mining: Discover and retrieve useful and interesting patterns from a large dataset.

• Web mining: Dataset is the huge web data. • Dimensions:

– Web content mining – Web structure mining – Web usage mining

Page 5: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

IP Address Request Time Method URL Protocol Success of

Return Code

Number of Bytes

Transmitted

144.123.121.23 [25/Apr/2005:03:04:41–05] GET A.html HTTP/1.0 200 3290

144.123.121.23 [25/Apr/2005:03:04:43–05] GET B.html HTTP/1.0 200 2050

144.123.121.23 [25/Apr/2005:03:04:48–05] GET C.html HTTP/1.0 200 4130

Web Usage Mining (WUM)

Application of data mining techniques to web log data in order to discover user access patterns.

Example User Web Access Log

Web Mining Web Mining

Page 6: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Phases of Web Usage Mining

Web Mining Web Mining

Pre-ProcessingPre-ProcessingPattern AnalysisPattern Analysis

RawRaw

Server logServer logUser User

sessionsession

File File Rules and Rules and PatternsPatterns

Interesting Interesting KnowledgeKnowledge

ApplicationsApplicationsPattern DiscoveryPattern Discovery

Apriori, GSP,Apriori, GSP, SPADESPADE

Session Session ReconstructionReconstruction

HeuristicsHeuristics

Page 7: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Session Reconstruction

• Sessions are reconstructed by using heuristics that select and group requests belonging to the same user session

• Types: – Reactive: processing requests after they are handled by the web

server, – Proactive: processing occurs during the interactive browsing of

the web site by the user

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 8: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Time-oriented heuristics • Navigation-oriented heuristic

New Reactive Session Reconstruction Technique: Smart-SRA

Combines these heuristics with "site topology" information in order to increase the accuracy of the reconstructed sessions

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 9: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Example Web Topology Graph

P13 P1

P49

P20 P23

P34

Example Web Page Request Sequence

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 15 29 32 47

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 10: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Time-oriented heuristics -1

Total session time: duration of a discovered session is limited with a threshold

Discovered Sessions (30 mins):

1. [P1, P20, P13, P49]

2. [P34, P23]

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 15 29 32 47

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 11: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Time-oriented Heuristics -2

Page-stay time: time spent on any page is limited with a threshold

Discovered Sessions (10 mins):

1. [P1, P20, P13]

2. [P49, P34]

3. [P23]

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 15 29 32 47

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 12: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Navigation-Oriented Heuristic

Adding page WPN+1 to a session [WP1, WP2, …, WPN]

• If WPN has a hyperlink to WPN+1

[WP1, WP2, …, WPN, WPN+1]

• If WPN does not have a hyperlink to WPN+1

and WPKmax is the nearest page having a hyperlink to WPN+1 add backward browser moves

[WP1, WP2,…, WPN, WPN-1, WPN-2,..., WPKmax, WPN+1]

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 13: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Navigation-Oriented Heuristic

Curent Session Condition New Page

[ ] P1

[P1] Link[P1, P20] = 1 P20

[P1, P20] Link[P20, P13] = 0

Link[P1, P13] = 1

P13

[P1, P20, P1, P13] Link[P13, P49] = 1 P49

[P1, P20, P1, P13, P49] Link[P49, P34] = 0

Link[P13, P34] = 1

P34

[P1, P20, P1, P13, P49, P13, P34] Link[P34, P23] =1 P23

[P1, P20, P1, P13, P49, P13, P34, P23]

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 14: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Smart-SRASmart-SRA

• Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria

Satisfies the overall session duration time limit

• Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that:

– between each consecutive page pair in a session there is a hyperlink from the previous page to the next page

– the page stay time criteria is also satisfied

Adds referrer constraints of the topology rule while eliminating the need for inserting backward browser movements.

Contains Two Phases:

Page 15: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

1. Determine the web pages without any referrer (on its left) and remove them from the candidate session

2. For each one of these pages• For each previously constructed session

– If there is a hyperlink from the last page of the session to the web page, then append the web page to the session (if the page stay time constraint is satisfied)

3. Remove non-maximal sessions

Steps of Phase 2

Process a candidate session from left to right by repeating the following steps until the candidate session is empty:

Smart-SRASmart-SRA

Page 16: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Example Candidate Session

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 9 12 14 15

P13 P1

P49

P20 P23

P34

Example Web Topology

Smart-SRASmart-SRA

Page 17: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Iteration 1 2

Candidate Session [P1, P20, P13, P49, P34, P23] [P20, P13, P49, P34, P23]

New Session Set(before)

[P1]

Temp Page Set {P1} {P20, P13}

Temp Session Set [P1] [P1,P20]

[P1,P13]

New Session Set(after)

[P1] [P1,P20]

[P1,P13]

Iteration 3 4

Candidate Session [P49, P34, P23] [P23]

New Session Set(before)

[P1,P20]

[P1,P13]

[P1,P13,P34]

[P1, P13, P49]

[P1, P20]

Temp Page Set {P49, P34} {P23}

Temp Session Set [P1,P13,P34]

[P1, P13, P49]

[P1, P13, P34, P23]

[P1, P13, P49, P23], [P1, P20, P23]

New Session Set(after)

[P1,P13,P34], [P1, P13, P49]

[P1, P20]

[P1, P13, P34, P23] , [P1, P13, P49, P23]

[P1, P20, P23]

Smart-SRASmart-SRA

Page 18: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Agent SimulatorAgent Simulator

• Models the behavior of web users and generates web user navigation and the log data kept by the web server

• Used to compare the performances of alternative session reconstruction heuristics

• Uses 4 Primitive behaviors for simulating complex navigation of web user.

Page 19: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Web user can start a new session with any one of the possible entry pages of the web site

Start page

New request from server

S1 Session I

S2 Session II

Agent SimulatorAgent Simulator

User-Behavior I

P13 P1

P20 P23

P34

1

S1

P49

2

S2

Page 20: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Start page

From cache

New request from server

P13 P1

P49

P20 P23

P34

2

1

Web user can select a new page having a link from the most recently accessed page

Agent SimulatorAgent Simulator

User-Behavior II

Page 21: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Start page

From cache

New request from server

P13 P1

P49

P20 P23

P34

2

1

3

4

5

Web user can select as the next page having a link from any one of the previously browsed pages

Agent SimulatorAgent Simulator

User-Behavior III

Page 22: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Start page

From cache

New request from server

P13 P1

P49

P20 P23

P34

2

1

3

4

5

6

Web user can terminate the session

Agent SimulatorAgent Simulator

User-Behavior IV

Page 23: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Parameters for simulating behavior of web user

• Session Termination Probability (STP)• Link from Previous pages Probability (LPP)• New Initial page Probability (NIP)

Agent SimulatorAgent Simulator

Page 24: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Heuristics Tested

• Time oriented heuristic (heur1)(total time < 30 min)

• Time oriented heuristic (heur2)(page stay < 10 min)

• Navigation oriented heuristic (heur3)

• Smart-SRA heuristic (heur4)

Experimental ResultsExperimental Results

Page 25: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Accuracy

Reconstructed session H captures

a real session R

if R occurs as a subsequence of H (R H)⊏

R = [P1, P3, P5]

H = [P9, P1, P3, P5, P8] => R H ⊏H = [P1, P9, P3, P5, P8] => R H⋢

Experimental ResultsExperimental Results

Page 26: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Parameters for generating user sessions and web topology

Number of web pages (nodes) in topology 300

Average number of outdegree 15

Average number of page stay time 2,2 min

Deviation for page stay time 0,5 min

Number of agents 10000

STP : Fixed & Range 5% 1%-20%

LPP : Fixed & Range 30% 0%-90%

NIP : Fixed & Range 30% 0%-90%

Experimental ResultsExperimental Results

Page 27: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Accuracy vs. STP

Real Accuracy vs STP

0

10

20

30

40

50

1 4 7 10 13 16 19

STP

heur1

heur2

heur3

heur4

Experimental ResultsExperimental Results

Page 28: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Real Accuracy vs LPP

0

10

20

30

40

50

0 10 20 30 40 50 60 70 80 90

LPP

Rea

l Acc

ura

cy % heur1

heur2

heur3

heur4

Accuracy vs LPP

Experimental ResultsExperimental Results

Page 29: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Real Accuracy vs NIP

05

101520253035

0 10 20 30 40 50 60 70 80 90

NIP

Rea

l Acc

uar

cy % heur1

heur2

heur3

heur4

Accuracy vs. NIP

Experimental ResultsExperimental Results

Page 30: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

ConclusionConclusion

• New session reconstruction heuristic: Smart-SRA– Does not allow sequences with unrelated consecutive requests

(no hyperlink between the previous one to the next one)• No artificial browser (back) requests insertion in order to prevent

unrelated consecutive requests– Only maximal sessions

• Agent simulator• Accuracy measure• Experimental results show Smart-SRA outperforms previous

heuristics

Page 31: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

PART IIPART II

Semantically Enriched Event Based Semantically Enriched Event Based Model Model ffor or WWeb eb Usage MiningUsage Mining

Page 32: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Introduction• Related Work• Semantic Event Based Sessions• Formal Definition of Semantic Events• Algorithms for Mining Semantic Event Patterns• Experimental Results• Conclusion

04/10/23

OUTLINE OUTLINE

Page 33: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Traditional WUM is based on pageviews,

but user interaction model is changing

• Users do not care about pageviews,

but they use web site to achieve high level goals such as

– Finding and viewing a video– Buying tickets– Searching for the nearest Italian restaurant– Listening to a song, etc

04/10/23

IntroductionIntroduction

Page 34: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• We should analyze usage data in a series of “events”

1. Search Mediterranean Restaurants2. Search Italian Restaurants3. View the reviews for Restaurant A4. View the reviews for Restaurant B5. Click the web site link of Restaurant A

• Incorporating semantic knowledge in the process is the logical choice

– A method should be devised to capture user behavior– Usage data should be mapped to semantic space– An algorithm should be developed to exploit semantic relations

04/10/23

IntroductionIntroduction

Page 35: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• In this work we propose methods for:

– tracking and logging domain level events– injecting semantic to events– semantic ordering of events– an algorithm for computing sequences of frequent events

• Proposed system tested with 2 web sites

– Music Streaming Site– Mobile Network Operator’s Site

04/10/23

IntroductionIntroduction

Page 36: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Events are conceptual actions that the user performs to achieve a certain affect

• Events are used to capture business actions that are defined in the site’s domain

• The site admin is responsible for defining and tracking events

• Events are tracked via JavaScript client

04/10/23

Semantic Event Based SessionsSemantic Event Based Sessions

Page 37: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Example events:

– Play a video event – Add to shopping cart event– Add friend action

• Sometimes we may be interested in properties of events, such as

– “query” property of a “search event”– “category” property of a “view video event”

04/10/23

Semantic Event Based SessionsSemantic Event Based Sessions

Page 38: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Every event is defined as an object.

• Objects can have properties which relate an object

– with another object or – with a datatype value

• The relations between objects are captured in a tree

– Each individual and property is a node – Object property nodes have object as a parent and a child

04/10/23

Semantic Event Based SessionsSemantic Event Based Sessions

Page 39: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

A sample event from a hypothetical video viewing site

04/10/23

Page 40: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Events can be used to capture all relevant actions of the user including plain pageviews

• With a mapping of events to the web site’s ontology we can define ‘semantic events’

• Events are mapped to semantic space by using the class and property names in the ontology

• As a result of this mapping, the data to be mined is– An ontology containing the terminological part– Logs containing semantic objects

04/10/23

Events as Semantic ObjectsEvents as Semantic Objects

Page 41: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• A Session is an ordered set of atomtrees that corresponds to events for a single user in a certain browsing activity

• An Atom-tree is a tree of connected atoms. The atom tree represents a domain event in the web site's ontology

• An Atom is either an individual of a class, a datatype property assertion, or an object property assertion

04/10/23

DefinitionsDefinitions

Page 42: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• A Pattern is an ordered set of atomtrees

• A session S supports the pattern Q

iff Q is a subsequence of S

where isMoreGeneralThan relation is used instead of equality in determining the subsequence relation

04/10/23

DefinitionsDefinitions

Page 43: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

Page 44: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• For a given set of session, the problem is to find the set of patterns with support greater than the threshold value, minSupport

• Two phase Apriori-like algorithm

– First phase finds frequent atomtrees (patterns containing single atomtree)

– Second phase searches for frequent atomtree patterns

04/10/23

AlgorithmAlgorithm

Page 45: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Apriori property:

If atomtree a1 is more general than atomtree a2, then the support of a1 is greater than the support of a2

• getMostGeneralForms generates the set of trees, – more general than the given atomtree – not less general than any other atomtree generated

• For level-wise search, a one-step refinement operator, is defined over the set of individual atoms, object and datatype property assertion atoms.

04/10/23

Phase I: Find Frequent AtomtreesPhase I: Find Frequent Atomtrees

Page 46: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Given an atom, one-step refinement operator refines another atom, by refining for subclass, sub property or refining a child of the node.

• A one-step refinement over the set of atom-trees returns a set of atom-trees by either – Refining a single node– Adding the most general form of a node

• One-step refinement takes two atom-trees and returns atom-trees that are more similar forms of the second towards the first

04/10/23

Phase I: Find Frequent AtomtreesPhase I: Find Frequent Atomtrees

Page 47: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

INPUT : Session data containing semantic eventsOUTPUT: List of frequent atom-trees

generate the initial candidate set using getMostGeneralForm of all events

iterate until no candidates can be generated { compare candidate set with the data set for each atom-tree in the data set { increment the frequency of each atom-tree in candidate set

that is more general than the atom-tree } filter the candidates that are less frequent that minSupport generate next candidate set using oneStepRefinement operator

on the current candidate set atom-trees}

Phase I: Find Frequent AtomtreesPhase I: Find Frequent Atomtrees

Page 48: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Similar to GSP

• The taxonomy introduced by isMoreGeneralThan relation is used

• Data set is converted: – each frequent atom-tree is mapped to an integer hash – each atom-tree in a session is replaced by a set of hashes of the

atom-tree and its ancestors

• Subsequence relation is modified to respect set inclusion

04/10/23

Phase II: Find Frequent SequencesPhase II: Find Frequent Sequences

Page 49: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

INPUT : session data containing semantic events and frequent atom trees from phase one

OUTPUT: list of frequent atom-trees

convert data set and frequent atom-trees to hashes

while the candidate set is not empty { generate candidate set from previous frequent patterns count candidates select frequent candidates}

reconvert patterns

04/10/23

Phase II: Find Frequent SequencesPhase II: Find Frequent Sequences

Page 50: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

• Two sites are tested:

– A music streaming site• Single-page, AJAX based music listening site• 280K events in 75K sessions• Events are tracked via Java Script client

– A mobile network operator’s site• Content-heavy, mostly static, high traffic web site • 1M pageviews in 175K sessions• Events are extracted from access logs

04/10/23

ExperimentsExperiments

Page 51: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

Music Streaming Site - EventsMusic Streaming Site - Events

Page 52: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

• 38.9% of the sessions: user made a search• 9.3% of the sessions: user removed a song from her playlist• 95.5% of the sessions: user made an action about a song• 27.5% of the sessions: user added a song to playlist

• 139 frequent patterns are found– Frequent pattern of length 2 describes

a search is performed after playing a particular song– Frequent pattern of length 6 describes

sequential removal of songs from playlist (due to the lack of ‘clear playlist’ button in the interface)

Music Streaming Site - PatternsMusic Streaming Site - Patterns

Page 53: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

• Two days of logs• More than 1 million pageviews occurred in 175K sessions.• Semi-automatically generated ontology• A total of 503 class• 7-level hierarchy

Mobile Network Operator Site - EventsMobile Network Operator Site - Events

Page 54: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

Mobile Network Operator Site - OntologyMobile Network Operator Site - Ontology

Page 55: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

04/10/23

• 10% of the sessions: at least one search action• 38% of the sessions: page not categorized in the ontology is visited• 71% of the sessions: user visits the home page (interesting)

• Some other subjectively interesting patterns – user’s browsing behaviors between subclasses of content class– users visited home page then jumped to some specific content – users searched and moved on to specific category

Mobile Network Operator Site - PatternsMobile Network Operator Site - Patterns

Page 56: Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Proposed system is • More generic than some of the previous semantic web usage

mining attempts • Captures usage model more correctly• Intuitive and sound• Uses most of the ontology constructs• Applicable to real web sites with varying domains• Parallelizable and suitable for MapReduce

04/10/23

ConclusionsConclusions