web usage miningand using ontology for capturing web usage semantic
DESCRIPTION
Professor Ismail Toroslu gave a lecture on "Web Usage Mining and Using Ontology for Capturing Web Usage Semantic" in the Distinguished Lecturer Series - Leon The Mathematician. More Information available at: http://dls.csd.auth.grTRANSCRIPT
İsmail Hakkı Toroslu
Middle East Technical UniversityDepartment of Computer Engineering
Ankara, Turkey
Web Usage Web Usage MiningMining and Using Ontology and Using Ontology for Capturing Web Usage Semanticfor Capturing Web Usage Semantic
04/10/23
PART IPART I
A New Approach for Reactive A New Approach for Reactive Web Usage Data ProcessingWeb Usage Data Processing
• Web Mining
• Previous Session Reconstruction Heuristics
• Smart-SRA
• Agent Simulator
• Experimental Results
• Conclusion
OUTLINE OUTLINE
Web Mining Web Mining
• Data Mining: Discover and retrieve useful and interesting patterns from a large dataset.
• Web mining: Dataset is the huge web data. • Dimensions:
– Web content mining – Web structure mining – Web usage mining
IP Address Request Time Method URL Protocol Success of
Return Code
Number of Bytes
Transmitted
144.123.121.23 [25/Apr/2005:03:04:41–05] GET A.html HTTP/1.0 200 3290
144.123.121.23 [25/Apr/2005:03:04:43–05] GET B.html HTTP/1.0 200 2050
144.123.121.23 [25/Apr/2005:03:04:48–05] GET C.html HTTP/1.0 200 4130
Web Usage Mining (WUM)
Application of data mining techniques to web log data in order to discover user access patterns.
Example User Web Access Log
Web Mining Web Mining
Phases of Web Usage Mining
Web Mining Web Mining
Pre-ProcessingPre-ProcessingPattern AnalysisPattern Analysis
RawRaw
Server logServer logUser User
sessionsession
File File Rules and Rules and PatternsPatterns
Interesting Interesting KnowledgeKnowledge
ApplicationsApplicationsPattern DiscoveryPattern Discovery
Apriori, GSP,Apriori, GSP, SPADESPADE
Session Session ReconstructionReconstruction
HeuristicsHeuristics
Session Reconstruction
• Sessions are reconstructed by using heuristics that select and group requests belonging to the same user session
• Types: – Reactive: processing requests after they are handled by the web
server, – Proactive: processing occurs during the interactive browsing of
the web site by the user
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
• Time-oriented heuristics • Navigation-oriented heuristic
New Reactive Session Reconstruction Technique: Smart-SRA
Combines these heuristics with "site topology" information in order to increase the accuracy of the reconstructed sessions
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Example Web Topology Graph
P13 P1
P49
P20 P23
P34
Example Web Page Request Sequence
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 15 29 32 47
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Time-oriented heuristics -1
Total session time: duration of a discovered session is limited with a threshold
Discovered Sessions (30 mins):
1. [P1, P20, P13, P49]
2. [P34, P23]
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 15 29 32 47
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Time-oriented Heuristics -2
Page-stay time: time spent on any page is limited with a threshold
Discovered Sessions (10 mins):
1. [P1, P20, P13]
2. [P49, P34]
3. [P23]
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 15 29 32 47
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Navigation-Oriented Heuristic
Adding page WPN+1 to a session [WP1, WP2, …, WPN]
• If WPN has a hyperlink to WPN+1
[WP1, WP2, …, WPN, WPN+1]
• If WPN does not have a hyperlink to WPN+1
and WPKmax is the nearest page having a hyperlink to WPN+1 add backward browser moves
[WP1, WP2,…, WPN, WPN-1, WPN-2,..., WPKmax, WPN+1]
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Navigation-Oriented Heuristic
Curent Session Condition New Page
[ ] P1
[P1] Link[P1, P20] = 1 P20
[P1, P20] Link[P20, P13] = 0
Link[P1, P13] = 1
P13
[P1, P20, P1, P13] Link[P13, P49] = 1 P49
[P1, P20, P1, P13, P49] Link[P49, P34] = 0
Link[P13, P34] = 1
P34
[P1, P20, P1, P13, P49, P13, P34] Link[P34, P23] =1 P23
[P1, P20, P1, P13, P49, P13, P34, P23]
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Smart-SRASmart-SRA
• Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria
Satisfies the overall session duration time limit
• Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that:
– between each consecutive page pair in a session there is a hyperlink from the previous page to the next page
– the page stay time criteria is also satisfied
Adds referrer constraints of the topology rule while eliminating the need for inserting backward browser movements.
Contains Two Phases:
1. Determine the web pages without any referrer (on its left) and remove them from the candidate session
2. For each one of these pages• For each previously constructed session
– If there is a hyperlink from the last page of the session to the web page, then append the web page to the session (if the page stay time constraint is satisfied)
3. Remove non-maximal sessions
Steps of Phase 2
Process a candidate session from left to right by repeating the following steps until the candidate session is empty:
Smart-SRASmart-SRA
Example Candidate Session
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 9 12 14 15
P13 P1
P49
P20 P23
P34
Example Web Topology
Smart-SRASmart-SRA
Iteration 1 2
Candidate Session [P1, P20, P13, P49, P34, P23] [P20, P13, P49, P34, P23]
New Session Set(before)
[P1]
Temp Page Set {P1} {P20, P13}
Temp Session Set [P1] [P1,P20]
[P1,P13]
New Session Set(after)
[P1] [P1,P20]
[P1,P13]
Iteration 3 4
Candidate Session [P49, P34, P23] [P23]
New Session Set(before)
[P1,P20]
[P1,P13]
[P1,P13,P34]
[P1, P13, P49]
[P1, P20]
Temp Page Set {P49, P34} {P23}
Temp Session Set [P1,P13,P34]
[P1, P13, P49]
[P1, P13, P34, P23]
[P1, P13, P49, P23], [P1, P20, P23]
New Session Set(after)
[P1,P13,P34], [P1, P13, P49]
[P1, P20]
[P1, P13, P34, P23] , [P1, P13, P49, P23]
[P1, P20, P23]
Smart-SRASmart-SRA
Agent SimulatorAgent Simulator
• Models the behavior of web users and generates web user navigation and the log data kept by the web server
• Used to compare the performances of alternative session reconstruction heuristics
• Uses 4 Primitive behaviors for simulating complex navigation of web user.
Web user can start a new session with any one of the possible entry pages of the web site
Start page
New request from server
S1 Session I
S2 Session II
Agent SimulatorAgent Simulator
User-Behavior I
P13 P1
P20 P23
P34
1
S1
P49
2
S2
Start page
From cache
New request from server
P13 P1
P49
P20 P23
P34
2
1
Web user can select a new page having a link from the most recently accessed page
Agent SimulatorAgent Simulator
User-Behavior II
Start page
From cache
New request from server
P13 P1
P49
P20 P23
P34
2
1
3
4
5
Web user can select as the next page having a link from any one of the previously browsed pages
Agent SimulatorAgent Simulator
User-Behavior III
Start page
From cache
New request from server
P13 P1
P49
P20 P23
P34
2
1
3
4
5
6
Web user can terminate the session
Agent SimulatorAgent Simulator
User-Behavior IV
Parameters for simulating behavior of web user
• Session Termination Probability (STP)• Link from Previous pages Probability (LPP)• New Initial page Probability (NIP)
Agent SimulatorAgent Simulator
Heuristics Tested
• Time oriented heuristic (heur1)(total time < 30 min)
• Time oriented heuristic (heur2)(page stay < 10 min)
• Navigation oriented heuristic (heur3)
• Smart-SRA heuristic (heur4)
Experimental ResultsExperimental Results
Accuracy
Reconstructed session H captures
a real session R
if R occurs as a subsequence of H (R H)⊏
R = [P1, P3, P5]
H = [P9, P1, P3, P5, P8] => R H ⊏H = [P1, P9, P3, P5, P8] => R H⋢
Experimental ResultsExperimental Results
Parameters for generating user sessions and web topology
Number of web pages (nodes) in topology 300
Average number of outdegree 15
Average number of page stay time 2,2 min
Deviation for page stay time 0,5 min
Number of agents 10000
STP : Fixed & Range 5% 1%-20%
LPP : Fixed & Range 30% 0%-90%
NIP : Fixed & Range 30% 0%-90%
Experimental ResultsExperimental Results
Accuracy vs. STP
Real Accuracy vs STP
0
10
20
30
40
50
1 4 7 10 13 16 19
STP
heur1
heur2
heur3
heur4
Experimental ResultsExperimental Results
Real Accuracy vs LPP
0
10
20
30
40
50
0 10 20 30 40 50 60 70 80 90
LPP
Rea
l Acc
ura
cy % heur1
heur2
heur3
heur4
Accuracy vs LPP
Experimental ResultsExperimental Results
Real Accuracy vs NIP
05
101520253035
0 10 20 30 40 50 60 70 80 90
NIP
Rea
l Acc
uar
cy % heur1
heur2
heur3
heur4
Accuracy vs. NIP
Experimental ResultsExperimental Results
ConclusionConclusion
• New session reconstruction heuristic: Smart-SRA– Does not allow sequences with unrelated consecutive requests
(no hyperlink between the previous one to the next one)• No artificial browser (back) requests insertion in order to prevent
unrelated consecutive requests– Only maximal sessions
• Agent simulator• Accuracy measure• Experimental results show Smart-SRA outperforms previous
heuristics
04/10/23
PART IIPART II
Semantically Enriched Event Based Semantically Enriched Event Based Model Model ffor or WWeb eb Usage MiningUsage Mining
• Introduction• Related Work• Semantic Event Based Sessions• Formal Definition of Semantic Events• Algorithms for Mining Semantic Event Patterns• Experimental Results• Conclusion
04/10/23
OUTLINE OUTLINE
• Traditional WUM is based on pageviews,
but user interaction model is changing
• Users do not care about pageviews,
but they use web site to achieve high level goals such as
– Finding and viewing a video– Buying tickets– Searching for the nearest Italian restaurant– Listening to a song, etc
04/10/23
IntroductionIntroduction
• We should analyze usage data in a series of “events”
1. Search Mediterranean Restaurants2. Search Italian Restaurants3. View the reviews for Restaurant A4. View the reviews for Restaurant B5. Click the web site link of Restaurant A
• Incorporating semantic knowledge in the process is the logical choice
– A method should be devised to capture user behavior– Usage data should be mapped to semantic space– An algorithm should be developed to exploit semantic relations
04/10/23
IntroductionIntroduction
• In this work we propose methods for:
– tracking and logging domain level events– injecting semantic to events– semantic ordering of events– an algorithm for computing sequences of frequent events
• Proposed system tested with 2 web sites
– Music Streaming Site– Mobile Network Operator’s Site
04/10/23
IntroductionIntroduction
• Events are conceptual actions that the user performs to achieve a certain affect
• Events are used to capture business actions that are defined in the site’s domain
• The site admin is responsible for defining and tracking events
• Events are tracked via JavaScript client
04/10/23
Semantic Event Based SessionsSemantic Event Based Sessions
• Example events:
– Play a video event – Add to shopping cart event– Add friend action
• Sometimes we may be interested in properties of events, such as
– “query” property of a “search event”– “category” property of a “view video event”
04/10/23
Semantic Event Based SessionsSemantic Event Based Sessions
• Every event is defined as an object.
• Objects can have properties which relate an object
– with another object or – with a datatype value
• The relations between objects are captured in a tree
– Each individual and property is a node – Object property nodes have object as a parent and a child
04/10/23
Semantic Event Based SessionsSemantic Event Based Sessions
A sample event from a hypothetical video viewing site
04/10/23
• Events can be used to capture all relevant actions of the user including plain pageviews
• With a mapping of events to the web site’s ontology we can define ‘semantic events’
• Events are mapped to semantic space by using the class and property names in the ontology
• As a result of this mapping, the data to be mined is– An ontology containing the terminological part– Logs containing semantic objects
04/10/23
Events as Semantic ObjectsEvents as Semantic Objects
• A Session is an ordered set of atomtrees that corresponds to events for a single user in a certain browsing activity
• An Atom-tree is a tree of connected atoms. The atom tree represents a domain event in the web site's ontology
• An Atom is either an individual of a class, a datatype property assertion, or an object property assertion
04/10/23
DefinitionsDefinitions
• A Pattern is an ordered set of atomtrees
• A session S supports the pattern Q
iff Q is a subsequence of S
where isMoreGeneralThan relation is used instead of equality in determining the subsequence relation
04/10/23
DefinitionsDefinitions
04/10/23
• For a given set of session, the problem is to find the set of patterns with support greater than the threshold value, minSupport
• Two phase Apriori-like algorithm
– First phase finds frequent atomtrees (patterns containing single atomtree)
– Second phase searches for frequent atomtree patterns
04/10/23
AlgorithmAlgorithm
• Apriori property:
If atomtree a1 is more general than atomtree a2, then the support of a1 is greater than the support of a2
• getMostGeneralForms generates the set of trees, – more general than the given atomtree – not less general than any other atomtree generated
• For level-wise search, a one-step refinement operator, is defined over the set of individual atoms, object and datatype property assertion atoms.
04/10/23
Phase I: Find Frequent AtomtreesPhase I: Find Frequent Atomtrees
• Given an atom, one-step refinement operator refines another atom, by refining for subclass, sub property or refining a child of the node.
• A one-step refinement over the set of atom-trees returns a set of atom-trees by either – Refining a single node– Adding the most general form of a node
• One-step refinement takes two atom-trees and returns atom-trees that are more similar forms of the second towards the first
04/10/23
Phase I: Find Frequent AtomtreesPhase I: Find Frequent Atomtrees
04/10/23
INPUT : Session data containing semantic eventsOUTPUT: List of frequent atom-trees
generate the initial candidate set using getMostGeneralForm of all events
iterate until no candidates can be generated { compare candidate set with the data set for each atom-tree in the data set { increment the frequency of each atom-tree in candidate set
that is more general than the atom-tree } filter the candidates that are less frequent that minSupport generate next candidate set using oneStepRefinement operator
on the current candidate set atom-trees}
Phase I: Find Frequent AtomtreesPhase I: Find Frequent Atomtrees
• Similar to GSP
• The taxonomy introduced by isMoreGeneralThan relation is used
• Data set is converted: – each frequent atom-tree is mapped to an integer hash – each atom-tree in a session is replaced by a set of hashes of the
atom-tree and its ancestors
• Subsequence relation is modified to respect set inclusion
04/10/23
Phase II: Find Frequent SequencesPhase II: Find Frequent Sequences
INPUT : session data containing semantic events and frequent atom trees from phase one
OUTPUT: list of frequent atom-trees
convert data set and frequent atom-trees to hashes
while the candidate set is not empty { generate candidate set from previous frequent patterns count candidates select frequent candidates}
reconvert patterns
04/10/23
Phase II: Find Frequent SequencesPhase II: Find Frequent Sequences
• Two sites are tested:
– A music streaming site• Single-page, AJAX based music listening site• 280K events in 75K sessions• Events are tracked via Java Script client
– A mobile network operator’s site• Content-heavy, mostly static, high traffic web site • 1M pageviews in 175K sessions• Events are extracted from access logs
04/10/23
ExperimentsExperiments
04/10/23
Music Streaming Site - EventsMusic Streaming Site - Events
04/10/23
• 38.9% of the sessions: user made a search• 9.3% of the sessions: user removed a song from her playlist• 95.5% of the sessions: user made an action about a song• 27.5% of the sessions: user added a song to playlist
• 139 frequent patterns are found– Frequent pattern of length 2 describes
a search is performed after playing a particular song– Frequent pattern of length 6 describes
sequential removal of songs from playlist (due to the lack of ‘clear playlist’ button in the interface)
Music Streaming Site - PatternsMusic Streaming Site - Patterns
04/10/23
• Two days of logs• More than 1 million pageviews occurred in 175K sessions.• Semi-automatically generated ontology• A total of 503 class• 7-level hierarchy
Mobile Network Operator Site - EventsMobile Network Operator Site - Events
04/10/23
Mobile Network Operator Site - OntologyMobile Network Operator Site - Ontology
04/10/23
• 10% of the sessions: at least one search action• 38% of the sessions: page not categorized in the ontology is visited• 71% of the sessions: user visits the home page (interesting)
• Some other subjectively interesting patterns – user’s browsing behaviors between subclasses of content class– users visited home page then jumped to some specific content – users searched and moved on to specific category
Mobile Network Operator Site - PatternsMobile Network Operator Site - Patterns
Proposed system is • More generic than some of the previous semantic web usage
mining attempts • Captures usage model more correctly• Intuitive and sound• Uses most of the ontology constructs• Applicable to real web sites with varying domains• Parallelizable and suitable for MapReduce
04/10/23
ConclusionsConclusions