modeling and understanding human behavior on the web · • nevertheless, provide good data for...
TRANSCRIPT
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
Modeling the Internet and the
Web:
Modeling and Understanding
Human Behavior on the Web
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
2
Introduction
• Useful to study human digital behavior, e.g. search engine data can be used for
– Exploration e.g. # of queries per session?
– Modeling e.g. any time of day dependence?
– Prediction e.g. which pages are relevant?
• Helps
– Understand social implications of Web usage
– Better design tools for information access
– In networking, e-commerce etc
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
3
Web data and measurement issues
Background:
• Important to understand how data is collected
• Web data is collected automatically via software
logging tools
– Advantage:
• No manual supervision required
– Disadvantage:
• Data can be skewed (e.g. due to the presence of robot traffic)
• Important to identify robots (also known as
crawlers, spiders)
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
4
A time-series plot of Web requests
Number of page requests per hour as a function of time from page
requests in the www.ics.uci.edu Web server logs during the first week of
April 2002.
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
5
Robot / human identification
• Robot requests identified by classifying page requests using a variety of heuristics
– e.g. some robots self-identify themselves in the server logs (robots.txt)
– Robots explore the entire website in breadth first fashion
– Humans access web-pages in depth first fashion
• Tan and Kumar (2002) discuss more techniques
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
6
Robot / human identification
• Robot traffic consists of two components
– Periodic Spikes (can overload a server)
• Requests by “bad” robots
– Lower-level constant stream of requests
• Requests by “good” robots
• Human traffic has
– Daily pattern: Monday to Friday
– Hourly pattern: peak around midday & low
traffic from midnight to early morning
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
7
Server-side data
Data logging at Web servers
• Web server sends requested pages to the requester browser
• It can be configured to archive these requests in a log file recording
– URL of the page requested
– Time and date of the request
– IP address of the requester
– Requester browser information (agent)
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
8
Data logging at Web servers
– Status of the request
– Referrer page URL if applicable
• Server-side log files
– provide a wealth of information
– require considerable care in interpretation
• More information in Cooley et al. (1999),
Mena (1999) and Shahabi et al. (2001)
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
9
Page requests, caching, and proxy
servers
• In theory, requester browser requests a
page from a Web server and the request is
processed
• In practice, there are
– Other users
– Browser caching
– Proxy Server caching
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
10
Page requests, caching, and proxy
servers
A graphical summary of how page requests from an individual user can be
masked at various stages between the user’s local computer and the Web
server.
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
11
Page requests, caching, and proxy
servers
• Web server logs are therefore not so ideal in
terms of a complete and faithful representation
of individual page views
• There are heuristics to try to infer the true
actions of the user: -
– Path completion (Cooley et al. 1999)
• e.g. If known B -> F and not C -> F, then session ABCF can
be interpreted as ABCBF
• Anderson et al. 2001 for more heuristics
• In general case, hard to know what user viewed
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
12
Identifying individual users from
Web server logs• Useful to associate specific page requests to
specific individual users
• IP address most frequently used
• Disadvantages– One IP address can belong to several users
– Dynamic allocation of IP address
• Better to use cookies– Information in the cookie can be accessed by the
Web server to identify an individual user over time
– Actions by the same user during different sessions can be linked together
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
13
Identifying individual users from
Web server logs
• Commercial websites use cookies extensively
• 90% of users have cookies enabled permanently on their browsers
• However …– There are privacy issues – need implicit user
cooperation
– Cookies can be deleted / disabled
• Another option is to enforce user registration– High reliability
– Can discourage potential visitors
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
14
Client-side data
• Advantages of collecting data at the client side:– Direct recording of page requests (eliminates ‘masking’ due to
caching)
– Recording of all browser-related actions by a user (including visits to multiple websites)
– More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer)
• Preferred mode of data collection for studies of navigation behavior on the Web
• Companies like comScore and Nielsen use client-side software to track home computer users
• Zhu, Greiner and Häubl (2003) used client-side data
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
15
Client-side data
• Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data
• Some limitations– Still some statistics like ‘Page-view duration’ cannot
be totally reliable e.g. user might go to fetch coffee
– Need explicit user cooperation
– Typically recorded on home computers – may not reflect a complete picture of Web browsing behavior
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
16
Handling massive Web server logs
• Web server logs can be very large– Small university department website gets a million
requests per month
– Amazon, Google can get tens of millions of requests each day
• Exceed main memory capacities, stored on disks
• Time-costs to data access place significant constraints on types of analysis
• In practice– Analysis of subset of data
– Filtering out events and fields of no direct interest
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
17
Empirical client-side studies of
browsing behavior• Data for client-side studies are collected at the
client-side over a period of time– Reliable page revisitation patterns can be gathered
– Explicit user permission is required
– Typically conducted at universities
– Number of individuals is small
– Can introduce bias because of the nature of the population being studied
– Caution must be exercised when generalizing observations
• Nevertheless, provide good data for studying human behavior
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
18
Early studies from 1995 to 1997
• Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997)
• In both studies, data was collected by logging Web browser commands
• Population consisted of faculty, staff and students
• Both studies found – clicking on the hypertext anchors as the most
common action
– using ‘back button’ was the second common action
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
19
The Cockburn and McKenzie study
from 2002• Previous studies are relatively old
• Web has changed dramatically in the past few years
• Cockburn and McKenzie (2002) provides a more up-to-date analysis– Analyzed the daily history.dat files produced by the
Netscape browser for 17 users for about 4 months
– Population studied consisted of faculty, staff and graduate students
• Study found revisitation rates higher than past 94 and 95 studies (~0.81)
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
20
Video-based analysis of Web
usage
• Byrne et al. (1999) analyzed video-taped recordings of eight different users over a period of 15 min to 1 hour
• Audio descriptions of the users was combined with the video recordings of their screen for analysis
• Study found – users spent a considerable amount of time scrolling
Web pages
– users spent a considerable amount of time waiting for pages to load (~15% of time)
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
21
Probabilistic models of browsing
behavior
• Useful to build models that describe the
browsing behavior of users
• Can generate insight into how we use
Web
• Provide mechanism for making predictions
• Can help in pre-fetching and
personalization
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
22
Markov models for page prediction
• For simplicity, consider order-dependent, time-
independent finite-state Markov chain with M states
• Let s be a sequence of observed states of length L. e.g.
s = ABBCAABBCCBBAA with three states A, B and C.
• This provides a simple generative model to produce
sequential data
Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine
23
Predicting page requests with
Markov models• Many flavors of Markov models proposed for
next page and future page prediction
• Useful in pre-fetching, caching and personalization of Web page
• For a typical website, the number of pages is large – Clustering is useful in this case