modeling and understanding human behavior on the web · • nevertheless, provide good data for...

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

Modeling the Internet and the

Web:

Modeling and Understanding

Human Behavior on the Web


2

Introduction

• Useful to study human digital behavior, e.g. search engine data can be used for

– Exploration e.g. # of queries per session?

– Modeling e.g. any time of day dependence?

– Prediction e.g. which pages are relevant?

• Helps

– Understand social implications of Web usage

– Better design tools for information access

– In networking, e-commerce etc


3

Web data and measurement issues

Background:

• Important to understand how data is collected

• Web data is collected automatically via software

logging tools

– Advantage:

• No manual supervision required

– Disadvantage:

• Data can be skewed (e.g. due to the presence of robot traffic)

• Important to identify robots (also known as

crawlers, spiders)


4

A time-series plot of Web requests

Number of page requests per hour as a function of time from page

requests in the www.ics.uci.edu Web server logs during the first week of

April 2002.

http://www.ics.uci.edu/


5

Robot / human identification

• Robot requests identified by classifying page requests using a variety of heuristics

– e.g. some robots self-identify themselves in the server logs (robots.txt)

– Robots explore the entire website in breadth first fashion

– Humans access web-pages in depth first fashion

• Tan and Kumar (2002) discuss more techniques


6

Robot / human identification

• Robot traffic consists of two components

– Periodic Spikes (can overload a server)

• Requests by “bad” robots

– Lower-level constant stream of requests

• Requests by “good” robots

• Human traffic has

– Daily pattern: Monday to Friday

– Hourly pattern: peak around midday & low

traffic from midnight to early morning


7

Server-side data

Data logging at Web servers

• Web server sends requested pages to the requester browser

• It can be configured to archive these requests in a log file recording

– URL of the page requested

– Time and date of the request

– IP address of the requester

– Requester browser information (agent)


8

Data logging at Web servers

– Status of the request

– Referrer page URL if applicable

• Server-side log files

– provide a wealth of information

– require considerable care in interpretation

• More information in Cooley et al. (1999),

Mena (1999) and Shahabi et al. (2001)


9

Page requests, caching, and proxy

servers

• In theory, requester browser requests a

page from a Web server and the request is

processed

• In practice, there are

– Other users

– Browser caching

– Proxy Server caching


10


servers

A graphical summary of how page requests from an individual user can be

masked at various stages between the user’s local computer and the Web

server.


11


servers

• Web server logs are therefore not so ideal in

terms of a complete and faithful representation

of individual page views

• There are heuristics to try to infer the true

actions of the user: -

– Path completion (Cooley et al. 1999)

• e.g. If known B -> F and not C -> F, then session ABCF can

be interpreted as ABCBF

• Anderson et al. 2001 for more heuristics

• In general case, hard to know what user viewed


12

Identifying individual users from

Web server logs• Useful to associate specific page requests to

specific individual users

• IP address most frequently used

• Disadvantages– One IP address can belong to several users

– Dynamic allocation of IP address

• Better to use cookies– Information in the cookie can be accessed by the

Web server to identify an individual user over time

– Actions by the same user during different sessions can be linked together


13

Identifying individual users from

Web server logs

• Commercial websites use cookies extensively

• 90% of users have cookies enabled permanently on their browsers

• However …– There are privacy issues – need implicit user

cooperation

– Cookies can be deleted / disabled

• Another option is to enforce user registration– High reliability

– Can discourage potential visitors


14

Client-side data

• Advantages of collecting data at the client side:– Direct recording of page requests (eliminates ‘masking’ due to

caching)

– Recording of all browser-related actions by a user (including visits to multiple websites)

– More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer)

• Preferred mode of data collection for studies of navigation behavior on the Web

• Companies like comScore and Nielsen use client-side software to track home computer users

• Zhu, Greiner and Häubl (2003) used client-side data


15

Client-side data

• Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data

• Some limitations– Still some statistics like ‘Page-view duration’ cannot

be totally reliable e.g. user might go to fetch coffee

– Need explicit user cooperation

– Typically recorded on home computers – may not reflect a complete picture of Web browsing behavior


16

Handling massive Web server logs

• Web server logs can be very large– Small university department website gets a million

requests per month

– Amazon, Google can get tens of millions of requests each day

• Exceed main memory capacities, stored on disks

• Time-costs to data access place significant constraints on types of analysis

• In practice– Analysis of subset of data

– Filtering out events and fields of no direct interest


17

Empirical client-side studies of

browsing behavior• Data for client-side studies are collected at the

client-side over a period of time– Reliable page revisitation patterns can be gathered

– Explicit user permission is required

– Typically conducted at universities

– Number of individuals is small

– Can introduce bias because of the nature of the population being studied

– Caution must be exercised when generalizing observations

• Nevertheless, provide good data for studying human behavior


18

Early studies from 1995 to 1997

• Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997)

• In both studies, data was collected by logging Web browser commands

• Population consisted of faculty, staff and students

• Both studies found – clicking on the hypertext anchors as the most

common action

– using ‘back button’ was the second common action


19

The Cockburn and McKenzie study

from 2002• Previous studies are relatively old

• Web has changed dramatically in the past few years

• Cockburn and McKenzie (2002) provides a more up-to-date analysis– Analyzed the daily history.dat files produced by the

Netscape browser for 17 users for about 4 months

– Population studied consisted of faculty, staff and graduate students

• Study found revisitation rates higher than past 94 and 95 studies (~0.81)


20

Video-based analysis of Web

usage

• Byrne et al. (1999) analyzed video-taped recordings of eight different users over a period of 15 min to 1 hour

• Audio descriptions of the users was combined with the video recordings of their screen for analysis

• Study found – users spent a considerable amount of time scrolling

Web pages

– users spent a considerable amount of time waiting for pages to load (~15% of time)


21

Probabilistic models of browsing

behavior

• Useful to build models that describe the

browsing behavior of users

• Can generate insight into how we use

Web

• Provide mechanism for making predictions

• Can help in pre-fetching and

personalization


22

Markov models for page prediction

• For simplicity, consider order-dependent, time-

independent finite-state Markov chain with M states

• Let s be a sequence of observed states of length L. e.g.

s = ABBCAABBCCBBAA with three states A, B and C.

• This provides a simple generative model to produce

sequential data


23

Predicting page requests with

Markov models• Many flavors of Markov models proposed for

next page and future page prediction

• Useful in pre-fetching, caching and personalization of Web page

• For a typical website, the number of pages is large – Clustering is useful in this case

modeling and understanding human behavior on the web · • nevertheless, provide good data for...

Documents