modeling and understanding human behavior on the web · • nevertheless, provide good data for...

23
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Modeling the Internet and the Web: Modeling and Understanding Human Behavior on the Web

Upload: others

Post on 22-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

Modeling the Internet and the

Web:

Modeling and Understanding

Human Behavior on the Web

Page 2: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

2

Introduction

• Useful to study human digital behavior, e.g. search engine data can be used for

– Exploration e.g. # of queries per session?

– Modeling e.g. any time of day dependence?

– Prediction e.g. which pages are relevant?

• Helps

– Understand social implications of Web usage

– Better design tools for information access

– In networking, e-commerce etc

Page 3: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

3

Web data and measurement issues

Background:

• Important to understand how data is collected

• Web data is collected automatically via software

logging tools

– Advantage:

• No manual supervision required

– Disadvantage:

• Data can be skewed (e.g. due to the presence of robot traffic)

• Important to identify robots (also known as

crawlers, spiders)

Page 4: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

4

A time-series plot of Web requests

Number of page requests per hour as a function of time from page

requests in the www.ics.uci.edu Web server logs during the first week of

April 2002.

Page 5: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

5

Robot / human identification

• Robot requests identified by classifying page requests using a variety of heuristics

– e.g. some robots self-identify themselves in the server logs (robots.txt)

– Robots explore the entire website in breadth first fashion

– Humans access web-pages in depth first fashion

• Tan and Kumar (2002) discuss more techniques

Page 6: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

6

Robot / human identification

• Robot traffic consists of two components

– Periodic Spikes (can overload a server)

• Requests by “bad” robots

– Lower-level constant stream of requests

• Requests by “good” robots

• Human traffic has

– Daily pattern: Monday to Friday

– Hourly pattern: peak around midday & low

traffic from midnight to early morning

Page 7: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

7

Server-side data

Data logging at Web servers

• Web server sends requested pages to the requester browser

• It can be configured to archive these requests in a log file recording

– URL of the page requested

– Time and date of the request

– IP address of the requester

– Requester browser information (agent)

Page 8: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

8

Data logging at Web servers

– Status of the request

– Referrer page URL if applicable

• Server-side log files

– provide a wealth of information

– require considerable care in interpretation

• More information in Cooley et al. (1999),

Mena (1999) and Shahabi et al. (2001)

Page 9: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

9

Page requests, caching, and proxy

servers

• In theory, requester browser requests a

page from a Web server and the request is

processed

• In practice, there are

– Other users

– Browser caching

– Proxy Server caching

Page 10: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

10

Page requests, caching, and proxy

servers

A graphical summary of how page requests from an individual user can be

masked at various stages between the user’s local computer and the Web

server.

Page 11: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

11

Page requests, caching, and proxy

servers

• Web server logs are therefore not so ideal in

terms of a complete and faithful representation

of individual page views

• There are heuristics to try to infer the true

actions of the user: -

– Path completion (Cooley et al. 1999)

• e.g. If known B -> F and not C -> F, then session ABCF can

be interpreted as ABCBF

• Anderson et al. 2001 for more heuristics

• In general case, hard to know what user viewed

Page 12: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

12

Identifying individual users from

Web server logs• Useful to associate specific page requests to

specific individual users

• IP address most frequently used

• Disadvantages– One IP address can belong to several users

– Dynamic allocation of IP address

• Better to use cookies– Information in the cookie can be accessed by the

Web server to identify an individual user over time

– Actions by the same user during different sessions can be linked together

Page 13: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

13

Identifying individual users from

Web server logs

• Commercial websites use cookies extensively

• 90% of users have cookies enabled permanently on their browsers

• However …– There are privacy issues – need implicit user

cooperation

– Cookies can be deleted / disabled

• Another option is to enforce user registration– High reliability

– Can discourage potential visitors

Page 14: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

14

Client-side data

• Advantages of collecting data at the client side:– Direct recording of page requests (eliminates ‘masking’ due to

caching)

– Recording of all browser-related actions by a user (including visits to multiple websites)

– More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer)

• Preferred mode of data collection for studies of navigation behavior on the Web

• Companies like comScore and Nielsen use client-side software to track home computer users

• Zhu, Greiner and Häubl (2003) used client-side data

Page 15: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

15

Client-side data

• Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data

• Some limitations– Still some statistics like ‘Page-view duration’ cannot

be totally reliable e.g. user might go to fetch coffee

– Need explicit user cooperation

– Typically recorded on home computers – may not reflect a complete picture of Web browsing behavior

Page 16: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

16

Handling massive Web server logs

• Web server logs can be very large– Small university department website gets a million

requests per month

– Amazon, Google can get tens of millions of requests each day

• Exceed main memory capacities, stored on disks

• Time-costs to data access place significant constraints on types of analysis

• In practice– Analysis of subset of data

– Filtering out events and fields of no direct interest

Page 17: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

17

Empirical client-side studies of

browsing behavior• Data for client-side studies are collected at the

client-side over a period of time– Reliable page revisitation patterns can be gathered

– Explicit user permission is required

– Typically conducted at universities

– Number of individuals is small

– Can introduce bias because of the nature of the population being studied

– Caution must be exercised when generalizing observations

• Nevertheless, provide good data for studying human behavior

Page 18: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

18

Early studies from 1995 to 1997

• Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997)

• In both studies, data was collected by logging Web browser commands

• Population consisted of faculty, staff and students

• Both studies found – clicking on the hypertext anchors as the most

common action

– using ‘back button’ was the second common action

Page 19: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

19

The Cockburn and McKenzie study

from 2002• Previous studies are relatively old

• Web has changed dramatically in the past few years

• Cockburn and McKenzie (2002) provides a more up-to-date analysis– Analyzed the daily history.dat files produced by the

Netscape browser for 17 users for about 4 months

– Population studied consisted of faculty, staff and graduate students

• Study found revisitation rates higher than past 94 and 95 studies (~0.81)

Page 20: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

20

Video-based analysis of Web

usage

• Byrne et al. (1999) analyzed video-taped recordings of eight different users over a period of 15 min to 1 hour

• Audio descriptions of the users was combined with the video recordings of their screen for analysis

• Study found – users spent a considerable amount of time scrolling

Web pages

– users spent a considerable amount of time waiting for pages to load (~15% of time)

Page 21: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

21

Probabilistic models of browsing

behavior

• Useful to build models that describe the

browsing behavior of users

• Can generate insight into how we use

Web

• Provide mechanism for making predictions

• Can help in pre-fetching and

personalization

Page 22: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

22

Markov models for page prediction

• For simplicity, consider order-dependent, time-

independent finite-state Markov chain with M states

• Let s be a sequence of observed states of length L. e.g.

s = ABBCAABBCCBBAA with three states A, B and C.

• This provides a simple generative model to produce

sequential data

Page 23: Modeling and Understanding Human Behavior on the Web · • Nevertheless, provide good data for studying human behavior. Modeling the Internet and the Web School of Information and

Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

23

Predicting page requests with

Markov models• Many flavors of Markov models proposed for

next page and future page prediction

• Useful in pre-fetching, caching and personalization of Web page

• For a typical website, the number of pages is large – Clustering is useful in this case