survival analysis of web users

80
“Survival” Analysis of Web Users 1 Dell Zhang DCSIS, Birkbeck, University of London

Upload: data-science-london

Post on 25-Dec-2014

1.218 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Survival Analysis of Web Users

“Survival” Analysis of Web Users

1

Dell Zhang

DCSIS, Birkbeck, University of London

Page 2: Survival Analysis of Web Users

Outline

• What Is It

• Why Is It Useful

• Case Study

– The Departure Dynamics of Wikipedia Editors

2

Page 3: Survival Analysis of Web Users

What Is It

3

Page 4: Survival Analysis of Web Users

Time-To-Event Data

• Survival Analysis is a branch of statistics which deals with the modelling of time-to-event data

– The outcome variable of interest is time until an event occurs.

• death, disease, failure

• recovery, marriage

– It is called reliability theory/analysis in engineering, and duration analysis/modelling in economics or sociology.

4

Page 5: Survival Analysis of Web Users

5

Y X

How to build a probabilistic model of Y ?

Page 6: Survival Analysis of Web Users

6

Y X

How to build a probabilistic model of Y ?

How to build a probabilistic model of Y given X ?

Page 7: Survival Analysis of Web Users

7

Y X

How to build a probabilistic model of Y ?

How to build a probabilistic model of Y given X ?

Page 8: Survival Analysis of Web Users

Censoring

• A key problem in survival analysis

– It occurs when we have some information about individual survival time, but we don’t know the survival time exactly.

8

Page 9: Survival Analysis of Web Users

9

Page 10: Survival Analysis of Web Users

10

Y X

Options: 1) Wait for those patients to die?

2) Discard the censored data?

3) Use the censored data as if they were

not censored?

4) ……

Page 11: Survival Analysis of Web Users

Goals

• Survival Analysis attempts to answer questions such as

– What is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die?

– Can multiple causes of death be taken into account?

– How do particular circumstances or characteristics increase or decrease the odds of survival?

11

Page 12: Survival Analysis of Web Users

• Censoring of data

• Comparing groups

– (1 treatment vs. 2 placebo)

• Confounding or Interaction factors

– Log WBC

12

Page 13: Survival Analysis of Web Users

Why Is It Useful

for Online Marketing etc.

13

Page 14: Survival Analysis of Web Users

The Data Are There

• Events meaningful to online marketing

– Time to Clicking the Ad

– Informational: Time to Finding the Wanted Info

– Transactional: Time to Buying the Product

– Social: Time to Joining/Leaving the Community

– ……

14

Time Matters!

Page 15: Survival Analysis of Web Users

Evidence-Based Marketing

• Let’s work as (real) doctors

– Users = Patients

– Advertisement (Marketing) = Treatment

Survival Analysis brings the time dimension

back to the centre stage.

15

Page 16: Survival Analysis of Web Users
Page 17: Survival Analysis of Web Users

17

Page 18: Survival Analysis of Web Users

18

Page 19: Survival Analysis of Web Users

19

Predict whether a new question asked on Stack Overflow will be closed when

Page 20: Survival Analysis of Web Users

Case Study

The Departure Dynamics of Wikipedia Editors

20

Page 21: Survival Analysis of Web Users

About 90,000 regularly active volunteer editors around the world 21

Page 22: Survival Analysis of Web Users

22

Page 23: Survival Analysis of Web Users

Departure Dynamics

• Who are likely to “die”?

• How soon will they “die”?

• Why do they “die”?

“live” = stay in the editors’ community = keep editing “die” = leave the editors’ community = stop editing (for 5 months)

23

Page 24: Survival Analysis of Web Users

Who are likely to “die”?

(WikiChallenge)

24

Page 25: Survival Analysis of Web Users

25

Page 26: Survival Analysis of Web Users

2010-09-01

2010-09-01

2011-02-01

2010-04-01 2001-01-01

2001-06-01

26

Page 27: Survival Analysis of Web Users

27

Page 28: Survival Analysis of Web Users

Behavioural Dynamics Features

months

Exponential Steps

28

Web Search (SIGIR-2009), Social Tagging (WWW-2009), Language Modelling (ICTIR-2009)

Page 29: Survival Analysis of Web Users

29

Page 30: Survival Analysis of Web Users

30

Page 31: Survival Analysis of Web Users

31

Page 32: Survival Analysis of Web Users

© 2008-2012 ~maniraptora 32

Gradient Boosted Trees (GBT)

Page 33: Survival Analysis of Web Users

Gradient Boosted Trees (GBT)

• The success of GBT in our task is probably attributable to – its ability to capture the complex nonlinear

relationship between the target variable and the features,

– its insensitivity to different feature value ranges as well as outliers, and

– its resistance to overfitting via regularisation mechanisms such as shrinkage and subsampling (Friedman 1999a; 1999b).

• GBT vs RF

33

Page 34: Survival Analysis of Web Users

34

Page 35: Survival Analysis of Web Users

35

Page 36: Survival Analysis of Web Users

36

Page 37: Survival Analysis of Web Users

37

Page 38: Survival Analysis of Web Users

Final Result

• The 2nd best valid algorithm in the WikiChallenge – RMSLE = 0.862582: 41.7% improvement over

WMF’s in-house solution

– Much simpler model than the top performing system : 21 behavioural dynamics features vs. 206 features

– WMF is now implementing this algorithm permanently and looks forward to using it in the production environment.

38

Page 39: Survival Analysis of Web Users

How soon will they “die”?

39

Page 40: Survival Analysis of Web Users

birth & death

The evolution of Wikipedia editors' community. 40

110,000 random samples

January 2001

Page 41: Survival Analysis of Web Users

active editors

The evolution of Wikipedia editors' community. 41

January 2001

110,000 random samples

Page 42: Survival Analysis of Web Users

Survival Function

42

What is the fraction of a population which will survive past a certain time?

Page 43: Survival Analysis of Web Users

The histogram of Wikipedia editors' lifetime.

Customary Editors Occasional Editors

43

Page 44: Survival Analysis of Web Users

Kaplan-Meier Estimator

44

Page 45: Survival Analysis of Web Users

45

Page 46: Survival Analysis of Web Users

The empirical survival function. 46

Page 47: Survival Analysis of Web Users

Normal Distribution

Probability Plot 47

Page 48: Survival Analysis of Web Users

Extreme Value Distribution

Probability Plot 48

Page 49: Survival Analysis of Web Users

Rayleigh Distribution

Probability Plot 49

Page 50: Survival Analysis of Web Users

Exponential Distribution

Probability Plot 50

Page 51: Survival Analysis of Web Users

Lognormal Distribution

Probability Plot 51

Page 52: Survival Analysis of Web Users

Weibull Distribution

Probability Plot 52

Page 53: Survival Analysis of Web Users

The survival function. 53

Page 54: Survival Analysis of Web Users

Weibull distribution

54

Page 55: Survival Analysis of Web Users

Expected Future Lifetime

55

median lifetime: 53 days

Page 56: Survival Analysis of Web Users

Hazard Function

The instantaneous potential per unit time for the event to occur, given that the individual has survived t.

56

Of those that survive, at what rate will they die?

Page 57: Survival Analysis of Web Users

Bathtub Curve

57 http://en.wikipedia.org/wiki/Bathtub_curve

Page 58: Survival Analysis of Web Users

The hazard function. 58

Page 59: Survival Analysis of Web Users

59 The hazard function.

Page 60: Survival Analysis of Web Users

Conclusions

• For customary Wikipedia editors,

– the survival function can be well described by a Weibull distribution (with the median lifetime of about 53 days);

– there are two critical phases (0-2 weeks and 8-20 weeks) when the hazard rate of becoming inactive increases;

– more active editors tend to keep active in editing for longer time.

60

Page 61: Survival Analysis of Web Users

Why do they “die”?

61

Page 62: Survival Analysis of Web Users

Covariates

Last Edit

62

Page 63: Survival Analysis of Web Users

63

Page 64: Survival Analysis of Web Users

64

Page 65: Survival Analysis of Web Users

Cox Proportional Hazards Model

65

Page 66: Survival Analysis of Web Users

Semi-Parametric

• The semi-parametric property of the Cox model => its popularity

– The baseline hazard is unspecified

– Robust: it will closely approximate the correct parametric model

– Using a minimum of assumptions

66

Page 67: Survival Analysis of Web Users

Cox PH vs. Logistic

67

Page 68: Survival Analysis of Web Users

Maximum Likelihood Estimation

68

Page 69: Survival Analysis of Web Users

Cox Proportional Hazards Model

β se z p

X1: namespace==Main

-0.1095 0.0172 -6.3664 0.1935e-9

X2: log(1+cur_size)

-0.0688 0.0036 -19.2474 0.0000e-9

69

Page 70: Survival Analysis of Web Users

Hazard Ratio

70

Page 71: Survival Analysis of Web Users

Adjusted Survival Curves

71

Page 72: Survival Analysis of Web Users

72

Page 73: Survival Analysis of Web Users

Next Step

73

Page 74: Survival Analysis of Web Users

Cartoon: Ron Hipschman Data: David Hand 74

Page 75: Survival Analysis of Web Users

Lightning Does Strike Twice!

• Roy Sullivan, a former park ranger from Virginia

– He was struck by lightning 7 times

• 1942 (lost big-toe nail)

• 1969 (lost eyebrows)

• 1970 (left shoulder seared)

• 1972 (hair set on fire)

• 1973 (hair set on fire & legs seared)

• 1976 (ankle injured)

• 1977 (chest & stomach burned)

– He committed suicide in September 1983.

75

Page 76: Survival Analysis of Web Users

A Lot More To Do

• Multiple Occurrences of “Death”

– Recurrent Event Survival Analysis (e.g., based on Counting Process)

• Multiple Types of “Death”

– Competing Risks Survival Analysis

76

Page 77: Survival Analysis of Web Users

Software Tools

• R

– The ‘survival’ package

• Matlab

– The ‘statistics’ toolbox

• Python

– The ‘statsmodels’ module?

77

Page 78: Survival Analysis of Web Users

References

• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta

• John Wallace. How Big Data is Changing Retail Marketing Analytics. Webinar, Apr 2005. http://goo.gl/OlMmi

• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors Keep Active? In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, Aug 2012. http://goo.gl/On3qr

• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct 2011. http://goo.gl/s2Dex

78

Page 79: Survival Analysis of Web Users

?

79

Page 80: Survival Analysis of Web Users

80