mining event or state sequences: a social science...

68
Mining Event or State Sequences Mining Event or State Sequences: A Social Science Perspective Gilbert Ritschard Department of Econometrics, University of Geneva http://mephisto.unige.ch IIS 2008, Zakopane, Poland, June 16-18 13/7/2008gr 1/86

Upload: others

Post on 16-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State Sequences

Mining Event or State Sequences:A Social Science Perspective

Gilbert Ritschard

Department of Econometrics, University of Genevahttp://mephisto.unige.ch

IIS 2008, Zakopane, Poland, June 16-18

13/7/2008gr 1/86

Page 2: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State Sequences

My talk is about life courses,Example of scientific life courseto help you understand what a social scientist does at IIS

date event1970-1979 Studies in econometrics1980-1992 Mathematical Economics1985-... Work with Social scientists (Family studies)

Interest in Statistics for social sciences1990-1995 Interest in Neural Networks2000-... KDD and data mining (Clustering, supervised learning)2003-... Work with historians, demographers, psychologists

(longitudinal data)2005-... KDD and Data mining approaches

for analysing life course data

13/7/2008gr 2/86

Page 3: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State Sequences

Outline

1 Sequence Analysis in Social Sciences

2 Survival Trees

3 Visualizing and clustering sequence data

4 Mining Frequent Episodes

13/7/2008gr 3/86

Page 4: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Motivation

Motivation

Individual life course paradigm.Following macro quantities (e.g. #divorces, fertility rate, meaneducation level, ...) over timeinsufficient for understanding social behavior.Need to follow individual life courses.

Data availabilityLarge panel surveys in many countries(SHP, CHER, SILC, GGP, ...)Biographical retrospective surveys (FFS, ...).Statistical matching of censuses, population registers and otheradministrative data.

13/7/2008gr 6/86

Page 5: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Motivation

Motivation

Need for suited methods for discovering interesting knowledgefrom these individual longitudinal data.Social scientists use

Essentially Survival analysis (Event History Analysis)More rarely sequential data analysis (Optimal Matching,Markov Chain Models)

Could social scientists benefit from data-mining approaches?Which methods?Are there specific issues with those methods for socialscientists?

13/7/2008gr 7/86

Page 6: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Motivation

Motivation: KD in Social sciences

In KDD and data mining, focus on prediction andclassification.Improve prediction and classification errors.

In Social science, aim is understanding/explaining (social)behaviors.Hence focus is on process rather than output.

13/7/2008gr 8/86

Page 7: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Motivation

What kind of data

What kind of data are we dealing with?Mainly categorical longitudinal data describing life coursesAn ontology of longitudinal data (Aristotelean tree).

13/7/2008gr 9/86

Page 8: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Motivation

Alternative views of Individual Longitudinal Data

Table: Time stamped events, record for Sandra

ending secondary school in 1970 first job in 1971 marriage in 1973

Table: State sequence view, Sandra

year 1969 1970 1971 1972 1973civil status single single single single marriededucation level primary secondary secondary secondary secondaryjob no no first first first

13/7/2008gr 10/86

Page 9: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Motivation

Issues with life course data

Incomplete sequencesCensored and truncated data:Cases falling out of observation before experiencing an event ofinterest.Sequences of varying length.

Time varying predictors.Example: When analysing time to divorce, presence of childrenis a time varying predictor.

Data collected by clustersExample: Household panel surveys.Multi-level analysis to account for unobserved sharedcharacteristics of members of a same cluster.

13/7/2008gr 11/86

Page 10: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Motivation

Multi-level: Simple linear regression example

y = 3.2 + 0.2 x

y = 6.2 - 0.8 x

y = 15.6 - 0.8 x

y = 12.5 - 0.8 x

0

1

2

3

4

5

6

7

8

9

1 3 5 7 9 11 13 15

Education

Chi

ldre

n

13/7/2008gr 12/86

Page 11: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Methods for Longitudinal Data

Classical statistical approachesSurvival Approaches

Survival or Event history analysis (Blossfeld and Rohwer, 2002)Focuses on one event.Concerned with duration until event occursor with hazard of experiencing event.

Survival curves: Distribution of duration until event occurs

S(t) = p(T ≥ t) .

Hazard models: Regression like models for S(t, x) or hazardh(t) = p(T = t | T ≥ t)

h(t, x) = g(t, β0 + β1x1 + β2x2(t) + · · ·

).

13/7/2008gr 14/86

Page 12: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Methods for Longitudinal Data

Survival curves (Switzerland, SHP 2002 biographical survey)

Women

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80

AGE (years)

Surv

ival

pro

babi

lity

Leaving home Marriage 1st Chilbirth Parents' deathLast child left Divorce Widowing13/7/2008gr 15/86

Page 13: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Methods for Longitudinal Data

Analysis of sequences

Frequencies of given subsequencesEssentially event sequences.Subsequences considered as categories ⇒ Methods forcategorical data apply (Frequencies, cross tables, log-linearmodels, logistic regression, ...).

Markov chain modelsState sequences.Focuses on transition rates between states.Does the rate also depend on previous states?How many previous states are significant?

Optimal Matching (Abbott and Forrest, 1986) .State sequences.Edit distance (Levenshtein, 1966; Needleman and Wunsch,1970) between pairs of sequences.Clustering of sequences.

13/7/2008gr 16/86

Page 14: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSequence Analysis in Social Sciences

Methods for Longitudinal Data

Typology of methods for life course data

IssuesQuestions duration/hazard state/event sequencingdescriptive • Survival curves: • Optimal matching

Parametric clustering(Weibull, Gompertz, ...) • Frequencies of given

and non parametric patterns(Kaplan-Meier, Nelson- • Discovering typicalAalen) estimators. episodes

causality • Hazard regression models • Markov models(Cox, ...) • Mobility trees

• Survival trees • Association rulesamong episodes

13/7/2008gr 17/86

Page 15: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

The biographical SHP dataset

SHP biographical retrospective surveyhttp://www.swisspanel.ch

SHP retrospective survey: 2001 (860) and 2002 (4700 cases).We consider only data collected in 2002.Data completed with variables from 2002 wave (language).

Characteristics of retained data for divorce(individuals who get married at least once)

men women TotalTotal 1414 1656 30701st marriage dissolution 231 308 539

16.3% 18.6% 17.6%

13/7/2008gr 20/86

Page 16: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

The biographical SHP dataset

Distribution by birth cohortBirth year

year

Fre

quen

cy

1910 1920 1930 1940 1950 1960

010

020

030

040

050

0

13/7/2008gr 21/86

Page 17: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

The biographical SHP dataset

Marriage duration until divorceSurvival curves

0 8

0.85

0.9

0.95

1

vie

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 10 20 30 40

prob

. de

surv

Durée du mariage, Femmes

1942 et avant

1943-1952

1953 et après

0 8

0.85

0.9

0.95

1

vie

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 10 20 30 40pr

ob. d

e su

rvDurée du mariage, Hommes

1942 et avant

1943-1952

1953 et après

0 8

0.85

0.9

0.95

1

vie

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 10 20 30 40

prob

. de

surv

Durée du mariage, Femmes

1942 et avant

1943-1952

1953 et après

13/7/2008gr 22/86

Page 18: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

The biographical SHP dataset

Marriage duration until divorceHazard model

Discrete time model (logistic regression on person-year data)exp(B) gives the Odds Ratio, i.e. change in the odd h/(1− h)when covariate increased by 1 unit.

exp(B) Sig.birthyr 1.0088 0.002university 1.22 0.043child 0.73 0.000language unknwn 1.47 0.000

French 1.26 0.007German 1 refItalian 0.89 0.537

Constant 0.0000000004 0.000

13/7/2008gr 23/86

Page 19: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

Survival Tree Principle

Survival trees: Principle

Target is survival curve or some other survival characteristic.Aim: Partition data set into groups thatdiffer as much as possible (max between class variability)

Example: Segal (1988) maximizes difference in KM survivalcurves by selecting split with smallest p-value of Tarone-WareChi-square statistics

TW =∑

i

wi

(di1 − E(Di )

)(w2

i var(Di ))1/2

are as homogeneous as possible (min within class variability)Example: Leblanc and Crowley (1992) maximize gain indeviance (-log-likelihood) of relative risk estimates.

13/7/2008gr 25/86

Page 20: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

Example

Divorce, Switzerland, Differences in KM Survival Curves I

Zoom

� � � � � � � � � � �

� � � � � � � � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � � � � � � � � �

� � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � �

� � � � � � � � � �

� � �

� � � � � � � � � � � �

� � � � � �� � � � � � � � � �

� � � � � � � � � � �

� � � � ! " " #$ � "

� � � � � � � � � � � �

� � � � � � �� � � � � � � � �

� � � � � � � � � �

% & � � � � � � � # � � � ' � � � # � � � �

� � � � � � � �

% & � � � � � � � � # � � ' � � � # � � � � % & � � � � � � � # � � � ' � � � # � � � �

� � � � � � � � � � � �

� � � � � � �� � � � � � � �

� � � � � � � � � �

� � � � � � � � � � �

� � � � � � � � �� � � � � � � � � �

� � � � � � � � � �

� � � � � � � � ( � ) � *� � � � � � � � � �

� � � � � � � �

% & � � � � � � � # � � � � ' � � � # � � � �

� � � � � � � � � � �

� � � � � � � � �� � � � � � � �

� � � � � � � � �

� � � � � � � � � � � �

� � � � � � �� � � � � � �

� � � � � � � � �

$ � "� �

� � � � � � � �

% & � � � � � � � # � � � ' � � � # � � �

� � � � � � � � � � � �

� � � � � � �� � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � � �

$ � "� �

� � � � � � � �

% & � � � � � � � # � � � � ' � � � # � � �

+ � + �

+

+ � + + � + �13/7/2008gr 27/86

Page 21: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

Example

Divorce, Switzerland, Differences in KM Survival Curves II

0 10 20 30 40

0.5

0.6

0.7

0.8

0.9

1.0

Cohort <=1940 & Non French Speaking & University

Cohort <=1940 & Non French Speaking & < University

Cohort <=1940 & French Speaking

Cohort > 1940 & No Child & University

Cohort > 1940 & No Child & < University

Cohort > 1940 & Child & German or Italian Speaking

Cohort > 1940 & Child & French or Unknown Speaking

13/7/2008gr 28/86

Page 22: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

Example

Divorce, Switzerland, Relative risk

� � � � � � �

� � � � � � � � � � � � � � �

� � � � � �

� � � � � � � � � � � � � �

� � � �

� � � � � � � � � �

� � � � �� � � � � � � � �

� � � � � � � � � � �

� � � � � � � � �� �

� � � � � � � � � �

� � � � � � � �

� � � � � � � � � � � � � � � � � � � � �

� � � � �

� � � � � � � � � � � � � �

� � � � � � � �

� � � � � � � � � � � � � �

� � � � � � �

� � � � � � � � � � � � � �

� � � � � � �

� � � � � � � � � � � � �

� � � � � � �

� � � � � � � � � � � � � �

13/7/2008gr 29/86

Page 23: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

Example

Hazard model with interaction

Adding interaction effects detected with the tree approachimproves significantly the fit (sig ∆χ2 = 0.004)

exp(B) Sig.born after 1940 1.78 0.000university 1.22 0.049child 0.94 0.619language unknwn 1.50 0.000

French 1.12 0.282German 1 refItalian 0.92 0.677

b_before_40*French 1.46 0.028b_after_40*child 0.68 0.010

Constant 0.008 0.00013/7/2008gr 30/86

Page 24: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSurvival Trees

Social Science Issues

Issues with survival trees in social sciences

1 Dealing with time varying predictorsSegal (1992) discusses few possibilities, none being reallysatisfactory.Huang et al. (1998) propose a piecewise constant approachsuitable for discrete variables and limited number of changes.Room for development ...

2 Multi-level analysisHow can we account for multi-level effects in survival trees,and more generally in trees?Conjecture: Should be possible to include unobserved sharedeffect in deviance-based splitting criteria.

13/7/2008gr 32/86

Page 25: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Life trajectories

Sequence analysis

Survival approaches not useful in a unitary (holistic)perspective of the whole life course.Sequence analysis of whole collection of life events bettersuited for such holistic approach (Billari, 2005).

Rendering sequencesColorize your life courses

Results from the analysis of the retrospective Swiss HouseholdPanel (SHP) survey.Focus on visualization of life course data.

13/7/2008gr 35/86

Page 26: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Life trajectories

Evolution tendencies in familial life course trajectories

Sequence analysis techniques permit to test hypotheses aboutevolution in these familial life trajectories. (Elzinga and Liefbroer,2007):

De-standardization: Some states and events of familial life areshared by decreasing proportions of the population, occur atmore dispersed ages and their duration is also more scattered.De-institutionalization: Social and temporal organization oflife courses becomes less driven by normative, legal orinstitutional rules.Differentiation: Number of distinct steps lived by individualincreases.

13/7/2008gr 36/86

Page 27: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Example: the BioFam sequential data set

Presentation of the “BioFam” data

Data from the retrospective survey conducted in 2002 by theSwiss Household Panel (SHP)(with support of Federal Statistical Office, Swiss NationalFund for Scientific Research, University of Neuchatel.)Retrospective survey: 5560 individualsRetained familial life events: Leaving Home, First childbirth,First marriage and First divorce.Age 15 to 45 → 2601 remaining individuals, born between1909 et 1957.

13/7/2008gr 38/86

Page 28: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Example: the BioFam sequential data set

Distribution by birth cohortBirth year

year

Fre

quen

cy

1910 1920 1930 1940 1950 1960

010

020

030

040

050

0

13/7/2008gr 39/86

Page 29: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Example: the BioFam sequential data set

Creating state sequences

Example of time stamped data:individual LHome marriage childbirth divorce

1 1989 1990 1992 NA

13/7/2008gr 40/86

Page 30: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Example: the BioFam sequential data set

Deriving the states

Need one state for each combination of events:

LHome marriage childbirth divorce0 no no no no1 yes no no no2 no yes yes/no no3 yes yes no no4 no no yes no5 yes no yes no6 yes yes yes no7 yes/no yes yes/no yes

13/7/2008gr 41/86

Page 31: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Definition

Entropy: measure of uncertainty regarding sequencepredictability.

pi , proportion of cases (or time points) in state i .Shannon h(p) =

∑i −pi log2(pi )

Other type of entropies: Quadratic (Gini), Daroczy, ...Two ways of using entropies.

Entropy of the state at each time (age) point: Entropyincreases with diversity of states observed at each time point(age).Entropy of each individual sequences: Entropy increases withdiversity of states during the observed life course and varieswith the time spend in each state.

13/7/2008gr 43/86

Page 32: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Entropy of the state at each time (age) point

Entropy of bifam state distribution by age

Age

Ent

ropy

a15 a17 a19 a21 a23 a25 a27 a29

0.2

0.4

0.6

0.8

13/7/2008gr 44/86

Page 33: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Entropy: Minimum/maximum

Entropie minimum, médiane et maximum

Time

Seq

uenc

es 1

−15

, sor

ted

by E

ntro

py

A15 A20 A25 A30 A35 A40 A45

N/N/N/NY/N/N/NN/Y/*/NY/Y/N/NN/N/Y/NY/N/Y/NY/Y/Y/N*/*/*/Y

13/7/2008gr 45/86

Page 34: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Entropy - histogram

Entropy for the sequences in the biofam data set

Entropy

Fre

quen

cy

0.0 0.5 1.0 1.5

010

020

030

040

050

0

13/7/2008gr 46/86

Page 35: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Hypothesis

Evolutions of familial life trajectories gives rise to an increasein the entropy of individual sequences,because they become less predictable and more diversified.

13/7/2008gr 47/86

Page 36: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Entropy by birth cohorts

●●●

● ●●●●●

●●●●

●●●

●●

●●

●●●●●● ●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●

1909−18 1919−28 1929−38 1939−48 1949−58

0.0

0.5

1.0

1.5

Distribution de l'entropie selon les cohortes de naissances

Birth cohort

Seq

uenc

es e

ntro

py

13/7/2008gr 48/86

Page 37: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Entropy by sex

●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Hommes Femmes

0.0

0.5

1.0

1.5

Distribution de l'entropie selon le sexe

Sexe

Seq

uenc

es e

ntro

py

13/7/2008gr 49/86

Page 38: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Definition

Turbulence (Elzinga and Liefbroer, 2007): Somewhat similarto entropy.Turbulence accounts for state sequencing (which is not thecase of the entropy).Turbulence accounts of the following two elements:

number of subsequences:x=S,U,M,MC - 16 subsequences more turbulent thany=S,U,S,C - 15 subsequencesvariance of duration in each state:S/10 U/2 M/132 is less turbulent thanS/48 U/48 M/48

13/7/2008gr 50/86

Page 39: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Turbulence - Minimum/maximum

Turbulence minimum, médiane et maximum

Time

Seq

uenc

es 1

−15

, sor

ted

by T

urbu

lenc

e

A15 A20 A25 A30 A35 A40 A45

N/N/N/NY/N/N/NN/Y/*/NY/Y/N/NN/N/Y/NY/N/Y/NY/Y/Y/N*/*/*/Y

13/7/2008gr 51/86

Page 40: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Turbulence - histogram

Turbulence for the sequences in the biofam data set

Turbulence

Fre

quen

cy

2 4 6 8 10

020

040

060

0

13/7/2008gr 52/86

Page 41: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Characteristics of sequences

Turbulence by cohorts

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●●●●●●●

●●

●●●

●●●●●

● ●

●●●●

●●

●●

●●●

●●

●●●

1909−18 1919−28 1929−38 1939−48 1949−58

24

68

10

Turbulence selon la cohorte de naissances

Birth cohort

Seq

uenc

es tu

rbul

ence

13/7/2008gr 53/86

Page 42: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Distances between sequences: Clustering

Clustering, Multidimensional scaling and more

Once you are able to compute 2 by 2 distances betweensequences you can among others:Cluster sequencesMake scatter plot representation of sets of sequences usingmultidimensional scaling.

13/7/2008gr 55/86

Page 43: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Distances between sequences: Clustering

Distances between sequences

Edit distance (known as Optimal matching in Social sciences)(Levenshtein, 1966; Needleman and Wunsch, 1970; Abbott andForrest, 1986)

d(x , y) Total cost of insert, deletion and substitution changesrequired to transform sequence x into y .Different solutions depending on indel and substitution costs.

Other metrics proposed by (Elzinga, 2008)LCP: Longest common prefix (also longest common postfix)LCS: Longest common subsequence(same as OM with indel cost = 1, and substitution cost = 2).NMS: Number of matching subsequences...

Elzinga (2008) proposes a nice formalization of these metrics.

13/7/2008gr 56/86

Page 44: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Distances between sequences: Clustering

Dendrogram, OM1 versus OM3different indel costs (1 vs 3)

117

334

784

910

8111

0011

9214

8817

5217

8322

0522

5923

8225

89 121

155

285

563

790

796

929

992

1019

1419

1468

2023

2125 13

0 5525

853

423

113

3218

5921

51 535

1387

1519 73

724

6711

4923

091

913

6810

2120

88 834

2305

1050

1444 2 13

314

218

424

863

765

381

888

991

211

9312

4312

5416

1516

7819

9321

6322

6125

55 26 59 104

159

172

428

663

860

1014

1452

1485

1559

1620

1663

2267

2525

2554

2584

1116 37 163

195

234

358

362

598

784

813

965

1020

1032

1042

1059

1065

1088

1249

1252

1343

1795

1825

1892

1899

1925

1964

2002

2258

2358

2535

2546

2597 16

229

753

665

285

612

3812

4415

1015

5215

5416

0917

2717

3817

8719

4520

4822

5723

3523

7324

5724

9624

97 110

918

1373

1978 36

020

3610

7315

0612

0422

23 15 82 129

131

312

660

677

833

905

913

1089

1138

1239

1329

1378

1512

1584

1680

1874

1884

2343

2448

2552 83 91 11

215

015

226

026

728

229

954

959

976

410

5313

7917

5320

7521

4525

47 132

2478

1673

1581

1873

1653 30 87 137

235

256

345

364

403

594

907

1092

1284

1476

1489

1526

2047

2207

2272

2349

2361

2396

2596 31 10

022

024

327

728

135

445

546

048

371

081

485

097

016

8919

0020

5222

1523

3923

4824

6525

7425

9520

68 35 102

259

266

309

311

429

729

819

837

999

1187

1250

1264

1760

1768

1806

1886

1920

2001

2067

2325

2340

2359

2556 42

015

7317

2810

9116

6925

22 340

552

820

9823

6523

6624

77 642

777

1072

1677 38

471

192

512

3413

1613

4117

9318

7519

27 842

1377

1915

2454

2097 29

1385 38

541

764

183

510

0812

3320

3824

6625

60 242

810

841

982

1156

1297

1436

1672

1683

1763

1798

1862

1990

2483 62

717

73 189

310

530

2327 78

894

324

38 649

1144

2144 94

512

1114

0611

7114

9116

94 224

379

503

940

1314

1645

1076

2437

240

2389 95

220

7417

9722

3723

26 395

473

502

561

880

1463

1812

1813 94

422

0622

1823

0613

3924

1498

117

0316

5618

9521

5722

43 61 143

465

472

593

646

755

792

876

1003

1384

1671

1907

2120

2245

2269

2503 16

925

332

287

710

0410

5610

7113

9717

1321

2822

2022

8124

8225

68 996

1870

1279

1866 41

341

446

876

389

611

5011

5514

0121

1021

5522

4122

2673

616

5213

3814

90 86 559

2381 441

1936

1442 168

2236

442

1356

1755

1937 92 33

346

481

710

8211

8212

1613

5016

9017

1217

6723

93 537

802

1126

1289

1702

1754

1803

1894

1950

2193

2421 13

932

334

813

8616

9211

1213

5514

2314

6719

5723

72 787

1740 90

825

1028

328

481

692

411

5420

5521

2724

7613

6166

213

3320

0517

6418

43 42 60 586

487

910

2026

2006

1562

319

1317

1051

488

585

1829 88

119

682

159

687

220

32 241

493

911

665

1952

887

1017

2105 23

891

430

016

8436

570

721

4225

9311

4520

5721

15 74 289

167

888

499

584

2531 19

779

355

516

6632

493

720

7112

0914

3518

3023

7051

316

97 780

1147

1085

1958 79

425

2119

4325

90 759

803

1599 873

1326

2371 76

146

750

1224

1535

1654

316

1143

2394 64

818

85 716

1443

2487

1125

1679

1691 75

2316 64

063

514

0016

3321

90 879

1879

1949 8

170

156

811

160

212

0814

3360

116

6717

22 960

2143

2016 34

497

958

019

77 401

1982

830

1976

1398

1242

1313 10 38 99 113

164

171

187

212

213

226

228

229

302

304

386

404

427

432

454

484

494

521

523

715

760

767

773

774

775

781

797

857

980

991

1011

1023

1031

1033

1040

1087

1130

1134

1263

1276

1344

1351

1390

1437

1509

1542

1591

1661

1662

1733

1784

1785

1944

1970

2009

2012

2034

2092

2094

2100

2154

2212

2233

2353

2367

2446

2459

2475

2543 5

423

2146

312

0621

24 77

2027

2417

1229 78 828

2077

1527 622

815

1575 29

117

4115

4125

33 519

740

853

1394

1457

1807 60

067

611

3996

711

8822

9967

525

87 678

1123

1214

1469

2024

2282 45

211

9613

8310

2511

46 746

2392 86

191

518

6711

1314

6616

60 2495

40 45 221

533

571

829

890

1024

1210

1246

1380

1473

1642

1832

1887

2060

2134

2203

2211

2320

2435 43

944

057

082

612

6814

6423

0923

1125

69 443

577

836

843

1028

1265

1664

2099

2135

2322 61

919

3325

82 239

518

2579

423

2391 27

473

925

2918

2827

214

3821

71 321

973

1364 41

950

758

810

3710

9417

7117

9421

3222

2722

88 566

1604

1938

2250 49

1179

1537

1804 36

744

451

754

882

016

1025

53 307

738

758

971

1119

1266

1305

1412

1572

2411 90

910

1210

2960

678

914

5622

52 273

449

516

868

900

927

1157

1354

1588

1608

1638

2021

2090

2300

2329

2410

2441 36

155

476

610

4713

5714

1414

1517

9218

5220

5122

3122

4422

9623

6823

8724

9425

28 46

1290

390

1454

2278 27

814

1117

2325

00 402

1054

1472

2263

2264

124

558

1142

2089 42

410

6212

5118

83 422

1550

2514

2515 60

822

9316

3623

2824

80 567

799

2138

1151

1353

346

1701 49

613

08 692

1699

604

895

1910

1253

2430

2199 66

821

6637

737

893

117

1017

1120

1016

11 855

2490

508

1064

624

2172

1462 99

322

0893

620

0823

2319

1921

3319

39 1815

2820

54 2199

521

3615

5523

6415

38 3222

48 263

1240

2189

1831 32

612

41 5725

04 8420

66 998

1007

270

987

2170

2037

1470

1948

2505

296

2222

1381

2112

477

1772

1841

2517 85 34

120

3135

214

3223

4517

5620

83 9317

2917

6910

0119

74 25

1579

2548

2559

1228

1959

2436 33

210

3814

9622

51 626

1061

1877

1322

1924

2507 51

1474 59

718

4222

1324

4525

13 476

1774

2369

177

1133

2279 60

720

5335

357

910

1657

810

18 827

994 66

1601

613

1605 75

420

0422

56 436

1010 49

211

8012

5525

6715

9519

03 200

338

2280 30

396

821

5020

114

2719

7324

89 997

1439

1367

1665

1996

1668

2527 39

689

797

614

4921

11 906

2287

1876 27

1002

1594 4

323

1912

4787

113

99 903

1009

2020 14

125

37 743

1446

2418 18

219

9717

3423

7525

424

2718

3910

0619

23 7312

9523

5221

9153

214

2414

3110

719

8916

81 612

1500

1896 40

679

110

8612

2319

6020

4421

97 106

1863

2192 250

339

2351

1227

1942

2290 16

659

528 17

023

9914

6510

4615

46 294

475

391

2332

1531

1574

1918

2471

127

1871

628

2481

2076 87

814

0819

94 700

1030

2061 52

213

7122

10 661

1487

2470 4

112

0238

113

4034

022

00 651

1330

1455 10

316

7024

4213

3123

5014

1323

57 798

1352

2512

1428 4

671

2338 80 57

511

7411

7516

1216

4822

6825

38 538

1311

2235 54

015

9054

423

8825

80 412

2073 86

419

7911

0317

614

8017

6224

7323

1424

626

444

812

8515

9823

1824

26 689

1520

1393

1968

741

865

1800

2059 22

596

114

7120

78 286

2303

1743

631

2575

1494 647

2079

2230

2341 95

824

1222

9724

86 882

2310

1098

1865

1657

2028

2029 5 59

210

48 512

515

1628 39

917

7919

35 418

2432 68

810

1315

4016

1311

7211

7322

6024

05 47

950

1327

2354

618

2195 67

413

7222

14 6725

3455

776

156

412

2015

3214

9515

8320

85 64 854

2247

2460 53

913

2510

4529

312

8013

2414

40 611

1757

582

1901

2070

2506 61

414

7714

7819

88 731

1334

1342

2524

2084

292

697

812

867

1258 35

178

388

423

0711

2918

4024

0611

0112

6718

5623

08 525

620

866

885

1036

1497

2194

2216 63

392

819

0211

2218

0120

6363

465

465

516

2224

34 805

1621

2123

1277

1396

2464

2511

818

58 2232

078

270

313

0711

8612

9318

3721

8224

74 44 749

901

2225 60

911

6319

5424

92 65

2413

2198

233

974

1104

1300

2013 26

539

893

015

0718

22 8918

9723

0125

7825

276

214

070

511

1725

73 329

2130

1309

1904

1108

1851

1651

1846

1111

1159

1160 1

215

6011

28 545

708

2035

1022

2363

1549

2180

2594 36

310

7920

58 610

1127

1067

1310 35

055

220

14 831

1066

1075

1571

2137

2443

2539

2156

659

2160 95

914

726

865

610

5810

7011

6817

8625

0217

58 451

615

546

1102 20 73

275

715

8220

8020

5021

74 630

387

1674

2140

1530

1961 34 109

694

734

823

990

1170

1213

1655

2119 14

919

428

737

165

715

9616

2616

5824

4010

8019

2219

322

729

039

374

210

6811

9111

9517

0618

1124

0924

29 126

2139

1941

2131

1518

2007

2469 13

576

598

925

40 288

832

1278

1448

2541 13

618

120

733

571

975

690

211

9914

2215

4515

5816

1616

2516

9818

1018

2720

1921

2221

7724

7925

71 942

2532

1323 53 148

262

370

392

636

870

947

953

966

1294

1299

1345

1430

1441

1659

1715

1747

1847

2018

2049

2176

2179

2408

2458

2536 62 10

815

119

936

643

072

812

4512

8312

8614

1716

3017

3017

9119

1419

8421

7322

7323

8624

2524

44 63 687

693

949

1027

1118

1370

1523

1525

1603

2104

2402

2509

2545 20

647

160

567

294

694

895

412

1713

1513

6614

2914

8215

8016

3916

4620

4521

7521

9623

3623

4724

2025

7025

92 276

1782 17

532

780

618

2612

1226

173

515

3922

55 245

629

1183 33

764

557

410

7719

08 772

1534 85

221

1822

6221

2915

3621

02 640

889

116

49 556

1389

1553 71

740

715

0117

5923

13 722

1274

1403

1688

1921

1987

2302

2404 12

213

424

915

6916

19 222

1687

2462

1359

1834

2431

101

485

551

825

917

1165

1524

1600

1631

1644 24

410

5241

616

2917

9918

5718

9323

7923

8525

3011

2024

6825

8811

5815

6115

5125

525

770

687

420

0031

816

43 510

768

1717 68

015

1422

09 298

851

972

1635

1817

2232

1891

2025 46

970

922

40 572

9716

17 128

204

2238 17

920

1522

5343

422

8421

8543

812

72 616

1618

1218

1291

1421

1815

1868

1363

1929

2337

117

1719 16

512

1525

72 482

1951 589

1878 984

2201

1460

1814 48

913

6511

8118

53 421

542

1911 77

914

9818

8820

9314

5915

0411

913

2017

51 343

590

380

1461

686

2407 61

725

5793

513

4821

1434

219

3024

9815

2183

918

6421

13 453

520

1909 547

587

690

359

1407

1055

2294

2344

644

941

1298 957

1083

2239

2518

2107

2108 54

111

15 667

1963

1854

1962 37

525

5121

1725

2645

745

815

4725

6212

6214

4537

671

350

625

19 527

714

280

2086

1222

1781

2039

2040

2224

720

2087

1981

1256

1231

1261

963

1221

1362

2081

2082

1848

1849 7

214

317

704

804

975

1232

1321

1404

1861

2452

2453 30

135

517

31 565

1275 65

092

010

3913

0214

2524

85 625

808

1499

1586

1724

2265

2463 16 88 41

043

169

569

811

7614

4714

8615

7016

5019

1322

2922

4624

1924

72 50 6818

3519

16 52 374

394

809

1492

1889

2046 36

2415

1869

2285 18

543

510

9712

9213

1917

04 685

1563

2095

2126

2161

2162

2378 15

619

3422

04 190

1682

356

1166

1185

1273

1566

1627

1824

2298

2451 29

551

163

871

882

211

3712

0312

0515

2915

9318

4520

56 526

1543 62

374

410

9017

6122

2825

98 232

400

486

576

988

1304

1349

1517

1685

1726

1744

2184

2274

2493 85

915

5610

0523

4620

6221

52 305

306

1235

2558 11

1178

2283 34

977

612

69 824

1225

1074 962

2169

1451

1906

154

529

218

1946 75

219

5613

6014

75 237

271

553

1395 79 397

1675

2183

2356

560

2330

1614

1965 96

1161

2583

1405

2249 93

811

6212

8819

8635

713

12 922

893

1721

2109

2433 93

912

7118

0823

33 174

334

664

1548

1479

1966

1737

461

462

581

1358

2072

2520 84

713

0370

223

9015

0819

47 325

1237

1493

2011

2331

1533

2549 9

202

279

308

328

470

745

786

801

1136

1141

1382

1410

1481

1585

1819

1953

2116

2242

2289

2488

2601 11

410

7822

91 33 314

415

459

478

480

573

858

1026

1135

1167

1257

1282

1568

1597

1693

1718

1746

1777

1872

1995

2153

2380

2508

2576

2577

2585 11

533

133

637

277

184

810

9311

9813

9215

1315

5718

1820

0324

5625

4225

8626

0013

91 23 24 72 116

188

389

450

639

682

1153

1226

1281

1318

1374

1578

1778

2286

2360

2501 56 63

269

989

289

911

8913

2815

0317

7017

7517

7617

9619

7521

4923

8323

8424

61 186

875

2101

1850

2158

2362

2544

1099

1700

2450 19 247

313

368

388

474

479

490

491

495

524

681

725

726

1152

1194

1347

1376

1483

1484

1544

1577

1676

1686

1821

1940

1983

1998

1999

2103

2219

2374

2422 95 12

321

621

741

142

577

011

7715

0216

4117

3218

2321

48 158

1335

1802 21

542

672

179

589

810

4913

3714

0923

42 39 160

161

315

840

883

1836

1991

2591

1184

1805

683

684

733

1140

1602

1632

1716

1971

1972

2516 80

011

06 58 157

178

330

433

509

679

712

785

869

1107

1306

1833

2042

2043

2091

2271

2312

2315

2397

2398

2424

2428

2599 56

210

0010

5721

65 14 846

1426

1890

2065

2164

2275 10

514

58 978

2266

2334 17 20

350

173

086

310

4310

4417

3618

9819

69 138

219

862

1511

1587

1860

1190 69 956

1522

192

211

1035

2168

2277 98 56

916

3718

16 437

467

2377

1564

1060

2566

1809

1105 48 269

466

845

964

1725

1742 93

219

1721

59 955

1069

1695

1696

1705

2064

2167

2276

2292 12

019

180

720

9624

0024

01 409

894

923

643

1453

2565 951

2106

1955 13

2581

504

1109

1881 77

811

3215

8916

06 9010

4119

05 621

1375 67

091

675

371 50

067

392

112

8720

17 382

2484

1388

198

751

1248

1624

1790

2270

2523 59

125

63 727

1880 7

024

9912

6014

1819

9220

30 236

723

1063

1567

1623 58

317

3919

26 933

118

446

1515 44

512

519

3220

69 208

209

383

2181

1131

1931

2403

2564 45

660

310

96 180

934

205

1709

2550 66

625

6117

0811

1418

321

022

1721

4714

3424

4717

5022

02 691

1270

2186

2187

2491 74

714

20 9417

8917

4517

80 724

1505

1607

2254 14

414

516

4717

07 251

977

1720

1219

1336

505

514

969

1369

1416 69

681

188

622

318

4413

0111

1016

4022

9511

6411

6920

33 369

1516 669

1820 76

917

8816

3422

3414

0237

312

3617

14 838

2141 98

619

6711

2415

9217

6622

2123

5524

39 153

497

1259

1576 481

2178

1148

2423

275

2317 447

2041

1765

1230

2304

2395

1346

2324

2376

1749

1748

1855

1912

2121 498

1985 748

1735 65

815

6519

80 531

1838

1928

2146

543

1201 84

411

9712

0714

5012

0010

8418

8221

8810

95 985

550

904

2449

2022

2416

926

2455 98

310

1510

3411

2112

96

020

040

060

080

010

00

Dendrogram of agnes(x = dist.om1, diss = TRUE, method = "ward")

Agglomerative Coefficient = 1dist.om1

Hei

ght

117

334

784

910

8111

0011

9214

8817

5217

8322

0522

5923

8225

89 121

155

285

563

790

796

929

992

1019

1419

1468

2023

2125 13

0 5525

853

423

113

3218

5921

51 535

1387

1519 73

724

67 213

314

218

424

863

765

381

888

991

211

9312

4312

5416

1516

7819

9321

6322

6125

55 26 59 104

159

172

428

663

860

1014

1452

1485

1559

1620

1663

2267

2525

2554

2584 11

4937 16

319

523

435

836

259

878

481

396

510

2010

3210

4210

5910

6510

8812

4912

5213

4317

9518

2518

9218

9919

2519

6420

0222

5823

5825

3525

4625

9711

1616

229

753

665

285

612

3812

4415

1015

5215

5416

0917

2717

3817

8719

4520

4822

5723

3523

7324

5724

9624

97 15 82 129

131

312

660

677

833

905

913

1089

1138

1239

1329

1378

1512

1584

1680

1874

1884

2343

2448

2552 83 91 11

215

015

226

026

728

229

954

959

976

410

5313

7917

5320

7521

4525

47 132

2478

1673

1581

1873

1653 30 87 137

235

256

345

364

403

594

907

1092

1284

1476

1489

1526

2047

2207

2272

2349

2361

2396

2596

1728 31 100

220

243

277

281

354

455

460

483

710

814

850

970

1689

1900

2052

2215

2339

2348

2465

2574

2595

2068 35 102

259

266

309

311

429

729

819

837

999

1187

1250

1264

1760

1768

1806

1886

1920

2001

2067

2325

2340

2359

2556 11

010

7315

06 360

2036 91

813

7319

78 420

1573

1091

1669

2522 626

1061

1877

1322

1924

2507

1204

2223 3

405

528

2098

2365

2366

2477 64

277

710

7216

77 384

711

925

1234

1316

1341

1793

1875

1927 84

213

7719

1524

5420

97 2913

85 385

417

641

835

1008

1233

2038

2466

2560 24

281

084

198

211

5612

9714

3616

7216

8317

6317

9818

6219

9024

83 189

649

1144

2144 94

598

117

0316

5618

9521

5722

43 310

530

2327 78

894

324

38 952

2074

224

379

503

940

1314

1645

1076

2437

2370

240

2389

1797

2237

2326 39

547

350

256

188

014

6318

1218

13 944

2206

2218

2306

1339

2414 7

615

3516

5414

675

012

2411

7114

9116

9412

1114

0631

611

4323

94 648

1885 71

614

4324

8723

7175

980

315

9911

2516

7916

91 51

2445

2513

1474

2213 34

458

059

718

4217

711

3322

7920

53 607

578

1018 82

799

417

7335

357

962

710

1647

617

7423

69 61 143

465

472

593

646

755

792

876

1003

1384

1671

1907

2120

2245

2269

2503 16

925

332

287

710

0410

5610

7113

9717

1321

2822

2022

8124

8225

68 996

1870

1279

1866 16

822

3644

213

5617

5519

37 92 333

464

817

1082

1182

1216

1350

1690

1712

1767

2393 53

741

341

446

876

389

611

5011

5514

0121

1021

5522

4122

2673

616

5213

3814

90 401

963

1221

1362

2081

2082

1848

1849 86 559

2381

1442

441

1936

960

2143

2016

1242

830

1976

1398

1982

1313

139

323

348

1386

1692

1112

1355

1423

1467

1957

2372 78

717

4066

213

3320

0517

6418

43 283

284

816

924

1154

2055

2127

2476

1361

802

1126

1289

1702

1754

1803

1894

1950

2193

2421 90

825

10 42 60 586

2006

487

910

2026

1051

1562

488

585

1829 88

181

887

1017

2105 56

870

124

149

391

166

519

5259

687

220

32 196

821

319

1317 23

891

430

016

8436

570

749

221

4225

93 74 289

7523

16 635

1400

1633

2190 64

087

918

7919

49 111

602

601

1667

1722

1208

1433 97

919

7716

788

849

958

425

31 873

1326

197

793

555

1666

324

937

2071

1209

1435

1830 51

316

9719

4325

90 794

2521

780

1147

1085

1958 10 38 99 11

316

417

118

721

221

322

622

822

930

230

438

640

442

743

245

448

449

452

152

371

576

076

777

377

477

578

179

785

798

099

110

1110

2310

3110

3310

4010

8711

3011

3412

6312

7613

4413

5113

9014

3715

0915

4215

9116

6116

6217

3317

8417

8519

4419

7020

0920

1220

3420

9220

9421

0021

5422

1222

3323

5323

6724

4624

5924

7525

43 54

2321

463

1206

2124 7

720

2724

1712

29 4612

9039

014

5422

78 278

1411

1723

2500 7

882

820

77 622

1527

1867

230

919

1368

1021

2088 83

423

0510

5014

44 1113

1466

1660

2495

40 45 221

533

571

829

890

1024

1210

1246

1380

1473

1642

1832

1887

2060

2134

2203

2211

2320

2435 43

944

057

082

612

6814

6423

0923

1125

69 273

449

516

868

900

927

1157

1354

1588

1608

1638

2021

2090

2300

2329

2410

2441 36

155

476

610

4713

5714

1414

1517

9218

5220

5122

3122

4422

9623

6823

8724

9425

28 239

518

274

2579

739

2529

1828

423

2391

1938

2250

419

507

588

1037

1094

1771

1794

2132

2227

2288 61

919

3325

8216

0444

357

783

684

310

2812

6516

6420

9921

3523

22 566

4911

7915

3718

04 367

444

517

548

820

1610

2553 30

773

875

897

111

1912

6613

0514

1215

7224

11 909

1012 60

678

914

5622

5210

2912

455

811

4220

89 424

1062

1251

1883 27

214

3821

71 321

973

1364 42

215

5025

1425

15 608

2293

1636

2328

2480 56

779

921

3811

5113

53 402

1054

1472

2263

2264

346

1701 49

613

08 692

1699 85

560

489

519

10 668

2166

1253

2430

2199

377

931

378

1710

1711

2010

1611

2490

624

2172

1462 99

393

620

0823

2322

0819

1921

3319

39 1815

2820

37 3222

48 263

296

2222

1381

2112

477

1772

1841

2517 27

098

721

7019

4825

05 332

1496

2251

1038

1228

1959

2436 85 34

114

3223

4514

7017

56 93

352

1769

1729

2083

326

2031 61

215

0010

0119

74 2199

521

3615

3815

5523

6420

5412

4021

8918

31 25

2548

2559

1579 57 84

2066 99

814

2725

0410

0712

41 6620

0422

5616

0161

375

416

0520

033

822

8013

6721

5096

824

8919

73 201

1665 99

714

3916

6825

27 303

396

897

906

976

1449

2111

2287

1996

1876

436

1010

1180

2020

1255

2567

1595

1903

1145

2057

2115 50

810

6445

211

9613

8310

2511

46 746

2392 86

191

527

1002

1896 43

1594

1247

182

871

1399 90

310

0923

19 7312

9523

5221

91 106

1863

2192

107

1989

1681 532

1424

1431

1960 40

620

4479

110

8612

2321

97 166

595

250

339

1227

2351

1428

1446

2418

1942

2290

2537 4

671

2338 19

875

112

48 540

1590

382

2484

1388

1172

1173

2270

2523

1624

1790 13

2581

538

1311

2235

1287

2017 71 50

067

392

120

7850

411

0918

81 778

1132

1589

1606

1110

1164

1169 90

1041

1905 38

321

8162

113

75 670

916

753

236

723

1567

1623 59

125

63 727

1880 18

374

714

20 210

2217

2147

1708

453

520

547

587

690

691

1270

2186

2187

2491

1434

2447

1750

2202 70

2499

1063

1260

1418

1992

2030 37

320

3398

619

6711

2422

2123

5524

39 933

369

1516 76

916

3422

3417

8866

918

2014

02 559

210

48 512

515

1628 24

626

444

812

8515

9823

1824

26 564

1220

1532

1495

1583

2085 63

125

7514

9422

596

114

71 286

2303 80 575

1174

1175

1612

1648

2268

2538 54

423

8825

8023

1439

917

7919

35 412

2073

1743

1762

2473 86

419

7911

03 47 950

2214

618

2195 41

824

32 674

1372

1327

2354 61

725

5793

513

4821

14 6734

359

055

776

125

3411

913

2017

51 686

2407 38

014

61 64 854

2247

2460 53

913

2510

4524

562

911

8311

2918

4024

06 293

1280

1324

1440

1478

1988 33

764

577

215

34 611

1757

525

620

866

885

1036

1497

2194

2216 63

465

465

516

2224

3424

6425

11 582

1901

2070

2506 80

513

4225

2420

8417

614

8022

3023

41 689

1520

1393

1968

351

783

884

2307 74

186

518

0020

5918

0120

6326

539

893

015

0718

22 292

697

812

867

1258

1101

1267

1856

2308 63

392

819

0218

0912

7713

9610

9818

6511

2220

2820

298

1858

1186

1293

1837

2182

2474 44 74

913

0020

13 22

320

782

703

1307

2198 14

060

932

916

5118

4618

51 705

2301

2578

811

886

2130

1111

1159

1160

1108

1117

2573 89

1897 25

276

264

720

7995

824

1224

8622

9723

10 9417

89 144

145

1647

1707 22

318

4413

0116

4022

95 251

977

1720

505

514

969

1369

1416 69

672

415

0512

1913

0919

0413

3616

0722

5417

4517

80 180

934

1114 20

517

0925

5011

717

19 165

482

1951

1215

2572 44

510

7414

5119

0682

412

25 962

2169

118

446

1515 12

519

3220

6924

0360

310

9611

3119

3125

6425

4948

913

6511

8118

5358

918

78 984

2201

1460

1814 42

119

1154

277

914

9818

8820

93 550

904

2449 926

2455 98

310

3410

1511

2112

9620

2224

16 153

658

1197

1207

1450

1928

2146

208

209

456

748

1735 498

1985

1565

1980

1592

1766

543

1201

1200

838

2141

1236

1714 98

553

118

38 844

1882

2188

1084

1095

275

447

2041

1148

2317

1748

1855

1912

2121

1765

481

2178

2423

497

1259

1576

1346

2324

2376

1749

1230

2304

2395 6

408

891

1649 10

148

555

182

591

711

6515

2416

0016

3116

44 244

1052 71

730

135

517

3114

9915

8617

2422

6524

63 416

1629

1799

1857

1893

2379

2385

2530

1120

2468

2588

1158

1561

1551 55

613

8915

53 722

1274

1403

1688

1921

1987

2302

2404 15

452

921

819

46 553

1395 625

808

237

271

1909

305

306

1475

565

1275 85

915

56 752

1956

1360

1235

2558 7

214

317

704

804

975

1232

1321

1404

1861

2452

2453 65

092

010

3913

0214

2524

85 190

1184

1005

2346 35

611

6611

8512

7315

6616

2718

2422

9824

51 232

400

486

576

988

1304

1349

1517

1685

1726

1744

2184

2274

2493 15

619

3422

0429

551

163

871

882

211

3712

0312

0515

2915

9318

4520

56 526

1543

1682

623

744

1090

1761

2228

2598 16 88 41

043

169

569

811

7614

4714

8615

7016

5019

1322

2922

4624

1924

72 52 374

394

809

1492

1889

2046 68

515

6320

9521

2621

6121

6223

78 3624

15 185

435

1097

1292

1319

1704

1869

2285

2107

2108 11

1178

2283 34

977

612

69 560

2330 93

811

6217

3712

8819

86 847

1303

1614

1965

174

334

664

1548

1479

1966

702

2390

1508

1947 35

713

12 922

461

462

581

1358

2072

2520

2062

2152 32

512

3714

9320

1115

33 79 397

1675

2183

2356

2331 39

168

810

1315

4016

1322

6024

0516

2121

23 9611

6125

8314

0522

4925

6534

219

3024

9815

2189

317

2121

0924

3318

0823

33 939

1271 280

2086

2039

2040

2224

1115

1222

1781

1261

375

666

2561

1231

720

2087

1981

1256

438

1272

1421

1815

1868

1218

1291

616

1618

2185 541

1854

1962 66

783

918

6421

13 957

1083

1963

2239

2518 35

914

0764

494

112

9845

745

815

4725

6214

4510

5522

9423

4412

6225

5121

1725

2637

671

350

625

19 714

9716

17 204

2238 43

422

84 128

179

2015

1363

1929

2337

122

134

249

1569

1619 22

216

8724

6213

5918

3424

3125

525

770

651

068

015

1422

09 874

2000

298

851

972

1635

1817

2232 31

816

4318

9176

817

17 407

1501

1759

2313

2025 46

970

957

229

117

4115

4125

33 600

676

1139

967

1188

2299

675

2587 67

811

2312

1414

6920

2422

82 519

740

1394

1457

1807

2253

815

1575 85

314

5915

04 920

227

930

832

847

074

578

680

111

3611

4113

8214

1014

8115

8518

1919

5321

1622

4222

8924

8826

01 114

1078

2291 19 24

731

336

838

847

447

949

049

149

552

468

172

572

611

5211

9413

4713

7614

8314

8415

4415

7716

7616

8618

2119

4019

8319

9819

9921

0322

1923

7424

22 39 160

161

315

840

883

1836

1991

2591 80

011

0618

0568

368

473

311

4016

0216

3217

1619

7119

7225

16 58 157

178

330

433

509

679

712

785

869

1107

1306

1833

2042

2043

2091

2271

2312

2315

2397

2398

2424

2428

2599 56

221

6510

0010

57 48 269

466

845

964

1725

1742 15

813

3518

02 215

426

721

795

898

1049

1337

1409

2342 12

019

180

720

9624

0024

01 409

894

923

643

1453 50 68

1835

1916 95 12

321

621

741

142

577

011

7715

0216

4117

3218

2321

48 98 569

1637

1816 95

121

0695

510

6916

9516

9617

0520

6421

6722

7622

92 192

211

1035

2168

2277 93

219

1721

59 14 846

1426

1890

2065

2164

2275 13

821

986

215

1115

8718

60 69

978

2266

2334 95

615

2210

514

5815

6411

9046

723

7711

0580

618

2612

1210

6025

66 437

527

882

23 24 72 116

188

389

450

639

682

1153

1226

1281

1318

1374

1578

1778

2286

2360

2501 11

533

133

637

277

184

810

9311

9813

9215

1315

5718

1820

0324

5625

4225

8626

00 206

471

605

672

946

948

954

1217

1315

1366

1429

1482

1580

1639

1646

2045

2175

2196

2336

2347

2420

2570

2592

1391

1371

2470

1487

2210 33 314

415

459

478

480

573

858

1026

1135

1167

1257

1282

1568

1597

1693

1718

1746

1777

1872

1995

2153

2380

2508

2576

2577

2585 56 63

269

989

289

911

8913

2815

0317

7017

7517

7617

9619

7521

4923

8323

8424

6110

9917

0024

5018

687

521

0118

5021

5823

6225

44 614

1477 73

113

34 12

1560 545

708

2035

147

268

656

1168

1786

2502

1163

1954

2492

1758

451

546

1102

1058

1070 61

563

065

2413

1657

233

974

1104 90

122

2565

921

60 959

350

552

2014

1549

831

1066

1075

1022

2363

1571

2137

2443

2539 36

310

7920

58 610

1127

1128

2180

2594

2156 17 203

501

730

863

1043

1044

1736

1898

1969 62 10

815

119

936

643

072

812

4512

8312

8614

1716

3017

3017

9119

1419

8421

7322

7323

8624

2524

44 276

1782 17

532

710

6713

10 261

735

1539

2255 57

410

7719

08 852

2118

2262

1536

2102 53 148

262

370

392

636

870

947

953

966

1294

1299

1345

1430

1441

1659

1715

1747

1847

2018

2049

2176

2179

2408

2458

2536 63 68

769

394

910

2711

1813

7015

2315

2516

0321

0424

0225

0925

4513

2313

576

598

925

40 126

1941

2131 19

322

729

039

374

210

6811

9111

9517

0618

1124

0924

29 942

2532 13

618

120

733

571

975

690

211

9914

2215

4515

5816

1616

2516

9818

1018

2720

1921

2221

7724

7925

7121

29 20 732

757

1582

2080 38

720

5021

74 34 109

694

734

823

990

1170

1213

1655

2119 14

919

428

737

165

715

9616

2616

5824

4010

8019

2215

3019

6116

7421

4028

883

212

7814

4825

4115

1820

0724

6921

39 28

170

1546

2399

700

1030

2061

1994

127

1871

1046

103

1670

2442 74

314

6552

262

824

8120

7614

08 878

1531

1574

1918

2471 41

381

1340 79

812

0222

4034

022

00 651

661

1330

1455

1331

2350

1413

2357

1352

2512 14

119

9725

424

2723

7510

0618

3919

2317

3429

447

519

5523

32 583

1739

1926

020

040

060

080

010

0012

00

Dendrogram of agnes(x = dist.om3, diss = TRUE, method = "ward")

Agglomerative Coefficient = 1dist.om3

Hei

ght

OM1 OM3

13/7/2008gr 57/86

Page 45: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Distances between sequences: Clustering

State distribution by age, within cluster

4.7 %

4.5 %

4.3 %

3.5 %

2.4 %

2.4 %

2 %

1.8 %

1.7 %

1.6 %

Age

A15 A17 A19 A21 A23 A25 A27 A29

0 1 2 3 4 5 6 7

A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45

Groupe 1

Age

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45

Groupe 2

Age

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45

Groupe 3

Age

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45

Groupe 4

Age

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45

Groupe 5

Age

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45

Groupe 6

Age

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

13/7/2008gr 58/86

Page 46: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Distances between sequences: Clustering

Most frequent sequences by cluster

4.7 %

4.5 %

4.3 %

3.5 %

2.4 %

2.4 %

2 %

1.8 %

1.7 %

1.6 %

Age

A15 A17 A19 A21 A23 A25 A27 A29

0 1 2 3 4 5 6 7

11.3 %

9.1 %

8.4 %

8.4 %

8 %

8 %

6.9 %

6.5 %

6.5 %

5.1 %

Groupe 1

Age

A15 A22 A29 A36 A43

5 %

4.1 %

4.1 %

3.5 %

3.2 %

2.9 %

2.6 %

2.6 %

2.3 %

2.3 %

Groupe 2

Age

A15 A22 A29 A36 A43

2.3 %

1.9 %

1.8 %

1.7 %

1.6 %

1.6 %

1.5 %

1.5 %

1.3 %

1.2 %

Groupe 3

Age

A15 A22 A29 A36 A43

57.5 %

3.9 %1.6 %1.6 %0.8 %0.8 %0.8 %0.8 %0.8 %0.8 %

Groupe 4

Age

A15 A22 A29 A36 A43

1.3 %

0.8 %

0.8 %

0.8 %

0.8 %

0.8 %

0.8 %

0.8 %

0.8 %

0.8 %

Groupe 5

Age

A15 A22 A29 A36 A43

10.2 %

8.2 %

7.8 %

4.8 %

4.8 %

4.8 %

4.4 %

3.4 %1.9 %1.9 %

Groupe 6

Age

A15 A22 A29 A36 A43

13/7/2008gr 59/86

Page 47: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Distances between sequences: Clustering

I-plot by cluster

4.7 %

4.5 %

4.3 %

3.5 %

2.4 %

2.4 %

2 %

1.8 %

1.7 %

1.6 %

Age

A15 A17 A19 A21 A23 A25 A27 A29

0 1 2 3 4 5 6 7

13/7/2008gr 60/86

Page 48: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Distances between sequences: Clustering

Distribution by birth cohort within each clusterAnnée de naissance (Groupe 1)

année

Fre

quen

cy

1910 1920 1930 1940 1950 1960

010

2030

4050

Année de naissance (Groupe 2)

année

Fre

quen

cy

1910 1920 1930 1940 1950 1960

010

2030

4050

60

Année de naissance (Groupe 3)

année

Fre

quen

cy

1910 1920 1930 1940 1950 1960

050

100

150

200

250

300

Année de naissance (Groupe 4)

année

Fre

quen

cy

1910 1920 1930 1940 1950 1960

05

1015

20

Année de naissance (Groupe 5)

année

Fre

quen

cy

1910 1920 1930 1940 1950 1960

010

2030

4050

60

Année de naissance (Groupe 6)

année

Fre

quen

cy

1910 1920 1930 1940 1950 1960

010

2030

4050

13/7/2008gr 61/86

Page 49: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Multidimensional Scaling representation of sequences

Multidimensional Scaling: Principle

Let D be a distance matrix between sequences.D computed using OM, LPS, LCS, ... metrics.Multidimensional Scaling consists in

Finding a set of real valued variables (f1, f2) such that theδij =

√(fi1− fj1)2 + (fi2− fj2)2 best approximate the

distances dij . between sequences.Plotting the points in the (f1, f2) space.

13/7/2008gr 63/86

Page 50: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesVisualizing and clustering sequence data

Multidimensional Scaling representation of sequences

Multidimensional Scaling

−30 −20 −10 0 10 20 30

−20

−10

010

2030

dist.om.mds$points[,1]

dist

.om

.mds

$poi

nts[

,2]

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

Groupe 1Groupe 2Groupe 3Groupe 4Groupe 5Groupe 6

13/7/2008gr 64/86

Page 51: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesMining Frequent Episodes

Mining Frequent Episodes

What can we expect from frequent episodes mining?GSP (Srikant and Agrawal, 1996)MINEPI, WINEPI (Mannila et al., 1997)TCG, TAG (Bettini et al., 1996)SPADE (Zaki, 2001)

Are there specific issues when applying these methods insocial sciences?

13/7/2008gr 66/86

Page 52: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesMining Frequent Episodes

What Is It About?

Frequent episodes. What is it?

Episode: Collection of events occurring frequently together.Mining typical episodes:

Specialized case of mining frequent itemsets.Time dimension ⇒ Partially ordered events.

More complex than unordered itemsets: User mustspecify time constraints (and episode structure constraints).select a counting method.

13/7/2008gr 68/86

Page 53: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesMining Frequent Episodes

What Is It About?

Episode structure constraints

For people who leave home within 2 years from their 17, what aretypical events occurring until they get married and have a firstchild?

LH,17

w = 2

??

w = 1

C1

M(0, 4)

(0,3)

(0, 1, 10)

elastic

event constraints

parallel

node constraint

edge constraints

13/7/2008gr 69/86

Page 54: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesMining Frequent Episodes

What Is It About?

Counting methods (Joshi et al., 2001)

20 21 22 23 24

U UUC C C

Searching (U,C)min gap= 1, max gap= 2, win size= 2

indiv. with episode COBJ = 1

windows with episode CWIN = 3

min win. with episode CminWIN = 2

distinct occurrences CDIS_o = 5

dist. occ. without overlap CDIS = 3

13/7/2008gr 70/86

Page 55: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesMining Frequent Episodes

Example: Counting Alternate Episode Structures

Example: Counting alternate structures (COBJ, no max gap)

0%

5%

10%

15%

20%

25%

30%

Child <

Marr

iage

Marriag

e < C

hild

Child =

Marr

iage

Child <

Job

Job <

Chil

d

Child =

Job

Child <

Edu

c end

Educ e

nd <

Child

Child =

Edu

c end

Marriag

e < Jo

b

Job <

Marr

iage

Marriag

e = Jo

b

Marriag

e < Edu

c end

Educ e

nd <

Marriag

e

Marriag

e = Edu

c end

Job <

Edu

c end

Educ e

nd <

Job

Job =

Edu

c end

Switzerland, SHP 2002 biographical survey (n = 5560).13/7/2008gr 72/86

Page 56: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesMining Frequent Episodes

Issues Regarding Episode Rules

Rules between episodes

Social scientists like causal explanations.Empirically assessed rules are valuable material in that respect.

Little attention paid to this aspect in the literature onfrequent subsequences.

Mined episodes are already structured: if (U,C) is a frequentepisode, then we know that C often follows U.Deriving association rules from frequent ordered patterns issimilar to what is done with unordered itemsets.

Rule relevance criteria: confidence, surprisingness, implicationstrength, ...Their value depends on the selected counting method.

13/7/2008gr 74/86

Page 57: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesMining Frequent Episodes

Issues Regarding Episode Rules

Issues with episode rules in social sciences

Parallel life courses:Family events and professional life course.Life courses of each partner of a couple.

Mining associations between frequent episodes of a sequencewith those of its parallel sequence.

Frequent episodes from mix of the 2 sequences, and thenrestrict search of rules among candidates with premise andconsequence belonging to a different sequence.Frequent episodes from each sequence, and thensearch rules among candidates obtained by combining frequentepisodes from each sequence.

Accounting for multi-level effects when validating rules.Is rule relevant among groups, or within groups?

13/7/2008gr 75/86

Page 58: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSummary

Summary

Data mining approaches (survival trees, clustering sequences,frequent episodes) have promising future in life courseanalysis.

Complement classical statistical outcomes with new insights.

Their use within social sciences raises specific issues:Accounting for multi-level effects when growing survival tree ormining association rules.Handling time varying predictors in survival trees.Selecting relevant counting methods (event dependent)?Suitable criteria for measuring association strength betweenfrequent episodes....

13/7/2008gr 76/86

Page 59: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSummary

Our TraMineR R-package

Let me finish with an Add ...

TraMineR, a free life trajectory mining toolfor the free open source R statistical environment.downloadable from http://mephisto.unige.ch/biomining

and soon from the CRAN

13/7/2008gr 77/86

Page 60: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesSummary

Thank You!Thank You!

13/7/2008gr 78/86

Page 61: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

Zoomed tree

Divorce, Switzerland, Differences in KM Survival CurvesI

Return

� � � � � � � � � � �

� � � � � � � � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � � � � � � � � �

� � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � �

� � � � � � � � � �

� � �

� � � � � � � � � � � �

� � � � � �� � � � � � � � � �

� � � � � � � � � � �

� � � � ! " " #$ � "

� � � � � � � � � � � �

� � � � � � �� � � � � � � � �

� � � � � � � � � �

% & � � � � � � � # � � � ' � � � # � � � �

� � � � � � � �

% & � � � � � � � � # � � ' � � � # � � � � % & � � � � � � � # � � � ' � � � # � � � �

� � � � � � � � � � � �

� � � � � � �� � � � � � � �

� � � � � � � � � �

� � � � � � � � � � �

� � � � � � � � �� � � � � � � � � �

� � � � � � � � � �

� � � � � � � � ( � ) � *� � � � � � � � � �

� � � � � � � �

% & � � � � � � � # � � � � ' � � � # � � � �

� � � � � � � � � � �

� � � � � � � � �� � � � � � � �

� � � � � � � � �

� � � � � � � � � � � �

� � � � � � �� � � � � � �

� � � � � � � � �

$ � "� �

� � � � � � � �

% & � � � � � � � # � � � ' � � � # � � �

� � � � � � � � � � � �

� � � � � � �� � � � � � � �

� � � � � � � � � �

� � � � � � � � � � � �

� � � � � � � �� � � � � � � � �

� � � � � � � � � �

$ � "� �

� � � � � � � �

% & � � � � � � � # � � � � ' � � � # � � �

+ � + �

+

+ � + + � + �

13/7/2008gr 79/86

Page 62: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

Sub-sequences

Clusters and subsequences

Return

m1 e1

m_e

10

m_e

5

m_e

1

em1 s1 c1

0.0

0.2

0.4

0.6

0.8

1.0

Groupe 1

m1 d1

d_m

10

dm1 c1

d_m

5

c_m

10

c_m

5

0.0

0.2

0.4

0.6

0.8

1.0

Groupe 2

m1 d1 e1

m_e

10

d_e1

0

m_e

5

dm1

d_e5

0.0

0.2

0.4

0.6

0.8

1.0

Groupe 3

m1 d1 e1

dm1

m_e

10

d_e1

0

m_e

5 c1

0.0

0.2

0.4

0.6

0.8

1.0

Groupe 4

m1 s1 d1 e1

m_s

10

dm1

d_e1

0

d_m

10

0.0

0.2

0.4

0.6

0.8

1.0

Groupe 5

d1 c1 cd1 e1

d_c1

0

m1

d_e1

0

ce1

0.0

0.2

0.4

0.6

0.8

1.0

Groupe 6

13/7/2008gr 80/86

Page 63: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

Sub-sequences

Biofam data: Legend

A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45

Groupe 1

Age

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

no eventleft homemarried with/without childleft home, marriedwith childleft home, with childleft home, married, childdivorced

13/7/2008gr 81/86

Page 64: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

For Further Reading

For Further Reading I

Abbott, A. and J. Forrest (1986). Optimal matching methods forhistorical sequences. Journal of Interdisciplinary History 16,471–494.

Bettini, C., X. S. Wang, and S. Jajodia (1996). Testing complextemporal relationships involving multiple granularities and itsapplication to data mining (extended abstract). In PODS ’96:Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGARTsymposium on Principles of database systems, New York, pp.68–78. ACM Press.

13/7/2008gr 82/86

Page 65: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

For Further Reading

For Further Reading II

Billari, F. C. (2005). Life course analysis: Two (complementary)cultures? Some reflections with examples from the analysis oftransition to adulthood. In P. Ghisletta, J.-M. Le Goff, R. Levy,D. Spini, and E. Widmer (Eds.), Towards an InterdisciplinaryPerspective on the Life Course, Advancements in Life CourseResearch, Vol. 10, pp. 267–288. Amsterdam: Elsevier.

Blossfeld, H.-P. and G. Rohwer (2002). Techniques of EventHistory Modeling, New Approaches to Causal Analysis (2nded.). Mahwah NJ: Lawrence Erlbaum.

Elzinga, C. H. (2008). Sequence analysis: Metric representationsof categorical time series. Sociological Methods and Research.forthcoming.

13/7/2008gr 83/86

Page 66: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

For Further Reading

For Further Reading III

Elzinga, C. H. and A. C. Liefbroer (2007). De-standardization offamily-life trajectories of young adults: A cross-nationalcomparison using sequence analysis. European Journal ofPopulation 23, 225–250.

Huang, X., S. Chen, and S. Soong (1998). Piecewise exponentialsurvival trees with time-dependent covariates. Biometrics 54,1420–1433.

Joshi, M. V., G. Karypis, and V. Kumar (2001). A universalformulation of sequential patterns. In Proceedings of theKDD’2001 workshop on Temporal Data Mining, San Fransisco,August 2001.

Leblanc, M. and J. Crowley (1992). Relative risk trees for censoredsurvival data. Biometrics 48, 411–425.

13/7/2008gr 84/86

Page 67: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

For Further Reading

For Further Reading IV

Levenshtein, V. (1966). Binary codes capable of correctingdeletions, insertions, and reversals. Soviet Physics Doklady 10,707–710.

Mannila, H., H. Toivonen, and A. I. Verkamo (1997). Discovery offrequent episodes in event sequences. Data Mining andKnowledge Discovery 1(3), 259–289.

Needleman, S. and C. Wunsch (1970). A general methodapplicable to the search for similarities in the amino acidsequence of two proteins. Journal of Molecular Biology 48,443–453.

Segal, M. R. (1988). Regression trees for censored data.Biometrics 44, 35–47.

13/7/2008gr 85/86

Page 68: Mining Event or State Sequences: A Social Science Perspectivemephisto.unige.ch/pub/publications/gr/slides/bm_rits... · 2011. 9. 23. · Mining Event or State Sequences Survival Trees

Mining Event or State SequencesAppendix

For Further Reading

For Further Reading V

Segal, M. R. (1992). Tree-structured methods for longitudinaldata. Journal of the American Statistical Association 87 (418),407–418.

Srikant, R. and R. Agrawal (1996). Mining sequential patterns:Generalizations and performance improvements. In P. M. G.Apers, M. Bouzeghoub, and G. Gardarin (Eds.), Advances inDatabase Technologies – 5th International Conference onExtending Database Technology (EDBT’96), Avignon, France,Volume 1057, pp. 3–17. Springer-Verlag.

Zaki, M. J. (2001). SPADE: An efficient algorithm for miningfrequent sequences. Machine Learning 42(1/2), 31–60.

13/7/2008gr 86/86