pattern mining and machine learning for demographic sequences

21
Pa#ern Mining and Machine Learning for Demographic Sequences Dmitry I. Ignatov 1 , Ekaterina Mitrofanova 2 , Anna Muratova 1 , and Danil Gizdatulin 1 1 Computer Science Faculty & 2 Institute of Demography National Research University Higher School of Economics, Moscow, Russia KESW 2015, Moscow

Upload: dmitrii-ignatov

Post on 06-Apr-2017

442 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Pattern Mining and Machine Learning for Demographic Sequences

Pa#ernMiningandMachineLearningforDemographicSequences

DmitryI.Ignatov1,EkaterinaMitrofanova2,AnnaMuratova1,andDanilGizdatulin1

1Computer Science Faculty & 2Institute of Demography

National Research University Higher School of Economics, Moscow, Russia

KESW 2015, Moscow

Page 2: Pattern Mining and Machine Learning for Demographic Sequences

Outline

KESW 2015, Moscow

• ProblemStatement• DemographicData• MethodsandResults

Ø DecisionTrees§ FirstlifecourseeventpredicGon§ NextlifecourseeventpredicGon§ Rulesfor‘gender’aLribute

Ø FrequentPaLerns§ Frequentclosedsequences§ Emergingsequencesformenandwomen

• ConclusionandFutureWork

2

Page 3: Pattern Mining and Machine Learning for Demographic Sequences

•  Analysisofdemographicdatabymeansofmachinelearninganddatamining[Blockeel et al, 2001; Billari et al., 2006]•  DemographersquesGons:

§ Whatarethedifferencesbetweendemographic behaviour of menandwomen?

§ Whatarethetypical(frequent)eventsequencesthatappearinlifecoursetrajectories?

§ WhatarethefirstandthenextstarGngeventsaparGcularpersonmayhave?

§ Andmanymore…

• SimpleDM&MLtoolspreferablywithGUI:§ Orange http://orange.biolab.si/§ SPMFhLp://www.philippe-fournier-viger.com/spmf/§ Ad-hocscriptsinPython

ProblemStatement

3KESW 2015, Moscow

Page 4: Pattern Mining and Machine Learning for Demographic Sequences

Thesurvey«Parentsandchildren,menandwomeninfamilyandsociety»www.socpol.ru/gender/RIDMIZ.shtml•  4857peopleincluding1545men3312women•  11generaGonsaresplitin5yearsintervalsfrom1930Gll1984

DemographicData

4KESW 2015, Moscow

Page 5: Pattern Mining and Machine Learning for Demographic Sequences

Comparisonofclassifiersforthefirsteventpredic?on

5KESW 2015, Moscow

Page 6: Pattern Mining and Machine Learning for Demographic Sequences

WhyDecisionTrees?

5KESW 2015, Moscow

•  Notablack-boxapproach•  Simpleif-thenrulebasedrepresentaGon•  However…

Page 7: Pattern Mining and Machine Learning for Demographic Sequences

Thefirsteventpredic?onbyDecisionTrees

ThetreeisbuiltinOrangeThefirsteventmainlydependsoneducaGontype

6KESW 2015, Moscow

Page 8: Pattern Mining and Machine Learning for Demographic Sequences

FeatureEncodingSchemes

General attributes: •  Gender •  Generation •  Type of education •  Locality •  Religious views •  Religious activity

Events encoding: •  Binary encoding (0 - absence, 1 - presence) •  Time-base encoding (age in months) •  Pairs of events ordered by precedence relations (<, >, =, n/a)

7KESW 2015, Moscow

The anonymised datasets for each experiment are freely available in CSV files: http://bit.ly/KESW2015seqdem

Page 9: Pattern Mining and Machine Learning for Demographic Sequences

Influenceoffeatureencodingonthenexteventpredic?on

Encodingtype

Classifica?onAccuracy

Imbalanceddata Balanceddata

Binary(BE) 0.8498 0.8780(*)

Time-based(TE) 0.3516 0.3591

Pairsofevents(PE) 0.7076 0.7013

BE+TE 0.7293(~) 0.7459

BE+PE 0.8407 0.8438

TE+PE 0.5465 0.4959

BE+TE+PE 0.7295(~) 0.7503

(*)meansthebestresult,(~)meansalmostequivalentresults

8KESW 2015, Moscow

Page 10: Pattern Mining and Machine Learning for Demographic Sequences

Theconfusionmatrixforthenexteventpredic?on

8KESW 2015, Moscow

– binary encoding for balanced dataset – “br” means “break up” and “div” means “divorce” events

Page 11: Pattern Mining and Machine Learning for Demographic Sequences

Examplesofrulesforthenexteventpredic?on

9KESW 2015, Moscow

Page 12: Pattern Mining and Machine Learning for Demographic Sequences

Thenexteventpredic?onbasedongenerala#ributes

A decision tree diagram built only for general attributes. The next event is influenced by the type of education.

10KESW 2015, Moscow

Page 13: Pattern Mining and Machine Learning for Demographic Sequences

Classifica?onaccuracyofdifferentencodingschemesforgenderpredic?on

(∼) means very close results, and (*) means the best result in the column.

11KESW 2015, Moscow

Page 14: Pattern Mining and Machine Learning for Demographic Sequences

Men:Women:

Examplesofrulesforpredic?onoftargeta#ribute«gender»

Antecedent Confidence

Firstjobaper19.9years,marriagein20.6-22.4,educaGonbefore20.7,breakupaper27.6,divorcebefore30.5

65.9%

Firstjobaper19.9,marriagein20.6-22.4,break-upbefore27.6 61.1%

Firstjobbefore17.2,marriagein20.6-22.4,break-upbefore27.6 61.3%

Firstjobaper21,marriageaper29.5 70.2%

Antecedent Confidence

Firstjobin18.2-19.9,marriagein20.6-22.4,break-upaper27.6,divorceaper30.5

71.9%

Firstjobin18.2-19.9,marriagein20.6-22.4,break-upaper27.6,divorcebefore30.5

70.9%

Firstjobin17.2-19.9,marriagein20.6-22.4,break-upbefore27.6 62.8%

Firstjobin17.7-21,marriageaper29.5 62.8%

12KESW 2015, Moscow

Page 15: Pattern Mining and Machine Learning for Demographic Sequences

•  Asequenceisanorderedsetofelements(events)•  ThesupportofsequencerindatabaseisthenumberofsequencesinDthatcontainr

•  Therela?vesupportofristhefracGonofsequencesthatcontainr•  AsequencerisfrequentinadatabaseDif,whereminsupisaminimalsupportthreshold•  Afrequentclosedsequenceis a sequence such that there is no any its

supersequence with the same support

SequenceMining

13KESW 2015, Moscow

[Zaki & Meira, 2014; Agrawal & Srikant, 1995]

sup(r) = #{si ∈ D | r is subsequence of si}

Page 16: Pattern Mining and Machine Learning for Demographic Sequences

•  AnemergingsequenceisasequencesuchthatitssupportgrowthsdrasGcallyinthetransiGonfromonedatabasetoanotherone

•  ThegrowthrateofasequencesinthetransiGonfromdatabasesD1toD2: (1) •  Thecontribu?onofasequencetoaparGcularclass:,(2)whereeisasubsequenceofsTheideaisbasedon[Mill,1843;Finn,1983;Dong&Li,1999]

EmergingSequences

14KESW 2015, Moscow

Page 17: Pattern Mining and Machine Learning for Demographic Sequences

•  An SPMFoutputforfrequentclosedsequencesmining

Frequentclosedsequences

15KESW 2015, Moscow

Page 18: Pattern Mining and Machine Learning for Demographic Sequences

•  ImplementedinPython2.7Input:twodatasetsformenandwomenwithageindicaGonfordemographicevents1.  TransformaGonofeventstosequences(80:20test-to-trainingraGo)2.  PassingthetestsettoSPMFandfindingfrequentsequences.3.  Findingemergingsequences(classificaGonrules)andtheircontribuGonsto

classes.4.  DefiningtheclassofarulebyitscontribuGonandthencontribuGon

normalisaGon.5.  AccuracycalculaGonforthetestset.Output:fileswithclassificaGonrulesformenandwomen

• minsup = 0.005; for each class we use 3312 sequences after oversampling.

•  The best classification accuracy (0.936) has been reached at minimal growth rate 1.0, with 577 rules for men and 1164 for women, and 3 non-covered objects.

EmergingSequenceMining

16KESW 2015, Moscow

Page 19: Pattern Mining and Machine Learning for Demographic Sequences

•  The list of emerging sequences for men(with their class contribution):•  Thelistofemergingsequencesforwomen:

Emergingsequences

17KESW 2015, Moscow

Page 20: Pattern Mining and Machine Learning for Demographic Sequences

• We have shown that decision trees and sequence mining could become the tools of choice for demographers.

• Machine learning and data mining tools can help in finding regularities

and dependencies that are hidden in voluminous demographic datasets. However, these methods need to be properly tuned and adapted to the domain needs.

•  In the near future we are planning:

•  to implement emerging prefix-string mining and learning to deal with sequences without gaps

•  to use different rule-based techniques that are able to cope with unbalanced multi-class data (Cerf et al., 2013)

•  to apply Pattern Structures to demographic sequence mining (Kuznetsov et al., 2013).

•  andmanymoreideasandtricks…

ConclusionandFutureWork

18KESW 2015, Moscow

Page 21: Pattern Mining and Machine Learning for Demographic Sequences

Thankyou.Ques?ons?

KESW 2015, Moscow