neural hawkes process memory - university of colorado ...mozer/research/selected publications... ·...
TRANSCRIPT
NeuralHawkesProcessMemory
MichaelC.MozerUniversityofColorado,Boulder
RobertV.LindseyImagen Technologies
DenisKazakovUniversityofColorado,Boulder
TemporallyDistributedStudyandMemoryRetention
üMassedstudy
üSpacedstudy
üMemorydecaysmoreslowlywithspacedstudy
time
time
?
?
TimeScaleandTemporalDistributionofBehavior
üCriticalformodelingandpredictingmanyhumanactivities:
§ Retrievingfrommemory
§ Purchasingproducts
§ Selectingmusic
§ Makingrestaurantreservations
§ Postingonwebforumsandsocialmedia
§ Gamingonline
§ Engagingincriminalactivities
§ Sendingemailandtexts
x1 x2 x3 x4
t1 t2 t3 t4
RecentResearchInvolvingTemporallySituatedEvents
üDiscretizetimeandusetensorfactorizationorRNNs§ e.g.,X.Wangetal.(2016),YSongetal.(2016),Neiletal.(2016)
üHiddensemi-Markovmodelsandsurvivalanalysis§ Kapooretal.(2014,2015)
ü IncludetimetagsasRNNinputsandtreatassequenceprocessingtask§ Duetal.(2016)
üTemporalpointprocesses§ Duetal.(2015),Y.Wangetal.(2015,2016)
üOurapproach§ incorporatetimeintotheRNNdynamics
TemporalPointProcesses
üProducessequenceofeventtimes𝓣 = 𝒕𝒊üCharacterizedbyconditionalintensityfunction,𝒉 𝒕𝒉 𝒕 = 𝑷𝒓 eventinintervaldt 𝓣)/𝐝𝐭
üE.g.,Poissonprocess
§ Constantintensityfunction
𝒉 𝒕 = 𝝁
üE.g.,inhomogeneousPoissonprocess
§ Timevaryingintensity
𝒉 𝒕
HawkesProcess
üIntensity dependsoneventhistory
§ Selfexcitatory
§ Decaying
üIntensitydecaysovertime
§ Usedtomodel earthquakes,financialtransactions,crimes
§ Decayratedeterminestimescaleofpersistence
Time
intensityh(t)
eventstream
𝒉 𝒕 = 𝝁 + 𝜶9𝒆;𝜸 𝒕;𝒕𝒋�
𝒕𝒋?𝒕
HawkesProcessüConditionalintensityfunction
üIncrementalformulationwithdiscreteupdates
with 𝓣 ≡ 𝒕𝟏, … , 𝒕𝒋, … times of past events
𝒉𝒌 = 𝝁 + 𝒆;𝜸𝜟𝒕𝒌 𝒉𝒌;𝟏 − 𝝁 + 𝜶𝒙𝒌
𝒉𝟎 = 𝝁 and 𝒕𝟎 = 𝟎
Δ𝑡K ≡ 𝑡K − 𝑡K;L 𝑥K = N1 ifeventoccurs0 ifnoevent
𝚫𝐭
𝒉
𝒙
𝝁 𝜸
𝜶
Prediction
Observeatimeseriesandpredictwhatcomesnext?
Givenmodelparameters,computeintensityfromobservations:
Givenintensity,computeeventlikelihoodina𝚫𝐭 window:Pr 𝑡K ≤ 𝑡K;L + Δ𝑡, 𝑥K = 1|𝑡L, … , 𝑡K;L = 1 − 𝑒; [\]^;_ L;`]abc /d;_ef
ℎK = 𝜇 + 𝑒;def\ ℎK;L − 𝜇 + 𝛼𝑥K
≡ 𝑍K(Δ𝑡)
0 100 200 300 400 500 600 700 800
2
8
32
0 100 200 300 400 500 600 700 800
Time
0 100 200 300 400 500 600 700 800
Inte
nsi
ty
0 100 200 300 400 500 600 700 800
Inte
nsi
ty0 100 200 300 400 500 600 700 800
Inte
nsi
ty
0 100 200 300 400 500 600 700 800
Inte
nsi
ty
?
KeyPremiseüThetimescaleforaneventtypemayvaryfromsequencetosequence
üTherefore,wewanttoinfertimescaleparameter𝜸 appropriateforeacheventandforeachsequence.
BayesianInferenceofTimeScale
üTreat𝜸 asadiscreterandomvariabletobeinferredfromobservations.
§ 𝛾 ∈ {𝛾L, 𝛾o, … , 𝛾p} whereS isthenumberofcandidatescales
§ log-linearscaletocoverlargedynamicrange
üSpecifyprioron𝜸
§ Pr 𝛾 = 𝛾r
üGivennextevent𝒙𝒌(presentorabsent)at𝒕𝒌,performBayesianupdate:
§ Pr 𝛾r 𝒙L:K, 𝒕L:K ~𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;L
𝜇 + 𝑒;dvef\ ℎK;L,r − 𝜇w\𝑍Kr Δ𝑡K
0 100 200 300 400 500 600 700 800 900
2
8
32
0 100 200 300 400 500 600 700 800 900
Time
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
short
long
medium
short
longmedium
0 100 200 300 400 500 600 700 800 900
2
8
32
0 100 200 300 400 500 600 700 800 900
Time
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
marginal
X
𝜸Intensity
0 100 200 300 400 500 600 700 800 900
1 2 4 8 16 32 64 128 256 512 102420484096
0 100 200 300 400 500 600 700 800 900
1 2 4 8 16 32 64 128 256 512 102420484096
0 100 200 300 400 500 600 700 800 900
1 2 4 8 16 32 64 128 256 512 102420484096
0 100 200 300 400 500 600 700 800 900
0 100 200 300 400 500 600 700 800 900
Time
10-2
10-1
100
log Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
= 1/4
= 1/64
= 1/1024
0 100 200 300 400 500 600 700 800 900
Time
10-2
10-1
100
log Inte
nsi
tyintensityfunctionfromgenerativeprocess
recoveredintensityfunction
inputsequences
inductionoftimescale
EffectofSpacing
0 20 40 60 80 100
1 2 4 8 16 32 64 128 256 512 1024
0 20 40 60 80 100
1 2 4 8 16 32 64 128 256 512 1024
0 20 40 60 80 100
1 2 4 8 16 32 64 128 256 512 1024
0 20 40 60 80 100
0 20 40 60 80 100
Time
10-2
10-1
100
log
Inte
nsity
HumandataCepeda,Vul,Rohrer,Wixted,&Pashler (2008)NeuralnetmodelrelatedtoHPMMozer,Pashler,Cepeda,Lindsey,&Vul (2009)
IntersessionInterval(Days)
%Recall
1 7 14 21 35 70 1050
20
40
60
80
100
spacing (days)
% re
call
7 day retention
35 day retention
70 day retention
350 day retention
TwoAlternativeCharacterizationsofEnvironment
üAlleventsareobserved§ e.g.,studentspracticingretrievalofforeignlanguagevocabulary
§ Likelihoodfunctionshouldreflectabsenceofeventsbetweeninputs
Pr 𝛾r 𝒙L:K, 𝒕L:K ~𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;L
𝜇 + 𝑒;dvef\ ℎK;L,r − 𝜇w\𝑍Kr Δ𝑡K
üSomeeventsareunobserved§ e.g.,shoppersmakingpurchasesonamazon.com (butpurchasesalsomadeontarget.com andjet.com)
§ Likelihoodfunctionshouldmarginalizeoverunobservedeventsandreflecttheexpectedintensity Pr 𝛾r 𝒙L:K, 𝒕L:K ~𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;L
𝜇1 − 𝛼/𝛾r
+ ℎK;L −𝜇
1 − 𝛼/𝛾r𝑒;dv L;
xdv
ef\𝒙𝒌
HawkesProcessMemory(HPM)Unitü LSTM
üHoldsahistoryofpastinputs(events) ✔
üMemorypersistencedependsoninputhistory X
üNoexplicit‘input’or‘forget’gate X
üCapturescontinuoustimedynamics X
𝚫𝐭
𝒉
𝒙
𝝁 𝜸
𝜶
EmbeddingHPMinanRNN
üBecauseeventrepresentationsarelearned,inputxdenotesPr(event)ratherthantruthvalue
§ ActivationdynamicsareameanfieldapproximationtoHPinference Marginalizingoverbeliefabouteventoccurrence:
Pr 𝛾r 𝒙L:K, 𝒕L:K ~∑ 𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;Lw\zLw\z{
üOutput(tonextlayerandthroughrecurrentconnections)mustbebounded.
§ Quasi-hyperbolicfunction
𝒉 𝒕 + 𝚫𝒕| /(𝒉 𝒕 + 𝚫𝒕| + 𝝂)
HPMRNN
...A B C
HPM HPM HPM HPM
A B C ...currentevent
predictedevent
Δ𝑡Δ�̂�time
sincelastevent
timetopredictedevent
WordLearningStudy(Kangetal.,2014)
üData
§ 32humansubjects
§ 60Japanese-Englishwordassociations
§ eachassociationtested4-13timesoverintervalsrangingfromminutestoseveralmonths
§ 655trialspersequenceonaverage
üTask
§ Givenstudyhistoryuptotrialt,predictaccuracy(retrievablefrommemoryornot)fornexttrial.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 3 9 28 84
1 10 19 28 84
Expanding Interval Schedule (Days)
Equal Interval Schedule (Days)
T1
T2
T3
T1
T2
T3
T1
T2
T3
T1
T2
T3
T1
T2
T3
T1
T2
T3
T1
T2
T3
T1
T2
T3
Reca
ll P
roport
ion
1 3 9 10 19 280
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Day
Reca
ll P
roport
ion
WordLearningResults
üMajorityclass(correct) 62.7%
üTraditionalHawkesprocess 64.6%
üNext=previous 71.3%
üLSTM 77.9%
üHPM 78.3%
üLSTMwith𝚫tinputs 78.3%
RedditPostings
0 20 40 60 80 100 120 140 160 180 200
0 50 100 150 200
0 20 40 60 80 100 120 140 160 180 200
0 20 40 60 80 100 120 140 160
0 50 100 150 200
0 50 100 150 200
0 50 100 150 200
0 50 100 150 200
üData
§ 30,733users
§ 32-1024postsperuser
§ 1-50forums
§ 15,000usersfortraining(andvalidation),remaindertesting
üTask
§ Predictforumtowhichuserwillpostnext,giventhetimeoftheposting
RedditResults
üNext=previous 39.7%
üHawkesProcess 44.8%
üHPM 53.6%
üLSTM(with𝚫𝐭 inputs) 53.5%
üLSTM,noinputorforgetgate 51.1%
üHumanbehaviorandpreferenceshavedynamicsthatoperateacrossarangeoftimescales.
üItseemslikeamodelbasedonthesedynamicsshouldbeagoodpredictorofhumanbehavior.
ü…andhopefullyalsoagoodpredictorofothermultiscaletimeseries.
KeyIdeaofHawkesProcessMemory
üRepresentmemoryofsequencesatmultipletimescalessimultaneously
üOutput‘appropriate’timescalebasedoninputhistory
0 100 200 300 400 500 600 700 800 900
2
8
32
0 100 200 300 400 500 600 700 800 900
Time
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
2
8
32
0 100 200 300 400 500 600 700 800 900
Time
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
0 100 200 300 400 500 600 700 800 900
Inte
nsi
ty
StateOfTheResearch
üLSTMisprettydarnrobust
üSomeevidencethatHPMandLSTMarepickingupondistinctinformationinthesequences
§ Possibilitythatmixingunittypeswillobtainbenefits
§ Canalsoconsiderothertypesofunitspremisedonalternativetemporalpointprocesses,e.g.,selfcorrectingprocesses
üPotentialforusingevent-basedmodelevenfortraditionalsequenceprocessingtasks
NoveltyOfApproach
üTheneuralHawkesprocessmemorybelongstotwonewclassesofneuralnetmodelsthatareemerging.
§ Modelsthatperformdynamicparameterinferenceasasequenceisprocessed(vs.stochasticgradientbasedadaptation)
seealsoFastWeights paperbyBa,Hinton,Mnih,Leibo,&Ionescu (2016),TauNet paperbyNguyen&Cottrell(1997)
§ Modelsthatoperateinacontinuoustimeenvironment
seealsoPhasedLSTMpaperbyNeil,Pfeiffer,Liu(2016)