neural hawkes process memory - university of colorado ...mozer/research/selected publications... ·...

NeuralHawkesProcessMemory

MichaelC.MozerUniversityofColorado,Boulder

RobertV.LindseyImagen Technologies

DenisKazakovUniversityofColorado,Boulder

PredictingStudentKnowledge

?

time

TemporallyDistributedStudyandMemoryRetention

üMassedstudy

üSpacedstudy

üMemorydecaysmoreslowlywithspacedstudy

time

time

?

?

ProductRecommendation

?

time

TimeScaleandTemporalDistributionofBehavior

üCriticalformodelingandpredictingmanyhumanactivities:

§ Retrievingfrommemory

§ Purchasingproducts

§ Selectingmusic

§ Makingrestaurantreservations

§ Postingonwebforumsandsocialmedia

§ Gamingonline

§ Engagingincriminalactivities

§ Sendingemailandtexts

x1 x2 x3 x4

t1 t2 t3 t4

RecentResearchInvolvingTemporallySituatedEvents

üDiscretizetimeandusetensorfactorizationorRNNs§ e.g.,X.Wangetal.(2016),YSongetal.(2016),Neiletal.(2016)

üHiddensemi-Markovmodelsandsurvivalanalysis§ Kapooretal.(2014,2015)

ü IncludetimetagsasRNNinputsandtreatassequenceprocessingtask§ Duetal.(2016)

üTemporalpointprocesses§ Duetal.(2015),Y.Wangetal.(2015,2016)

üOurapproach§ incorporatetimeintotheRNNdynamics

TemporalPointProcesses

üProducessequenceofeventtimes𝓣 = 𝒕𝒊üCharacterizedbyconditionalintensityfunction,𝒉 𝒕𝒉 𝒕 = 𝑷𝒓 eventinintervaldt 𝓣)/𝐝𝐭

üE.g.,Poissonprocess

§ Constantintensityfunction

 𝒉 𝒕 = 𝝁

üE.g.,inhomogeneousPoissonprocess

§ Timevaryingintensity

𝒉 𝒕

HawkesProcess

üIntensity dependsoneventhistory

§ Selfexcitatory

§ Decaying

üIntensitydecaysovertime

§ Usedtomodel earthquakes,financialtransactions,crimes

§ Decayratedeterminestimescaleofpersistence

Time

intensityh(t)

eventstream

𝒉 𝒕 = 𝝁 + 𝜶9𝒆;𝜸 𝒕;𝒕𝒋�

𝒕𝒋?𝒕

HawkesProcessüConditionalintensityfunction

üIncrementalformulationwithdiscreteupdates

with 𝓣 ≡ 𝒕𝟏, … , 𝒕𝒋, … times of past events

𝒉𝒌 = 𝝁 + 𝒆;𝜸𝜟𝒕𝒌 𝒉𝒌;𝟏 − 𝝁 + 𝜶𝒙𝒌

𝒉𝟎 = 𝝁 and 𝒕𝟎 = 𝟎

Δ𝑡K ≡ 𝑡K − 𝑡K;L 𝑥K = N1 ifeventoccurs0 ifnoevent

𝚫𝐭

𝒉

𝒙

𝝁 𝜸

𝜶

Prediction

Observeatimeseriesandpredictwhatcomesnext?

Givenmodelparameters,computeintensityfromobservations:

Givenintensity,computeeventlikelihoodina𝚫𝐭 window:Pr 𝑡K ≤ 𝑡K;L + Δ𝑡, 𝑥K = 1|𝑡L, … , 𝑡K;L = 1 − 𝑒; [\]^;_ L;`]abc /d;_ef

ℎK = 𝜇 + 𝑒;def\ ℎK;L − 𝜇 + 𝛼𝑥K

≡ 𝑍K(Δ𝑡)

0 100 200 300 400 500 600 700 800

2

8

32

0 100 200 300 400 500 600 700 800

Time

0 100 200 300 400 500 600 700 800

Inte

nsi

ty

0 100 200 300 400 500 600 700 800

Inte

nsi

ty0 100 200 300 400 500 600 700 800

Inte

nsi

ty

0 100 200 300 400 500 600 700 800

Inte

nsi

ty

?

KeyPremiseüThetimescaleforaneventtypemayvaryfromsequencetosequence

üTherefore,wewanttoinfertimescaleparameter𝜸 appropriateforeacheventandforeachsequence.

BayesianInferenceofTimeScale

üTreat𝜸 asadiscreterandomvariabletobeinferredfromobservations.

§ 𝛾 ∈ {𝛾L, 𝛾o, … , 𝛾p} whereS isthenumberofcandidatescales

§ log-linearscaletocoverlargedynamicrange

üSpecifyprioron𝜸

§ Pr 𝛾 = 𝛾r

üGivennextevent𝒙𝒌(presentorabsent)at𝒕𝒌,performBayesianupdate:

§ Pr 𝛾r 𝒙L:K, 𝒕L:K ~𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;L

𝜇 + 𝑒;dvef\ ℎK;L,r − 𝜇w\𝑍Kr Δ𝑡K

0 100 200 300 400 500 600 700 800 900

2

8

32

0 100 200 300 400 500 600 700 800 900

Time

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

short

long

medium

short

longmedium

0 100 200 300 400 500 600 700 800 900

2

8

32

0 100 200 300 400 500 600 700 800 900

Time

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

marginal

X

𝜸Intensity

0 100 200 300 400 500 600 700 800 900

1 2 4 8 16 32 64 128 256 512 102420484096

0 100 200 300 400 500 600 700 800 900

1 2 4 8 16 32 64 128 256 512 102420484096

0 100 200 300 400 500 600 700 800 900

1 2 4 8 16 32 64 128 256 512 102420484096

0 100 200 300 400 500 600 700 800 900

0 100 200 300 400 500 600 700 800 900

Time

10-2

10-1

100

log Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

= 1/4

= 1/64

= 1/1024

0 100 200 300 400 500 600 700 800 900

Time

10-2

10-1

100

log Inte

nsi

tyintensityfunctionfromgenerativeprocess

recoveredintensityfunction

inputsequences

inductionoftimescale

EffectofSpacing

0 20 40 60 80 100

1 2 4 8 16 32 64 128 256 512 1024

0 20 40 60 80 100

1 2 4 8 16 32 64 128 256 512 1024

0 20 40 60 80 100

1 2 4 8 16 32 64 128 256 512 1024

0 20 40 60 80 100

0 20 40 60 80 100

Time

10-2

10-1

100

log

Inte

nsity

HumandataCepeda,Vul,Rohrer,Wixted,&Pashler (2008)NeuralnetmodelrelatedtoHPMMozer,Pashler,Cepeda,Lindsey,&Vul (2009)

IntersessionInterval(Days)

%Recall

1 7 14 21 35 70 1050

20

40

60

80

100

spacing (days)

% re

call

7 day retention

35 day retention

70 day retention

350 day retention

TwoAlternativeCharacterizationsofEnvironment

üAlleventsareobserved§ e.g.,studentspracticingretrievalofforeignlanguagevocabulary

§ Likelihoodfunctionshouldreflectabsenceofeventsbetweeninputs

 Pr 𝛾r 𝒙L:K, 𝒕L:K ~𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;L

𝜇 + 𝑒;dvef\ ℎK;L,r − 𝜇w\𝑍Kr Δ𝑡K

üSomeeventsareunobserved§ e.g.,shoppersmakingpurchasesonamazon.com (butpurchasesalsomadeontarget.com andjet.com)

§ Likelihoodfunctionshouldmarginalizeoverunobservedeventsandreflecttheexpectedintensity Pr 𝛾r 𝒙L:K, 𝒕L:K ~𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;L

𝜇1 − 𝛼/𝛾r

+ ℎK;L −𝜇

1 − 𝛼/𝛾r𝑒;dv L;

xdv

ef\𝒙𝒌

HawkesProcessMemory(HPM)Unitü LSTM

üHoldsahistoryofpastinputs(events) ✔

üMemorypersistencedependsoninputhistory X

üNoexplicit‘input’or‘forget’gate X

üCapturescontinuoustimedynamics X

𝚫𝐭

𝒉

𝒙

𝝁 𝜸

𝜶

EmbeddingHPMinanRNN

üBecauseeventrepresentationsarelearned,inputxdenotesPr(event)ratherthantruthvalue

§ ActivationdynamicsareameanfieldapproximationtoHPinference Marginalizingoverbeliefabouteventoccurrence:

 Pr 𝛾r 𝒙L:K, 𝒕L:K ~∑ 𝑝 𝑥K, 𝑡K 𝒙L:K;L, 𝒕L:K;L, 𝛾r Pr 𝛾r|𝒙L:K;L, 𝒕L:K;Lw\zLw\z{

üOutput(tonextlayerandthroughrecurrentconnections)mustbebounded.

§ Quasi-hyperbolicfunction

 𝒉 𝒕 + 𝚫𝒕| /(𝒉 𝒕 + 𝚫𝒕| + 𝝂)

GenericLSTMRNN

...A B C

LSTM LSTM LSTM LSTM

A B C ...currentevent

predictedevent

HPMRNN

...A B C

HPM HPM HPM HPM

A B C ...currentevent

predictedevent

Δ𝑡Δ�̂�time

sincelastevent

timetopredictedevent

WordLearningStudy(Kangetal.,2014)

üData

§ 32humansubjects

§ 60Japanese-Englishwordassociations

§ eachassociationtested4-13timesoverintervalsrangingfromminutestoseveralmonths

§ 655trialspersequenceonaverage

üTask

§ Givenstudyhistoryuptotrialt,predictaccuracy(retrievablefrommemoryornot)fornexttrial.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 3 9 28 84

1 10 19 28 84

Expanding Interval Schedule (Days)

Equal Interval Schedule (Days)

T1

T2

T3

T1

T2

T3

T1

T2

T3

T1

T2

T3

T1

T2

T3

T1

T2

T3

T1

T2

T3

T1

T2

T3

Reca

ll P

roport

ion

1 3 9 10 19 280

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Day

Reca

ll P

roport

ion

WordLearningResults

üMajorityclass(correct) 62.7%

üTraditionalHawkesprocess 64.6%

üNext=previous 71.3%

üLSTM 77.9%

üHPM 78.3%

üLSTMwith𝚫tinputs 78.3%

RedditPostings

0 20 40 60 80 100 120 140 160 180 200

0 50 100 150 200

0 20 40 60 80 100 120 140 160 180 200

0 20 40 60 80 100 120 140 160

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

Reddit

üData

§ 30,733users

§ 32-1024postsperuser

§ 1-50forums

§ 15,000usersfortraining(andvalidation),remaindertesting

üTask

§ Predictforumtowhichuserwillpostnext,giventhetimeoftheposting

RedditResults

üNext=previous 39.7%

üHawkesProcess 44.8%

üHPM 53.6%

üLSTM(with𝚫𝐭 inputs) 53.5%

üLSTM,noinputorforgetgate 51.1%

üHumanbehaviorandpreferenceshavedynamicsthatoperateacrossarangeoftimescales.

üItseemslikeamodelbasedonthesedynamicsshouldbeagoodpredictorofhumanbehavior.

ü…andhopefullyalsoagoodpredictorofothermultiscaletimeseries.

KeyIdeaofHawkesProcessMemory

üRepresentmemoryofsequencesatmultipletimescalessimultaneously

üOutput‘appropriate’timescalebasedoninputhistory

0 100 200 300 400 500 600 700 800 900

2

8

32

0 100 200 300 400 500 600 700 800 900

Time

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

2

8

32

0 100 200 300 400 500 600 700 800 900

Time

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

0 100 200 300 400 500 600 700 800 900

Inte

nsi

ty

StateOfTheResearch

üLSTMisprettydarnrobust

üSomeevidencethatHPMandLSTMarepickingupondistinctinformationinthesequences

§ Possibilitythatmixingunittypeswillobtainbenefits

§ Canalsoconsiderothertypesofunitspremisedonalternativetemporalpointprocesses,e.g.,selfcorrectingprocesses

üPotentialforusingevent-basedmodelevenfortraditionalsequenceprocessingtasks

NoveltyOfApproach

üTheneuralHawkesprocessmemorybelongstotwonewclassesofneuralnetmodelsthatareemerging.

§ Modelsthatperformdynamicparameterinferenceasasequenceisprocessed(vs.stochasticgradientbasedadaptation)

 seealsoFastWeights paperbyBa,Hinton,Mnih,Leibo,&Ionescu (2016),TauNet paperbyNguyen&Cottrell(1997)

§ Modelsthatoperateinacontinuoustimeenvironment

 seealsoPhasedLSTMpaperbyNeil,Pfeiffer,Liu(2016)

neural hawkes process memory - university of colorado ...mozer/research/selected publications... ·...

Documents