computational models of eye movements in reading: a data...

62

Upload: others

Post on 13-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop
Page 2: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop
Page 3: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop
Page 4: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop
Page 5: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

To my family

Page 6: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop
Page 7: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

List of papers

This thesis is based on the following papers, which are referred to in the text by theirRoman numerals.

I Learning Where to Look: Modeling Eye Movements in ReadingMattias Nilsson and Joakim NivreIn Proceedings of the Thirteenth Conference on Computational Natural Lan-guage Learning (CoNLL), Boulder, Colorado, USA, pp. 93–101, June 4–5,2009.

II Towards a Data-Driven Model of Eye Movement Control in ReadingMattias Nilsson and Joakim NivreIn Proceedings of the 2010 Workshop on Cognitive Modeling and Computa-tional Linguistics (ACL 2010), Uppsala, Sweden, pp. 63–71, July 15, 2010.

III Entropy-Driven Evaluation of Models of Eye Movement Control in ReadingMattias Nilsson and Joakim NivreIn Proceedings of the 8th International NLPCS Workshop, Copenhagen, Den-mark, pp. 201–212, August 20–21, 2011.

IV A Survival Analysis of Fixation Times in ReadingMattias Nilsson and Joakim NivreIn Proceedings of the 2nd Workshop on Cognitive Modeling and ComputationalLinguistics (ACL 2011), Portland, Oregon, USA, pp. 107–115, June 23, 2011.

V Time-Varying Effects on Eye Movements during ReadingMattias Nilsson and Joakim NivreSubmitted

Reprints were made with permission from the publishers.

Page 8: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop
Page 9: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Acknowledgements

Many people contributed to this thesis. In particular, I want to thank the fol-lowing.

I am very grateful to my supervisor Joakim Nivre for his encouragementand guidance over the last few years and not least for his remarkable patiencewith which he helped me to structure this thesis into what I hope is a coher-ent form. I am also very thankful to my co-supervisors Ina Bornkessel andShravan Vasishth for inviting me to join their research groups in Leipzig andPotsdam respectively, and for reading and commenting on an earlier draft ofthis thesis. Thanks is also due to my colleagues in the computational linguis-tics group at Uppsala University who read and provided useful comments onan earlier draft. I am also grateful to Erik Reichle for encouraging me in thefirst place, responding so thoughtfully to my questions and for digging up theold models.

This thesis would not have been written without the love and support myfamily has always given me. Thank you. Finally, I want to express my love toChiara for her compassion, generosity, and appreciation of what is importantin our lives.

Page 10: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop
Page 11: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Eye Movements in Reading: Data and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1 Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Where the Eyes Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 When the Eyes Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Models of Eye Movement Control in Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Modeling Eye Movements in Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Levels of Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Computational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1.2 Algorithmic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.3 Implementational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Models and Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Where the Eyes Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Saccade Target Selection as Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Transition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.2 Learning Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.3 Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Entropy-Based Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 When the Eyes Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Saccade Timing as Time-to-Event Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.1 The Survival Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2.2 Kaplan-Meier Survival Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.3 The Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.4 The Cox Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.5 Extended Cox Model: Time-Varying Effects . . . . . . . . . . . . . . . . . . . 425.2.6 Prediction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Page 12: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Overview of the Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Page 13: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

1. Introduction

The eye movements you make, as you read sentences like this one, are one ofthe few behavioral means by which we can study and make inferences aboutthe nature of the cognitive processes that support language comprehension.Understanding the links between the observed data – eye movement behavior –and the underlying processes is one of the great challenges of cognitive sci-ence. The overall goal of this thesis is to find novel methods and modelsfor understanding these links. In a general sense, our work builds on morethan three decades of eye movement research and psycholinguistic studiesusing eye movement data as a means for studying cognitive processes. Themethodological intuition for this research is well captured in the immediacy-of-processing and eye-mind assumptions of Just and Carpenter (1980). Theseassumptions hold that a reader carries out the processes required to understandeach word and its relationship to previous words in a sentence as soon as theword is encountered, and for as long as the eyes remain on the word. That is,it is assumed that eye movements are tightly controlled and coordinated withthe cognitive processes of reading. Thus, if eye movement behavior varies as afunction of the cognitive processes associated with language comprehension,as suggested by Just and Carpenter, it may be possible to make inferencesfrom the data to the processes involved. Although it is now generally agreedthat this relationship is not always as tight as Just and Carpenter assumed, nu-merous studies since have shown that cognitive processes have an essentiallyimmediate influence on eye movements in reading.

1.1 Research QuestionsThe methods and models presented in this thesis focus on the questions ofwhen and where the eyes move during reading. These questions have beenat the center of attention for decades in research on eye movements in read-ing and refer to the two central, generally unconscious, decision processesinvolved in reading. In asking when the eyes move, we are asking a questionabout time: what determines the duration of fixations in reading? In askingwhere the eyes move, we are asking a question about space: what determinesthe location of fixations in reading, or, from a slightly different perspective,what determines the length and direction of eye movements from one fixationto the next? Much of what is known about these processes is due to exper-imental research carried out since the 1970s using a variety of experimental

11

Page 14: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

techniques, such as the moving window paradigm (Reder, 1973; McConkieand Rayner, 1975), the moving mask paradigm (Rayner and Bertera, 1979),and the boundary paradigm (Rayner, 1975; Balota et al., 1985).

Current research addressing these questions, however, is centered on com-putational modeling and simulation of eye movements in reading. A numberof models have emerged to date which make explicit assumptions about thetime course of cognitive processes, and produce detailed predictions of whenand where the eyes move, which can be compared to human reading data.These models generally develop from a set of assumptions about the algorith-mic, or procedural, relationship between the perceptual, cognitive, and motorprocesses that guide the eyes during reading. Theoretical constraints, in accordwith experimental evidence, are then imposed on the model parameters.

In this thesis, we take a different approach based on the use of empiricaleye movement data, as recorded in large eye tracking corpora, and data-drivenmodeling methods. We explore the idea that empirical data carries rich infor-mation about the processes that guide eye movement decisions. The centralresearch question we address is how data-driven methods can be used to modeleye movements in reading and to recover characteristics of the underlying pro-cesses. We also emphasize the role of prediction in eye movement modeling,and an additional research question we address, therefore, is how to evaluatewhether the models make good predictions about human reading behavior.

A great deal of experimental evidence suggests that the decisions of whenand where to move the eyes are made independently from one another anddepend on different mechanisms. We maintain this separation between whenand where and explore two data-driven modeling methods, each addressinga different aspect of eye movement behavior, and each paired with a differ-ent evaluation method. The spatial question – where the eyes move – is ap-proached using standard machine learning and classification methods. Wepresent a flexible model, with few fixed assumptions, that allows for a numberof parameters to be explored empirically, and we show how the predictions ofthe model can be evaluated on held-out data using the notion of entropy. Thetemporal question – when the eyes move – is explored using time-to-eventmodeling (also known as survival analysis). We show how this method can beused to study the timing and strength of processes that influence the decisionof when to move the eyes, and, again, we use an entropy-related method toevaluate the predictions of our model on held-out data.

1.2 Outline of the ThesisThe remainder of this thesis is structured as follows.

Chapter 2 provides background information about eye movements and readingresearch. We review some basic characteristics of eye movements in reading,

12

Page 15: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

factors which influence when and where the eyes move, and models of eyemovement control in reading.

Chapter 3 broadens the discussion of models of eye movements in reading. Weshow that different models are driven by different motivations and concerns,which reflects that work is progressing at different descriptive levels. We fur-ther discuss the role of prediction and evaluation.

Chapter 4 demonstrates that the problem of where the eyes move can be ap-proached as a machine learning problem. We present a transition-based modelthat is guided by a classifier to select saccade targets during reading. We showhow such a model can learn to produce eye movements that are similar to hu-man readers. Further, we discuss problems in treating saccade target selectionas classification and present an alternative approach based on probabilistic sac-cade models. For these models, we propose a probabilistic evaluation methodbased on measuring the entropy, relative to a model, with respect to indepen-dently observed eye movement behavior.

Chapter 5 demonstrates that the problem of when the eyes move can be ap-proached as a time-to-event modeling problem. We motivate the use of time-to-event modeling, review the basic concepts and methods, and show how Coxhazards models can be applied to model the influence of covariates on the de-cision to move the eyes. We further propose that Cox hazards modeling withtime-varying effects can be used to recover some basic time course character-istics of the short-lived processes that influence eye movements. Some resultssupporting this hypothesis are presented. In addition, we propose a probabilis-tic evaluation method for hazards models of eye movements based on the Brierscore.

Chapter 6 summarizes the main results and contributions of the thesis. Weconclude with a discussion of promising directions for future research.

Chapter 7 provides a brief overview of the papers on which this thesis is based.

13

Page 16: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

2. Eye Movements in Reading: Data andModels

This chapter provides background information on eye movements during read-ing. We review basic characteristics of eye movements in reading, research onwhere and when the eyes move, including measures of eye movements, andmodels of eye movement control in reading. This review does not attempt tocover all relevant issues and is by no means exhaustive. Instead, we focus thediscussion primarily on aspects that have some bearing on papers I–V. Com-prehensive summaries of research on eye movements in reading are providedby Rayner (1998, 2009a).

2.1 Basic CharacteristicsThe two basic components of eye movements in reading are fixations and sac-cades. Fixations are the short periods of time during which the eyes remainfairly still, and saccades are the rapid movements the eyes make between fix-ations. Because vision is suppressed during saccades, information from thevisual array is obtained only during fixations. The amount of visual informa-tion that a reader is able to make effective use of during a fixation is referredto as the perceptual span. The size of the perceptual span is asymmetric to theright and left of the fixation, and the size as well as the direction of the asym-metry vary as a function of the writing system used. For readers of left-to-rightorthographies, like English, the perceptual span extends 14–15 letter spaces tothe right of the fixation but only 3–4 letter spaces to the left (McConkie andRayner, 1975, 1976; Rayner and Bertera, 1979). The size of the perceptualspan is also influenced by reading skill. Beginning readers and readers withdyslexia generally have a smaller span than more skilled readers. The percep-tual span is distinct from the word identification span which is the area fromwhich a word can be identified during a fixation. The word identification spanis smaller than the perceptual span and extends usually only about 7–8 letterspaces to the right of the fixation (Rayner et al., 1982).

The average fixation duration in reading is on the order of 225–250 ms, andthe average saccade length is 7–9 letter spaces for readers of alphabetic writ-ing systems (Rayner, 2009a). Hence, a skilled reader typically moves the eyesabout 7–9 letter spaces every quarter of a second. There is, however, a con-siderable spread around these averages. Fixation durations for an individual

14

Page 17: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

reader often range from 50 ms to about 600 ms, and saccade lengths from 1letter space to about 20 letter spaces.

Three other basic characteristics of reading behavior need to be considered.First, not all words are fixated. About 20–30% of the words in a text areskipped during reading. A word that is skipped does not receive any fixation.There is no reason, however, to assume that the word is not being processed.On the contrary, there is evidence that skipped words do get processed (Fisherand Shebilske, 1985). The majority of skipped words are short function words.Second, readers do not invariably progress in the direction of the text. About10–15% of all saccades in reading are regressions, that is, saccades that movebackwards in the text. Most regressions are to the immediately preceding oneor two words but they also often stretch a longer distance. Third and finally,the same word is often fixated more than once in succession, in particular if itis a long word. That is, many words get refixated during reading.

2.2 Where the Eyes MoveThe question of where the eyes move during reading may be decomposed intotwo problems: (a) which word is going to be fixated next, and (b), where in theword the eyes (i.e., the saccade) is going to land. Another distinction, how-ever, is often made between where the eyes intend to move, and where theyactually go. Although there is some variability in where the eyes land in aword, they tend to land halfway between the beginning of a word and the mid-dle of a word. This is referred to as the preferred viewing location (Rayner,1979). It is generally argued, however, that saccades are aimed at the center ofwords but fall short due to different sources of error arising from the fact thatsaccades are motor movements and require muscle control. Given that mostsaccades tend to land at or around the preferred viewing location, the criticalquestion is which word is going to be the target for the next saccade. Thisproblem is referred to as saccade target selection (McConkie et al., 1994). Asthe intended saccade targets are unknown, we focus the discussion here on theobserved saccade targets, in the empirical distribution of eye movements. Themajority of saccades in reading lands in one of the three following words, witha probability which decreases with the distance from the fixated word. Themain factors influencing whether a word will be skipped are the length, fre-quency, and predictability of the word. Thus, short, frequent and predictablewords are often skipped. By predictability we refer generally to the extent towhich words are contextually constrained and thus more easily predicted fromthe preceding context.1 Generally, word length has the greatest influence onskipping, and predictability is more important than frequency (Brysbaert and

1In experimental reading studies, predictability is often estimated using a cloze task (Taylor,1953). The cloze score for a given word then corresponds to the proportion of subjects thatcorrectly guesses the word when presented with the preceding sentence context.

15

Page 18: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Vitu, 1998; Rayner, 1998). As noted earlier, there are two other basic typesof eye movements in reading, refixations and regressions. The probability ofrefixating a word increases if the word is longer than average or if the initialfixation on the word lands near the end of a word, rather than near the center(McConkie et al., 1989). In the latter case, it is argued that a refixation is trig-gered to move the eyes to a better viewing location. Concerning regressions,short-range regressions to a previous word might be due to incomplete lexicalprocessing of the word (Vitu et al., 1998). This could happen when a readermoves the eyes from a word too early and then regresses back to the word tocontinue processing it. Long-range regressions, instead, are generally arguedto be triggered by comprehension failures in syntactic parsing or semantic in-terpretation.

The spatial characteristics of eye movement behavior in reading are oftensummarized in terms of probabilities for different types of saccades calculatedover a group of readers. Commonly reported measures include the probabilityof a word being skipped, fixated once, and fixated more than once. These mea-sures are typically calculated based on the first pass reading through the text.Similar measures, however, are also sometimes reported for the probability ofregressing to or from a word. Often, the measures are further averaged overclasses of words with different frequency and length to describe the effect ofthese variables on spatial saccade behavior.

2.3 When the Eyes MoveThe amount of time spent fixating a word is influenced by a number of vari-ables related to lexical word identification, syntactic parsing, semantic inter-pretation and discourse representation. A particularly relevant finding whichattests to the fact that cognitive processes may influence fixation durations, isthat information gets into the processing system very early during a fixation.Experiments have shown that if readers are given just 50–60 ms on each wordbefore the word is withheld from view, then reading proceeds quite normally(Ishida and Ikeda, 1989; Rayner et al., 1981, 2003). If the word is withheldearlier than that, however, reading is disrupted. These results do not implythat a reader completes the processing of a word within 50–60 ms. Rather, theresults suggest that this amount of time is sufficient to encode the informationneeded for reading, thus leaving time for other cognitive processes to develop.Furthermore, in these experiments it is shown that after a word is withheld, thetime the eyes remain in place depends on the frequency of the word. The eyestend to remain longer if the word is a low-frequency word when compared to ahigh-frequency word, even though the word is no longer in view. More gener-ally, the influence of word frequency on fixation durations is well documented(e.g., Inhoff and Rayner, 1986; Rayner and Duffy, 1986; Schilling et al., 1998;Juhasz and Rayner, 2003).

16

Page 19: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

In addition to word frequency, word length also influences fixation dura-tions. Longer words are fixated for a longer time than shorter words. Anadditional important variable is contextual constraint or word predictability.Words are fixated for less time if they are contextually constrained and eas-ier to predict than if they occur in more neutral contexts (Ehrlich and Rayner,1981; Rayner and Well, 1996). It is important to note that word predictabil-ity, in general, may in fact incorporate a host of different lexical, syntactic,semantic and discourse variables. Readers presumably draw on all kindsof available linguistic evidence in order to constrain the possible interpreta-tions of an upcoming word. Other variables which have been shown to influ-ence the time spent on a word include: lexical ambiguity (Duffy et al., 1988;Rayner and Frazier, 1989; Sereno et al., 2006), syntactic ambiguity (Frazierand Rayner, 1982; Altman et al., 1992; Clifton et al., 2007), semantic rela-tions between words in a sentence (Carroll and Slowiaczek, 1986; Sereno andRayner, 1992), anaphora and coreference (Ehrlich and Rayner, 1983; Duffyand Rayner, 1990). More recently, information-theoretical measures of com-plexity, such as surprisal and entropy have been used to approximate the easeor difficulty associated with processing individual words (Hale, 2001, 2003,2006; Levy, 2008). These measures have also been demonstrated to affect fix-ation times in reading (Boston et al., 2008; Demberg and Keller, 2008; Frank,2010; Boston et al., 2011).

There are two additional effects that need to be mentioned: spillover effectsand parafoveal preview effects. The spillover effect is another type of wordfrequency effect. It has been demonstrated that there is a tendency for the eyesto remain longer on a word when the previous word is a low frequency word(Rayner and Duffy, 1986). Thus, it appears that the decision of when to movethe eyes is modulated not only by the frequency of the fixated word but also,to some extent, by the frequency of the previous word. While the spillovereffect relates to how the previous word may influence the fixation time on thecurrent word, the parafoveal preview effect relates to how information aboutthe upcoming word may influence the subsequent duration of that word. Morespecifically, it has been shown that readers often obtain information from theword to the right of the fixation (the parafoveal word) (Balota et al., 1985;Pollatsek et al., 1992). This subsequently facilitates the processing of thatword once it is fixated in the sense that the fixation duration is slightly reducedon the word. Overall, parafoveal preview effects suggest that the processingof a word often begins before the word is actually fixated.

The temporal characteristics of eye movements in reading are commonlysummarized in terms of different word-based measures of fixation durations,generally assumed to reflect processing time. It is worth noting that the aver-age fixation duration is not used. While the average fixation duration may bea useful descriptive statistic as such, it is not, however, an appropriate mea-sure of the processing time of words. Consider, for example, a word whichis refixated. The mean fixation duration (i.e., the mean of the individual fixa-

17

Page 20: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

tions on the word) clearly underestimates the time the eyes actually remain onthe word. Moreover, the fact that some words are refixated and others are notfurther complicates the use of average fixation duration. Instead, three otherword-based measures are typically used. These measures are single fixationduration, first fixation duration and gaze duration. Single fixation duration isthe duration of the fixation on a word which is fixated exactly once on thefirst pass through the text. First fixation duration is the duration of the firstfixation on a word on the first pass, irrespective of whether the word receivesadditional fixations. Gaze duration is the sum of all fixation durations on aword on the first pass (i.e., including refixations), prior to moving to anotherword. These measures are typically averaged over a group of readers to pro-duce mean durations over words. In turn, words are often categorized intodifferent word frequency or word length classes and the average durationsof these classes are then calculated. The three measures mentioned here (inparticular single- and first-fixation duration) are generally assumed to reflectearly cognitive processing activities of a word as they are calculated only forfixations made during the first pass and thus exclude regressions to previouswords. However, additional measures which are assumed to reflect later cog-nitive processing activities are also frequently used. One such measure is thetotal fixation duration, which is the sum of all fixation durations on a word,including regressions to the word.

2.4 Models of Eye Movement Control in ReadingAs we noted in the previous chapter, much current research is centered oncomputational models that simulate eye movements during reading. A numberof computational models of eye movement control in reading have emerged inrecent years (see Reichle, 2006, for an overview of 5 models). In general,these models attempt to explain the processes which underlie observed read-ing behavior. As noted by Rayner (2009b), a strength of the models is thatthey produce predictions about actual fixation durations rather than about pro-cessing costs or other more indirect measures. Likewise, they produce precisepredictions for saccade lengths and fixation locations. Thus, the simulationdata which the models produce can be directly compared to human eye move-ment data. The differences between these models can be understood in termsof the assumptions made about how perceptual, cognitive, and motor controlprocesses guide the eyes through the text. With respect to the decision ofwhen and where the eyes move, the differences between the models tend tobe greater in the assumptions about when the eyes move. Two assumptions inparticular differ between the models. The first concerns the extent to whichlanguage and cognitive processes influence the decision of when to move theeyes, and the second concerns the way in which attention is allocated to wordsduring reading. To exemplify how models may differ in these issues we out-

18

Page 21: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

line the basic assumptions of the most influential model to date, E-Z Reader(Reichle et al., 1998, 2003, 2009), and then mention how other models differ.

E-Z Reader is built on the assumption that the process of identifying wordsis “the engine that drives eye movements” through the text. More specifically,lexical access occurs in two stages in E-Z Reader. The duration of these stagesis assumed to be a function of the word frequency and contextual predictabilityof the word being processed (the effects of these variables, however, differ be-tween the two stages). The completion of the first stage, called the “familiaritycheck”, is the signal to start planning a new saccade to the next word (the wordto the right of the currently processed word). The completion of the secondstage, called the “completion of lexical access”, is the signal to shift the covertattention (the mental focus) to the next word and begin the familiarity check onthat word via parafoveal preview. At this point, one of two things will happen.If a preliminary stage of saccade planning, called the “labile stage”, finishesbefore the familiarity check on the word now being processed, a saccade willbe executed and the next word will be fixated. If, however, the familiaritycheck finishes first, the current saccade being planned will be canceled anda new saccade will be planned one word further downstream in the text. Inthis second case, the word being processed will be skipped. Thus, the mainassumptions of E-Z Reader are that lexical access is the trigger to move theeyes (i.e., to start planning a saccade), and that attention is allocated strictlyserially, such that the processing of the next word does not begin until theprocessing of the current word has completed. These assumptions differ fromthe assumptions of other models, like SWIFT (Engbert et al., 2002, 2005) andGlenmore (Reilly and Radach, 2006). In these models attention is allocatedacross words in the perceptual span, thus allowing parallel lexical processingof words. Furthermore, in SWIFT it is assumed that saccades are generatedbased on a random timer which initiates saccade planning at random intervalsof time (though based on a preferred mean rate of saccades). Variables likeword frequency and word predictability do influence fixation time in SWIFTbut only indirectly, by inhibiting the random timer when words are difficult toprocess. In other words, saccades are initiated autonomously at times whichare only occasionally influenced by current cognitive processing. Yet anothermodel, the Competition/Activation model (Yang and McConkie, 2001; Yang,2006), assumes that cognitive processes do not have any immediate influenceon fixation times, but rather only affects very long fixations.

Most models of eye movement control in reading have been demonstratedto account for basic characteristics of eye movements during reading. Themodels generally predict word frequency and predictability effects on fixa-tion times similarly to what is observed in human readers. They also accountfor saccade lengths, skipping rates and, to some extent, regressions, althoughsome of these effects may be hard-wired in the models (Rayner, 2009b).

19

Page 22: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

In the next chapter we broaden the scope of discussion to include an overviewof different approaches to eye movement modeling. This overview then servesas background to introduce some basic characteristics of our approach.

20

Page 23: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

3. Modeling Eye Movements in Reading

In this chapter we broaden the discussion on eye movement modeling initiatedin the previous chapter. This chapter is divided in two sections. In section 3.1,we discuss different approaches to eye movement modeling and make an at-tempt at separating them using Marr’s notion of different levels of descriptionfor cognitive systems (Marr, 1982). In section 3.2, we comment on modelsand predictions and introduce some recurring themes in later chapters.

3.1 Levels of DescriptionModels of eye movements in reading can be compared and contrasted to oneanother in a number of different ways. In the previous chapter, we noted, forexample, that models of eye movement control make different assumptionsabout how language processing affects fixation times and about the nature ofattention allocation, that is, whether more than one word may be processedat any given time. At a more general level, it is also possible to distinguishbetween different approaches to modeling, such as inside-out and outside-inapproaches (Feng, 2006), or theory-driven and data-driven modeling (Feng,2001). These terms are generally meant to distinguish between models thatemphasize a particular theory of eye movements (inside-out or theory-driven),as opposed to models that emphasize the use of empirical data to model eyemovement behavior (outside-in or data-driven). Although these terms mayconvey the general idea, they need to be nuanced in specific contexts. The dis-tinction between theory-driven and data-driven modeling, for example, mayappear to imply that data-driven models are theory-independent or atheoreti-cal, or perhaps that theory-driven models are non-empirical. Neither of theseimplications is valid. Models of eye movements in reading are driven by dif-ferent motivations and concerns. To understand these, it may be more fruitfulto distinguish between models at different levels of description, in line withDavid Marr’s celebrated account of explanatory levels of cognitive systems.Marr argues that cognitive systems, or information processing systems in gen-eral, may be described at three independent though complementary levels: thecomputational, the algorithmic, and the implementational level. Models ordescriptions at the computational level focus on what functions the system iscomputing, and why those computations are required to achieve the goals ofthe system. Broadly, descriptions at this level are consistent with accounts

21

Page 24: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

that help identify conditions and constraints associated with the function be-ing computed, but abstracts away from how the computation is actually carriedout. In contrast, the algorithmic and implementational levels address the ques-tion of how computations are performed. The algorithmic level specifies thealgorithms and representations involved in the computations, while the imple-mentational level is concerned with how algorithms and representations maybe realized physically, for example by neural processes in the brain.

While Marr emphasized the importance of the computational level, thestrength of the proposal is that it motivates research at all levels and allowsresearchers to abstract away from specificities which are extraneous to theirprimary concern. For example, we may pose theories of which algorithmssolve particular higher brain functions without making claims about the neu-ral realization of the algorithms.

While we do not want to argue that all models of eye movements in readingmap neatly to one or the other of Marr’s levels, it is clear that work is nowcarried out at multiple levels of description, largely corresponding to Marr’sthree levels. Thinking in terms of Marr’s levels may help to understand howdifferent models relate or may be compared to one another, both across andwithin levels. In the next section, we briefly exemplify modeling work atdifferent descriptive levels and discuss some basic methodological concernsthat characterize models at each level.

3.1.1 ComputationalMuch work at the computational level draws inspiration, methodologically,from the mathematical theory of human response time (Luce, 1986; Van Zandt,2002). Studies in this context typically treat eye movements as observable re-sponses, or outcomes, of an underlying stochastic process. Mathematical andstatistical methods are applied to model the empirical distribution of eye move-ments on the assumption that essential properties of the underlying stochasticprocess can be characterized that way. In turn, it is generally assumed thatknowledge about the precise distribution functions can inform or place con-straints on theories about mechanisms of eye movements in reading. This lineof research thus emphasizes the use of data in deriving mathematical models ofeye movement behavior that characterize the basic constraints and conditionsthe underlying processes must satisfy. Generally, these models have tended todownplay the role of cognition, often assuming that language processing onlyinfluences very long fixations when the initially planned saccade is delayed orcanceled due to processing difficulty.

Model assessment is typically based on comparing the fitted distributionsto the observed distributions using some goodness-of-fit statistic. The assess-ment is generally based on the same set of data that is used to estimate theparameters of the model. Analyses of this kind include models of within-word

22

Page 25: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

landing positions (McConkie et al., 1988; Radach and McConkie, 1998), fix-ation durations (McConkie et al., 1994; McConkie and Dyre, 2000), regres-sions (Vitu et al., 1998; Vitu and McConkie, 2000), refixations (McConkieet al., 1989; Radach and McConkie, 1998), and skipping rates (McConkieet al., 1994). More recent work in this tradition is exemplified by Feng (2006,2009). Other work at the computational level, addressing how eye movementsin reading can be explained in terms of more general goals of human behav-ior, is exemplified by the work of Legge et al. (1997) and, more recently, byBicknell and Levy (2010).

3.1.2 AlgorithmicModels at the algorithmic level largely correspond to the models of eye move-ment control discussed in the previous chapter. Here, we focus on the basicmodeling approach, first introduced by the development of E-Z Reader. Thisapproach starts with a set of assumptions concerning the underlying controlstructure of eye movements in reading. This structure, often implemented asa state machine, defines the set of perceptual, cognitive, and motor processeswhich support eye movement control, and specifies the relationship betweenthe processes. These processes are internal to the system and not directly ob-servable from empirical data. Theoretical constraints are therefore typicallyimposed on the parameters involved in executing the processes. Typically,the mean and standard deviation of the distribution for a given process (e.g.,lexical processing time) are specified; variability is then built into the modelby random sampling from the assumed distribution whenever a process is ex-ecuted. As noted by Reichle et al. (1998), an important concern is that theparameter values defining the distributions should be based on plausible esti-mates given experimental evidence.

In assessing the models, a parameter search is performed to determine thebest fitting values for the model parameters given the observed data. The as-sessment is based on the same data that is used to tune the parameters. Inother words, one attempts to identify the estimates which minimize the errorof the model on the fitted data. Typically, the best fitting parameter valuesare assessed qualitatively so that they are in reasonable agreement with otherevidence. The error that is minimized is usually the root mean squared errorbetween observed and predicted mean values for different eye movement mea-sures such as gaze duration and skipping probability. The means are typicallycalculated over a small set of word frequency classes in order to assess themodels’ capacity to reproduce frequency effects on eye movements in read-ing. Additionally, more subtle effects (e.g., spillover, preview and skippingeffects) are often also assessed based on the simulation data.

23

Page 26: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

3.1.3 ImplementationalGiven that little is known about the underlying neural mechanisms of eyemovements in reading, it is not surprising that there has been less effort atthe implementational level. While some ideas for a possible layout of the neu-ral implementation of E-Z Reader have been discussed (Reichle et al., 2003),the first biologically inspired neural network model of eye movement controlin reading was presented only recently (Heinzle et al., 2010). The basic goalof this model is to shed some light on how the brain may control eye move-ments in reading. The model is implemented as a layered network of spikingneurons that can control sequences of eye movements in reading. Interactionwith a cortical word processing module allows the model to read sequences ofx-strings (as a simple representation of normal words).

The performance of the model is assessed rather qualitatively by examin-ing how a set of simulation-based descriptive statistics compares with thoseobserved in human readers in general. The evaluation focuses primarily ondemonstrating the model’s capacity to account for some basic aspects of read-ing behavior, such as total fixation times and skipping rates as a function ofword length.

3.2 Models and PredictionsUnder the usual view of eye movements in reading, the two primary behav-iors of the underlying perceptual, cognitive and motor processes we observeare fixation durations and fixation locations. The problem, as approached inour work, is to generate good predictions of eye movement behavior, that is,fixation durations and fixation locations that are in good agreement with theobserved performance. In this sense, our primary concern is to characterizeeye movement performance, rather than eye movement control. Thus, overall,the methods employed and models presented in this thesis are best understoodat the computational level of description in Marr’s taxonomy.

Assuming the central problem is to generate good predictions of eye move-ment behavior, an important issue arises – how do we know if a model makesgood predictions? As is clear from the review in the previous section, the eval-uation of models, across levels, is generally based on comparing model predic-tions to empirical observations. The accuracy of the predictions, however, isusually only assessed on the same set of data that is used to fit the parametersof the model. In other words, models are rarely evaluated on held-out data as isstandard in other modeling fields (e.g., artificial intelligence, natural languageprocessing and machine learning). The potential problem of not separatingdata for parameter estimation (training data) and model evaluation (test data)is that the model being evaluated is then selected purposively to lie near theobserved data points, which is what fitting means but generally not what pre-diction means. By prediction we rather refer to a statement about the outcome

24

Page 27: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

of observations we have not seen previously. In principle it is of course pos-sible to use the results of our best parameter estimates on the fitted data alsoas an estimate of how accurate the model is for predicting novel outcomes.Only, we risk that this estimate is overly optimistic, because the error on thetraining data tends to underestimate the true error of the model. If our goalis prediction, we should instead consider how well the model generalizes topreviously unseen behavior, as measured by the prediction error of the model.The prediction error is the observed inaccuracy of the fitted model applied toa new, representative, set of data points not used during model development.

However, saying that models are assessed with respect to their predictionerror is only half an answer to the general question, if it is not also specifiedhow predictions are actually compared to the data. In some contrast to currentwork, we propose to assess the precision of probability predictions, rather thanthe average of count predictions. The idea follows from that the models weconsider yield a probability distribution over space and time. The basic intu-ition is that good models of eye movement behavior assign high probabilityto empirically observed data. Measuring the probability of human behavior,relative to a model, is one way of assessing the similarity between predictionsand the observed data. This measurement allows for a finer assessment of amodel’s ability to generate accurate predictions than mere averages of countpredictions. Using it in practice also presumes that the model is evaluatedagainst a test sample which is drawn independently of the data the model isfitted to. In chapter 4 we elaborate further on the idea and show how entropy,a measure of probability, can be used to assess probabilistic models of wherethe eyes move during reading. In chapter 5, we relate the same idea to theBrier score (Brier, 1950), which is employed in order to assess the accuracyof probabilistic temporal predictions of when the eyes move.

25

Page 28: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

4. Where the Eyes Move

In this chapter we demonstrate that the problem of determining where the eyesmove during reading may be approached as a machine learning problem.1 Themodels we introduce learn, from empirical data, where to move the eyes underdifferent conditions associated with the words being read.

This chapter is divided into three parts. In the first part, section 4.1, we givesome background to the models discussed in this chapter. In the second part,section 4.2, we present a transition based model of saccade target selection asa classification task. We show how this model can be used to predict somebasic characteristics of human saccade behavior. This section is based on thecontent of paper I. In the third part, section 4.3, we discuss some problems intreating saccade target selection as a classification task. In particular, giventhe large inherent variability in saccade behavior, exact categorical predictionsfor where the eyes move may be of limited value. Therefore, we propose thattarget selection models be defined as probabilistic models yielding a probabil-ity distribution over the outcome variable. This allows for a more fine-grainedevaluation method, measuring the entropy, relative to a model, on unseen eyemovement data. This section is based on the content of paper III.

4.1 BackgroundIt has been argued that target selection during reading is based on strategieswhich have developed and become automated with years of reading experi-ence. The model by Reilly and O’Regan (1998), for example, suggests thatthe eyes are guided by a simple strategy based on targeting the longest wordin a right parafoveal window extending 20 characters to the right of the fix-ated word. This simple strategy, they show, gives better fit to empirical datathan a more linguistically oriented strategy based on skipping high-frequencywords. In the model by Brysbaert and Vitu (1998), the decision to skip aword is based on the length of the word as well as on how often a word ofa certain length and at a certain distance can be skipped without inhibitingcomprehension. This strategy, they assume, is learned from experience. TheSERIF model (McDonald et al., 2005) extends this proposal beyond skipping

1It should be pointed out that McDonald (2003) provides the first attempt, to our knowledge,at using classification methods for target selection. The work we present is a considerableextension to this work and classification is used here as one component in a larger and moreflexible model that allows for the generation of complete fixation sequences over text.

26

Page 29: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

to the selection of one of three candidate words in the forward perceptual span.A target is selected based on random sampling from a cumulative probabilitydistribution, whose parameters are estimated empirically based on a logisticregression model. The classifiers and probabilistic models we present in thischapter can, similarly, be viewed as simple approximations to the accumu-lated experience a reader’s eye guidance system has built up over the years.The basic model, however, does not involve any particular assumption aboutthe strategies or the variables that influence saccade target selection. These areinstead parameters that should be explored experimentally.

4.2 Saccade Target Selection as ClassificationWe review the basic task before we consider the model. We use the followingsimple representations of a text and a fixation sequence. Let a text T be rep-resented as a sequence of word tokens w1,w2, . . . ,wn, and a fixation sequenceF for T be represented as a sequence of token positions in T , i1, i2, . . . , im. Forexample, the short text John gave Mary the book is represented by T = John,gave, Mary, the, book; and a fixation sequence over this text corresponding toJohn – gave – John – Mary – Mary – book is represented by F = 1,2,1,3,3,5.The task we now consider is to predict the fixation sequence F for a givenreader R on some text T . Generally, we also assume that this prediction isover a particular reading of the text T , as different readings of T by the samereader R would generally yield different fixation sequences. In the following,we outline a model for this task involving three basic components: a transitionsystem for saccades in reading; a classifier that predicts the next transition andhence the saccade target; and a search algorithm that generates saccades overthe reading of text. We discuss these components in turn.

4.2.1 Transition SystemA transition system is an abstract machine consisting of a set of configurations(or states) and transitions between configurations. In the transition systemhere, a configuration represents a fixation state while a transition represents asaccade (fixation state update). A fixation state in this system is essentially arepresentation of the text relative to the fixated word, which may or may notbe the word currently attended to. Given that a fixated word is generally alsoattended to at some point during the fixation, we may however assume thatthis is usually the case. A fixation configuration in the transition system is atriple C = L,R,F , where

1. L is a list of tokens representing the left context, including the currentlyfixated token and all preceding tokens in the text.

27

Page 30: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

2. R is a list of tokens representing the right context, including all tokensfollowing the currently fixated token in the text.

3. F is a list of token positions, representing the fixation sequence so far,including the currently fixated token.

For example, assuming the text to be read is John gave Mary the book, thenthe configuration

([John, gave, Mary], [the, book], [1, 2, 1, 3])

represents the state where Mary is currently fixated, and John, gave, John werepreviously fixated, in that order.

For any text T = w1 . . .wn, we define initial and terminal configurations asfollows:

1. Initial: C = ([ ], [w1, . . . ,wn], [ ])

2. Terminal: C = ([w1, . . . ,wn], [ ],F) (for any F)

In the initial configuration, all tokens remain to be read (i.e., L and F areempty). In the final configuration, all tokens have been read (i.e., R is empty),and at this point F may correspond to any fixation sequence.

We then define the following transitions:2

1. Progress(n):([λ |wi], [wi+1, . . . ,wi+n|ρ], [φ |i])⇒ ([λ |wi,wi+1, . . . ,wi+n],ρ, [φ |i, i+n])

2. Regress(n):([λ |wi−n, . . . ,wi−1,wi],ρ, [φ |i])⇒ ([λ |wi−n], [wi−n+1, . . . ,wi|ρ], [φ |i, i−n])

3. Refixate:([λ |wi],ρ, [φ |i])⇒ ([λ |wi],ρ, [φ |i, i])

This transition system accounts for the possible interword and intraword (re-fixation) saccades in reading, while also keeping track of the fixation sequence.The transition Progress(n) generates a progressive saccade of length n, whichmeans that the next fixated word is n positions forward with respect to thecurrently fixated word (i.e., n−1 words are skipped). Similarly, the transi-tion Regress(n) generates a regressive saccade of length n. If the parametern of either Progress(n) or Regress(n) is greater than the number of words re-maining in the relevant direction, then the longest possible movement is madeinstead, in which case Regress(n) leads to a configuration that is similar to theinitial configuration in that it has an empty L list, while Progress(n) leads to aterminal configuration. The transition Refixate, finally, generates a refixation,which means that the next word fixated is the same as the current.

2We use the variables λ , ρ and φ for arbitrary sublists of L, R and F , respectively, and we writethe L and F lists with their tails to the left, to maintain the natural order of words.

28

Page 31: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

To give a simple illustration of how this system works, we may consider thetransition sequence corresponding to the reading of the text John gave Marythe book used as an example in Section 4.2:

Initial ⇒ ([ ], [John, gave, Mary, the, book], [ ])Progress(1) ⇒ ([John], [gave, Mary, the, book], [1])Progress(1) ⇒ ([John, gave], [Mary, the, book], [1,2])Regress(1) ⇒ ([John], [gave, Mary, the, book], [1,2,1])

Progress(2) ⇒ ([John, gave, Mary], [the, book], [1,2,1,3])Refixate ⇒ ([John, gave, Mary], [the, book], [1,2,1,3,3])

Progress(2) ⇒ ([John, gave, Mary, the, book], [ ], [1,2,1,3,3,5])

4.2.2 Learning TransitionsThe transition system we have described defines the set of possible saccadetransitions in reading but involves no mechanism for deciding which transi-tion to make in a given configuration. One way to introduce such a mecha-nism, however, is to use empirical data to train a classifier to predict the nexttransition, given any configuration. In order to train such a classifier we needto define a set of features to represent the data. In other words, we must specifythe kind of information we assume to be relevant for the task. A useful startingpoint, then, is the evidence reviewed in chapter 2 about variables influencingwhere the eyes move during reading. Thus, we might want to include features,or predictors, for the length, frequency and predictability of a few words to theright of fixation, and perhaps also of the currently fixated word (we noted, forexample, that the likelihood of refixating a word depends on the length of theword). Assuming that the empirical eye movement data is overlaid with thiskind of information, we can create training data for classifiers. This is doneby reconstructing the transition sequence for the reading of a text and by ex-tracting the feature information associated with each resulting configuration.A training instance for the classifier then consists of a feature vector repre-sentation of a fixation state, and the observed saccade transition out of thatstate. At this stage, the problem has been reduced to a standard supervisedmachine learning problem and it is then straightforward to train a classifierusing some learning algorithm to predict the most probable transition out ofany configuration. There is a wide variety of learning algorithms that could beused for this purpose, including logistic regression, support vector machines,neural networks and decision trees.

4.2.3 Search AlgorithmOnce we have trained a classifier f that predicts the next transition t out ofany configuration C, we can use it as an oracle for modeling target selection

29

Page 32: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

through the reading of a text T = (w1, . . . ,wn). Let Rc be the list R of con-figuration c and Fc be the list F of configuration c. The following algorithmachieves this:

Read(w1, . . . ,wn)1: c← ([ ], [w1, . . . ,wn], [ ])2: while Rc 6= []3: t← f (c)4: c← t(c)5: return Fc

We start in the initial configuration and as long as there are more words to theright of fixation we apply f (c) to get t, and then apply t(c) to update the currentconfiguration. When the search terminates, we return the fixation sequence Fof the current configuration c.

4.2.4 Experimental ResultsThe model we have outlined so far is flexible enough to allow explorationof a range of different models of saccade target selection. As it stands, wehave made no specific assumptions about what feature information to bringto the learning task, or which learning algorithm to use for guiding the deci-sions of the classifier. Rather, these are parameters to be explored empirically.We finish this section with an overview of some experimental results using aparticular instantiation of the general model. In these experiments, we trainseparate classifiers for different readers and then compare the predicted andthe observed fixation sequences on held-out test data.

The results we report are based on data from the English section of theDundee eye tracking corpus (Kennedy and Pynte, 2005). This data set con-tains the eye tracking record of 10 native English-speaking adults reading 20newspaper articles collected from The Independent newspaper. The newspaperarticles consist of roughly 2500 words each, giving about 50000 words total.The eye movements were recorded using a Dr. Bouis eye-tracker, sampling theposition of the right eye at a rate of 1000 Hz (once per millisecond).

The parameters of the classifiers are estimated on the first 16 texts in theDundee corpus while the remaining four texts (17-20) are held out for vali-dation and evaluation purposes. The results we report are based on the blindtest data comprising the last two Dundee texts (19-20). The classifiers aretrained using an off-the-shelf implementation of logistic regression availablein Weka, a publicly available collection of machine learning software writtenin Java (Witten and Eibe, 2005). We limit the task in these experiments to thatof predicting saccades falling roughly within the perceptual span of the fixatedword (98.3% of all saccades in training data).

The classifiers are trained using only a few features known to influence sac-cade decisions, such as the word length and word frequency (broken down into

30

Page 33: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Fixation Accuracy Fixations SkipsReader # sentences Baseline Model Prec Rec F1 Prec Rec F1

a 136 53.3 70.0 69.9 73.8 71.8 69.0 65.8 67.4b 156 55.7 66.5 65.2 85.8 74.1 70.3 80.4 75.0c 151 59.9 70.9 72.5 82.8 77.3 67.4 53.1 59.4d 162 69.0 78.9 84.7 84.8 84.7 66.0 65.8 65.9e 182 51.7 71.8 69.1 78.4 73.5 75.3 65.2 69.9f 157 63.5 67.9 70.9 83.7 76.8 58.7 40.2 47.7g 129 43.3 56.6 49.9 80.8 61.7 72.2 38.1 49.9h 143 57.6 66.9 69.4 76.3 72.7 62.8 54.3 58.2i 196 56.4 69.1 69.6 80.3 74.6 68.2 54.7 60.7j 166 66.1 76.3 82.2 81.9 82.0 65.0 65.4 65.2

Average 157.8 57.7 69.5 70.3 80.9 75.2 67.5 58.3 62.6

Table 4.1. Fixation and skipping accuracy on test data; Prec = precision, Rec =recall, F1 = balanced F measure.

five frequency classes) of the currently fixated word and of words to the rightof fixation. The distance between the current fixation and recent fixations isalso included as it influences regression probability. Regressions occur moreoften after skipping, for example. The accuracy of the models is measuredas follows. First, we compute the fixation accuracy, that is, the proportion ofwords that are correctly fixated or skipped by the model, which we also brokedown into precision and recall for fixations and skips separately.3 Secondly,we compare the predicted fixation distributions to the observed fixation distri-butions, both over all words and broken down into five frequency classes ofwords.

Table 4.1 shows the fixation accuracy, and precision, recall and F1 (un-weighted harmonic mean of precision and recall) for fixations and skips, foreach of the ten different models and the average across all models. The fixa-tion accuracy is compared to baseline models which always predict the mostcommon type of saccade for a given reader (Progress(2) for readers a and e,and Progress(1) for the others).

If we consider the fixation accuracy, we see that all models improve sub-stantially on the baseline models. The variation in the relative improvementbetween models is however quite large, ranging from 4.4 percentage pointsfor the model of reader f to 20.1 percentage points for the model of readere. Averaged over all models, the results improve over the baseline by 11.8%.Comparing the precision and recall for fixation and skips, we see that whileprecision tends to be about the same for both categories (with a few notableexceptions), recall is consistently higher for fixations than for skips. This is

3Fixation/skip precision is the proportion of tokens fixated/skipped by the model that were alsofixated/skipped by the reader; fixation/skip recall is the proportion of tokens fixated/skipped bythe reader that were also fixated/skipped by the model.

31

Page 34: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

a b c d e f g h i j

ReaderModel

Proportion

0.0

0.2

0.4

0.6

0.8

Figure 4.1. Proportion of fixated tokens grouped by reader and model

likely due to a tendency of the model to overpredict fixations, especially forlow-frequency words. This has a great impact on the F1 measure, which isconsiderably higher for fixations than for skips.

Figure 4.1 shows the distributions of fixations grouped by reader and model.The models are reasonably good at modeling the empirical fixation distribu-tions of the readers. However, the models tend to overestimate the fixationrate, looking at more words than the readers, as noted above. This suggeststhat the models lack sufficient information to learn to skip words more often.As noted in chapter 2, an additional important determinant of the decision toskip a word, which we have not included in these models, is word predictabil-ity. It is possible that the inclusion of some estimate of word predictability,such as the predictions of an n-gram language model, would increase the skip-ping rate and improve the results.

Figure 4.2, finally, shows the mean observed and predicted fixation andskipping probability as a function of word frequency class, averaged over allmodels. The effect of the frequency of the word on performance is comparableto the readers, although the models typically tend to exaggerate the observedeffect. Thus, too few words are skipped in the lower to medium frequencyclasses (1–3), while too many of the most frequent words (5) are skipped.The skipping rate (and thus fixation rate) is, however, very well-predicted forwords in frequency class (4), that is for words with an estimated 1001–10000occurrences per million words.

32

Page 35: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

F F

F

F

F

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

FF

F

F

F

S S

S

S

S

SS

S

S

S

Fixa

tion

prob

abilit

y

Frequency class

FFSS

Fixation ! ObservedFixation ! PredictedSkipping ! ObservedSkipping ! Predicted

Figure 4.2. Mean observed and predicted fixation and skipping probability for fivefrequency classes of words

4.3 Entropy-Based Model AssessmentIn this section we discuss some problems in treating saccade target selectionas a classification task and propose an alternative view of the problem whichleads to a more refined model assessment based on entropy.

The classification-based approach to saccade modeling we outlined in theprevious section relies on hard classification decisions. In other words, themodel always gives a single discrete categorical prediction, which defines themost likely saccade target given the current fixation state. We showed thatthese predictions can be used to assess the accuracy of a model by compar-ing them to empirical observations using a 0-1 loss function which counts thenumber of correct (or incorrect) predictions. One limitation to this approach,however, is that some information is lost, such as the partitioning of the pre-dictions between the possible classes. When an instance is misclassified, forexample, we do not know how “close” the prediction may be to being cor-rect. This is, however, relevant information since the large variability inherentin human saccade behavior renders hard classifications and exact predictionshighly uncertain. The classification error, therefore, appears to be a crudemeasure of model performance.

If instead of stating the problem as a classification task, we consider it tobe one of assigning probabilities to fixation sequences, we may assess modelsdifferently. In principle, we may then measure the distance between the pre-dicted distribution of saccade targets given the model and the actual, or true,distribution. In practice, this may be achieved by measuring the entropy, or

33

Page 36: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

more precisely, by measuring an approximation to a quantity known as thecross entropy between probability distributions. Rather than asking how oftenthe model agrees (exactly) with the observed human behavior, we may thenask how probable the observed behavior is from the model’s point of view.Intuitively, the entropy of a test sample relative to a model measures the uncer-tainty or average surprise associated with the observations in the test sample,as perceived by the model. A “good” model of human saccade behavior, then,is a model that assigns high probability and low entropy to representative testdata. The lower the uncertainty, the better the model, being less surprised onaverage by the observed behavior. Or, from a slightly different view, by mea-suring the entropy of a model, we assess how similar the behavior that we canexpect from the model is to the observed human behavior.4

The use of entropy presupposes a probabilistic model which defines a prob-ability distribution over the outcome, that is, over the possible saccade transi-tions. Under this view, there are a number of candidate words at each fixation,where each candidate has a certain probability of being selected as the targetfor the subsequent saccade. Thus, the basic constraint we impose is that, givena text T , a saccade model can assign a probability to any arbitrary sequence offixations F over the text:

P(F |T ) = P(i1, i2, . . . , im|T )

Given any such model, we may assess the entropy of the model using essen-tially the same transition-based model as we outlined in the previous section,even though we are not doing classification, but only estimating probabilities.We mention how this is done below but first we outline the notion of entropyand how we may apply it. For a random variable X with n outcomes, theentropy H(X) is

H(X) =−n

∑i=1

p(xi) log2 p(xi)

where p(xi) is the probability of outcome xi. The quantity H(X) is a measureof the uncertainty associated with the variable X , quantifying the expected oraverage surprise over all possible outcomes. Lower values of entropy indicateless average surprise, or likewise, lower uncertainty. Essentially, uncertaintyis greatest when the chances for one outcome is no different from the chancesof another. In other words, entropy is maximal when the probability mass isevenly distributed so that all outcomes are equally likely.

An often used variant of entropy, known as the cross entropy, allows us tocompare probability distributions of a random variable X . By replacing thesurprise value in the definition of entropy with an estimate derived from a

4In this context, it is worth noting that entropy-based measures of probabilistic models arewidely used in many modeling fields, for example, in artificial intelligence (AI) and naturallanguage processing (NLP).

34

Page 37: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

model m, we get the cross entropy of the model with respect to the true modelp:

Hc(p,m) =−n

∑i=1

p(xi) log2 p(xi|m)

The cross entropy Hc(p,m) is an upper bound on the true entropy H(p), whichmeans that the cross entropy of a model m(x) on some distribution p(x) isgreater than or equal to the actual entropy of the true distribution H(p):

H(p)≤ Hc(p,m)

In principle, then, the cross entropy can be used as a model evaluator on the as-sumption that the lower the cross entropy, the better the model (i.e., the closerit is to the true entropy). This presupposes however that the true probabilitydistribution p is known, which is not the case in practice. The solution is toestimate how well the model predicts a separate test sample drawn from thesame distribution p. Better models of p will tend to assign higher probabilities,and thus lower surprisal, to the observed events in the test sample.

The cross entropy of the test data, given the model, is known as the log-probability (log-prob or LP). Given a saccade model M and a test sample F(for some text T ), the log-prob is:

LP(M) =−1n

log2 p(F |M)

This formula gives the average surprise associated with an observed saccadebehavior over some text, as perceived by the model. Intuitively, a better sac-cade model assigns lower log-prob to the test sample, being less surprised onaverage. Importantly, in order for the log-prob to be a reliable performancemeasure, we must use a training sample to estimate the parameters of themodel and then a different but representative sample to test or evaluate themodel.

Once we have trained a probabilistic model, or probabilistic classifier, wecan derive the necessary probabilities for assessing the entropy of a test samplerelative to the model using the same model as before. The important differ-ence is that we do not use the classifier to return the most probable saccadetransition given the model, but, instead, to return the probability assigned bythe model to the observed saccade transition in the test data. We sum thesurprise for each observed saccade target in the data (the negative log of theprobability) and the log-prob is then given by the average.

Figure 4.3, shows the result of one experiment using entropy-based modelassessment. In this experiment, saccades of different lengths are grouped intofive saccade types5 and the model is assessed with respect to the entropy as-signed to each of these types in the test data. The entropy is in bits per fixation,

5The types are forward (move forward to next token); regress (regress to any previous token);refixate (fixate current token); skip (move forward to the second next token); and other (moveforward to any other token).

35

Page 38: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Regress Refixate Forward Skip Other

2

3

4

Saccade Type

Ent

ropy

perF

ixat

ion

Figure 4.3. Entropy per fixation grouped by saccade type averaged over all models ofreaders a-j.

averaged over all ten models (of the ten readers in the Dundee corpus). Theresults identify the relative difficulty in modeling different types of saccades.We see that the model is more surprised, on average, when readers refixate aword or regress to a previous word than when they move their eyes to the nextword or skip over a word. We also see, as we would expect, that regressivesaccades are much more surprising than any other saccade type.

More generally, this is an example of a particular setup of the model thatprovides a way to test different learning strategies and predictors with respectto, for example, how well they reduce the entropy of regressive eye move-ments in reading. Analyzing the conditions under which regressions occuris considerably more difficult in the classification-based version of the modelsince only very few of the classifications result in regressive eye movements.

36

Page 39: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

5. When the Eyes Move

In this chapter we demonstrate that the problem of determining when the eyesmove during reading may be approached as a time-to-event modeling prob-lem.1 More specifically, given their wide use in studying events that occurover time, we introduce Cox hazards models for eye movement modeling andshow how these models can be applied to address questions about the strength,as well as the timing, of processes that influence the decision to move the eyes.We further present an evaluation metric, commonly applied to time-to-eventmodels, for assessing the predictive accuracy of the models. This metric isbased on precisely the same intuition as the measure of entropy we developedin chapter 4: good models assign high probability to empirically observedbehavior. Here, this intuition is expressed differently in order to assess proba-bilistic predictions about temporal events.

The chapter is divided into two parts. In the first part, section 5.1, we pro-vide background that serves both as an introduction to the methods and modelslater presented and as a concise summary of the main results of this chapter.In the second part, section 5.2, we first review the basic concepts and methodswe use, and then summarize some of the experimental results from applyingthese methods to eye tracking corpus data.

This chapter is based on the contents of paper IV and V.

5.1 BackgroundTime-to-event modeling is concerned with studying the length of time beforean event of interest occurs and, more generally, addresses the challenges in-volved in modeling duration data. Although predominantly used in biomedicalsciences, where the approach is referred to as survival analysis,2 the notion oftime as involving a past, a present, and a future makes time-to-event modelinggenerally useful for analyzing events occurring in time (Aalen et al., 2008).Essentially, the methods can be applied to any data involving temporal obser-vations with a well-defined starting point (point when the “clock” starts) and

1We note that paper II also addresses the problem of determining when the eyes move duringreading (as well as where). Although introducing some novel ideas, the model presented in thispaper is simpler but similar in style to more sophisticated models of eye movement control likeE-Z Reader. Because paper IV and V offer more novelty to the field, we focus the discussion inthis chapter exclusively on the material presented in these papers.2In survival analysis, the methods are used to study lifetime data (hence the name) and thusoften, literally, the time until death occurs.

37

Page 40: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

end point (point when the “clock” stops). The distance on the time scale be-tween these points is generally referred to as survival time (or time-to-event),regardless of the nature of the event (Hosmer et al., 1999).

Central notions of time-to-event modeling, such as survival and hazard todescribe the temporal nature of events, are frequently used also in experimen-tal and mathematical psychology, in particular in the study of human responsetime (see, for example, Luce, 1986; Van Zandt, 2002), and, to a lesser ex-tent, in eye movement and reading research since the work by McConkie andcolleagues (McConkie et al., 1994; McConkie and Dyre, 2000; Yang and Mc-Conkie, 2001; Feng, 2009). Here, we extend this latter line of work by theuse of Cox hazards modeling,3 which is, by a wide margin, the most commonmethod in time-to-event analysis for analyzing the relationship of explanatoryvariables to survival time.

In applying Cox modeling to reading time data, it makes sense to focuson the survival function distribution of the data, as Cox models are typicallyused to estimate the chances of survival over time under different conditions.Hence, we set ourselves the overall goal to model the survival function offixation durations in reading. In other words, we aim to model, as accurately aspossible, the probability that a fixation lasts beyond a given length of time. TheCox proportional hazards model is introduced for this purpose (Cox, 1972) andwe show how this model can be applied to account for covariate effects, suchas the length and frequency of the word fixated, on the chances of survival. Itis shown that this model improves in predicting survival over a simpler modelwhich does not take such effects into account.

However, the Cox proportional hazards model is based on one particular as-sumption, the assumption of proportional hazards. This means that covariateeffects on the hazard – or risk – of making a saccade at any time are con-stant over time. In many applied settings of survival analysis this assumptiondoes not hold.4 We evaluate this assumption of the Cox proportional hazardsmodel and the results suggest that the hazards are not proportional. Instead,the influence of covariates typically changes over time and decays in the longrun. Moreover, in a second model which relaxes the proportional hazards as-sumption, we demonstrate that by partitioning the time-axis and allowing thestrength of covariates to vary over time, we reduce the prediction error no-tably, when compared both to the previous simpler estimate and the covariate-adjusted model of the survival function based on the Cox proportional hazardsmodel. This supports the initial result that the effect on the hazard is differentat different points in time. While we do not provide a concise explanationwhy such time-varying effects should occur, these findings resonate, from apsycholinguistic point of view, with the idea that different kinds of informa-

3The proposal to use Cox models for eye movement modeling is due to Feng (2009).4For example, a new drug may work well initially but then gradually lose its efficacy over time.Or conversely, a drug may be effective in the long run but perhaps have some small adverseeffect early after administration.

38

Page 41: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

tion are made available to the processing system at different times. If certainprocesses (e.g., lexical processing) act on the output of others (e.g., visualprocessing) during the course of a fixation, we would, intuitively, expect thestrength of covariates to vary with time. Taken together, our results suggestthat Cox hazards modeling with time-varying effects can recover some ba-sic time course characteristics of the short-lived processes that influence thedecision to move the eyes. Further research, however, is required to betterunderstand the nature of these results.

5.2 Saccade Timing as Time-to-Event ModelingIn this section, we first review the basic concepts and methods we make useof in applying time-to-event modeling to the timing of saccades during read-ing. This includes an outline of the survival and the hazard functions, theKaplan-Meier estimate of the survival function, the Cox proportional hazardsmodel, the extended Cox model with time-varying effects, and the Brier scorefor model evaluation. Subsequently, we summarize experimental results fromapplying these methods to data.

5.2.1 The Survival FunctionWe denote by T the random variable for the survival time, or duration, of areading fixation.5 The saccade terminating a fixation then defines the event ofinterest for a given observation (fixation). Since T denotes time, its possiblevalues include all non-negative numbers, that is, T ≥ 0. Next, we denote by t,any specific value of interest for the random variable T . The survival function,denoted S(t), gives the probability that the survival time is longer than somespecified time t. In other words, S(t) is the probability that the random variableT exceeds the specified time t:

S(t) = P(T > t)

The survival function has the following property:

S(t)≤ S(t ′) if t ≥ t ′

where t and t ′ denote two specified values of time. Thus, S(t) is a non-increasing function of time, heading downward as t increases. Furthermore,the following properties are usually assumed:

S(t) =

{1 for t = 00 for t = ∞

5In the previous chapter, T denotes a text. Throughout this chapter, T denotes survival time.

39

Page 42: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Given these properties the survival function can, theoretically, be graphedagainst time as a decreasing smooth curve that starts at one and approacheszero as time tends to infinity. In practice, however, the value of an estimatedsurvival function is constant between observed survival times t1, t2, . . . , tn, yield-ing a step function, rather than a smooth curve.

5.2.2 Kaplan-Meier Survival EstimateThe Kaplan-Meier estimate (Kaplan and Meier, 1958) is the most frequentlyused method for estimating the survival function from time-to-event data inthe presence of censored observations.6 The Kaplan-Meier estimate is basedon the simple intuition that in order to be alive at some point in time, it isnecessary to survive all previous time points. Let the observed survival timesin a sample of n observations be ordered such that:

t1 ≤ t2 ≤ . . .≤ tn

The Kaplan-Meier estimate of surviving longer than to time t is:

S(t) = ∏ti≤t

ni−di

ni

where ni is the number of observations at risk (alive and not censored) justprior to time ti, and di, the number of deaths at ti (i.e., the number of observa-tions with survival time ti).

When there are no censored observations, the Kaplan-Meier estimate re-duces to the proportion of survival times greater than t. In other words, thesurvival function at time t is given by the number of fixations still at risk ofterminating at time t, divided by the total number of fixations:

S(t) =ni

n

The Kaplan-Meier estimate does not take account of the additional influencethat covariates may have on survival time. To describe how other factors mayinfluence survival, the Cox proportional hazards model is typically used in-stead. Mathematically, this model is specified in terms of the hazard functionrather than the survival function. There is, however, a defined relationshipbetween the two functions such that it is always possible to derive one fromthe other. We introduce the hazard function next, before turning to the Coxproportional hazards model.

6Censoring in time-to-event analysis occurs when the precise survival time for an observationis not known, only that the observation remained alive for as long as it was observed.

40

Page 43: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

5.2.3 The Hazard FunctionIntuitively, the hazard function, denoted h(t), gives the risk for the event tooccur in the next instant given that it has not yet happened. It is defined as:

h(t) = lim∆t→0

P(t ≤ T < t +∆t | T ≥ t)∆t

The conditional probability in the numerator of the hazard function formulagives the probability that the survival time, T , will lie in the interval between tand t+∆t where ∆t denotes an infinitesimally small interval of time, given thatthe survival time is greater than or equal to t. More intuitively, it is describedby:

h(t) =f (t)S(t)

where f (t) denotes the probability density function (pdf) of T , and S(t) de-notes the survival function of T . In other words, it is equal to the unconditionalprobability that a saccade occurs at time t, f (t), divided by the probability thata saccade does not occur before t, S(t). In eye movement research, the hazardhas been referred to as the momentary, or instantaneous saccade likelihood, oras the conditional saccade rate (Yang and McConkie, 2005).

5.2.4 The Cox Proportional Hazards ModelThe Cox proportional hazards model regresses the hazard function, h(t), on aset of covariates. Let x1,x2, . . . ,xp be the values of p covariates X1,X2, . . . ,Xp,and let β 1,β 2, . . . ,β p be the corresponding regression parameters. Accord-ing to the Cox proportional hazards model, the hazard function, h(t), for anobservation with covariate values x1,x2, . . . ,xp, is given as:

h(t) = h0(t)exp(p

∑i=1

βixi)

The model states that the hazard function at any time t is the product of twoquantities: the baseline hazard function, denoted h0(t), and the exponentiatedlinear sum of β1x1 + β2x2 + . . .+ βpxp. The baseline hazard, h0(t), definesthe hazard at time t for an observation whose covariate values are equal tozero, i.e., if x1 = x2 = . . .= xp = 0, then h(t) = h0(t)exp(0) = h0(t). That is,when there are no covariates in the model, the hazard function reduces to thebaseline hazard. While the baseline hazard involves t and varies with time,the exponential expression does not. This is the assumption of proportionalhazards. Thus, although the hazard rate may vary over time, the assumptionis that covariate effects multiply the baseline hazard by a constant factor at alltimes.

41

Page 44: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

The outcome of fitting a Cox proportional hazards model yields an equationfor the hazard at time t as a function of one or more explanatory variables. Thesize of the effects of the explanatory variables is usually interpreted in termsof their hazard ratios, which are obtained by exponentiating the values of theestimated regression coefficients. For covariates with hazard ratios less than 1,increasing values of the covariates are associated with lower hazard and longersurvival times. Conversely, when hazard ratios are greater than 1, increasingvalues of the covariates are associated with higher hazard and shorter survivaltimes. More specifically, the hazard ratio measures the change in the risk forthe event to happen relative to a unit increase in the value of the covariate,assuming the values of the other covariates in the model are held constant.A hazard ratio of 1 means no effect, a hazard ratio below 1 means a reducedrelative risk and a hazard ratio above 1 means an increased relative risk forthe event. Under the assumption of proportional hazards, the hazard ratio isconstant and independent of time.

The primary diagnostic test for assessing whether the hazard is proportionalin covariates is based on the Schoenfeld residuals (Schoenfeld, 1982) from thefitted Cox proportional hazards model. The Schoenfeld residual for an obser-vation with survival time ti for a given covariate is the value of the covariate,minus a weighted average of the values of the covariate for the remainingobservations that are still at risk of being terminated at the same time ti. Ifthe proportional hazards assumption holds for the covariate, the Schoenfeldresiduals for that covariate will be independent of time. This can be testedby correlating the residuals with the survival times. A p-value less than .05then indicates a departure from proportionality and hence that the effect on thehazard changes over time. In this case, we may still use the Cox proportionalhazards model but risk to have biased estimates, or, we can use an extendedversion of the Cox proportional hazards model to try and account for the time-varying effects.

5.2.5 Extended Cox Model: Time-Varying EffectsThe extended Cox model is based on the intuition that, even though the pro-portional hazards assumption does not hold over the whole time period, it maystill hold over shorter time periods. In other words, the idea is to partition thetime axis into shorter time intervals, over which the effect on the hazard re-mains constant. Separate effects can then be estimated for each time interval.

The extended Cox model can be expressed as:

h(t) = h0(t)exp(p

∑i=1

βixi +p

∑i=1

γixigi(t))

where βi and γi are unknown regression parameters and gi(t) is some specifiedfunction of time for xi. The extended Cox model, like the Cox proportional

42

Page 45: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

hazards model, gives the hazard at time t for an observation with a given spec-ification of covariate values. Note that γi may be equal to zero, in which casethere is no defined interaction with time for covariate xi. Thus, the extendedCox model allows the inclusion of covariates with as well as without time-varying effects.

A common form for gi(t) is to let it be a “heaviside function” of time, wheregi(t) = 1 when t is greater than some specified time, t0, and gi(t) = 0 when tis less than or equal to t0:

gi(t) =

{1 if t > t00 if t ≤ t0

Based on the idea that the proportional hazards assumption holds at least overshorter time periods, separate effects can then be estimated for each periodt ≤ t0 and t > t0. This use of heaviside functions may be extended to giveseparate estimated effects over several time intervals. In other words, eachcovariate can then relate differently to the hazard at several different points intime. The choice of the cut-off value t0 is typically based on either measuresof central tendency or on change-points in the shape of the hazard or survivalcurve.

The hazard ratios from the model are then interpreted as functions of time.We consider a simple model involving only one explanatory variable X anda single heaviside function g(t). The extended Cox model is then given asfollows:

h(t) = h0(t)exp(βx+ γxg(t))

where

g(t) =

{1 if t > t00 if t ≤ t0

We then obtain two different hazard ratios for the effect of X , depending onthe value of t. One value for the effect when t is greater than t0 and anothervalue when t is less than or equal to t0:

t ≤ t0 : HR = exp(β )t > t0 : HR = exp(β + γ)

There is a mathematically equivalent way to write this model that uses twoheaviside functions instead of one and no main effect for the explanatory vari-able X . Although less intuitive in its specification, this version is sometimespreferred in practice as it eliminates the need to add up coefficients in the finalmodel.

43

Page 46: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

5.2.6 Prediction ErrorWe have described three different models that can be used to derive modelsof the survival function for a fixation duration distribution: the Kaplan-Meierestimate, The Cox proportional hazards model, and the extended Cox modelwith time-varying effects. Given such models we want to assess how accuratethey are in predicting the chances of survival over time. For this purpose weuse the Brier score, a commonly used evaluation metric for survival models.

The basic intuition behind the Brier score is as follows. If an observation isstill alive at some specified time t, the predicted survival probability for thatobservation should be high at that time, and conversely, if the observation isnot alive (i.e., the saccade has occurred by time t), the survival probabilityshould be low. It is worth noting that in using the Brier score, we once againapply the idea of using entropy as a measure of how similar a model is to ob-served data. Specifically, we expect that a good model estimates high chances(low entropy) of being alive for those observations that are alive at some statedvalue t, and low chances (high entropy) for those observations which are not.

Let Yi(t) denote the survival status of observation i at some specified timet, which is equal to 1 if the observation is still alive at time t and 0 otherwise.Let further Si(t) denote the predicted survival probability, given a model, forobservation i at the same time t. The Brier score at time t for observation i isdefined as the squared difference between the observed survival status (0 or 1)and the predicted survival probability at time t. Calculated over all observa-tions i1, i2, . . . in in a test sample, the Brier score gives the mean squared errorbetween the observed survival status and the predicted survival probability atsome specified point in time t:

BS(t) =1n

n

∑i=1

(Yi(t)− Si(t))2

Note that the Brier score is a function of time and may be used to evaluate theprediction error of a model of S(t) at any stated value of time t. At each valueof t, the Brier score is based on an average computed over all observations inthe sample. The lower the Brier score, the lower the prediction error. Pos-sible values for the Brier score range from 0 to 1. A non-informative modelwith a 50% incidence of the outcome gives a 0.25 Brier score at time t. Of-ten, however, the Kaplan-Meier estimate, which does not depend on covariateinformation, is used as the basis for comparison. The Brier score is usually fol-lowed over time, from the shortest to the longest survival time in the test data.In this way we obtain a prediction error curve that visualizes how accurate amodel is in predicting survival over the whole distribution.

44

Page 47: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Variable

word lengthlog word frequencylog bigram probabilitysyntactic surprisalsaccade distanceeccentricity

Table 5.1. Covariates in the Cox proportional hazards model (time-constant effectsmodel).

5.2.7 Experimental ResultsWe now summarize some experimental results reported in paper V. The over-all purpose of these experiments is to compare the predictive performanceof models with time-constant effects to models with time-varying effects us-ing the methods we have described. Overall, these experiments show that(i) the use of Cox proportional hazards model reduces prediction error overthe Kaplan-Meier survival estimate slightly; (ii) some covariates have time-varying effects on the hazard; and (iii) a model with time-varying effects re-duces prediction error notably over the previous two models. In this summary,we focus exclusively on the prediction error of the models. However, a de-tailed summary of the regression coefficients, hazard ratios, p-values, 95%confidence intervals, along with Schoenfeld plots of time-varying effects iscontained in paper V.

The results are based on first fixation duration data from the Dundee Cor-pus. The covariates we use in fitting the Cox proportional hazards model areshown in table 5.1.7 Word length is in number of character spaces; log wordfrequency is the natural logarithm of the word’s estimated frequency in lan-guage use; bigram probability refers to the conditional probability of the wordgiven the preceding word. Syntactic, or structural surprisal (Originally dueto Hale, 2001) is a measure of the uncertainty, in the information-theoreticsense, of a word in its syntactic context. Intuitively, the surprisal increaseswhen highly expected structural expectations turn out to be incorrect at a word.Saccade distance refers to the distance between the previous and the currentwithin-word fixation position, and eccentricity to the number of characters thecurrent fixation position deviates from the center of the word. This collectionrepresents a fairly standard set of covariates often used in regression analysisof reading time data.

7Since the goal of this study is not to motivate or provide evidence for novel covariate influenceson eye movements, we keep our review of covariates short here. A longer discussion of thecovariates, in the context of time-varying effects, is found in paper V.

45

Page 48: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Word frequency and bigram probability estimates are based on occurrencesin the Google 5-gram corpus (Brants and Franz, 2006) while surprisal esti-mates are based on a probabilistic parser (Roark et al., 2009) trained on theWall Street Journal section of the Penn Treebank (Marcus et al., 1993).

The model parameters are estimated on the first 16 texts in the corpus. Theremaining four texts (17-20) are held out for evaluation, giving roughly a 80/20percent training/test split of the data. The data consists in total of 175884 firstfixations, 141692 for training and 34192 for evaluation.

ModelsWe introduce the following shorthand notation for the covariates in the model.Let X1 = word length, X2 = log word frequency, X3 = log bigram probability,X4 = syntactic surprisal, X5 = saccade distance, and X6 = eccentricity. TheCox proportional hazards model for the hazard function then takes the form:

h(t) = h0(t)× exp(6

∑i=1

βixi)

We assess the assumption of proportional hazards for this model by correlat-ing the scaled Schoenfeld residuals for each individual covariate against rankordered survival time. Following standard use of this test, a rejection of thenull hypothesis at the 5% level indicates that the hazard ratio for a given co-variate is non-proportional and thus that the covariate’s effect on the hazard isvarying over time. The results indicate non-proportional hazards in four co-variates, namely word length, log word frequency, saccade distance and eccen-tricity. No evidence of non-proportionality is found for log bigram probabilityor syntactic surprisal. While we do not discuss the time-varying effects in anydetail in this summary, we note that the main result based on our analysis ofthe Schoenfeld test is that all time-varying covariates show similar decayingeffects over time. For example, the (positive) effect of word frequency on thehazard appears to be strong between 180 and 210 ms. During this time, thefrequency of the word has a relatively strong impact on the hazard. For fixa-tions still surviving at around 300 ms, however, the risk of making a saccadein the next instant does not appear to depend much on the frequency of theword.

To account for these time-varying effects we consider an extended Coxmodel with heaviside functions of time for covariates with non-proportionalhazards. In other words, we define g(t) as heaviside functions of time, takingthe value 1 if t is greater than some specified value, t0, and the value 0 whent is less than or equal to t0. Our choice of cut-off values for t0 is based onthe shape of the empirical hazard function for reading fixations. As describedin Feng (2009), the hazard function for the distribution of reading fixationsis characterized by four periods: an early slow rising period until about 130ms, a second rapidly rising period until about 180 ms, a third less rapidly ris-ing period until about 250 ms, and then a slow decline with a long right tail.

46

Page 49: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

We extend the Cox proportional hazards model so that we obtain four hazardratios for each of the time-varying covariates, corresponding to four differenttime intervals identified by three change-points at 130, 180, and 250 ms. As inthe original model, a single hazard ratio is obtained for each of the covariatesbigram probability and syntactic surprisal. The time varying hazards model isspecified as follows:8

h(t) = h0(t)× exp([4

∑i=1

β1ix1×gi(t)]+ [4

∑i=1

β2i×gi(t)]+β3x3 +β4x4 +

[4

∑i=1

β5ix5×gi(t)]+ [4

∑i=1

β6ix6×gi(t)])

where

g1(t) =

{1 if 0 < t ≤ t10 if otherwise

g2(t) =

{1 if t1 < t ≤ t20 if otherwise

g3(t) =

{1 if t2 < t ≤ t30 if otherwise

g4(t) =

{1 if t > t30 if otherwise

and

t1 = 130 t2 = 180 t3 = 250

The hazard ratio for the variables word length, word frequency, saccade lengthand eccentricity now varies with time. For each of these variables it assumesfour distinct values depending on the value of t. We obtain the hazard ra-tios from the fitted model by separately exponentiating each of the estimatedcoefficients in each time interval:

0 < t ≤ 130 : HR = exp(βi1)

130 < t ≤ 180 : HR = exp(βi2)

180 < t ≤ 250 : HR = exp(βi3)

t > 250 : HR = exp(βi4)

The hazard ratios for bigram probability and syntactic surprisal do not varywith time. These covariates occur in the model only as main effects and theirhazard ratios are given by:

HR = exp(βi)

8Note that we use the alternative formalization of the extended Cox model mentioned in section5.2.5.

47

Page 50: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Figure 5.1. Prediction error curves (Brier score) on test data between the observedsurvival status and the predicted survival probability, for Kaplan-Meier estimate, Coxproportional hazards model (time-constant effects) and the extended Cox model (time-varying effects).

Once again, we assess the proportional hazards assumption, now for the re-fitted time-varying effects model. The evidence for non-proportional hazardsis now much weaker in comparison to the original Cox model. In particu-lar, there is no longer evidence for non-proportional hazards in the covariatesword length, word frequency and eccentricity, in any of the time intervals.This suggests that the effect of these covariates now varies over time but re-mains constant prior to 130 ms, between 130 and 180 ms, between 180 and250 ms, and after 250 ms. We still find non-proportional hazards in the covari-ate saccade distance, in the two intervals between 130 and 250 ms. Overall,however, we conclude that the time-varying hazards model gives piecewiseconstant effects over all time.

ResultsFigure 5.1 shows the prediction error curves for the survival functions basedon the Cox proportional hazards model, the extended Cox model with time-varying effects, and, as a basis for comparison, the Kaplan-Meier estimate ofthe survival function. The prediction error at any time t is measured by theBrier score, which gives the mean squared error over all fixations between the

48

Page 51: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

observed survival status and the predicted survival probability at time t. Thesurvival status refers in this case to whether a fixation is still alive (i.e., word isstill fixated) at the given time t or if it is not (i.e., saccade has been launched)The Kaplan-Meier estimate at time t is equal to the proportion of survivingfixations in the training data at time t.

The prediction error curves resemble positively skewed gaussian distribu-tions, with lower prediction error towards the origin and the end point of thetime axis, where there is also less variation in the data. For example, it iseasier to predict whether a fixation is still alive at 100 ms and at 300 ms thanby 200 ms, because most fixations are usually longer than 100 ms and shorterthan 300 ms. The variation is much greater around 200 ms.

When we first compare the time-constant effects model to the Kaplan-Meiercurve, we see that this model makes slightly better predictions when there isgreater variation in the data, roughly between 175 and 225 ms. Overall, how-ever, it seems that the improvement resulting from taking covariates into ac-count in predicting the chances of survival is relatively marginal. For fixationsshorter than 175 ms and longer than 225 ms, covariate information does notseem to be very useful for predicting survival. The time-varying effects model,however, does much better, in comparison to both previous models. Just as forthe time-constant model, there is not much improvement before 175 ms, butaround this point in time the prediction error decreases for the time-varying ef-fects model, while it continues to increase for the other models. The predictionerror remains substantially lower until about 400 ms. In sum, the improvementoffered by the time-constant effects model appears only marginal and limitedto a short time window around 200 ms. By contrast, the improvement of thetime-varying hazards model seems fairly substantial and extends much furtherout in the right tail of the fixation distribution.

To conclude, we have demonstrated that eye movement modeling can beaddressed as a time-to-event, or survival analysis problem. The approach wehave outlined is based on three components: survival function estimation (goalof analysis), Cox hazards modeling (method of analysis), and Brier score eval-uation (assessment of analysis). We have further demonstrated results whichindicate that the assumption of proportional hazards, or time-stationary covari-ate effects, is violated for first fixation reading time data. These results implythat time itself is an important predictor of survival time, since covariates mayrelate differently to the hazard at different points in time. More generally, theresults also suggest that Cox hazards modeling can recover some basic timecourse characteristics of the short-lived processes that influence the saccadetiming during reading. Further research into this matter is required, however,before any safe conclusions can be drawn.

49

Page 52: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

6. Conclusion

In this chapter, we summarize what we take to be the main contributions ofthe thesis and point to promising directions for future research.

6.1 Main ContributionsThis thesis presents new methods and models for understanding eye move-ments in reading based on the use of eye tracking corpora and data-drivenmodeling. The methods and models proposed address the central decisionsthat underlie eye movement behavior: when and where to move the eyes. Weexplore the idea that empirical eye movement data carries rich informationabout the processes that guide these decisions.

Throughout the thesis we emphasize the role of prediction in eye move-ment modeling and propose that models should be evaluated with respect tohow well they account for previously unseen data. By assessing the predic-tion error, rather than the training error, we reduce the risk of making overlyoptimistic assumptions about the ability of models to generate accurate pre-dictions. In close connection with this proposal, we further propose a newintuition on which the assessment of eye movement models can be based: bet-ter models assign higher probability to representative, but independently ob-served, behavior. This intuition is made explicit in the form of two evaluationmetrics for spatial and temporal predictions, respectively.

The decision of where to move the eyes is approached using standard ma-chine learning methods. The model proposed learns where to move the eyesunder different conditions associated with the words being read. Applied tonew text, the model moves the eyes in ways it has learnt, showing similar char-acteristics to human readers. We demonstrate, for example, that this model canbe trained on individual readers to predict their individual eye movement be-havior reasonably well on new data. The model is flexible, contains few fixedparameters, and can be used to explore a range of different learning strate-gies and factors influencing eye movement decisions. Further details on thesemethods and models are found in paper I, to some extent in paper II, and inpaper III.

The decision of when to move the eyes is approached using time-to-eventmodeling. The modeling strategy we present is based on three components:survival function estimation, Cox hazards modeling, and Brier score evalua-tion. The models proposed learn the timing of eye movements under different

50

Page 53: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

conditions associated with the words being read, and, applied to new text, es-timate the probability that a fixation falling on a word survives for a givenlength of time. We demonstrate that the Cox proportional hazards model isbetter in estimating survival than a simpler model which does not take covari-ate effects into account. Furthermore, we also show that covariate effects onthe hazard of making a saccade change over the time course. By partitioningthe time-axis and allowing the strength of covariates to vary over time, wereduce the prediction error notably over the Cox proportional hazards model.This result attests to the influence of time-varying effects on the decision ofwhen to move the eyes. More generally, these results suggest that Cox hazardsmodeling can recover some basic time course characteristics of the short-livedprocesses that influence saccade timing during reading. If this is the case, Coxhazards modeling may open up novel ways for studying cognitive processesfrom corpus data and for deriving constraints on computational models of eyemovement control in reading. Further details about our work on when the eyesmove in reading are found in paper III, IV and V.

6.2 Future DirectionsThe methods and models presented in this thesis open up a number of di-rections for future investigation. With respect to the methods and models ofwhere the eyes move in chapter 4, we have already pointed out that they allowfor further exploration. Most obviously perhaps, the basic model allows fordifferent learning algorithms and feature models to be tested experimentally.Our own experiments involved a logistic regression model and a small set offeatures, or predictors, motivated by experimental findings on saccade targetselection during reading. However, this is only one possible setup and not arestriction of the model. Different learning algorithms may, for example, becompared in order to better understand what types of learning methods per-form well in modeling the eye movement decisions. Perhaps less obviously,the transition system of the model is also a parameter that can be varied, andalternative formulations can be explored which may, in turn, affect the learningprocess and the results.

With respect to the methods and models of when the eyes move in chapter5, we made an interesting finding that covariate effects change over the timecourse and that by modeling these changes we improve our predictions of sur-vival time. We consider it an important task for the future to learn the cause forthese time-varying effects on the hazard and how to interpret them. While wehave suggested that they may relate to different stages of cognitive processingduring a fixation, further control experiments, as well as theoretical investiga-tions, are necessary to evaluate the validity of this proposal. In addition to themodels we have explored, the time-to-event modeling approach itself allowsfor further exploration. The analysis of “competing risks” for different types

51

Page 54: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

of saccades (e.g., regressions, refixations, and progressive saccades) is an in-teresting example. Competing risk analysis is an extension to time-to-eventanalysis which allows for more than one type of terminating event (but onlyone event can occur for each observation). This method provides dedicatedtechniques to explore how the hazard for competing saccade events evolvesover time under different covariate conditions. Finally, it is worth noting thatwe have not addressed individual differences in temporal reading behavior,which are known to be substantial. We plan to address this issue by intro-ducing frailties in the models. Frailties are introduced in hazard models toaccount for unobserved heterogeneity, or random effects, between individualsor subgroups of individuals, and allow for individual differences in the hazardfunctions.

The models explored in this thesis are best described at Marr’s computa-tional level and do not address the specific mechanisms involved in eye move-ment control during reading. We hope, however, that some of our results mayprove useful for identifying basic constraints on lower level models, such asthe time-varying effects of different covariates on fixation durations. In con-tinuing our investigations along these lines, we hope to match some of themajor progress in eye movement modeling seen at other levels of description,and more generally to contribute to the wider understanding of eye movementsand cognitive processes.

52

Page 55: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

7. Overview of the Papers

I. This paper introduces the use of supervised machine learning methods topredict the saccade behavior of individual readers. The model we present isbased on three components: a simple transition system for saccades in reading;a log-linear classifier that predicts the next transition, and hence the saccadetarget; and a simple search algorithm that derives a fixation sequence over anytext, guided by the classifier. We train separate models for different readersand assess each model with respect to its capacity to predict the saccade be-havior of the same reader the model is trained on but on other texts. We showthat the models do fairly well on this task, largely reproducing the fixation dis-tributions and fixating the same words as the individual readers. The modelsrespond to word frequency in ways similar to human readers, often skippingover more frequent words.

II. This paper builds on paper I and presents a model that predicts the timecourse of eye movements, in addition to where the eyes move. This modelintroduces a set of processes assumed to control the timing of eye movementsand imposes theoretical constraints on their durations based on empirical andexperimental estimates. The decision to move the eyes is delayed as a functionof the ease or difficulty in processing the fixated word. The decision of whereto move the eyes is construed as an automated low-level process, approxi-mated here using an induced classifier based on the model presented in paperI. Regressions occur in the model as a result of occasional desynchronizationbetween the processes of when and where to move the eyes. In evaluating themodel against held-out data, we show that it predicts observed mean gaze du-rations and mean skipping probabilities over different word frequency classeswith good accuracy.

III. This paper presents a method for the evaluation of probabilistic saccademodels based on computing the test sample entropy, relative to a model, in-stead of the accuracy of argmax-prediction. The basic intuition of the pro-posal is that a model that approximates human behavior well is a model thatassigns high probability, and thus low entropy, to some representative and in-dependently observed behavior. Given the large variation that exists in eyemovement behavior, both between and within readers, this kind of probabilis-tic assessment of the prediction capacity of a model seems appropriate. An

53

Page 56: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

example of how it may apply is demonstrated on essentially the same sac-cade model as presented in paper I but with a different interpretation. Theentropy associated with the observed eye movement behavior is measured andreported.

IV. This paper construes fixation times in reading as time-to-event data anduses methods from survival analysis to model the time course of eye move-ments. We discuss the motivation for this approach and focus on modeling thesurvival function of fixation time. We derive estimates of the survival functionusing Cox proportional hazards model to adjust for the influence of linguisticeffects on the empirical estimate. The survival models are assessed using thetime-dependent Brier score, which can be used to evaluate the prediction errorof a model at any stated value of time. The adjusted model, averaged overall readers, is shown to reduce prediction error within a limited time window,roughly between 150 and 250 ms. following the onset of a fixation.

V. This paper extends paper IV. We further motivate the use of survival analyt-ical methods for modeling the time course of eye movements in reading andintroduce models with time-varying effects of the covariates. Such models,we show, afford a detailed analysis of the time-course. We model the survivalfunction of fixation time and demonstrate, on the one hand, that a time-fixedmodel that adjusts for the influence of cognitive effects reduces prediction er-ror over the empirical estimate which disregards such effects and, on the other,that a model with time-varying effects, employing heaviside functions of time,improves considerably over the time-fixed model.

54

Page 57: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

References

Aalen, O. O., Borgan, O., and Gjessing, H. K. (2008). Survival and Event HistoryAnalysis: A Process Point of View. Springer.

Altman, G. T., Garnham, A., and Dennis, Y. I. L. (1992). Avoiding the garden path:Eye movements in context. Journal of Memory and Language, 31:685–712.

Balota, D. A., Pollatsek, A., and Rayner, K. (1985). The interaction of contextualconstraints and parafoveal visual information in reading. Cognitive Psychology,17:364–390.

Bicknell, K. and Levy, R. (2010). A rational model of eye movement control inreading. In Proceedings of the 48th Annual Meeting of the Association forComputational Linguistics, pages 1168–1178.

Boston, M. F., Hale, J., Kliegl, R., Patil, U., and Vasishth, S. (2008). Parsing costs aspredictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus.Journal of Eye Movement Reasearch, 2:1–12.

Boston, M. F., Hale, J. T., Vasishth, S., and Kliegl, R. (2011). Parallel processing andsentence comprehension difficulty. Language and Cognitive Processes,26:301–349.

Brants, T. and Franz, A. (2006). Web 1T 5-gram Version 1. Linguistic DataConsortium.

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78:1–3.

Brysbaert, M. and Vitu, F. (1998). Word skipping: Implications for theories of eyemovement control in reading. In Underwood, G., editor, Eye guidance in Readingand Scene Perception, pages 124–147. Elsevier science Ltd.

Carroll, P. and Slowiaczek (1986). Constraints on semantic priming in reading: Afixation time analysis. Memory and Cognition, 14:509–522.

Clifton, C., Staub, A., and Rayner, K. (2007). Eye movements in reading words andsentences. In van Gompel, R., editor, Eye movements: A window on mind andbrain, pages 341–372. Amsterdam: Elsevier.

Cox, D. R. (1972). Regression models and life-tables. Journal of the RoyalStatistical Society. Series B (Methodological), 34:187–220.

Demberg, V. and Keller, F. (2008). Data from eye-tracking corpora as evidence fortheories of syntactic processing complexity. Cognition, 109:193–210.

Duffy, S. A., Morris, R. K., and Rayner, K. (1988). Lexical ambiguity and fixationtimes in reading. Journal of Memory and Language, 27:429–446.

Duffy, S. A. and Rayner, K. (1990). Eye movements and anaphor resolution: Effectsof antedecent typicality and distance. Language and Speech, 33:103–119.

Ehrlich, S. F. and Rayner, K. (1981). Contextual effects on word perception and eyemovements during reading. Journal of Verbal Learning and Verbal Behavior,20:641–655.

55

Page 58: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Ehrlich, S. F. and Rayner, K. (1983). Pronoun assignment and semantic integrationduring reading: Eye movements and immediacy of processing. Journal of VerbalLearning and Verbal behavior, 22:75–87.

Engbert, R., Longtin, A., and Kliegl, R. (2002). A dynamical model of saccadegeneration in reading based on spatially distributed lexical processing. VisionResearch, 42:621–636.

Engbert, R., Nuthmann, A., Richter, E., and Kliegl, R. (2005). SWIFT: A dynamicalmodel of saccade generation during reading. Psychological Review, 112:777–813.

Feng, G. (2001). SHARE: A stochastic, hierarchical arcitecture for readingeye-movement. PhD thesis, University of Illinois at Urbana-Champaign.

Feng, G. (2006). Eye movements as time-series random variables: A stochasticmodel of eye movement control in reading. Cognitive Systems Research, 7:70–95.

Feng, G. (2009). Time course and hazard function: A distributional analysis offixation duration in reading. Journal of Eye Movement Research, 3:1–23.

Fisher, D. F. and Shebilske, W. L. (1985). There is more that meets the eye than theeye mind assumption. In Groner, R., McConkie, G. W., and Menz, C., editors, Eyemovements and human information processing. Amsterdam: Elsevier. Amsterdam:Elsevier.

Frank, S. L. (2010). Uncertainty reduction as a measure of cognitive processingeffort. In Proceedings of the ACL Workshop on Cognitive Modeling andComputational Linguistics.

Frazier, L. and Rayner, K. (1982). Making and correcting errors during sentencecomprehension: Eye movements in the analysis of structurally ambiguoussentences. Cognitive Psychology, 14:178–210.

Hale, J. (2001). A probabilistic early parser as a psycholinguistic model. InProceedings of the second conference of the North American chapter of theAssociation for Computational Linguistics, volume 2, pages 159–166.

Hale, J. (2003). The information conveyed by words. Journal of PsycholinguisticResearch, 32:101–123.

Hale, J. (2006). Uncertainty about the rest of the sentence. Cognitive Science,30:643–672.

Heinzle, J., Hepp, K., and Martin, K. A. C. (2010). A biologically realistic corticalmodel of eye movement control in reading. Psychological Review, 117:808–830.

Hosmer, W. D., Lemeshow, S., and May, S. (1999). Applied Survival Analysis:Regression Modeling of Time-To-Event Data. New York: Wiley.

Inhoff, A. W. and Rayner, K. (1986). Parafoveal word processing during eye fixationsin reading: Effects of word frequency. Perception & Psychophsics, 40:431–439.

Ishida, T. and Ikeda, M. (1989). Temporal properties of information extraction inreading by a text-mask replacement technique. Journal of the Optical Society ofAmerica A, 6:1624–1632.

Juhasz, B. J. and Rayner, K. (2003). Investigating the effects of a set ofintercorrelated variables on eye fixation durations in reading. Journal ofExperimental Psychology: Learning, Memory & Cognition, 29:1312–1318.

Just, M. A. and Carpenter, P. A. (1980). A theory of reading: From eye fixations tocomprehension. Psychological Review, 87:329–354.

Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incompleteobservations. Journal of the American Statistical Association, 53:457–481.

56

Page 59: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Kennedy, A. and Pynte, J. (2005). Parafoveal-on-foveal effects in normal reading.Vision Research, 45:153–168.

Legge, G. E., Klitz, T. S., and Tjan, B. S. (1997). Mr. Chips: An ideal-observermodel of reading. Psychological Review, 104:524–553.

Levy, R. (2008). Expectation-based syntactic comprehension. Cognition,29:375–419.

Luce, R. D. (1986). Response Times: Their role in inferring elementary mentalorganization. New York: Oxford University Press.

Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a largeannotated corpus of English: The Penn Treebank. Computational Linguistics,19:313–330.

Marr, D. (1982). Vision: A Computational Investigation into the HumanRepresentation and Processing of Visual Information. New York: Freeman.

McConkie, G., Kerr, P., Reddix, M., and Zola, D. (1988). Eye movement controlduring reading: I. The location of initial eye fixations on words. Vision Research,28:1107–1118.

McConkie, G., Kerr, P., Reddix, M., Zola, D., and Jacobs, A. (1989). Eye movementcontrol during reading: II. Frequency of refixating a word. Perception &Psychophysics, 46:245–253.

McConkie, G. W. and Dyre, B. P. (2000). Eye fixation durations in reading: Modelsof frequency distributions. In Kennedy, A., Heller, D., and Pynte, J., editors,Reading as a perceptual process, pages 683–700. Oxford: Elsevier.

McConkie, G. W., Kerr, P. W., and Dyre, B. P. (1994). What are normal eyemovements during reading: Toward a mathematical description. In Ygge, J. andLennerstrand, G., editors, Eye movements in reading: Perceptual and languageprocesses, pages 315–327. Oxford: Elsevier.

McConkie, G. W. and Rayner, K. (1975). The span of the effective stimulus during afixation in reading. Perception & Psychophysics, 17:578–586.

McConkie, G. W. and Rayner, K. (1976). Asymmetry of the perceptual span inreading. Bulletin of the Psychonomic Society, 8:365–368.

McDonald, S. A. (2003). Saccade target selection as a classification problem. Posterat the XIII Conference of the European Society for Cognitive Pyschology(ESCOP), Granada, Spain. September 17-20.

McDonald, S. A., Carpenter, R., and Schillcock, R. C. (2005). Ananatomically-constrained, stochastic model of eye movement control in reading.Psychological Review, 112:814–840.

Pollatsek, A., Lesch, M., Morris, R. K., and Rayner, K. (1992). Phonological codesare used in integrating information across saccades in word identification andreading. Experimental Psychology: Human Perception and Performance,18:148–162.

Radach, R. and McConkie, G. (1998). Determinants of fixation positions in reading.In Underwood, G., editor, Eye guidance in reading and scene perception, pages77–100. Oxford, England: Elsevier.

Rayner, K. (1975). The perceptual span and peripheral cues in reading. CognitivePsychology, 7:65–81.

Rayner, K. (1979). Eye guidance in reading: Fixation location in words. Perception,8:21–30.

57

Page 60: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Rayner, K. (1998). Eye movements in reading and information processing: 20 yearsof research. Psychological Bulletin, 124:372–422.

Rayner, K. (2009a). Eye movements and attention in reading, scene perception, andvisual search. The Quarterly Journal of Experimental Psychology, 62:1457–1506.

Rayner, K. (2009b). Eye movements in reading: Models and data. Journal of EyeMovement Research, 2:1–10.

Rayner, K. and Bertera, J. H. (1979). Reading without a fovea. Science,206:468–469.

Rayner, K. and Duffy, S. A. (1986). Lexical complexity and fixation times inreading: Effects of word frequency, verb complexity, and lexical ambiguity.Memory & Cognition, 14:191–201.

Rayner, K. and Frazier, L. (1989). Selection mechanisms in reading lexicallyambiguous words. Journal of experimental psychology: Learning, Memory, andCognition, 15:779–790.

Rayner, K., Inhoff, A. W., Morrison, R. E., Slowiaczek, M. L., and Bertera, J. H.(1981). Masking of foveal and parafoveal vision during eye fixations in reading.Journal of Experimental Psychology: Human perception and performance,7:167–179.

Rayner, K., Liversedge, S. P., White, S. J., and Vergilino-Perez, D. (2003). Readingdisappearing text: Cognitive control of eye movements. Psychological Science,14:385–388.

Rayner, K. and Well, A. D. (1996). Effects of contextual constraint on eyemovements in reading: A further examination. Psychonomic Bulletin & Review,3:504–509.

Rayner, K., Well, A. D., Pollatsek, A., and Bertera, J. H. (1982). The availability ofuseful information to the right of fixation in reading. Perception & Psychophysics,31:537–550.

Reder, S. M. (1973). On-line monitoring of eye position signals in contingent andnoncontingent paradigms. Behaviour Research Methods & Instrumentation,5:218–228.

Reichle, E., editor (2006). Cognitive Systems Research. 7:1–96. Special issue onmodels of eye-movement control in reading.

Reichle, E., Pollatsek, A., Fisher, D., and Rayner, K. (1998). Toward a model of eyemovement control in reading. Psychological Review, 105:125–157.

Reichle, E., Rayner, K., and Pollatsek, A. (2003). The E-Z Reader model ofeye-movement control in reading: Comparisons to other models. Behavioral andBrain Sciences, 26:445–476.

Reichle, E., Warren, T., and McConnell, K. (2009). Using E-Z Reader to model theeffects of higher-level language processing on eye movements during reading.Psychonomic Bulletin & Review, 16:1–21.

Reilly, R. G. and O’Regan, J. K. (1998). Eye movement control during reading: asimulation of some word-targeting strategies. Vision Research, 38:303–317.

Reilly, R. G. and Radach, R. (2006). Some empirical tests of an interactive activationmodel of eye movement control in reading. Cognitive Systems Research, 7:34–55.

Roark, B., Bachrach, A., Cardenas, C., and Pallier, C. (2009). Deriving lexical andsyntactic expectation-based measures for psycholinguistic modeling viaincremental top-down parsing. In Proceedings of the Conference on Empirical

58

Page 61: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop

Methods in Natural Language Processing (EMNLP), pages 324–333.Schilling, H. E. H., Rayner, K., and Chumbley, J. I. (1998). Comparing naming,

lexical decision, and eye fixation times: Word frequency effects and individualdifferences. Memory & Cognition, 26:1270–1281.

Schoenfeld, D. (1982). Partial residuals for the proportional hazards model.Biometrika, 69:51–55.

Sereno, S. C., O’Donnell, P. J., and Rayner, K. (2006). Eye movements and lexicalambiguity resolution: Investigating the subordinate bias effect. Journal ofExperimental Psychology: Human Perception and Performance, 32:335–350.

Sereno, S. C. and Rayner, K. (1992). Fast priming during eye fixations in reading.Journal of Experimental Psychology: Human Perception and Performance,18:173–184.

Taylor, W. (1953). Cloze procedure: a new tool for measuring readability.Journalism Quarterly, 30:415–433.

Van Zandt, T. (2002). Analysis of response time distributions. In Wixted., J. T. andPashler, H., editors, Stevens’ Handbook of Experimental Psychology (3rd Edition),Volume 4: Methodology in Experimental Psychology, pages 461–516. New York:Wiley.

Vitu, F. and McConkie, G. W. (2000). Regressive saccades and word perception inadult reading. In Kennedy, A., Radach, R., Heller, D., and Pynte, J., editors,Reading as a perceptual process, pages 301–326. Oxford: Elsevier.

Vitu, F., McConkie, G. W., and Zola, D. (1998). About regressive saccades inreading and their relation to word identification. In Underwood, G., editor, EyeGuidance in Reading and Scene Perception. Oxford: Elsevier.

Witten, I. H. and Eibe, F. (2005). Data Mining: Practical machine learning tools andtechniques. Morgan Kaufmann.

Yang, S. N. (2006). An oculomotor-based model of eye movements in reading: Thecompetition/interaction model. Cognitive Systems Research, 7:56–69.

Yang, S. N. and McConkie, G. W. (2001). Eye movements during reading: A theoryof saccade initiation times. Vision Research, 41:3567–3585.

Yang, S. N. and McConkie, G. W. (2005). New directions in theories ofeye-movement control during reading. In Underwood, G., editor, Cognitiveprocesses in eye guidance, pages 105–130. Oxford University Press.

59

Page 62: Computational Models of Eye Movements in Reading: A Data ...uu.diva-portal.org/smash/get/diva2:484385/FULLTEXT01.pdfMattias Nilsson and Joakim Nivre In Proceedings of the 2nd Workshop