the rhythms of head, eyes and hands at intersectionscvrr.ucsd.edu/publications/2016/0406.pdfto...

The Rhythms of Head, Eyes and Hands at Intersections

Sujitha Martin, Akshay Rangesh, Eshed Ohn-Bar and Mohan M. Trivedi

Abstract— In this paper, we study the complex coordinationof head, eyes and hands as the driver approaches a stop-controlled intersection. The proposed framework is made upof three major parts. The first part is the naturalistic drivingdataset collection: synchronized capture of sensors looking-inand looking-out, multiple drivers driving in urban environment,and segmenting events at stop-controlled intersections. Thesecond part is extracting reliable features from purely visionsensors looking in at the driver: eye movements, head poseand hand location respective to the wheel. The third part isin the design of appropriate temporal features for capturingcoordination. A random forest algorithm is employed forstudying relevance and understanding the temporal evolutionof head, eye, and hand cues. Using 24 different events (from5 drivers resulting in ∼ 12200 frames analyzed) of threedifferent maneuvers at stop-controlled intersections, we foundthat preparatory motions range in the order of a few seconds toa few milliseconds, depending on the modality (i.e. eyes, head,hand), before the event occurs.

I. INTRODUCTION

In a recent report, NHTSA revealed that urban automotivecollisions accounted for some 46% of 33, 782 fatal crashesin the United States in 2012; intersection-related collisionsaccounted for 26% of fatal crashes [1]. Furthermore, in-tersections, especially stop-controlled, have been found tobe the most demanding driving scenario among all typicalscenarios in driving [2]. Hence, understanding, extractingand temporally modeling preparatory actions, such as headmovements, eye glances and hand movements, before theonset of the respective maneuver has tremendous benefits inthe development of intelligent vehicles.

Head, eyes and hand coordination for driver hand activityclassification has been studied before in [3]. Yet unlike [3],this work studies maneuvers at stop-controlled intersectionsand a different temporal feature extraction process. Maneuverintent studies [4], [5], [6] generally emphasize highwaydriving, and do not provide a thorough analysis on featureselection and relevance.

This paper investigates how and to what extent informationfrom driver’s face and hands can be used to differentiatebetween maneuvers at a stop-controlled intersection beforethe maneuver is executed. Fig. 1 shows sample instances ofdriver’s preparatory motions prior to the start of respectivemaneuvers. Note that while external views are shown anddiscussed throughout the paper, external views are shown andused only for the sake of context. Fig. 2 illustrates the threeimportant cues of interest: iris deviation, head pose in yaw,

The authors are with the Laboratory for Intelligent and SafeAutomobiles, University of California, San Diego, La Jolla,CA 92092, USA {scmartin, arangesh, eohnbar,mtrivedi}@ucsd.edu

(a) Head and eye movements

(b) Hand positioning

(c) External viewFig. 1. Sample instances of (a) Head and eye movements, (b) Handpositioning and (c) external context prior to the start of respective maneuvers(i.e. stop and right/left turns, stop and go straight).

and hand location with respect to center of steering wheelfor an example event. With time 0 representing the start ofthe right turn maneuver, evident head motions 10 secondsbefore can be observed, with iris deviations 6 seconds before,and hands 1 second before. Inspired by this observation, thispaper introduces the following contributions:• Naturalistic driving dataset collection: synchronized

capture of sensors looking-in and looking-out, multipledrivers driving in urban environment, and segmentingevents at stop-controlled intersections

• Extracting reliable features: eye movements [7], headpose [8] and hand location [9] respective to the wheelfrom purely vision sensors looking in at the driver

• Data driven processing: construction of features usingtemporal pyramids and using the random forest algo-rithm to extract optimal feature subset, where optimalfeatures reveal when and what cue is most relevant forrepresenting the preparatory motions.

II. NATURALISTIC DRIVING DATASET

A large corpus of naturalistic driving data was collectedusing a highly instrumented test vehicle dubbed LISA-A[10]. The testbed was built to provide a near-panoramicsensing field of view with synchronized internal vision,external vision, radar, lidar and GPS. Of vision sensors,of relevance to this work are four camera views, three ofwhich are depicted in Fig. 2; the fourth camera gives anotherperspective of the driver’s face.

Using this testbed, multiple drivers were asked to naturally

2016 IEEE Intelligent Vehicles Symposium (IV)Gothenburg, Sweden, June 19-22, 2016

978-1-5090-1820-8/16/$31.00 ©2016 IEEE 1410

Fig. 2. The interesting coordination of head, eye and hand movements prior to the driver starting a right turn at the strop-controlled intersection. Thebottom part is a time series plot of the cues: eye deviation, head pose and relative hand location. The top part presents instances from looking-in andlooking-out. The colored coded boxes in the two parts shows were the images are taken from in the time series.

drive on local streets and freeways in Southern California.The subjects were given no instructions on how to drive,resulting in naturalistic driving data. From the collected data,we selected events when the driver passes through stop-controlled intersections, by either turning right/left or goingstraight. In particular, we narrowed down the events to thestop-controlled intersections geographically illustrated in Fig.3. The accumulated events were gathered from 7 uniquedrives lasting at least 60 minutes each of 5 different drivers(4 male and 1 female). The table in Fig. 3 shows the eventsconsidered, their respective counts and total frames analyzedwith respect to one vision sensor. The duration of how long adriver stops varied depending on the intersection, the driverand the maneuver executed at the intersection. In SectionIV.A, we define epochs that consistently exist in all stop-controlled intersection related maneuvers. These epochs arenecessary to properly extract temporal features for analysis(see Section IV.B for more details).

III. EXTRACTION, REPRESENTATION AND REDUCTIONOF SPATIO-TEMPORAL FEATURES

Driver’s exhibit preparatory motions in multiple waysusing any or all parts of the body. In this framework, weare interested in preparatory motions observable with head,eyes and hands in the context of preparing for a maneuver atstop-controlled intersections (Fig. 2). This section describes

how to objectively extract features from head, eyes and handsas independent and joint modalities.

A. Head and Eye Features

In this work, the state of the head and eye are representedusing head pose and normalized iris deviation. Fig. 4 illus-trates the overall flow from a set of input video frames tothe output. Given an image, it starts with face detection [11],followed by landmark estimation. A total of 68 landmarks asdefined originally in the CMU Multi-Pie dataset [12] and 2iris locations [7] are estimated in the current framework. Inthis work, we use our own implementation of the cascade ofregression models [13], [14] for facial landmark estimation.The idea is given an initial estimate of the facial landmarklocations, say p0, which is generally a mean shape, and alearned sequence of regression models (R1, ..., RK), faciallandmark location at the kth iteration is computed as follows:

pk = pk−1 +Rk ∗ F (pk−1,M)

where k ∈ [1,K], M represents the image on which thelandmarks are being estimated, and F (pk−1, I) is a vector offeatures extracted at landmark positions found at the (k−1)iteration on image M . Note that there are two views oflooking at the driver’s face. This is in order to use thecamera view with minimal self-occlusion from large headmovements [8].

1411

Fig. 3. Geographical location of the stop-controlled intersections wheredata is extracted from and statistics on the events analyzed.

For the head pose estimation, we use a geometric methodwhere local features, such as eye corners, nose corners, andthe nose tip, and their relative 3-D configurations, determinethe pose [15]. In addition to head pose, the framework alsooutputs normalized iris deviation. Iris deviation is definedas the distance from the iris to one of the eye corners,normalized by the distance between the eye corners. Forexample, iris deviation of the left eye is calculated as follows:

4dLEI =d(qLELC,qLEI)

d(qLELC,qLERC)(1)

with d as the Euclidean distance. The vectors qLELC,qLERC, and qLEI are the left eye’s left corner coordinates,left eye’s right corner coordinates, and left eye’s iris coordi-nates, respectively. Iris deviation of the right eye is calculatedsimilarly.

Given a video sequence V T = {I1, I2, . . . , IT } of lengthT , the final output of head and eye features are representedas:

X fhead = {φ1, φ2, . . . , φT },X f

eye = {4d1,4d2, . . . ,4dT }

where φ is the yaw rotation of the head and 4d is irisdeviation of the eye with the most visibility; the visibility isdetermined by head pose with respect to camera perspective.Note that due the natural challenges of lighting conditions,occlusions and large head movements, the system requiredinput from the user.

B. Hand Analysis

Hand motion patterns can be leveraged as an importantcue to understand driver activity. To this end, driver handsare detected and tracked in order to produce continuoustrajectories using the tracker proposed in [9], [16].

Given a video sequence V T = {I1, I2, . . . , IT } of lengthT , the hand tracker outputs the trajectory corresponding toeach hand of the driver:

Fig. 4. A flowchart of the face analysis process which yields the driver’shead pose and eye movements.

Xleft = {xl1,x

l2, . . . ,x

lT },

Xright = {xr1,x

r2, . . . ,x

rT }

(2)

where each

xi = (xi, yi),∀i ∈ {1, 2, . . . , T} (3)

represents the location corresponding to the center of thehand in the image plane.

The trajectories obtained as such are not informative intheir original form. In addition to this, the tracks may befragmented and subject to spatial noise or jitters. To solvethe above issues, we propose to encode the spatial locationin an intuitive yet discriminative manner. First, we changethe origin of the image plane to the center of the wheel.Given the location of this center in the image plane (xc), thetransformed location of the hands are calculated as follows:

xi′ = xi − xc. (4)

To make the tracks robust against noise and fragmentation,we propose a region based mapping,

M : R2 7→ {−4,−3,−2,−1, 1, 2, 3, 4}

to encode tracks meaningfully. This mapping is visualizedin Fig. 5 along with sample plots for left and right handtracks. The regions are assigned labels with a motive ofassociating distinct motions with larger skips in region labels.For instance, keeping one hand on each side of wheel withlittle or no movement would constrain your hand tracks toregions 2 and 3 for the left, and -2 and -3 for the right handrespectively. However, substantial hand motions would resultin large changes in the region labels through a given temporalwindow.

With the above mapping in place, the final trajectory maythen be represented as :

X fleft = {c

l1, c

l2, . . . , c

lT },

X fright = {c

r1, c

r2, . . . , c

rT }

(5)

where,

ci =M(x′i),∀i ∈ {1, 2, . . . , T}. (6)

.

1412

1

2

3

4-4

-3

-2

-1

1

2

3

4-4

-3

-2

-1

Fig. 5. The left panel depicts the labels assigned to regions around thewheel. These are then used to map hand locations to the corresponding labelvalues. The right panel shows an exemplar hand track sequences encodedusing the method described.

C. Temporal Modeling and Feature Ranking

In this subsection, we describe the feature generation andranking process given temporal sequences from all threemodalities - head, hand and eyes. From here on, we denoteany generic temporal sequence as X . For any given X , weextract a set of statistical features (detailed in Table III-C)and concatenate them to create a feature vector.

The above feature representation ignores any temporalstructure in the data. This makes it hard to decipher the evo-lution of an event in time (starting 10-12 seconds before theevent). To encode temporal information into our framework,we represent features in a temporal pyramid [17], where atthe top level features are extracted over the full temporalextent of a video sequence, the next level is the concatenationof features extracted by temporally segmenting the videointo two halves, and so on. We obtain a coarse-to-finerepresentation by concatenating all such features together togenerate a feature vector f .

Consider a set of feature vectors F = {f1, f2, . . . , fN},and their corresponding class labels Y = {y1, y2, . . . , yN}.In the context of our study, the class labels are one of thefollowing: stop and turn right, stop and turn left or stop andgo straight. Given F and Y , we train a random forest (RF)with an ensemble of 1000 decision trees on the entire corpus.The maximum depth of each tree is restricted to 10 to preventoverfitting. This RF is then used to perform feature selectionacross modalities. The advantage in choosing RFs over otherfeature selection techniques is three fold: First, we sidestepthe entire process of tuning hyper-parameters through cross-validation. Second, RFs make no assumption about the linearseparability of the data. Third, RFs are capable of handlingdifferent data formats like integers, floats or labels. Thus, nostandardization or regularization is necessary.

TABLE ICANDIDATE FEATURES FOR TEMPORAL

SEQUENCES

Feature Description

hist1 Histogrammean2 Sample meanstd2 Sample standard deviationmin Minimum valuemax Maximum valuemode3 Moderange Range (max - min)

q1 25th percentileq2 50th percentileq3 75th percentile

1 only for head and hand sequences2 only for head and eye sequences3 only for hand sequences

The trained RF measures the importance of each feature asthe averaged impurity decrease computed from all decisiontrees in the forest. The impurity measure used is the giniimpurity [18], which acts as a criterion to minimize theprobability of mis-classification. Using these importancescores, we rank the set of all features in decreasing orderof discriminative power.

IV. ON-ROAD EXPERIMENT AND ANALYSIS

A. Event Description

With an intent of analyzing driver behavior at stop-controlled intersections, we define the following epochsduring each event:• stop : the time at which the vehicle approaching an

intersection comes to complete halt.• start : the time at which the vehicle starts moving after

the stop epoch.• mid : the time halfway between the stop and start

epochs.• end : the time at which the driver completes the entire

maneuver.With the above definitions in place, we analyze each event

for a period starting 12 seconds prior to the stop epoch, andending with end epoch.

B. Data Driven Event Analysis

To understand the interplay between head, hand and eyemodalities, we extract a set of features for each of the 24stop-controlled intersection events, as described in sectionII. These features are labeled based on the end maneuverperformed by the driver (i.e. right turn, left turn, go straight).

To see how the feature importances change as a functionof time, we train one RF each for every 0.25 secondsstarting from 10 seconds prior to the stop epoch, up untiland including the stop epoch. Each forest is trained only onfeatures extracted up to that given point in time. Similarly, wetrain one RF each for every 0.25 seconds after and includingthe start epoch. Since the time interval between the stop and

1413

Sto

p &

T

urn

Rig

ht

Sto

p &

G

o S

trai

gh

tS

top

&

Tu

rn L

eft

Fig. 6. A plot of the histogram of top 25 features with respect to the modality it originates from as a function of time (bottom panel). Exemplar imagesequences extracted from different time intervals for each of the 3 maneuvers. Best viewed in color.

start epochs is variable, we simply train a RF at the midepoch for each event. To encode temporal structure into thefeatures, we use a temporal pyramid of three levels to traineach RF.

For each time instant considered above, all features areranked based on their importance scores. We consider thetop 25 features and plot a histogram based on their modality.Fig. 6 depicts the evolution of the histogram through time. Inaddition to the plot, we provide exemplar image sequencesextracted from different time periods for each maneuver type.

The eye modality is seen to hold significance very earlyin the event. It provides discriminative feature up to afew seconds before the stop epoch. This indicates long-term planning on the drivers’ part. The head modality isclearly the most discriminative of all three modalities in thiscontext. This is indicated by a large number of highly rankedhead features throughout the entire event. This modalityseems to be most effective just as the vehicle is comingto a stop, up until the start epoch. This corresponds tothe side-to-side head movements of a driver to scan theintersection for pedestrians or vehicles. This is reflectedin the corresponding image sequences from this temporalregion. Finally, the hand modality comes into play later inthe event. This modality is very short-sighted in the sensethat its prediction window is very small in relation to othermodalities. However, it provides the most reliable features for

classification late in the event. This is verified by the rapidincrease in discriminative features obtained from hands rightafter the start epoch. Additionally, the hand modality is avery strong indicator of preparatory movements before theactual maneuver is executed (seen from corresponding imagesequences).

Fig. 7 shows the frequency of occurrence of each in-dividual feature in the top 25, for each modality, acrossall time intervals. A higher count generally corresponds toa higher relevance value for coordination analysis in thedataset. It can be observed that the range, std, min, and maxfeatures are useful across the different modalities studied.These motion cues are useful for capturing the extent ofthe motion in the temporal window, such as the side-to-sidemovements of the head (captured in the yaw). Because thehand movement is most active towards the end of the event,hand cues are selected less frequently. Among those selected,the hist features seem to work best for the left hand, whilerange is seen to be of most use for the right hand. Thismay be indicative of the fact that the right hand is moreprone to crossover from one side of the wheel to the otherin comparison to the left hand.

V. CONCLUDING REMARKS

In this study, we provide a data-driven approach to un-derstand and analyze the temporal interplay between three

1414

Fig. 7. A plot of relative frequency with which each feature occurs in the top 25 across all modalities and for all time intervals.

modalities: head, hands and eyes of the driver, in the contextof stop-controlled intersections. A naturalistic driving datasetwas employed to show that preparatory motions range in theorder of a few seconds to a few milliseconds, dependingon the modality, before maneuver events at stop-controlledintersections. Features generated from the head are seen to bemost useful in terms of predictive power. Eye based featuresplay an important role in prediction at a very early stageof the maneuver, while hand features dominate towards theend. These findings are in line with the general flow of mosthuman activities- first see, then perceive and finally actuate.

Future work encompasses extending this understandingof the temporal influence of each modality to differentmaneuvers performed under different contexts. Additionally,this study will serve as a basis to come up with a strong pre-dictive algorithm for intersections. Other information sourceslike vehicle dynamics and external camera sensors can beintegrated.

ACKNOWLEDGMENT

The authors would like to thank the research support ofthe UC Discovery program and associated industry partners.

REFERENCES

[1] “Traffic safety facts,” Department of Trasportation-National HighwayTraffic Safety Administration, 2014.

[2] Y. Liao, S. E. Li, W. Wang, Y. Wang, G. Li, and B. Cheng, “Detectionof driver cognitive distraction: A comparison study of stop-controlledintersection and speed-limited highway,” IEEE. Trans. IntelligentTransportation Systems, 2016.

[3] S. Martin, E. Ohn-Bar, A. Tawari, and M. M. Trivedi, “Understandinghead and hand activities and coordination in naturalistic drivingvideos,” in IEEE Intelligent Vehicles Symposium, 2014.

[4] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car thatknows before you do: Anticipating maneuvers via learning temporaldriving models,” in IEEE Intl. Conf. Computer Vision, 2015.

[5] A. Doshi, B. Morris, and M. Trivedi, “On-road prediction of driver’sintent with multimodal sensory cues,” IEEE Pervasive Computing,2011.

[6] E. Ohn-Bar, A. Tawari, S. Martin, and M. M. Trivedi, “On surveillancefor safety critical events: In-vehicle video networks for predictivedriver assistance systems,” Computer Vision and Image Understand-ing, 2015.

[7] A. Tawari, K. H. Chen, and M. M. Trivedi, “Where is the driver look-ing: Analysis of head, eye and iris for robust gaze zone estimation,”in IEEE Conf. Intelligent Transportation Systems, 2014.

[8] A. Tawari, S. Martin, and M. M. Trivedi, “Continuous head move-ment estimator for driver assistance: Issues, algorithms, and on-roadevaluations,” IEEE Trans. Intelligent Transportation Systems, 2014.

[9] A. Rangesh, E. Ohn-Bar, and M. M. Trivedi, “Long-term, multi-cuetracking of hands in vehicles,” IEEE Trans. Intelligent TransportationSystems, 2016.

[10] A. Tawari, S. Sivaraman, M. M. Trivedi, T. Shannon, and M. Tip-pelhofer, “Looking-in and looking-out vision for urban intelligentassistance: Estimation of driver attentive state and dynamic surroundfor safe merging and braking,” in IEEE Intelligent Vehicles Symposium.IEEE, 2014.

[11] P. Dollar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramidsfor object detection,” IEEE Trans. Pattern Analysis and MachineIntelligence, 2014.

[12] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,”Image and Vision Computing, 2010.

[13] X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmarkestimation under occlusion,” in IEEE Intl. Conf. Computer Vision,2013.

[14] X. Xiong and F. De la Torre, “Supervised descent method and itsapplications to face alignment,” in IEEE Conf. Computer Vision andPattern Recognition, 2013.

[15] S. Martin, A. Tawari, E. Murphy-Chutorian, S. Y. Cheng, andM. Trivedi, “On the design and evaluation of robust head pose forvisual user interfaces: algorithms, databases, and comparisons,” inAutomotive User Interfaces and Interactive Vehicular Applications,2012.

[16] A. Rangesh, E. Ohn-Bar, and M. Trivedi, “Hidden hands: Trackinghands with an occlusion aware tracker,” in IEEE Conf. ComputerVision and Pattern Reocgnition Workshops-HANDS, 2016.

[17] H. Pirsiavash and D. Ramanan, “Detecting activities of daily livingin first-person camera views,” in IEEE Conf. Computer Vision andPattern Recognition, 2012.

[18] K. J. Archer and R. V. Kimes, “Empirical characterization of randomforest variable importance measures,” Computational Statistics & DataAnalysis, vol. 52, no. 4, pp. 2249–2260, 2008.

1415

the rhythms of head, eyes and hands at intersectionscvrr.ucsd.edu/publications/2016/0406.pdfto...

Documents