[ieee 2014 ieee international conference on multimedia and expo workshops (icmew) - chengdu, china...

2
RECOGNITION OF WHO IS DOING WHAT FROM THE STICK MODEL OBTAINED FROM KINECT V Ramu Reddy, Kingshuk Chakravarty, T Chattopadhyay, Aniruddha Sinha, and Arpan Pal TCS, Innovation Labs, Kolkata, India {ramu.vempada, kingshuk.chakravarty, t.chattopadhyay, aniruddha.s, and arpan.pal} @tcs.com ABSTRACT In this demo, authors are going to demonstrate a method of identifying the person and his/her activities such as sitting, standing and walking using the skeleton information (stick model) obtained from Kinect. This set up is deployed in a drawing room for the real-time Television Rating Point (TRP) measurement. Index TermsTelevision Rating Point, Activity Moni- toring, Person Identification, Kinect, Stick Model 1. INTRODUCTION Now a days, marketing teams of the commercial organiza- tions are spending billions of USD for advertising their prod- ucts in television (TV) [1]. The advertising costs of same product vary from one show to another based on the popular- ity of the show. Television Rating Point (TRP) is a tool which provides the information of about popularity of the particu- lar TV program. Therefore, there is a need for precise TRP measurement in TV industry, but the people are worried to be monitored by using any optical sensor because of privacy con- cern. It is also difficult to ask people to wear a number of sen- sors because of obtrusiveness. Currently TRP is usually mea- sured using People meter [2] which uses one or combination of four techniques namely (i) audio matching, (ii) frequency of tuned channel, (iii) audio watermark, and (iv) visual pat- tern recognition. But none of these techniques can identify who is really watching TV for how much time. However, above methods can’t capture the exact viewership if multiple persons watch the same TV program. Moreover, the person may leave the room even when the TV is tuned to a partic- ular channel. The person may talk over phone or chat with others while watching a particular TV program. So, there is a need to have a precise measure of how many persons are really watching a particular TV show. In addition, the per- formance of any camera based TRP monitoring system is not satisfactory in low light condition which is very much normal for any TV room. Not only that, these types of system do not ensure viewers privacy and security issues. One possible solution might be ensured by using one new sensor, that rocked the commercial market in last a few years, namely Kinect. Kinect can generate the stick model which represents the human body by 20 joints namely Hip Center(1), Spine(2), Shoulder Center(3), Head(4) etc. in a skeleton form [3] [4]. To address the obtrusiveness and privacy concern a real-time TRP measurement system is developed by identify- ing “The Viewer” based on activity using these skeleton joints information obtained from the Kinect mounted over the TV set. In general, when a person wants to watch TV show, he/she walks into the room, switches on the TV set and then he sits in a couch/sofa in front of TV. In some scenarios, people also prefer to stand while watching certain crucial moments of a TV program. We have assumed this scenario in our system and thus we have developed person identification system (PI) based on the activity identification (AI). It needs to be men- tioned that Kinect can identify the person within 10 feet field of view and TV is also a 10 ft HCI. As Kinect is mounted over a TV, we assume that the person is involved in watch- ing the TV if he/she sits on a sofa/couch or standing in the field of view of Kinect. Keeping this in view, the PI system is developed for the static activities like sitting and standing but not for walking. This will be very useful for precise TRP measurement, because once the activities are identified, we consider only the amount of time the person is static (i.e., standing and sitting) and can ignore the rest. 2. PROPOSED METHODOLOGY Our proposed system has four modules in its architecture as shown in Fig. 1. Data Acquisition (DA) Feature Generator (FG) F act and F PI Activity Identification (AI) Person Identification (PI) x,y,z F PI F act Fig. 1. System Architecture

Upload: arpan

Post on 06-Mar-2017

215 views

Category:

Documents


0 download

TRANSCRIPT

RECOGNITION OF WHO IS DOING WHAT FROM THE STICK MODEL OBTAINED FROMKINECT

V Ramu Reddy, Kingshuk Chakravarty, T Chattopadhyay, Aniruddha Sinha, and Arpan Pal

TCS, Innovation Labs, Kolkata, India{ramu.vempada, kingshuk.chakravarty, t.chattopadhyay, aniruddha.s, and arpan.pal} @tcs.com

ABSTRACT

In this demo, authors are going to demonstrate a method ofidentifying the person and his/her activities such as sitting,standing and walking using the skeleton information (stickmodel) obtained from Kinect. This set up is deployed in adrawing room for the real-time Television Rating Point (TRP)measurement.

Index Terms— Television Rating Point, Activity Moni-toring, Person Identification, Kinect, Stick Model

1. INTRODUCTION

Now a days, marketing teams of the commercial organiza-tions are spending billions of USD for advertising their prod-ucts in television (TV) [1]. The advertising costs of sameproduct vary from one show to another based on the popular-ity of the show. Television Rating Point (TRP) is a tool whichprovides the information of about popularity of the particu-lar TV program. Therefore, there is a need for precise TRPmeasurement in TV industry, but the people are worried to bemonitored by using any optical sensor because of privacy con-cern. It is also difficult to ask people to wear a number of sen-sors because of obtrusiveness. Currently TRP is usually mea-sured using People meter [2] which uses one or combinationof four techniques namely (i) audio matching, (ii) frequencyof tuned channel, (iii) audio watermark, and (iv) visual pat-tern recognition. But none of these techniques can identifywho is really watching TV for how much time. However,above methods can’t capture the exact viewership if multiplepersons watch the same TV program. Moreover, the personmay leave the room even when the TV is tuned to a partic-ular channel. The person may talk over phone or chat withothers while watching a particular TV program. So, there isa need to have a precise measure of how many persons arereally watching a particular TV show. In addition, the per-formance of any camera based TRP monitoring system is notsatisfactory in low light condition which is very much normalfor any TV room. Not only that, these types of system do notensure viewers privacy and security issues.

One possible solution might be ensured by using one newsensor, that rocked the commercial market in last a few years,

namely Kinect. Kinect can generate the stick model whichrepresents the human body by 20 joints namely Hip Center(1),Spine(2), Shoulder Center(3), Head(4) etc. in a skeleton form[3] [4]. To address the obtrusiveness and privacy concern areal-time TRP measurement system is developed by identify-ing “The Viewer” based on activity using these skeleton jointsinformation obtained from the Kinect mounted over the TVset.

In general, when a person wants to watch TV show, he/shewalks into the room, switches on the TV set and then he sitsin a couch/sofa in front of TV. In some scenarios, people alsoprefer to stand while watching certain crucial moments of aTV program. We have assumed this scenario in our systemand thus we have developed person identification system (PI)based on the activity identification (AI). It needs to be men-tioned that Kinect can identify the person within 10 feet fieldof view and TV is also a 10 ft HCI. As Kinect is mountedover a TV, we assume that the person is involved in watch-ing the TV if he/she sits on a sofa/couch or standing in thefield of view of Kinect. Keeping this in view, the PI systemis developed for the static activities like sitting and standingbut not for walking. This will be very useful for precise TRPmeasurement, because once the activities are identified, weconsider only the amount of time the person is static (i.e.,standing and sitting) and can ignore the rest.

2. PROPOSED METHODOLOGY

Our proposed system has four modules in its architecture asshown in Fig. 1.

Data Acquisition(DA)

Feature Generator(FG)

Fact and FPI

ActivityIdentification (AI)

PersonIdentification (PI)

x,y,z

FPIFact

Fig. 1. System Architecture

The “Data Acquisition” (DA) module is responsible forcapturing skeleton information. It mainly records {𝑥, 𝑦, 𝑧}co-ordinates (in meters) of different skeleton joints for a par-ticular subject. “Feature Generator” (FG) module extractsnecessary feature set from this skeleton data for activity de-tection as well as person identification. Finally this featureextraction leads to supervised learning based person identifi-cation based on activity pattern analysis.

In this work, feature vector 𝐹𝑎𝑐𝑡 representing featuressuch as average values (𝐹𝑚𝑒𝑎𝑛 ∈ R12) and the range(𝐹𝑟𝑎𝑛𝑔𝑒 ∈ R12) of the joint co-ordinates (x,y,z) extractedfrom the skeleton-joints like Head, ShoulderCenter, Shoul-derLeft and ShoulderRight over f number of frames are usedfor identifying the activities [5]. In case of static poses likesitting or standing only structural characteristics or physicalbuild of a person (e.g. height, length of limbs etc.) canprovide us all the meaningful information about the person’sidentity or uniqueness. So keeping this fact in mind, we usedifferences of 3D world co-ordinates between every pair ofjoints for each frame as a candidate feature (𝐹𝑃𝐼 ). It needs tobe mentioned that 𝐹𝑃𝐼 contains all the necessary informationabout the characteristics of the physical build of the subject.

2.1. Performance evaluation

In our proposed system, 5 fold cross validation is used for per-formance evaluation, where 4 folds are used for training and1 fold for testing. In our implementation we have used super-vised multi-class SVM with RBF kernel as a classifier for AIand PI. It needs to be mentioned that our proposed method isable to achieve 96.91% accuracy in identifying sitting, stand-ing and walking activities where as 93.88% and 96.24% accu-racy in identifying the person in standing and sitting posturesamong 5 subjects. Tables 1 and 2 shows the confusion matrixof AI, and PI system for standing posture.

Table 1. Sample confusion matrix for activity identification.Stand Sit Walk

Stand 96.70 0.00 3.30

Sit 1.20 95.15 3.65

Walk 0 1.10 98.90

Table 2. Sample confusion matrix for person identification instanding posture.

𝑃1 𝑃2 𝑃3 𝑃4 𝑃5

𝑃1 93.61 2.02 4.37 0.00 0.00

𝑃2 1.61 88.08 9.25 1.06 0.00

𝑃3 0.08 7.42 92.26 0.24 0.00

𝑃4 1.96 0.08 0.00 97.96 0.00

𝑃5 0.00 2.48 0.02 0.00 97.50

0 1 2 3 4 5 6

Walk

Stand

Sit

Time (min)

Act

ivity

(a)

0 1 2 3 4 5 6 7 8

Person1

Person2

Person3

Person4

Person5

Time (min)

Sub

ject

WalkingStandingSitting

(b)Fig. 2. (a) Activity pattern for a particular subject in TV room,and (b) Activity patterns of different subjects (Person 1 to 5)in TV room

2.2. Viewing Habit

In this work we have visualized the viewing habit of a per-son by plotting the activity pattern of a person against timeas shown in the Fig. 2(a). From Fig. 2(a), it is evident thatthe person initially walks for a while and then he stands for awhile and then he sits to watch the TV show. Thus, it is possi-ble to design an index based on person’s activities, performingwhile watching TV. We have kept this as our future scope ofresearch. In Fig. 2(b), we have plotted the activity patternsof different persons over time where colors red, green, andblue indicate walking, standing and sitting, respectively. Itcan be very useful to analyze the demography-wise TV pro-gram viewership. For example, as shown in the Figure 2(b) itis evident that Person 1 and Person 2 watch the TV with moreattention than the rest of the people. Here, we have made anassumption that if the person is walking he/she is giving leastattention towards the TV show and while sitting he pays themost attention. We are going to add more action like readingbook, talking over phone etc. in future course of research sothat the limitation from the above mentioned assumption neednot to be considered in future.

3. REFERENCES

[1] “http://en.wikipedia.org/wiki/Criticism of advertising,”[Online; accessed 12-March-2014].

[2] D. Mukherjee, P. Mishra, S. Bhattacharya, T. Chat-topadhyay, and A. Ghose, “An Architecture for RealTime Television Audience Measurement”, ISCI 2011,Malaysia, March, 2011, pp. 611–616.

[3] M. K. SDK, “http://www.microsoft.com/en-us/kinectforwindows/develop/developer-downloads.aspx”,2012. [Online; accessed 12-March-2014].

[4] A. Sinha; K. Chakravarty, “Pose Based Person Identi-fication Using Kinect”, IEEE SMC 2013, pp. 497-503,DOI 10.1109/SMC.2013.91

[5] V. Ramu Reddy and T. Chattopadhyay, “Human Activ-ity Recognition from Kinect Captured Data using StickModel”, HCII 2014, Greece, June, 2014.