bsc project final report immersive computer games using ... · bsc project final report immersive...

BSc Project Final Report

Immersive Computer Games Using Computer Vision

Submitted for the BSc in Computer Science

with Games Development

April 2011

by

Timothy Robert Orton

Table of Contents 1 Introduction ....................................................................................................................... 1

1.1 Brief ............................................................................................................................. 2

1.1.1 Initial Brief .............................................................................................................. 2

1.1.2 Agreed Brief ............................................................................................................. 2

1.1.3 Reasons for Change ................................................................................................. 2

1.2 Context ........................................................................................................................ 3

1.3 Aim and Objectives ..................................................................................................... 4

2 Project Background ............................................................................................................ 5

2.1 Problem Context ......................................................................................................... 5

2.1.1 Depth Perception ..................................................................................................... 5

2.1.2 Face Detection ..................................................................................................... 6

2.2 Comparison of Technologies ....................................................................................... 8

2.2.1 Open CV ............................................................................................................... 8

2.2.2 VXL or LTI ........................................................................................................... 8

2.2.3 Face API ............................................................................................................... 9

2.2.4 AAM API or VOSM .............................................................................................. 9

2.2.5 STASM ................................................................................................................. 9

2.3 Comparison of Algorithms .......................................................................................... 9

2.3.1 Skin Colour Detection .......................................................................................... 9

2.3.2 Motion Detection ................................................................................................. 9

2.3.3 Haar-like Feature Detection .............................................................................. 10

2.3.4 AAMs .................................................................................................................. 10

2.4 Similar Projects .......................................................................................................... 11

3 Technical Development .................................................................................................... 12

3.1 System Architecture .................................................................................................. 12

3.2 Head Tracking ........................................................................................................... 13

3.3 System Testing .......................................................................................................... 13

4 Critical Evaluation ............................................................................................................ 15

4.1 Project Achievements ................................................................................................ 15

4.1.1 Face Tracking ........................................................................................................ 15

4.1.2 Artificial Intelligence ......................................................................................... 15

4.2 Further Development ................................................................................................ 16

4.3 Personal Reflection ................................................................................................... 17

5 Conclusion ........................................................................................................................ 18

6 References ........................................................................................................................ 19

7 Bibliography .....................................................................................................................20

8 Appendices ....................................................................................................................... 21

8.1 Initial Task Analysis .................................................................................................. 21

8.2 Interim Class Diagram .............................................................................................. 22


- 1 -

Immersive Computer Games

Using Computer Vision

1 Introduction

This project, as suggested in the title, revolves around the idea of immersion in computer games. Lots of computer games can already be very immersive, and especially horror games which are fast paced, and can be quite scary. The figure below shows “typical human-computer interaction by symbolic encoding. Analogue channels are not seen by the computer”. (Ritter W, 2011)

Ritter suggests that: “Another way to enrich communication would be to simply tear another hole into the wall between humans and computers, for allowing more channels, many of them being analogue and unconscious by nature, to pass through”. Through use of a webcam, Ritter‟s metaphorical hole is torn, and most of the analogue channels of communication listed, can be processed (including speech, as many webcams today include built in microphones). The term Computer Vision is used to describe machines that extract information from an image (image processing). In some senses, the computer can “see” the image concerned. In this case a webcam will be used to capture images multiple times a second. A computer vision library can then be used to extract the information, and “find” the users‟ head coordinates, which can be used to manipulate the view perspective of the game, in relation to the player. The overall effect of this is to give the feel of a “more 3D” game, where the screen acts more like a window which the user can see into, and see around objects realistically, simply by moving his or her head to the side, in just the same way as they would look around real objects; which links back to immersive gaming.

http://www.hindawi.com/61382438/


- 2 -

This windowing technique will make the game appear to be behind the computer monitor, but only perceivable through the screen; by viewing the screen from a different angle, what is perceived on the other side will change accordingly. The expression „windowing‟ will be used herein, because of this metaphor, throughout the rest of the report to describe this technique. Another noteworthy expression used in this report is that the target user of the game will be referred to as the user, whereas the playable character in the game will be referred to as the player.

1.1 Brief

1.1.1 Initial Brief

The original brief description of the project was to produce a graphically navigable map of the Semantic Web, or Web 2.0 Web 2.0 uses metadata, which the machine can read in order to gain understanding of the semantics or meaning of information on the web. This would change the way search engines work (for example) for the better, but also the metadata could also be used to produce the proposed map. At the time of commencing the project, Web 2.0 was still in most part a concept by Sir Tim Berners-Lee (Berners-Lee T et al, 2001) who invented the World Wide Web.

1.1.2 Agreed Brief

Following discussions with the project supervisor, the brief took on a radical change, which was instead, to make a simple 3d game, using a webcam to detect the features of the human face and to implement head-tracking; creating an immersive environment. The research question to be answered by this project is whether or not it is plausible for computer games to become more immersive using head tracking technology, without having to spend a large amount of money on specialist development tools. The game will be a first person horror game, due to their already high level of immersion. The first person aspect adds to the realism of the game, as the user feels situated in the game rather than merely playing as a detached character. The horror aspect also draws the user into the game, because by creating a tense atmosphere or scaring them, they are forced to concentrate on what is going on in the game, and suspend their disbelief.

1.1.3 Reasons for Change

The initial research carried out on this project associated with graphic navigation, lead towards the level of interaction; specifically using a webcam to increase interactivity. Using augmented reality could have been an excitingly innovative method of interaction, or using head tracking to manoeuvre the perspective is also an excellent way to add a third dimension into the map, and/or make even a very complex map or chart much more easily navigable. As this is a newly emerging technology in many areas including electronic games, this interaction and navigation with the program became more appealing in itself, and much more so than mapping the semantic web. Since the beginning of the project, new direction has been given in the form of 3D sound effects. 3D sound effects will make the game more immersive, and will add to the tension (of a horror style game which is being created in this project) by giving the user the incentive to move his or her head to see around certain areas. This method of communication between the computer and the user has been missed out in Ritter‟s example, but to achieve maximum


- 3 -

immersion into the game, communication between the user and the computer must be bidirectional.

1.2 Context

“Players engaged in video games often exhibit physical behaviours not necessary for game control, such as leaning when steering during racing game or dodging bullets in shooting-related scenarios. In these cases, upper body movement can be translated through head movements. Consequently, face tracking can be applied to reflect players‟ intentions.” (Wang S, 2006). These subconscious movements would normally help the situation in real life, but are futile in today‟s games as they are not translated into the virtual world. Many games console companies are investing in so called “vision” of some sort. Although this technology has been around for a while (for example in Sega‟s Dreameye for the Dreamcast, or the better known EyeToy for Sony‟s Playstation 2), it has not become popular until recently. Now nearly all mainstream games consoles have brought out devices used for tracking the user‟s movement, to enable them to interact physically with the games (for example Sony‟s Playstation Move and Microsoft‟s Kinect for the Xbox 360). The reason that this technology didn‟t really take off in the past, might be because the processing power in games consoles was not as powerful as it is now, or maybe Nintendo‟s revolutionary Wii sparked a new desire to interact with games in a new way. According to Juul, the key to Nintendo‟s success in games, is that “they have physical interfaces that mimic the action in the games” (Jull, no date). However, Sega brought out other hardware for the Dreamcast as well as the Dreameye; for example the DC Fishing Rod, which, despite being released a long time before Nintendo‟s Wii, is comparable to the Wii Remote in many ways. The reason that these early devices failed to take off may be because they all tended to be additional hardware to the console. This meant that only some games would support the device, and fewer people wanted to buy the extension, whereas the Wii uses its Remote and Nunchucks as standard, meaning that all of the games designed for the platform supports the same hardware. Another reason for games using this technology not to take off may be a matter of integration. The games typically produced to support this technology rely on the movement to interact with the game. For example a lot of Wii games and Xbox Kinect games often require a lot of movement space using exaggerated body movements to control the character directly, rather than using natural movement to aid game play such as suggested in this project. Although people have had webcams as a standard computer peripheral for a long time, head tracking technology has not really branched out into computer games. The method of game play would have to be different from that of games consoles using similar technology, because there is no controller which can be used to detect movement (such as the Wii remote or Playstation 3 controllers which use accelerometers, among others, to detect movement). Also, because the user sits at the computer, the whole body of the user cannot be detected by a webcam (such as the Xbox Kinect), which leads nicely to the idea of head tracking.


- 4 -

1.3 Aim and Objectives

This project is a proof of concept relating to the level of immersion in games using standard hardware, and accessible software, which is to say, it does not require (for example) any expensive Application Programming Interface (API), and is not dependent on any specialist headgear, or other such equipment, as most everyday gamers will not have these items, and this would limit the playability and marketability of the game. To create an immersive environment, the project uses head tracking technology, which does not require setting up before game play, as some face tracking systems may require. This means that the user is able to start playing immediately without having to train the classifier to recognise their individual face before playing. During game play the viewing position of the camera will change according to the movement of the user‟s head. Being in first person this effect should feel like the user‟s physical head movements and those of the game character, are in fact one. “In traditional video games, actions need to be mapped to a controller. In the real world, users have expectations of how their surrounding environment works. The game world should match such a model.” (Wang et al, 2006). This statement backs up the idea that coupling the user‟s head movements with the camera position in a game, is more natural, than without. Another important feature which is included in this project is surround sound, which is used to drastically improve the feeling of immersion. If the user has stereo speakers or headphones, an atmosphere can be created through use of creepy sounds, and the game will instantly become more realistic, drawing (or immersing) the user into the game. This effect will not only be used for atmospheric purposes, but will also be used to make the user aware of incoming enemies which might be just out of the screen. By hearing a monster coming towards the player, they can use the direction of the sound to turn to face him. These two features (head tracking and 3D sounds) used together will be used like real life stimulus and response whereby the stimulus will be hearing something on the left (for instance), and the response would be to look in that direction, making full use of the head tracking feature this project is based on. This communication with the machine makes the game more immersive. Another objective is enemy artificial intelligence. The aim was to create monsters which try to attack the player from either side or from behind rather than from the direction the player is looking. This is not only to make the enemy more believable, but also to force the user to use the audio cues, which in turn promotes use of the virtualised head movements; the user is able to hear the enemy sneaking up from one side, and can then move their head in order to see. This will also make the game scarier, and therefore more immersive. Other aspects of the game which are not directly related to the games‟ immersion can be found in Appendix 8.1. This list is the original task assessment, which has changed since the beginning of the project. The game itself is comprised of a graphical display (the menu screen) which introduces the user to the game; the environment, including the terrain and atmosphere; trees, which are scattered to add detail to the game and serve as obstacles that the player can hide behind etc; monsters as described earlier, which are meant to be scary; and the playable character who can run around the map shooting the monsters.


- 5 -

2 Project Background

Head detection and tracking can be very useful in areas other than entertainment, the most obvious being security. By detecting a person‟s face through a security camera, facial recognition is then often used to identify that person. Although facial recognition is similar to facial detection, it requires different additional algorithms, and will not play a role in this project at this time. Viewing media using the windowing functionality described earlier will also be more immersive, for example viewing pictures, looking at any graphs or charts, or developing 3D models could all use windowing to allow the user to see around objects, in an easy and natural manner. Another interesting area in which this could be used is in psychology. According to the observer effect, people change their behaviour when they are aware of being watched. A method called “gaze estimation”, can work out where the user is looking on the screen (either by head orientation, or by calculating the convergence of the two eyes), and as the head tracking doesn‟t need to be set up (it automatically detects a face and tracks it), the user would be unaware they are being tracked. This could be useful in experiments conducted to determine what kinds of things attract the attention of certain computer users. This would be very useful for the marketing and web design industries. Another reason this project might be interesting in psychology is the recognition of facial expressions. Although similar, this project does not concern gaze estimation, because when looking through a window, the view doesn‟t change according to viewers gaze direction, only when they move their head, the perceived orientation of objects are altered. Another argument against using gaze estimation to change the camera angle is that the user must always face towards the camera for their facial features to be recognised. Moreover the user must be discouraged from turning their head away from the screen to turn around in the game, as they will not then be able to see the screen.

2.1 Problem Context

2.1.1 Depth Perception

Nearly all games today have 3D graphics of some sort. We call them 3D, although we know that the image being displayed is not actually 3 dimensional; they look 3 dimensional because they incorporate monocular cues to depth such as perspective (by which parallel lines seem to converge at a distance) and relative size (when objects seem smaller the further away they are). These are described as monocular because only one eye is needed to observe the optical illusion taking place. As humans, we have two eyes facing the same direction, offset by a few inches, presenting two different views simultaneously. The brain then uses this difference in images to perceive a single image with depth. This process is called stereopsis. There is headgear of many forms that use stereoscopy to simulate three dimensional space, for example Virtual Reality (VR) goggles or anaglyph (red/cyan)/stereoscopic glasses. These all work by sending slightly different images to each eye. An early example of this is the Virtual Boy, which uses a head mounted display that looks like a cross between a game boy and a stereoscope. It has “2.5D” graphics and only one colour channel, but depth is simulated using stereopsis in this way. The problem with the Virtual Boy, and similar equipment, is that some users often find that they become nauseous after playing, possibly due to motion sickness due to a disagreement between the visually perceived movement and the motion sensed in the inner ear.


- 6 -

More advanced virtual reality equipment exists, but this can be very expensive and is not commonly used by everyday game players. An asymmetric frustum could be used in the game, which encompasses the use of two cameras facing the same direction, but with a horizontal offset similar to that of human eyes. By tinting one view cyan and tinting the other red, the two views can be merged in an over-lapped fashion, which can then be seen through anaglyph glasses, as a single image with depth. The problem with anaglyph glasses is obviously that the game would be totally discoloured, and it would be unrealistic to expect all of the users to buy and wear special glasses to have a true 3D or immersive gaming experience. Without using any headgear, we can‟t use steriopsis to gather information about depth in the virtual environment. “Pigeons eyes do not have overlapping fields of view and thus cannot use stereopsis. Instead, they bob their heads up and down to perceive depth” (Steinman and Garzia, 2000, p. 180). This process is called motion parallax, which works by moving ones head to gain different viewpoints. By continuously tracking the user‟s head movement, motion parallax can be emulated by changing the viewing position in the game according to the position of the user‟s head. Lee‟s work (Lee, no date) shows how he uses the parallax effect and how this technique can give the impression that some objects are closer than the screen would normally allow. Here, Lee uses infra red emitting glasses, and a Wii remote to track its position, but it is possible to do head tracking without using any head gear at all, which is by detecting the user‟s face using an ordinary webcam

2.1.2 Face Detection

This painting by Bev Doolittle addresses quite nicely, the question of how we detect faces. There are thirteen faces depicted within in the forest, but we know that the only real face in the picture is that of the rider, which cannot be seen properly. Psychologists have spent a long time researching our ability for facial recognition, and this is the main problem – how do we detect a face?

Here we know the rider has a face because we know that humans have faces, and we can see there is a person on a horse in this picture. This is the context (i.e. we know that even realistic looking faces are not real if there isn‟t a person there). Assuming the webcam is

http://en.wikipedia.org/wiki/Parallax#CITEREFSteinmanGarzia2000


- 7 -

placed in front of the user (which it should be), the problem of context is not be an issue – the webcam does not share our knowledge or world experience to assume that a face is present. The problems we have with finding all thirteen faces in the picture relate to the fact that they are partially camouflaged by the artist‟s use of colour – the faces are not what we call “skin colour”, also not all of the hidden faces are complete. When detecting real faces, we can assume they will all be complete, but there may be an issue with occlusion, i.e. when a person‟s face is partially hidden, recognition may be more difficult; for example, in the case of a user covering a portion of their face by lifting a cup to their mouth to drink. This would be a rarely occurring, but major form of occlusion as the cup would then cover the nose and mouth. However there are many other smaller, but more frequent forms of occlusion, such as headphone/microphone sets, glasses worn, facial hair, or even make up. In the picture above, the direction and the expressions of the faces change the appearance of enormously, making face detection more tricky in computing. Another major issue with computers detecting faces is the lighting of the room. Lighting from different directions can make a face look dramatically different, and the situation would of course be worse if no lighting were available. If the user were to be sat in a dark room (perhaps to intensify the scariness of the game), the glare from the screen alone might not be enough light for the equipment to detect any facial features. This is a problem that entirely depends on the quality of the user‟s webcam. Humans can recognise people‟s faces very easily despite all of these factors, with an incredibly small failure rate (i.e. thinking a face is not a face, and vice versa) and almost immediately at that. However, it is quite another task identifying how we do this, or even writing down a fixed set of rules in order to detect faces as easily as we can.


- 8 -

2.2 Comparison of Technologies

This chart is a recreation from Learning OpenCV (Bradski, 2008) which compares the three libraries mentioned below (LTI, VXL and OpenCV with and without IPP) on four different benchmarks. The scores indicated are proportional to run time.

2.2.1 Open CV

Open CV stands for Open source Computer Vision, which is a library of functions used for real time computer vision. The term computer vision simply means that it is able to carry out image processing. Of course, in this case, the image will come from a webcam. This is the library used in the project because it is fairly well documented, and as computer vision has not been taught at university, the amount of documentation and tutorials available is a major factor. In addition, the algorithm for haar-like facial feature detection (described later) is implemented in Open CV. As shown in the graph above, Open CV is consistently faster than the other two libraries, and even faster with Intel‟s Integrated Performance Primitives (IPP).

2.2.2 VXL or LTI

Like Open CV, these VXL and LTI are also both libraries used for computer vision. VXL is a group of Visual libraries; the X is used to stand for a number of libraries (Geometry/Image processing/Numeric/Streaming etc), while VL stands for Vision Library. It is a collection of libraries designed for computer vision to be implemented in C++. The main benefit of this is its portability. LTI-Lib is an object oriented library for computer vision and image processing implemented in C++, similar to Open CV and VXL

0

20

40

60

80

100

120

140

160

2D DFT Resize Optical Flow Neural Nets

LTI

VXL

OpenCV

OpenCV + IPP

Description Test Station: Pentium M, 1.7GHz Libraries: Open CV 1.0pre, IPP 5.0, LTI 1.9.14, VXL 1.4.0 2D DFT: Forward Fourier Transform of 512x512 image Resize: 512x512 -> 384x 384 bilinear interpolation, 8-bit 3-channel image Optical Flow: 520 points tracked with 41x41 window, 4 pyramid levels. Neural Net: mushroom benchmark from FANN


- 9 -

2.2.3 Face API

AAMs (described below) would probably be the best face detection/tracking method for this project. FaceAPI is a commercial closed source API which implements AAMs. This would be a good API to so use, as it seems to be fast, robust, and allow for the use of many features would answer the research question asked in this report. However the cost of this API (Approx US $4,000) is out of the budget of this project.

2.2.4 AAM API or VOSM

AAMs are also implemented in AAM API (Stegmann, no date a) and VOSM (Visual Open Statistical Models) are specifically for the implementation of AAMs. Unfortunately neither one is very easy to understand, and both are quite poorly documented.

2.2.5 STASM

This is a library created for detecting facial features, which could be used to create an AAM (Milborrow S, assessed on date), but unfortunately, it is designed to work for passport style photographs, and would not work very well in real time, on a moving head, because by the time a face is detected in one frame of game play, the rest of the game will have moved on. This is why the speed of detection has to be so fast (real time). The main reason for including this library in the report is to show some of the research that went into AAM technology during the early stages of this project.

2.3 Comparison of Algorithms

The algorithms given below are concerned with finding the x and y coordinates of the user‟s head position. Using a single webcam, the z position will have to be estimated using the size of the detected head. This project should not be concerned with the angle at which the user rotates their head (or gaze direction as briefly described earlier). There are two reasons for this, the first of which is that when a person looks through a window (imagining the window frames are the boundaries of a computer screen), the view does not change according to their head rotation or view direction, only their head position; the second reason is because the user would end up having to look away from the screen in order to turn 180 degrees in the game. This kind of rotation will be achieved using the user‟s mouse movements, as in many current computer games.

2.3.1 Skin Colour Detection

This is a fairly simple method in principle, where the computer detects areas which are near an average skin colour. The problems that arise with this method (assuming the user will have a full colour webcam) are immediately obvious; different ethnicities will widen the spectrum of possible skin colours (there have been recent articles questioning whether the Xbox Kinect is racist due to a problem detecting darker skinned employees). Various types of lighting that could be used will also change the apparent or perceived colour of the face; also some wooden furniture could be mistaken for skin due to its colour. Skin colour detection alone will always be a problem in face detection for the foreseeable future, but as with any algorithm used, the best results would be yielded in good lighting.

2.3.2 Motion Detection

Many camera based games for Sony PlayStation 2 use motion detection as a sole method of data extraction from the camera input, which limits the mechanism to interacting by hitting hotspots on the screen. The game Virtua Fighter (which uses the EyeToy) interprets motion detected relative to where the enemy‟s head is, as an attempt to punch the enemy by the user. The game Wishi Washi relies on the novelty of this mechanic. The game play consists of virtually cleaning a dirty screen by making wiping motions with ones hands.


- 10 -

“... the motion only system does not have enough information to answer questions below: Is the player left or standing still in front of the camera? Is it a human player who is producing the detected motion? Is the player overreacting and falling out of the camera‟s view? How many people are in the game? Is the player cheating with helpers? Is the player too close to or too far from the camera?” (Wang S et al, 2006).

2.3.3 Haar-like Feature Detection

This method was first proposed by Paul Viola and Michael Jones in 2001 (Viola and Jones, 2001), using four basic scalar features. The number of features was later extended, and the algorithm improved by Rainer Lienhart (Lienhart, 2002).

Haar-like features (as depicted above) are simple rectangular features, similar to, but not exactly like Haar wavelets (square waves with one high interval and one low interval). A cascade of boosted classifiers using Haar-like features is trained using hundreds of samples of faces, called positive examples and arbitrary images to be used for negative examples. These are all scaled down to the same size (e.g. 24x24 pixels). Using the trained classifier, it is applied to the input images (from the webcam in this case). The search window is moved across the image, checking every location at different sizes if the size is unknown. This algorithm is used in this project, as it is implemented in Open CV, returning 0 if no faces are found, or returning 1 otherwise. The main advantage of using this algorithm is its calculation speed, which is vital in gaming because even a noticeable latency can be distracting and even off-putting to play.

2.3.4 AAMs

Standing for Active Appearance Models, this is the most complex method suggested, but also the most desirable. Like an Active Shape Model (ASM), it works using a set of coordinates on landmarks around the face, but also takes into account the texture within these points, as demonstrated in the picture. This technique could be useful in a project such as this to implement more advanced features such as mapping the user‟s face onto the characters (coordinates and texture), but unfortunately, would take too long a time to understand fully, and implement. During the early stages of this project, a lot of research was done on AAMs as a method of face detection, yet it soon became apparent that this technique was too ambitious, and haar-like feature detection has been used instead, as is more suitable, within the scope of the available resources.


- 11 -

2.4 Similar Projects

One similar project that influenced this one is Jonny Lee‟s Head Tracking for Desktop VR (Virtual Reality) Displays using the Wii Remote (Lee, no date) as mentioned earlier. This concept work allows the user to visually move around the screen by wearing infra red transmitting headgear, and using a Wii remote propped up near the screen to pick up the location of the user. Torben Sko‟s work was referenced in the initial report, following Sko‟s completion of a PhD on head tracking in first-person games (Sko and Gardener, 2009). Working closely with the AAM API company, he used the Half Life 2 engine, to implement head tracking. Since then, Sko has devised brought out a game called Face-Off Paintball (Sko, 2010). This also uses Face API and the Source engine. Tech48 (Teatime, 2009) is a Japanese „eroge‟ game, which uses teatime‟s special webcam called the T-CAM to track the user‟s head and hand gestures. This game uses the technology to interact with anime style girls in the game, with the main attraction being the ability to look up their skirts.


- 12 -

3 Technical Development

The initial task analysis has changed somewhat since it was written for the initial report. This list can be found in Appendix 8.1. The third task listed (after research and system design) describes the environment for the game. The terrain in the game uses a height map, which is a picture (greyscale in this case, but separate colour channels can be used) which uses the saturation of a colour channel to determine the height. In this case dark grey areas represent valleys, and lighter shades represent hills. The image used is black and white, but only the red colour channel is being read, which means that the blue and green colour channels could still be used for something else, like placing objects in the scene. A commonly used method called multitexturing was used to make the ground more realistic. One texture is fit to the size of the terrain (texture map), which shows the overall look of the terrain, while another texture (detail map) is tiled repeatedly over the terrain which adds a grassy detail. A skybox was then developed, simply by creating a textured cube relative to the viewing position, so that the camera is always in the middle. The depth is not calculated for the skybox, which makes it occulted by even the furthest objects . This makes the sky look infinitely far away. The textures of a cloudy night sky is used for this; they are made specifically for use in skyboxes (viewed normally they look warped), so that when they fit together to form the cube, the user is not aware it is a cube at all. In order to add to the atmosphere of the scene, fog is simulated, rendering far away objects more grey. The fifth task stated in the initial task analysis suggests making 3D models. 3D models are used in this project, though developing them from scratch, would have been outside the scope of this project. Instead, the models taken are from Doom, which became freely available to the public domain in 1997. To load and the MD2 format models into the game, unoriginal code is used (Jacobs, no date) this is also referenced within the program with comments on changes made to it.

3.1 System Architecture

During the early stages of the project, the issue of programming architecture was not addressed as thoroughly as it might have been; the program was initially developed incrementally by adding classes as they became necessary, rather than by adding classes according to system design. This structure (outlined in Appendix 8.2) was replaced very late on in the project, for the MVC (Model-View-Controller) architecture.


- 13 -

The MVC architecture was designed to keep the data and the representation of an object separate (Reenskaug, 1979). Data and logic are kept in the model layer, while all the representation code is kept in the view layer. A controller is used to send messages to the model and view via interfaces. Implementing this architecture meant splitting up the objects (which previously only had one class each) into 3 classes each. All of the rendering code and Open GL was put into the View classes, and the data about the objects and other methods were put into the Model classes.

The MVC architecture is known for its use in business software development, though it can also be very useful in games. The separation of the concerns (SoC) enables each layer to be debugged and developed independently. Another advantage of keeping the logic and the representation separate is that there can be multiple views of the same model. Moreover, using the Model-View-Controller also makes the project more flexible for later use; by simply changing the rendering and input, the project could more easily be ported onto another platform. Once the project conformed to the MVC architecture, management became a lot easier, and the frame rate improved dramatically, because the 3D model did not need to be loaded for every instance of an object. However, because the architecture was not implemented until almost all of the programming had been finished anyway, the process of implementing it took longer than needed because all of the methods were previously optimised to fit the original architecture. Nevertheless, the time spent altering this was not time wasted, as it is now much more understandable and manageable, making future development on the game a great deal easier.

3.2 Head Tracking

A lot of research was necessary throughout the development of the project. The biggest topic for research was in computer vision owing to a lack of prior experience and the difficulty of finding the appropriate expertise. The fundamentals of image processing needed to be familiarised before starting any work on the immersive side of the game. After having read various materials on the subject, the gap between basic image processing and face detection and face tracking algorithms had to be bridged with more research. Implementing Active Appearance Models was a highly desirable task that was not realised. After approximately six weeks of research and development testing that took place (using AAM API and VOSM separately in order to elucidate which one would achieve the best results), it was decided that it was not going to be a feasible feature to implement. Researching a different method of face detection was the next step. After speculation over the other possible libraries (as described above), the decision was made to use Open CV. This was a new technology which had to be learnt before being able to develop the head tracking system. This process was not too much of a strenuous one, as Open CV is fairly well documented.

3.3 System Testing

An incremental testing method was implemented throughout the development of this project; during runtime, the project prints its current process to the console, which was used to make sure there were no errors or interruptions when initialising objects, or loading files. Testing the graphic side of the program was also incremental, by altering values by trial and error to enhance the appearance of the game. Integration testing was also used to isolate certain tasks and to ensure they worked independently; the face detection and tracking class was developed separately and tested as it was being developed, until being integrated into the rest of the game.


- 14 -

The collision detection and response code was also developed separately in a 2D world using as objects, circles of varying radius. To test the collision code still worked as bounding cylinders in 3D space, the distance between test objects was printed to the console every time a collision was detected. At each mile stone achieved, more extensive testing was done before moving on to the next stage. The face tracking system is designed to ignore other people, who might be in the background by finding the biggest face and discarding other possible detections. The user‟s head should be the biggest from the webcam‟s perspective. However, if the user were sat next to or in front of a poster or statue with a person‟s face, the algorithm might think that is the user. To test this scenario, exactly that situation was set up.

This test shows that if a larger „head‟ is located in the background, the system will ignore the user. The only way to combat this is by covering up background interferences such as these. Notably, this should not happen with non-photorealistic faces, as the classifier was only trained with true faces. However there is the possibility of false positives (a wrong detection), which could be comparable to the phenomenon of pareidolia (perceiving an imaginary image in a pattern; for example a man in the moon). Even these false positives should be ignored unless it is a blatant representation of a face. One problem that can arise is that caused by the positioning of the users webcam. During the production of this game, the webcam used was placed in the centre on top of the monitor. This is the ideal position. Placing the camera at the bottom of the monitor can work too, so long as the user does not look up too much. The problems arise when the user wishes to place their camera to the side of the monitor, or not next to the monitor at all. Then the webcam is not likely to be able to detect the user‟s face due to the angle, because haar-like feature detection works best with front-on faces. Even if the webcam can detect the user‟s face, the parallax motion might not work properly due to the nature of how it works.


- 15 -

4 Critical Evaluation

4.1 Project Achievements

4.1.1 Face Tracking

A fast and robust face detection algorithm has been built, which, in itself, works well; detecting a very large variety of different faces at a high speed. However, tracking the detected face causes a small amount of jitter in the camera which impedes the functionality of the game. During the development of the face tracking system, the biggest problem was the speed. The speed of all aspects in games is important, to maintain an enjoyable experience, However, due to focusing chiefly on efficiency, precision was overlooked. The algorithm being used is not best suited for finding precise coordinates, but for quickly finding the position of a head and an approximation of coordinates. To find a face, a search area is moved over the image from the camera input, searching for a face. The search box inevitably finds the same face multiple times, so these detections are merged to make the approximation. This means that when coupling the head position with the camera orientation, the view point jumps around looking “glitchy”, rather than smoothly moving with the user‟s head. Unfortunately, the level of jitter is almost enough to counteract the benefits of using the system. This could be because of the webcam‟s frame rate; most webcams have a standard of thirty frames a second in good lighting, which means the face detection algorithm can only be done every 30th of a second, which may sound like a lot, but may cause noticeable latency. More likely, however, the problem is in the cascade set being used. The cascade set used is one native to the Open CV called haarcascade_frontalface_alt.xml which is a cascade file, trained to detect many different faces, but no more information is given about it. Aside from the jitter, the parallax effect works quite well; when the user moves their head left, the camera pans left, etc. At first the camera mimicked the user‟s head movement, but the effect was not realistic; when the user moved their head left the camera also moved left. Because the user tends to focus on the centre of the screen (especially when aiming as the crosshair is in the centre), when the user moves their head left, they are also turning slightly to look right. An experiment was conducted to determine if there was any benefit to artificially exaggerating head movements in a virtual environment such as this. “Although no significant differences were found for speed or accuracy by level of exaggeration, subjective impressions from the participants suggested that they preferred at least a modest amount of exaggeration. In addition, under one level of exaggeration (an exaggeration factor of 2), users seemed to get better significantly faster than in the other conditions.” (Teather R J and Stuerzlinger W, 2008). The paper goes on to suggest that with the exaggeration factor double that of normality, the users got better a lot faster at the tasks than the others. This could be a further interesting development for this project, to elucidate weather a user‟ performance in the game can be increased using this method.

4.1.2 Artificial Intelligence

Artificial Intelligence is a huge area in games programming, and a lot of research has been put into this to make the enemies more believable, and make the game better by increasing its verisimilitude. There are very many algorithms that could be used to make believably intelligent enemies, and this area alone could be a separate project.


- 16 -

An A* algorithm (Rabin S, 2002, pp 103-153) was first suggested for this project. The trees coordinates could be used to separate the terrain into segments by using the Delaunay triangulation algorithm. These segments would then be used to calculate the fastest route for each monster to get to the player by moving from segment to segment. The tree positions could also be used to divide the terrain using the Voronoi algorithm (de Borg M et al, 2000, chapter 7); the monsters would then calculate the shortest path to the player by moving along the path between triangulations Both Delauney and Voronoi algorithms are implemented in OpenCV, using virtual points to calculate segmentations. Using A* the monsters move towards waypoints, rather than following the player directly, and this could take some of the realism away from the game. A more fluid method of moving is by using Boids-like intelligence. Boids is an artificial life program, developed to simulate bird flocking in three dimensions (Reynolds C, 1987). For this project, movement direction would have to be two dimensional (along the x-z plane which is the ground), by reacting to nearby objects. Flocking is not necessary for this game, but the goal seeking and obstical avoidance portion of Boids is desirable. Finally, the artificial intelligence implemented in this project for the enemy in the form of a Finite State Machine (FSM) (Carlisle P, 2002). This has the freedom of movement that a boids-like algorithm offers, but lacks the path planning that A* offers. The enemy starts off standing in a randomly generated position, where he stands until the player is within seeing distance, whereby he walks directly towards the player unless there is a tree in the way, in which case he tries to walk around the tree, very unintelligently, until the tree is no longer in the way. When the enemy is close enough he will start to attack the player, until either one of them is dead.

4.2 Further Development

To eliminate the amount of error in the head tracking system, a cascade set could be trained specifically, with more attention given to the dimensions of the training data. A more favourable method however, would be to implement optical flow such as the Lucas Kanade algorithm (Baker and Mathews, 2001). Given another nine months or so, this project could certainly be turned into a functional game engine, which could be used for many other games using head tracking technology, and/or a fully marketable first person horror game – the first of its kind to take advantage of head tracking technology. There are other features related to face detection that would be great additions to this project, such as extracting the texture of the users face to be mapped onto that of the character. This would work best on a networked multiplayer platform because then each user can see each other in the game. It would also be possible to animate the characters head movements, and facial movements to match those of the user. Such features would bring analogue data from the user into the game, helping to break down the barrier between our reality and virtual reality. However, these are features which would be best achieved using model based face tracking such as AAMs. The game could even include face recognition, allowing secure save games and automatic logging on. Additionally, the recognition of certain gestures could be implemented. For example, in Sko‟s work, the user can look down the gun sight, by tilting their heads to one side, as if preparing to look down a gun scope (Sko, 2010). The idea of recognising facial expressions was briefly mentioned earlier in the report. This would be a very interesting topic to undertake in a game such as this. By recognising when


- 17 -

the user is shocked or scared, the game could use the same techniques to keep scaring the user, keeping them immersed in the atmosphere of the game. Using machine learning, the game could actively elicit what scares the current user, this way, if a user becomes desensitised to the way in which they were getting scared before, the game could be designed to stop using that particular trick for a while. To get a better sense of depth, an asymmetric frustum could be developed, which would make the use of stereoscopic vision as well as the parallax motion currently being used, making a truly three dimensional game. These are all feasible features, which could be introduce to this game or ones like it, but perhaps all of them included in a full game would require more processing power than is commonly available. According to Moor‟s law the processing speed of hardware is exponentially increasing all the time (doubling approximately every two years). This means that if it is not feasible to include head tracking in today‟s mainstream games, it will be very soon.

4.3 Personal Reflection

This has been a highly ambitious and challenging project. Many areas of the project may have been more greatly developed or expanded if there was more time and this clearly indicates that further work and experimentation may yield more satisfactory results. Similarly, many strands of the project suggest areas which could be other projects in themselves. For example, the artificial intelligence for enemies in a game has a vast scope, is a very interesting topic at that. Focusing more on head tracking algorithms, and implementing active appearance models would have been advantageous, but for use in this project, the development of the project, the decision was made that AAMs were simply too complex to implement with no supervisors in the department with relevant experience of the issues related to computer vision. If this project were to be started again, the first major change that would be made is the amount of time deliberated over the system architecture. With a solid fundamental structure, the rest of the game can be developed, tested and debugged a lot easier. In fact more effort would be put into all aspects of design before any code is written. Also the direction of the project would be steered more towards one section of interest. For example, by using an existing game engine, more focus can be given to the head tracking feature. Some things would not be changed though, such as the methodology of work undertaken; putting a lot of time into the research behind the project has proven to be beneficial. Also the prototyping process of developing, testing, debugging and integrating individual sections of code, worked well during the development of this game. Using MVC from the beginning would also make this process easier.


- 18 -

5 Conclusion

Because of the irregularity in the Haar-Like feature detection algorithm, this project does not directly prove the concept of head tracking as a means of view perspective manipulation. This concept, however, has not been disproved. The research question posed at the beginning of the project was whether or not it is plausible for computer games to become more immersive using head tracking technology, without having to spend a large amount of money on specialist development tools. Whilst demonstrating some of the weaknesses inherent in this particular instance, the project seems to suggest that with more time and research, the answer would be in the affirmative.


- 19 -

6 References

Baker S and Mathews I, 2002 , Lucas-Kanade 20 Years on: A Unifying Framework, International Journal of Computer Vision, volume 56, USA: Springer. Bradski G, Kaehler A, 2008, Learning OpenCV: Computer Vision with the OpenCV Library, Edited by Mike Loukides, First Edition, California: O‟Reilly Media. Berners-Lee T et al, 2001, The Semantic Web, Scientific American, May 2001, pp 29-37. Carlisle P, 2002, Designing a GUI Tool to Aid in the Development of Finite-State Machines, in Rabin S, AI Game Programming Wisdom, Massachusetts: Charles River Media, pp 71-78. De Borg M, 2000, Computational Geometry: Algorithms and Applications, Second Edition, Germany: Springer. Edwards G J, Taylor C J, Cootes T F, 1998, Active Appearance Models [conference], Available: http://personalpages.manchester.ac.uk/staff/timothy.f.cootes/refs_by_subject.html#AAM [Accessed 12 October 2010]. Higgins D, 2002, Generic A* Pathfinding, in Rabin S, AI Game Programming Wisdom, Massachusetts: Charles River Media, pp 114-122. Jacobs B, no date, Video Tutorials Rock [online], Available: http://www.videotutorialsrock.com/opengl_tutorial/animation/text.php [Accessed 12 July 2010]. Juul J, 2009, A Casual Revolution: Reinventing Video Games and Their Players, Cambridge: MIT Press. Lee J C, no date, Head Tracking for Desktop VR Displays using the Wii Remote [online], Available: http://johnnylee.net/projects/wii/ [Accessed April 2010]. Lienhart R and Maydt J, 2002, An extended set of Haar-like features for rapid object detection [online]. Available: http://reference.kfupm.edu.sa/content/e/x/an_extended_set_of_haar_like_features_fo_76939.pdf [Accessed 05 January 2011]. Milborrow S, no date, STASM Homepage [online], Available: http://www.milbo.users.sonic.net/stasm/ [Accessed 06 November 2010]. Rabin S, 2002, AI Game Programming Wisdom, Massachusetts: Charles River Media. Reenskaug T, 1979, Models-Views-Controllers [online], Available: http://heim.ifi.uio.no/~trygver/themes/mvc/mvc-index.html [Accessed 07 February 2011]. Reynolds C, 1987, Flocks, Herds and Schools: a Distributed Behavioural Model, Proceedings of the 14th annual conference on Computer graphics and interactive techniques, 1987, pp 25-34. Ritter W, 2011, Benefits of Subliminal Feedback Loops in Human-Computer Interaction: Advances in Human-Computer Interaction, volume January 2011, New York: Hindawi Publishing Corp.

http://www.hindawi.com/61382438/


- 20 -

Stegmann M B, No date a, AAM-API [online]. Available: http://www2.imm.dtu.dk/~aam/ [Accessed 06 November 2010]. Steinman S B and Garzia R P, 2000, Foundations of Binocular Vision: A Clinical perspective. United States of America: McGraw-Hill Professional. Seeing Machines, no date, FaceAPI [online], Available: http://www.seeingmachines.com/product/faceapi/ [Accessed 05 September 2010]. Sko T and Gardner H J, 2009, Head Tracking in First-Person Games: Interaction Using a Web-Camera, in Gross T et al, Human Interaction – INTERACT 2009, pt 1, vol 5726, pp342-355. Sko T, 2010, FaceOff Paintball [online], Available: http://torbensko.com/faceoff/ [Accessed 24 April 2011]. Teather R J and Stuerzlinger W, 2008, Exaggerated Head Motions for Game Viewpoint Control, in FuturePlay, International Academic Games Conference of the Future of Game Design and Technology, Toronto: Futureplay, pp 240-243. Teatime, 2009, Tech48 [online], Available: http://www.teatime.ne.jp/infor/tech48/tech48_index.htm [Accessed 28 October 2010]. Wang S et al, 2006, Face Tracking as an Augmented Input in Video Games: Enhancing Presence, Role-playing and Control, New York: ACM.

7 Bibliography

Barr P, Noble J, Biddle R, 2007, Video game values: Human-computer interaction and games, edited by Barhydt S J and Edmonson K, United States of America: McGraw Hill Companies. Burbeck S, 1987, Applications Programming in Smalltalk-80(TM): How to use Model-View-Controller (MVC) [online], Available: http://st-www.cs.illinois.edu/users/smarch/st-docs/mvc.html [Accessed 07 February 2011]. DiNucci D, 1999, Fragmented Future, Print Magazine, April 1999, pp 32, 221-222. Geroimenko V and Chen C, Visualizing the Semantic Web: XML-Based Internet and Information Visualisation, London: Springer.

McShaffry M, 2009, Game Coding Complete, 3rd Edition, United States of America: Course Technology. Schafaer, J and Van Den Herik, H J, 2002, Games, Computers and Artificial Intelligence, Artificial Intelligence, 2002, pp 1-7. Branislav Kisačanin, et al, Real-Time Vision for Human-Computer Interaction, United States of America: Springer. Gulliksen J, Gross T and Kotzé P, 2009, Human Computer Interaction – Interact: Lecture Notes in Computer Science, Vol. 5726, edited by Oestreicher L et al York: Springer.

http://academic.research.microsoft.com/Author/142754/pippin-barr

http://academic.research.microsoft.com/Author/435864/james-noble

http://academic.research.microsoft.com/Author/1074610/robert-biddle

http://www.amazon.com/s/ref=ntt_athr_dp_sr_1?_encoding=UTF8&sort=relevancerank&search-alias=books&field-author=Vladimir%20Geroimenko

http://www.amazon.com/s/ref=ntt_athr_dp_sr_2?_encoding=UTF8&sort=relevancerank&search-alias=books&field-author=Chaomei%20Chen

http://www.amazon.com/s/ref=rdr_ext_aut?_encoding=UTF8&index=books&field-author=Mike%20McShaffry

http://www.springerlink.com/content/?Editor=Branislav+Kisa%c4%8danin


- 21 -

8 Appendices

8.1 Initial Task Analysis N

um

be

r

Ta

sk

Ta

sk

D

es

cr

ipti

on

De

pe

nd

en

cie

s

Tim

e N

ee

de

d

(We

ek

s)

1 Research Head tracking and its use in games 0 2

2 Class design diagram

Planning what classes need to be included 0 1

3 Create terrain This will include a textured ground and a skybox to contain the game, and lighting to view it

2 1

4 Making models

3D models to be used for the character, and enemies. This can be simple to start with, and make better ones depending on time.

2 2

5 Import model into program

Make a class to load and draw the model(s) in OpenGL

3, 4 2

6 Keyboard/Mouse handling

Program keyboard and mouse functions to control the character

2 1

7 Create GUI A simple menu/title screen should be sufficient, and on screen text during game play to show lives... etc

2 2

8 Research computer vision methods

I will need to look at frameworks/APIs that support computer vision and compare them

0 3

9 Research methods of face detection

I will need to look at various algorithms to recognise faces

8 4

10 Create program to detect faces

This includes implementing the algorithm into a new program, and probably customising it

9 2

11 Research methods of face tracking

I will need to look at various algorithms to achieve this

8 2

12 Update program to track faces

This includes implementing the algorithm into the face detecting program, and probably customising it

11 2

13 Implement face tracking into 3d game

I will need to import the face detecting/tracking code into the game previously made, and make sure it still works

3, 4, 9, 11

2

14 Make camera move according to head position

Using the face tracking algorithm, I will make the camera move accordingly, to give a 3 dimensional feel

13 1


- 22 -

8.2 Interim Class Diagram

bsc project final report immersive computer games using ... · bsc project final report immersive...

Documents