results of user-centered quality evaluation experiments and usability...

Dominik Strohmeier n Kristina Kunze n Satu Jumisko-Pyykkö

Results of user-centered quality evaluation experiments and usability tests of prototype

MOBILE3DTV

Project No. 216503

Results of user-centered quality evaluation experiments and usability tests of prototype

Dominik Strohmeier, Kristina Kunze, Satu Jumisko-Pyykkö

Abstract: In this report we present our work towards finalization of the user-centered quality of experience evaluation framework. During the standardization activities for our framework, we identified the need for comparisons of OPQ with related methods and evaluation approaches to increase its validity. The first study compares OPQ evaluation done in laboratory and the context of use and shows comparable results in the statistical analysis. The second study introduces the Extended-OPQ approach which allows deriving components of Quality of Experience from a series of OPQ studies. The developed QoE terminology for mobile 3D video has been applied in a second study and the results are compared to OPQ evaluation. The comparison shows again comparable results for OPQ and the descriptive evaluation with fixed vocabulary, but also reveals a need for further work towards an optimized terminology. Altogether, the results of the studies contribute to the validity of OPQ and finalize its methodological development. The second part of the report includes the preparations of a final prototype evaluation for MOBILE3DTV. Although we were not able to conduct the study due to unavailability of the prototype at the planned time, we report in detail the test procedure and the selection of the independent variables to provide a valid research plan.

Keywords: 3DTV, mobile video, Open Profiling of Quality, descriptive evaluation, comparison model, conventional profiling, terminology, component model

MOBILE3DTV D4.5 Results of user-centered quality evaluation experiments and usability tests of prototype

Executive Summary

Open Profiling of Quality (OPQ) has become a well established tool during the quality evaluations of Mobile3DTV. OPQ is a mixed methods research approach which extends commonly applied psychoperceptual, quantitative evaluations of perceived quality with a descriptive evaluation approach. This descriptive or sensory evaluation enables researcher to identify the underlying quality rationale of a quantitative quality rating on test participants‟ individual attributes.

The application of OPQ in a series of studies on different research questions within the MOBILE3DTV project has shown good validity of OPQ by complementing research results among the studies. Together with the evaluations of quality in the context of use it now builds the User-centered Quality of Experience evaluation framework. This evaluation framework was accepted as contribution for the standardization activities of the ITU-T SG12 where both methods were presented in the general meeting.

During the work towards the proposal, we identified several issues on OPQ that still needed to be studied. In this report, we present the results of two studies which targeted an increased validity of the OPQ approach. The goal of the first study was to extend and to validate the use of the OPQ method in field circumstances. We conducted the first experiment in two different evaluation contexts, laboratory and café with varying video qualities under assessment. The second study targets comparison of OPQ to related research methods. For this comparison, we included two new approaches into the UC-QoE evaluation framework.

Firstly, we introduced the Extended-OPQ approach (Ext-OPQ) as an additional fourth step within OPQ to derive a general component model from the individual attributes of a series of studies on the same domain of research. We describe the research method of the Extended-OPQ and results of the application of Ext-OPQ approach on mobile 3D video. The result consists of 19 general descriptive attributes for mobile 3D video which we then utilized further in an adaptation of OPQ based on fixed vocabulary. The descriptive evaluation of quality based on a vocabulary which is the same for all participants is known as conventional profiling and describes a first approach towards operationalization of the Ext-OPQ component model. We compare the results of the CP evaluation to an OPQ evaluation in the second study which is reported here.

For this comparison, we further introduce our comparison model which is the result of our work towards holistic comparison of research methods. The comparison model has been developed based on a literature review into different research areas from which we collected the comparison criteria applied. The comparison model structures these criteria into four different classes and allows for systematic and holistic comparison of research methods. The comparison of the CP and OPQ approach is based on a subset of attributes from the comparison model as a first approach to deeper understanding of benefits and shortcomings of different research methods in subjective quality evaluations.

In overall, the results presented in this report finalize the methodological work on the User-centered Quality of Experience evaluation framework within the MOBILE3DTV project and provide a last validation of OPQ as well established research method for mixed methods evaluation. A second part of the report is dedicated to the final overall evaluation of the whole MOBILE3DTV prototype system. A study which combines different evaluation approaches with optimized components of the system has been planned. The research plan as well as the selected content and a detailed description of the research method are provided in this report.


Table of Content 1 Introduction .......................................................................................................................... 5

2 Simulator Sickness on mobile autostereoscopic screens ...................................................... 8

2.1 Research Method .......................................................................................................... 8

2.1.1 Simulator Sickness Questionnaire .......................................................................... 8

2.1.2 Procedure .............................................................................................................. 8

2.1.3 Characteristics of the experiments ......................................................................... 9

2.1.4 Apparatus – displays ............................................................................................ 10

2.2 Results ........................................................................................................................ 10

2.3 Discussions and Conclusion ........................................................................................ 11

3 Probing OPQ in the context of use ..................................................................................... 13

3.1 Research method ........................................................................................................ 13

3.2 Results ........................................................................................................................ 16

3.2.1 Psychoperceptual evaluation ................................................................................ 16

3.2.2 Sensory evaluation ............................................................................................... 18

3.2.3 Comparison of results .......................................................................................... 22

3.3 Discussion and Conclusion ......................................................................................... 23

4 Extended OPQ ................................................................................................................... 26

4.1 Fixed vocabulary and terminologies in descriptive analysis ......................................... 26

4.2 The component model as extension of the OPQ method ............................................ 26

4.2.1 Open definition task and qualitative descriptions .................................................. 26

4.2.2 Components of Quality of Experience for mobile 3D video ................................... 28

5 Comparison of OPQ and CP .............................................................................................. 33

5.1 Introduction and research problem .............................................................................. 33

5.2 Comparison criteria and comparison model ................................................................ 33

5.3 Comparison Study: Comparing OPQ and CP .............................................................. 37

5.3.1 Research Method ................................................................................................. 37

5.3.2 Results ................................................................................................................. 38

5.4 Systematic Comparison of Methods ............................................................................ 43

5.4.1 Test results .......................................................................................................... 43

5.4.2 Test procedure ..................................................................................................... 44

5.4.3 Amount of time ..................................................................................................... 44

5.4.4 Costs .................................................................................................................... 44

5.4.5 Research purpose ................................................................................................ 45

5.5 Discussion ................................................................................................................... 45

6 Prototype Study: Usability and quality experience with the final mobile 3D prototype in the context of use ............................................................................................................................ 46


6.1 Preliminary planning .................................................................................................... 46

6.1.1 Participants .......................................................................................................... 46

6.1.2 Test design .......................................................................................................... 46

6.1.3 Test procedure ..................................................................................................... 46

6.1.4 Test Material and Apparatus ................................................................................ 48

6.2 Actual planning ............................................................................................................ 51

6.2.1 Participants .......................................................................................................... 51

6.2.2 Test design .......................................................................................................... 51

6.2.3 Test procedure ..................................................................................................... 51

6.2.4 Test Material and Apparatus ................................................................................ 52


1 Introduction The User-centered Quality of Experience (UC-QoE) evaluation framework [42] is the methodological result of the subjective quality evaluation studies in the MOBILE3DTV project. The evaluation framework consists of two main evaluation approaches that extend conventional quantitative profiling. The evaluations in the context of use extend quality assessment in controlled laboratory environments with exploration of the impact of different evaluation contexts on users‟ experienced quality [46][42][29]. The contextual evaluations increase the external validity of the research results. The approach has been applied successfully in the quality evaluations of mobile 3D video and television [29][55].

The second methodological approach within the UC-QoE evaluation framework has been the development of a research method for evaluation of individual quality factors by applying a descriptive evaluation method in parallel to common quantitative, psychoperceptual evaluations. Open Profiling of Quality (OPQ) is a mixed method that combines the evaluation of quality preferences and the elicitation of idiosyncratic experienced quality factors. It therefore uses quantitative psychoperceptual evaluation and, subsequently, an adaption of Free-Choice Profiling [10][12]. The application of OPQ in the quality evaluations of mobile 3DTV has shown complementing results in a series of studies on different research questions (Table 1).

The complementation of results and the additional knowledge obtained from Open Profiling of Quality in contrast to pure psychoperceptual evaluations have made Open Profiling of Quality to be a well validated tool for mixed methods evaluations in the UC-QoE evaluation framework.

Both methodological approaches of the UC-QoE evaluation framework were accepted as proposals for standardization in ITU-T SG12 [57][58]. Within the standardization activities for the OPQ approach, we identified two issues that are needed towards finalization and validation of the research method. The first point targets the internal comparison of different methods of analysis that can be applied on the OPQ data sets [10] and confirmation of its internal validation. The second point targets the external comparison of OPQ with related research methods in the field of descriptive quality evaluations.

In the first part of this deliverable we present the results of targeted an complementation of the existing results. First, we present, in section 2, the results of an analysis of Simulator Sickness data which was collected in the previous studies. Then, we present the results of a study in which we conducted an OPQ evaluation in the context of use and compared the results of this evaluation to an assessment in controlled laboratory environment. The results are presented in section 3. This work finalizes the development of the OPQ approach. In section 4, we introduce the Extended-OPQ (Ext-OPQ) approach [12]. During the applications of OPQ in the MOBILE3DTV project, we identified the need to be able to transform the individual quality factors into a fixed vocabulary of components of Quality of Experience for mobile 3D video [12]. Ext-OPQ introduces the component model as an extension of the OPQ method which enables to derive a common terminology from a set of OPQ studies in a specified field of research. The terminology obtained in the Ext-OPQ was then used in an external comparison study in which the results for OPQ were compared to a Conventional Profiling. Conventional Profiling (CP) is commonly understood as a sensory evaluation based on a fixed vocabulary [1][2][18][52]. We operationalized our QoE components for a CP approach. For systematic comparison of related research methods, we first introduce our comparison model in section 5. This model was built based on a literature review of different comparison criteria for research methods in different domains of research. Our comparison model is the first step towards a holistic comparison of research methods. A subset of comparison criteria is then used to compare our OPQ with the conventional profiling approach.


Table 1 Series of OPQ studies during its methodological development in the application on different research questions on experienced quality of mobile 3D video and television

Name of the study Underlying research question Summary of results

Study 1: Experienced Quality of Audiovisual Depth [10][14]

How does perception of audiovisual content change when it is presented in 2D and 3D?

Although the results of the psychoperceptual evaluations did not reveal any significant differences between the 2D and 3D conditions, OPQ results show that the independent variables in the test were identified and evaluated. In addition, we were able to identify different preferences of participants from which modality they derive their quality attributes.

Study 2: Experienced Quality of Audiovisual Depth in Mobile 3D Television and Video [10]

How does perception of audiovisual mobile 3D videos change when it is presented in either 2D or on an autostereoscopic mobile device?

The results underlined the dominance of visual quality factors over the audio factors and their interaction in the experienced quality. The results also showed a controversial impact of 3D presentation mode on overall quality and depth impression. While the use of 3D mode increased the depth impression, it decreased the overall satisfaction. 3D quality was often described with artifacts. However, our results also showed that in artifact-free cases, 3D can reach higher perceived quality compared to 2D.

Study 3: Experienced Quality of Video Coding Methods for Mobile 3D Television [8][10]

What is the optimum coding method for mobile 3D video?

Our results of psychoperceptual evaluation showed that Multiview Coding and Video + Depth provide the highest experienced quality among the tested coding methods. The results of sensory profiling showed that artifacts are still the determining quality factor for 3D. The expected added value through depth perception was rarely mentioned by the test participants. When mentioned, it was connected to the artifact-free video. We identified a hierarchical dependency between depth perception and artifacts. When the visibility of artifacts is low, depth perception seems to contribute to the added value of 3D.

Study 4: Experienced Quality of Mobile 3D Video Broadcasting over DVB-H [11][12]

What are the optimum transmission conditions for transmitting mobile 3D videos over a DVB-H channel?

The results show that the provided quality level of videos with a low error rate is clearly above 50%. Still, the different coding methods had the highest impact on the experienced quality of test participants. In the sensory evaluation we were able to show again that the expected descriptions of judder as contrast to fluency of the test items are found rarely and descriptions are dominated by artifacts relating to blockiness or blur. Again, impact of 3D perception was only identified for artifact-free videos.

The second part of this deliverable targets a final holistic evaluation of the MOBILE3DTV prototype system. During its development process the different stages of the system were optimized within a user-centered optimization approach along the production chain of mobile 3D video [10][11][12][45]. The final system from optimized content to use of the final prototype


device is targeted for evaluation in a scenario-based approach in different contexts of use. However, we were not able to conduct this study in time due to problems in the availability of a final prototype device. Section 6 presents the research plan for the study including selected content, independent parameters and detailed description of the developed test procedure in the scenario-based evaluation.


2 Simulator Sickness on mobile autostereoscopic screens Previous research into the subjective quality of autostereoscopic displays suggests that visual discomfort is a common problem for 3D media [60]. Visual discomfort by autostereoscopic displays is often caused by impairments in stereoscopy, e.g. crosstalk or keystone distortion [61]. The experienced visual discomfort may degrade the perceived image quality and cause annoyance which can result in a lower acceptance of the novel technology [60]. Three main approaches for measuring visual discomfort exist: 1) explorative studies, 2) psychophysical scaling and 3) questionnaires [60]. Questionnaires are commonly applied to subjectively study the degree of visual discomfort.

Kennedy et al. (1993) [62] originally developed the Simulator Sickness Questionnaire (SSQ) to study sickness related symptoms induced by aviation simulator displays. In the questionnaire, symptoms that contribute to nausea, oculomotor symptoms, disorientation are measured individually and a combined total severity score to subjectively quantify the experienced symptoms of the participant is finally calculated. Since its conception, SSQ has also been applied to several fields outside the aviation research community. Jaeger & Mourant (2001) [64] compared simulator sickness symptoms with static and dynamic virtual environments. They concluded that increased duration of exposure intensifies sickness symptoms. During their longest session of 23 minutes, they did not observe physiological adaptation that would lessen the symptoms during prolonged exposure. In contrast, Häkkinen et al. (2002) [63] noted that stereoscopic gaming induced strong nausea and disorientation symptoms with the worst symptoms being experienced within 10 minutes after task completion. They applied SSQ to study stability and sickness symptoms after Head-Mounted Display (HMD) use. Oculomotor symptoms were experienced independently of the used stimuli. In a similar study, Pölönen & Häkkinen (2009) [65] used the SSQ to measure sickness symptoms in three different applications while using a Near-to-Eye Display (NED). They compared the dependency of SSQ during watching a movie, a game play, and reading. The results show that discomfort was experienced with all applications, especially with reading using the NED. Disorientation was experienced especially when playing games with strong motion scenes. Movie viewing invoked the least symptoms of the three applications. Lambooij et al. [60] note in their review on stereoscopic displays and visual comfort that induced blur may cause unnatural depth perceptions. They also emphasize that spatial and temporal inconsistencies and conflicting depth cues are a cause for annoyance and visual discomfort.

2.1 Research Method

2.1.1 Simulator Sickness Questionnaire

In this study, we analyze the data of five different experiments in which we collected SSQ data. The SSQ as applied contains 16 physical symptoms [62]. Each symptom is rated on a categorical labeled scale (none, slight, moderate, severe) by the test participant. Each symptom score then contributes to groups of 1) nausea (e.g. stomach awareness), 2) oculomotor (e.g. eyestrain), and 3) disorientation (e.g. dizziness). A total score can be calculated by a weighted contribution of the subcategories. The weight scores are: Nausea = 9.54, Oculomotor= 7.58, Disorientation = 13.92 and total score = 3.74. The SSQ was collected prior to immersion and several times after immersion (after 0, 4, 8, 12 minutes) in the experiments 1-4. In the experiment 5, only immediate post-immersion was collected after two immersive sessions. In this report, the absolute values are presented.

2.1.2 Procedure

The structure of all experiments was similar. Pre-immersive evaluation of SSQ was collected at the beginning of each test. During the immersion, test participants conducted a


psychoperceptual quality evaluation task [24]. During the rating, participants gazed off screen, partly resembling typical mobile video viewing situation [72] . Post-task SSQ data was collected applied directly after completion of the immersive psychoperceptual evaluation task.

2.1.3 Characteristics of the experiments

Experiment 1 targeted the evaluation of a suitable surround sound setup for a 15” autostereoscopic laptop [10]. Experiment 2 aimed on identifying audiovisual experienced quality under monoscopic and stereoscopic video presentation [10]. Experiment 3 [10] explored the influence of video coding methods and experiment 4 [12] transmission parameters on experienced quality of mobile 3D television. Finally, experiment 5 studied different video coding parameters. The experiments 1-4 were conducted in the controlled laboratory envorinments. The experiment 5 was conducted in both controlled and in-door quasi-experimental settings [72]. All characteristics of the experiments are summarized in Table 2

Table 2 Characteristics of the experiments

EXPERIMENT

[Reference]

IMMERSION Length of viewing

Total length

DISPLAY

CONTENT

CHARACTERISTICS

SAMPLE

N

EFFECT OF TIME

(post-immersion)

EXP 1

[10]

Viewing: 4min

Total: 6.7min

Actius AL-

3DU

512x768px

at 42.5 DPI

Length: 15sec

Videos: Synthetic

Motion: Moderate

3D: 100% of time

Described impairments: N/A

Quality level: Highly acceptable

32 Nausea: FR=6.42, df=3,p=.93, ns

Oculomotor: FR=10.95, df=3,

p<.05

Disorientation: FR=17.73, df=3,

p<.001

Total: FR=27.52, df=3, p<.001

EXP 2

[10]

Viewing: 13.9min

Total: 23 min

HDDP

427x240px

at 155DPI

Length: ~ 18s

Videos: Synthetic and Natural

Motion: Variable

3D: 50% of time, 2D:50% of time

Described impairments: Depth , spatial


42 Nausea: FR=14.89, df=3, p<.01


p<.001


p<.001

Total: FR=51.17, df=3, p<.001

EXP 3

[10]

Viewing: 7.9min

Total: 15.8min

HDDP

427x240px

at 155DPI

Length: ~ 10s

Videos: Synthetic and natural

Motion: Variable

3D: 100% of time

Described impairments: Spatial


38 Nausea: FR=30.29, df=3, p<.001


p<.001


p<.001

Total: FR=61.92, df=3, p<.001

EXP 4

[12]

Viewing: 32 min

Total: 37.3 min

HDDP

427x240px

at 155DPI

Length: ~ 60s


Motion: Variable

3D: 100 % of time

Described impairments: N/A


77 Nausea: FR=13.5, df=3, p<.01


p<.001


p<.001

Total: FR=53.14, df=3, p<.001

EXP 5

[72]

Viewing: 46min

[23 + 23]

Total: 54.8min

[27.4 + 27.4]

3D LCD

400x480px

at 100 DPI

Length: ~ 30s


Motion: Variable

3D: 80% of time, 2D: 20% of time

Described impairments: Depth , spatial

Quality level: Mainly unacceptable

30

Heterogeneous stimuli material was used in the experiments. The stimuli contained synthetic and natural video scenes, variable depth levels, motion and impairments. Based on the descriptive quality evaluation tasks that were included in the evaluations (and conducted after the post-task SSQ data elicitation), the experiments 2 and 5 contained detectable impairments in spatial and depth domains while experiment 3 resulted in only spatial impairments. Finally, rated overall quality in the psychoperceptual studies was highly acceptable except in experiment 5.


2.1.4 Apparatus – displays

Three dual-view autostereoscopic displays were used in the experiments. Such displays work by showing different image to each eye of the observer. Dual-view displays generate two images which are spatially interleaved – half of the sub-pixels are visible from one direction and the other half from another direction. The light of the display is redistributed by an optical filter – either parallax barrier, which selectively blocks the light or lenticular sheet, which refracts the light in different directions [68]. When the display is correctly positioned in respect to the observer eyes, as marked with “1” and “2” in Figure 2, it is possible to perceive 3D objects as freely floating in front of the display. In some autostereoscopic displays the optical layer can be turned off, which allows the display to be used for 2D images. In other displays, where the optical layer is static, the only option is to duplicate the visual information, and make the same image visible by each eye of the observer.

The first display used in the experiments is Actius AL-3DU by Sharp, which uses switchable parallax barrier [68]. Every other sub-pixel of that display belongs to alternative view. As each view is visible from multiple angles, and the angle of visibility of one view is quite narrow, it is possible for an observer to perceive reversed stereo image, as is for observer at position 3 on Fig. 1c. The visual quality of the 3D scene is very sensitive to the observation angle - except for three narrow observation spots the display exhibits noticeable moiré and ghosting artifacts. In 3D mode, the resolution per view is 512x768px at 42.5 DPI, with pixel aspect ratio of 2:1. The viewing distance in the experiments was ~55 cm.

The second display is 3D display by NEC with horizontally double-density pixel arrangement, also known as HDDP display [70]. Due to special pixel arrangement it has the same resolution in 2D and 3D mode, namely 427x240px at 155DPI. Its optical layer is lens-based and cannot be turned off, and 2D display mode is achieved through pixel doubling. The HDDP display is with the lowest crosstalk and the highest visual quality of the three. The 3D effect can be observed from a wide range of angles and distances. The viewing distance in the experiments was ~45 cm.

The third display that was used is Stereoscopic 3D LCD MB403M0117135, produced by masterImage [69]. It uses Cell Matrix Type parallax barrier which can be switched between portrait 3D and landscape 3D mode. The display is 2D/3D switchable with 2D resolution of the display is 800x480px at 200 DPI and the 3D resolution is 400x480px at 100 DPI. The views of that display alternate for each 3 sub-pixels – every second full pixel belongs to alternative view. This creates specific color tint artifacts in 3D mode, caused by sub-pixels with certain color being partially covered by the barrier. As the 100DPI resolution was deemed high enough, 2D images were shown using 3D mode and pixel doubling. The viewing distance in the experiments was ~40 cm.

2.2 Results The analysis targeted three main aspects: To explore 1) the influence of immersion by comparing pre-and post immersive evaluations, 2) influence of post immersive time on evaluations, and 3) to identify the time when post immersive symptoms are reduced back to the pre-immersive level. As an overall tendency, the immersive period caused a short term peak to the total simulator sickness and its factors.

Pre vs. post immersion - The immersion significantly increased the symptoms. Wilcoxon pairwise comparisons showed a significant difference between pre and post evaluations in 14 out of 15 cases (p<.01, cf. Figure 1). There are two exceptions to this main result. In the first experiment, the immersion did not influence the severity of nausea symptoms (Z =-1.46, p=1.43, ns). In the second experiment, a lower level of nausea was reported after the immersion being in the contradiction to the main tendency of results (Z =-2.30, p<.05)


Influence of post-immersive time - After the immersion, both individual and total symptoms reduced over time (Figure 1). Time (see Table 2) has significant influence of the symptoms in the experiments 1-4. As an exception, this influence was not announced in the case of nausea of the first experiment.

Reduction of post-immersive symptoms - Immersion caused a short term peak to simulator sickness score or its factors and starting level of pre-immersive scores was mainly reached within eight minutes after immersion. The level of pre-immersive total and oculomotor scores was reached in four minutes after immersion in three experiments (1, 2, 3) while it was not reached within the twelve minutes in experiment 4 (p>.05). In the terms of disorientation, to recover from immersion required from four to twelve minutes (Exp 1: 8min, Exp 2-3: 4min, Exp 4: 12 min, p>.05).

Figure 1 Results of simulator sickness symptoms and its factors in the experiments 1-4 shown as total scores. The results of experiment 5 show the post-immersive measures after two different lengths of exposure. Pre = pre-immersive measure; time 0-12 = time of post-immersive measurement in minutes. The bars show 95% CI of mean.

Finally, the pre-immersive level of nausea was equal after immersion (Exp 1; p>.05), it become lower after immersion (Exp2; p<.05), or it was reached within four minutes (Exp 3-4; p>.05). In total, the results showed that the recovering time was prolonged after a long immersive period of viewing 3D in the experiment 4 compared to the other experiments. This result is especially visible in oculomotor factors, as well as in total simulator sickness and disorientation.

In experiment 5, the post immersive evaluations were collected two times, immediately after viewing at 24.7 min and at 58.4 min (Figure 2). There were neither differences in the total simulator sickness (Wilcoxon: Z=-1.301, p=.193) nor in its factors (nausea: Z=-1.789, p=.074, oculomotor: Z=-1.85, p=.064, disorientation: Z=-.794, p=.43) between the post immersive measurements. Overall, the experiment 5 showed a higher level of symptoms in all evaluations compared to the other experiments.

2.3 Discussions and Conclusion Goal of this study was to explore the simulator sickness symptom severity in five quality evaluation experiments conducted on three mid-sized or small mobile autostereoscopic displays. Our five experiments were characterized with variable length and structure of video viewing, overall quality of stimuli and nature of perceivable impairments. Although variable characteristics


of settings can limit the direct between experiment comparisons, our results are beneficial for showing the general tendency of visual comfort in these experiments.

The results showed a slight and mainly short term increase in the symptoms after immersion. In these studies, the overall quality level was acceptable for the prospective consumer [10][11][12]. Firstly, the reported level of symptoms in our studies with HDDP and Actius ALDU displays are equal or lower compared to previous studies using CRT or Head-Mounted Displays after 40 min lasting fast-speed gaming [63]. Secondly, the pre-immersive level of simulator sickness symptoms was reached mainly within the first four minutes when active viewing time was less than 14 minutes. For longer viewing time (more than 30 min) on a small sized HDDP display, the recovery time was prolonged a bit. These results indicate that a short term video viewing (e.g. typical for mobile television and video), on these autostereoscopic dual-view displays is not offending.

The results also showed a relatively high level of symptoms after a long viewing task on a 3D LCD display. In this study, the overall quality level was low for users and some of the used stimuli were highly impaired, causing perceptual problems such as cross-talk. In this study, the viewing time (23 or 46 min) did not increase the symptoms. This might be explained by physiological adaptation. According to Stanney et al. [71], under prolonged exposure, near or over 30 min can even lessen the symptoms due to the physiological adaptation.


3 Probing OPQ in the context of use Subjective quality evaluation research has slowly started to emerge its focus towards user-centric assessments with rich use of methods during recent years. It has become important to understand quality of experience more extensively than only as sensorial satisfaction or as a ratio of erroneous and error-free system quality, and to further utilize this information in design of novel systems [10][42]. Methodologically, this change has not only introduced the application of descriptive evaluation methods in parallel to traditional quantitative psychoperceptual quality evaluation tools (e.g. ITU-T P.910 [24]), but also a more detailed analysis of the factors of external validity referring to the use of the end product such as user characteristics, necessary system components, and context of use (overview [42]).

Descriptive quality evaluation methods are used to identify the critical quality attributes, complement and give deeper understanding on the attributes beyond the conventional psychoperceptual assessment methods [10][12][28][25]. There are two main approaches of the descriptive methods: 1) Interview-based methods contain a relatively fast data-collection phase and data-driven procedure in the analysis which can also utilize statistical techniques. These methods have been applied in the evaluation of unimodal and multimodal stimuli with naïve participants mainly in the controlled circumstances [25][44]. 2) Vocabulary-based methods are characterized by a multistep procedure to develop either an individual or consensus vocabulary, and to rate quality using that vocabulary [17][41]. The analysis emphasizes the use of statistical techniques, and these methods have been applied in the evaluation of heterogeneous stimuli with naïve and trained participants in controlled circumstances. Within this branch of descriptive methods, Open Profiling of Quality (OPQ) is a mixed method combining conventional quantitative psychoperceptual quality evaluation and qualitative descriptive quality evaluation based on individual‟s own vocabulary [10]. Although extensive work has been done to develop different descriptive methods, their applicability and validity outside the laboratory circumstances is unknown.

There are only few previous studies which have examined quality of experience of time-varying media in the context of use. Context of use represents the circumstances in which the activity takes place and it is characterized by physical, temporal, task, social, and technical and information context, and their variable properties (overview [46]). The focus of the previous studies has been set on identifying the context-dependant quality requirements, to compare these requirements to the laboratory evaluations, and to maximize the ecological validity. These perspectives have a high relevance to the mobile video and television services which are used (or expected to be used) in heterogeneous usage contexts. However, to conduct this type of study requires a change in research paradigm from experimental to quasi-experimental research [45]. It requires an identification of threats of validity, tracking of the circumstances on the multiple levels in different phases of the study, and use of different research methods to be able to conclude the causal effects (ibid). Even though these studies have contributed to the novel way of evaluating quality in the context of use, their focus has not been essentially to validate the descriptive or mixed methods outside of the laboratory circumstances.

The goal of this study is to extend and to validate the use of the Open Profiling of Quality (OPQ) method to the field circumstances. Following, we present the research method of a comparison study between a controlled laboratory circumstances and a context of use. Furthermore, we report and discuss the results.

3.1 Research method Participants – Between-subject design was used in the study. A total of 42 untrained participants (age: 19-52 years; 18 female, 24 male) took part in the study. A control group with 21 participants was tested under laboratory conditions and 21 participants were tested in a user


context situation. 15 participants per group were selected randomly for sensory evaluations. All test participants were tested for visual acuity (myopia and hyperopia: Snellen index: 20/40), color vision (Ishihara test) and stereo vision (Randot Stereotest 0.6). Five of the participants had been working in the field of video editing or video application. One of the participants had prior experience in subjective quality evaluation, but none of them about 3D video. All other test participants can be classified as naïve participants.

Stimuli – We chose six different audiovisual clips with a length of 20 seconds for the test according to their audiovisual characteristics and the user requirements for mobile 3D television and video [30]. The clips were edited using Premiere Pro CS4 and exported with a resolution of 640px x 480px for each channel. Audio was sampled at a 44.1 kHz rate and 16 bit. The videos were encoded at three different quantization parameters (QP) at three different quality levels: high at QP 30, medium at QP 40 and low at QP 45. For encoding, we used x.264 for video and Nero AAC for audio. Finally, we used Stereo Movie Maker to multiplex the prepared clips into the 3D-avi format which was needed for the presentation of the stimuli material.

Table 3 Snapshots of the six contents under assessment (VSD=visual spatial details, VTD=temporal motion, VD=amount of depth, VDD=depth dynamism, VSC=amount of scene

cuts, A= audio characteristics)

Screenshot Genre and their audiovisual characteristics

Animation - Dracula

VSD: med, VTD: high, VD: high,

VDD: high, VSC: high, A: music, effects

Documentary - Macroshow

VSD: high, VTD: med, VD: high,

VDD: low, VSC: med, A: orchestral music, ambience

Sports – Skydiving

VSD: low, VTD: med, VD: med,

VDD: low, VSC: low, A: music

User-created Content – Street Dance

VSD: med, VTD: high, VD: high,

VDD: med, VSC: low, A: music, ambience

Documentary – The Eye

VSD: med, VTD: med, VD: med,

VDD: med VSC: med, A: music

Sports – 24h

VSD: med, VTD: high, VD: med,

VDD: high, VSC: high, A: ambient music

Stimuli presentation – The tests were conducted in two different contexts. The first context was a controlled laboratory environment [4] at Ilmenau University of Technology. We chose a café as context of use according to the most mentioned usage situations for mobile 3DTV [43]. In the café, we used the same time slot during the day and same place for each participant to obtain similar conditions for the study as defined for quasi-experimental settings. The clips were


presented on an 8'' FinePix Real 3D V1 display based on parallax barrier technology. Test participants were allowed to adjust their initial viewing distance of 45cm. The two built in stereo speakers of the device were used for audio playback. The order of the clips was randomized.

Table 4 Characteristics of the contexts, described based on the Model of Context of Use for Mobile-Human-Computer-Interaction [46], operationalized in [45][42]

Components/ properties Lab Café

Physical context Functional place Laboratory conditions Student café at TU Ilmenau

Sensed attributes (Audio, Visual) A: quiet

V: calm, indoor

A: noisy

V: noisy, indoor

Movements (Movement, Position) M: none

P: straight

M: none

P: lean

Artifacts (other than answer sheet) none tea cup

Temporal context Duration 1,5 – 2 hours 1,5 – 2 hours

Time of day Vary Between 11.45 am and 3 pm

Actions-time Extra time Extra time

Task context Multitask 1 Quality evaluation Quality evaluation

Multitask 2 none Relax, drink tea/ coffee

Interruptions none possible

Task type Entertain Entertain

Social context Persons present Moderator Moderator, other guests

Interpersonal actions none possible

Technical and informational context Other systems none none

Properties Level of dynamism Static dynamic

Other related factors Motivations * Entertain, pass time, relax

Viewing distance Freedom to adjust Freedom to adjust

Device volume Freedom to adjust Freedom to adjust

Test procedure – In overall, the test procedure of the study followed the Open Profiling of Quality approach [10][12] and all evaluations were organized in one single session.

Psychoperceptual evaluation started with the visual screening and the explanation of the test procedure. In the following training and anchoring, we presented a subset of test items which covered the full range of quality. Test participants were asked to find their best viewing position and to practice the evaluation task. Then, an Absolute Category Rating (ACR) according to ITU-R P.910 [24] was conducted to evaluate the overall quality quantitatively. The stimuli were presented one by one and the participants rated retrospectively the acceptance of the quality on a binary (yes/no) scale [27] and the overall satisfaction on an unlabeled 11-point scale [47]. Each stimulus was assessed twice. After a short break of about 10 minutes, in which the participants filled out a demographic data questionnaire, the sensory evaluation was conducted.

Sensory evaluation started with the introduction of participants to the sensory evaluation task. Then, in attribute elicitation, the participants watched a second subset of test items to develop their individual quality attributes [10]. In the attribute refinement they were then asked to define their quality attributes and if necessary to reconsider if they perceived some of the attributes as not unique or could not define them precisely. At the end of the refinement, each of the final attributes was attached to a 10 cm long line with the labels 'min' and 'max' at its ends. In the final


sensory evaluation, the stimuli were again presented one after the other and the participants rated overall quality on all of their attributes for each test item. The participants were instructed to mark the sensation quality of an attribute on the line, 'min' for no sensation of this attribute at all and 'max' for the maximum sensation of this attribute.

Methods of Analysis – The quantitative data was analyzed using non-parametric statistical analysis as no normal distribution was given for the test items (Kolmogorov-Smirnov: P<.05). Friedman test was applied to check if the independent variables impacted on the dependent one. Significant differences between two related items can then be measured using Wilcoxon test. To compare the binary, non-related acceptance data between the contexts, Pearson‟s Chi-Square test was applied. For the pair-wise comparison of satisfaction data between contexts we applied Mann-Whitney-U test. All quantitative data analysis was performed using PASW Statistics 18.

The sensory data was transformed into a set of quantitative measures by measuring the distance from the min-label to the participant‟s tick on the line per attribute and stimulus. This results in individual configuration per participant which is called individual configuration (Figure 2). The individual configurations can be analyzed by applying Multiple Factor Analysis (MFA) as suggested in the Extended-OPQ approach [12]. MFA first conducts a principal component analysis for each configuration and scales each configuration to the PCA‟s first singular value. In a second step, all configurations are merged into a single matrix and another PCA is conducted. R and its FactoMineR package were used for sensory analysis. Hierarchical Multiple Factor Analysis (HMFA) finally is a method that analyzes data sets having a hierarchical structure (Figure 2). We applied HMFA to compare the two sensory data sets from laboratory and context for similar information as final step of checking validity.

Figure 2 The hierarchical structure of the data set with the two hierarchical levels of individual configurations (green) and the evaluation in different contextual settings (red).

3.2 Results

3.2.1 Psychoperceptual evaluation

Acceptance of overall quality - On average, all presented stimuli at qp30 provided a highly acceptable quality level of 93%, qp40 stimuli reached an average level just above 50% and qp45 got an acceptance rate of 12%. Comparison between the results of the two contexts did not reveal significant difference, with exceptance to contents streetdance and theeye at medium quality level (Pearson‟s χ2: P < .05). Figure 3 presents the results of the overall acceptance scores content by content and for the two contexts.

#1 #2 #m #1 #2 #n......

CaféLaboratory


Figure 3 Overall acceptance scores for the items under test

Satisfaction with overall quality – Parameter combinations influenced overall quality satisfaction when averaged over the content for each of the contexts (laboratory: Fr = 627.705, df = 2, P < .001; café: Fr = 419.846, df = 2, p < .001). The comparison of QPs between the context revealed slightly better results for QP45 in the café (Mann-Whitney-U: U = -2.485, P<.05). Comparison of QP30 and QP40 did not show significant differences (all comparisons: P>.05). In a content-by-content comparison for the QPs significant differences in satisfaction scores was found only for theeye which was rated slightly better in the café than in the laboratory (Mann-Whitney-U: U = -2.305, P<.05; other comparisons: P>.05). Figure 4 shows the overall quality scores averaged over contents and content by content for the different QPs and contexts. QP30 provided significantly the highest quality satisfaction, the rating for QP45 are worst (all comparisons: P<.001). The results of content by content analysis follow the overall tendency for different QPs (all comparisons: P<.001).


Figure 4 Mean satisfaction scores for the items under test. Error bars show 95% CI of mean.

3.2.2 Sensory evaluation

Test participants developed a total of 91 individual quality attributes in the laboratory (mean: 6, min: 4, max: 7) and 78 attributes in the café (mean 5, min: 4, max: 8).

Laboratory – The results for the sensory data from the laboratory can be found as item and correlation plot in Figure 5 and Figure 6. The item plot (Figure 5) shows the loadings of the test items on the first and second component of the MFA. The first two components of the MFA explain 65.42% of the variance in the individual data (also called explained variance) with 44.25% and 12.17%, respectively.

Along the first component, the items separate along the different QPs. Items close to the origin have less impact on the component than those with high (positive or negative) loadings. Along the second component, a clear separation of all items of content dracula can be found. These items show high impact for the second component of the model.

Further insight can be obtained from the correlation plot (Figure 6). This plot shows the correlation of each individual attribute with the first and second component of the MFA model. The first component is mainly described with attributes like „blocky‟ or „artifacts‟ on its negative polarity, while the positive one correlates strongly with attributes like „clear, „sharpness of edges‟, or „3D effect‟. These items describe the differences in the perception of video quality and are in accordance to the separation of QPs along the first dimension. The second dimension is correlating with attributes such as „double images‟ on the one, with few attributes like „color-fast‟ and „perceivable as one image‟ on the other polarity. Having in mind that along this component content dracula separated from the other content, we see that there was a problem with getting a proper 3D perception. The double images may have been caused by a high disparity of this content. A few attributes correlate with both dimensions, such as „3D effect‟, „depth‟, and „amount of 3D‟. However, the partial plot (Figure 5) in which the impact of each individual configuration is illustrated, shows that this problem only occurred with some participants. While


some participants show very high loadings for their individual configuration on dimension 2, others show none.

Figure 5 Item plot and partial loadings for the laboratory. The partial loadings show individual participant’s impact.

-10 -5 0 5 10

05

10

15

Dim 1 (44.25 %)

24h_qp3024h_qp40

24h_qp45

dracula_qp30dracula_qp40

dracula_qp45

makro_qp30

makro_qp40

makro_qp45skydive_qp30skydive_qp40skydive_qp45

streetdance_qp30

streetdance_qp40

streetdance_qp45

theeye_qp30theeye_qp40theeye_qp45

group.1group.2

group.3group.4

group.5group.6group.7

group.8group.9


group.13group.14

group.15group.16


Figure 6 Correlation plot of the laboratory evaluation.

Café – The sensory results for the café are comparable to those obtained in the laboratory (Figure 7 and Figure 8). The first two components of the MFA model account for 47.76% of explained variance (component 1: 33.48%, component 2: 14.28%). As for the laboratory results, the items of the café results separate along the first dimension according to their QPs (Figure 7). The separation of content dracula along the second component can be found as well. The correlation plot (Figure 8) shows high correlation of attributes like „blocky‟ or „blurry‟ and, in contrast, of attributes like „spacious‟, „rich in details‟, and „clear‟ with the first component. The second component correlates with attributes like „double effect‟, „dark‟, and „annoying‟ on its one polarity. The other one correlates with „bright‟ and „realistic‟. Again, the partial plots show the differences of individual contributions to the second dimension.

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Dim 1 (44.25 %)

Dim

2(1

2.1

7%

)

sharp

blocky

3D.effectclose.to.reality

colour.intensity

true.to.realityblocky.1

effectful

sharp.1

angular

blocky.2

resolution3D.effect.1

errors

ghost.images

blocky.3

flat

recognizable

ghost.images.1

sharp.2

high.resolution

three.dimensional.1depth.suggestion

perceived.as.one.picture

artifacts

X3D.impressioncolour.fidelity

sharp.3

rich.in.contrast.1

detailed

blocky.4

degree.of.3D.effect

blocky.5sharp.4

brilliant

blocky.6

rich.in.detail

sharpness

bright

wigglyhalting.1

sonorous

blocky.7

good.colour

ghost.images.2

blocky.8

3D.effect.3

ghost.images.3

quality.of.colour

blocky.9

image.sharpness.2

vividness

3D.effect.4

sharpness.1

depth.estimation.1

blocky.10

naturalness

clarity.of.3D

blocky.12


Figure 7 Item plot and partial loadings for the café. The partial loadings show individual

participant’s impact.

-10 -5 0 5 10

05

10

15

Dim 1 (33.48 %)

24h_qp30

24h_qp40

24h_qp45


dracula_qp45

makro_qp30makro_qp40

makro_qp45

skydive_qp30skydive_qp40

skydive_qp45

streetdance_qp30streetdance_qp40

streetdance_qp45theeye_qp30

theeye_qp40

theeye_qp45

group.1group.2

group.3group.4


group.8group.9


group.13group.14

group.15


Figure 8 Correlation plot of the evaluation in the context of use

3.2.3 Comparison of results

The final step of the analysis is the comparison of the two MFA models obtained from laboratory and café. A simple solution of comparison is the description of differences and similarities between the results of the individual models. Overall, the separate analysis has shown that the first two dimensions of each model describe similar things. While the first component relates to video quality, the second component refers to quality factors in relation to display and disparity problems. In addition, the attributes, also coming from different participants are very similar in describing the two components. However, a difference can be found in the amount of attributes. In general, attributes which have a correlation higher than 0.5, i.e. they contribute to at least 50% of explained variance, are regarded to be more important than the rest. For the laboratory, a percentage of 61.5% of all attributes classify to this criterion, while for the café only 44.6% of the attributes contribute to 50% of explained variance or higher.

A majority of the attributes for the laboratory MFA model show high correlation with the first dimension while only few correlate with the second one. For the café MFA results it is noticeable that the amount of attributes along the first component is lower. In addition, there exist more inter-dimensional attributes, which means that the dimensions are not as well separated as in the laboratory model.

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Dim 1 (33.48 %)

Dim

2(1

4.2

8%

)

blocky

displaced.3D

clarity.of.3D

bright.image

true.to.detail

resolution

true.to.colours

resolution.1

colour.contrast

high.resolution

jerky.free

spatial

brilliant.colour

blocky.1

bright

brilliant

image.qualityimage.sharpness

3D.effect

blocky.2

realistic

bright.1

smearing

blurredblocky.3

blocky.4blurred.1

dark ghost.image

blocky.5

ghost.image.1

blocky.6

blocky.7

dark.1

blurred.3

blurred.4

blocky.8

bright.2

blocky.9

perturbing

realistic.1


Although discussion already shows differences, we want to see if models are different. HMFA results confirm the previous findings and allow modelling the comparison in a joint analysis of both data sets. In the HMFA result, each test item is plotted at the center of gravity between both data sets. In addition, the partial clouds for each data set are plotted to see the separate impact of laboratory and café data. The HMFA model equals the separate models in terms of explained variance (51.12%; 37.93% for component 1 and 13.19% for component 2) and loadings of the items (Figure 9). In this joint model, again the different QPs separate along the first component. The analysis shows that the deviation between the data sets along the first component of the HMFA model is low. Along the second dimension, we can identify differences in the impact of the partial clouds. The deviation between the data sets along this component is much higher, especially for content dracula. Along the second dimension the café data shows higher loadings than does the laboratory data set.

Summarizing, the statistical comparison between the profiles revealed comparable discrimination of the test items for the visual quality based on similar factors. Higher sensitivity for the quality of the 3D in terms of crosstalk between channels can be identified for the café context.

3.3 Discussion and Conclusion The goal of this study was to validate the quality models that can be obtained using the Open Profiling of Quality approach in a comparison between data from laboratory and context of use. Within the User-centered Quality of Experience approach [26], the descriptive evaluation of quality and its evaluation in the context of use are two key approaches that were combined in this study.

The OPQ approach allows for a combined evaluation of quality using quantitative evaluation and qualitative descriptive sensorial profiling methods and high complementation of the results of both methods have been found. The comparison of the two contexts has shown that individual quality factors and their impact on users‟ perceived quality are very stable for different contexts. From the models, we were able to identify two main components that describe users‟ perceived quality. The most important component is video quality. As in previous studies, good video quality includes also descriptions of 3D perception [10][12]. This can explain the content-dependency among different QPs found in the quantitative results. Contents makro and 24h are rich in details and offer better 3D perception as other contents do. When video quality decreases, the added valued is not evaluated anymore and the satisfaction of quality with these two contents decreases accordingly with the loss of details. Beside the video quality, the impact of perceivable double images was found. These descriptions mainly correlate with content dracula and arise from a high amount of disparity and resulting crosstalk between left and right channel. For this dimension, two important findings were made. First, the analysis shows that these double images were only perceived by several test participants although participants were screened for the same visual abilities at the beginning of the test. This finding deepens the need for tools for better screening and description of the test sample [14]. In addition, test participants show higher sensitivity for this component in the context of use. This finding is confirmed by previous study in which the ease of use and the viewing comfort for mobile 3D television were identified as important components of quality in the context of use [42][45].


Figure 9 Result of the Hierarchical Multiple Factor Analysis. The partial clouds show the impact of the laboratory data set (black) and the data set from the café (red) on the joint

model

Limitations in this study can be seen in the missing detailed recording of the characteristics of the context of use during the conduction. Although we tracked special events during the conduction by writing them down, a detailed recording over time, e.g. video, is missing. This does not allow for deeper analysis towards shared attention and makes it impossible to report this more accurately as reported in other studies in the context of use [42][45]. However, the results of the study give valuable knowledge to deeper understand the interaction of quality perception and the interaction of perception and shared attention in the context of use.

Summarizing the comparison, we have shown that descriptive quality models obtained by applying the OPQ method are, in overall, very similar between the two contextual settings. From both models, two main components, video quality and crosstalk, were identified and the loadings of attributes as well as the correlation of individual attributes was equal to each other. This finding contributes to the overall validity of OPQ results that now has been found in several studies [10][12]. Open Profiling of Quality has become a validated tool for measuring experienced quality factors. However, differences were still identified between the evaluation in the laboratory and the context of use which OPQ was not able to measure and monitor, e.g.

-2 -1 0 1 2 3

-10

12

3

Dim 1 (37.93 %)

24h_qp30

24h_qp40

24h_qp45


dracula_qp45

makro_qp40

makro_qp45

skydive_qp30skydive_qp40

skydive_qp45streetdance_qp30

streetdance_qp40streetdance_qp45theeye_qp30

theeye_qp40theeye_qp45

Lab

Café


shared attention. This underlines the importance for contextual evaluations in the User-centered Quality of Experience evaluations which we see as a fruitful addition in the holistic evaluation of user-centered Quality of Experience of multimedia systems.


4 Extended OPQ

4.1 Fixed vocabulary and terminologies in descriptive analysis In contrast to individual descriptive methods, fixed vocabulary approaches evaluate perceived quality based on a predefined set of quality factors. The descriptive evaluation with fixed vocabularies has had a long tradition and several methods have been introduced and applied successfully on different research questions [18][52]. In general, this fixed vocabulary (also objective language [6], lexicon [15], terminology [36], or consensus vocabulary [1]) is regarded as a more effective way of communicating research results between the quality evaluators and other parties(e.g. development, marketing) involved in the development process of a product [6] compared to individual quality factors. They also allow for direct comparison of different studies or easier correlation of results with other data sets like instrumental measures [5]. In general, vocabularies include a list of quality attributes to describe the specific characteristics of the product to which they refer. These quality attributes are usually structured hierarchically into categories or broader classes of descriptors. In addition, vocabularies provide definitions or references for each of the quality attributes [6][15]. Some terminologies in the field of sensory evaluation have become very popular as they allowed defining a common understanding about underlying quality structures. Popular examples are the wine aroma wheel by Noble et al. [36] or Meilgaard et al.‟s beer flavor wheel [34] which have the common wheel structure to organize the different quality terms.

A fixed vocabulary in sensory evaluation needs to satisfy different quality aspects that were introduced by Civille and Lawless [5]. Especially the criteria of discrimination and non-redundancy need to be fulfilled so that each quality descriptor has no overlap with another term. In descriptive evaluation methods which apply these vocabularies, a consensus about the meaning of each of the attributes is needed among assessors [18]. While sensory evaluation methods like Texture Profile [3] or Flavour Profile (see [35]) apply vocabularies that have been defined by underlying physical or chemical properties of the product, Quantitative Descriptive Analysis (QDA) (see [52]) makes use of extensive group discussions and training of assessors to develop and sharpen the meaning and consensus of the set of quality factors. Relating to audiovisual quality evaluations, Bech and Zacharov [1] provide an overview of existing quality attributes obtained in several descriptive analysis studies. Although these attributes show common structures, Bech and Zacharov outline that they must be regarded highly application-specific so that they cannot be regarded as a terminology for audio quality in general [1]. A consensus vocabulary for video quality evaluation was developed in Bech et al‟s RaPID approach [2]. RaPID adapts the ideas of QDA and uses extensive group discussions in which experts develop a consensus vocabulary of quality attributes for image quality. The attributes are then refined in a second round of discussions in which the panel then agrees about the important attributes and the extremes of intensity scale for a specific test according to the test stimuli available.

4.2 The component model as extension of the OPQ method

4.2.1 Open definition task and qualitative descriptions

Within a set of OPQ studies in a specific research area, test participants develop a large amount of attributes that all relate to their individual descriptions of perceived quality in the specific domain. As descriptive analysis targets a broad evaluation of a specific research area with respect to different research problems [52], these descriptors cover a multifaceted view on experienced quality in this domain. During the development of Open Profiling of Quality, the question has arised if it was possible to develop a common vocabulary from these individual attributes for description and evaluation of experienced quality of audiovisual 3D media. In fact,


OPQ is a suitable approach to investigate and model individual experienced quality factors, but higher level descriptions of these quality factors to be able to communicate the main impacting factors to engineers or designers have been missing.

As a related approach, Samoylenko et al. [16] introduced the Verbal Protocol Analysis method. The goal of this approach was to analyze descriptions of the timbres of musical sounds into a common structure. The approach contains three levels of classification. In the first level, Samoylenko et al. classify each of the verbal descriptors into its „logical sense‟, i.e. if it describes similarities or differences between two stimuli. The second phase clusters each descriptor according to its „stimulus relatedness‟ which refers to either global or specific descriptions. The third hierarchical level finally groups each descriptor in accordance to its „semantic aspect‟. This level differentiates each descriptor into either single features or holistic, conceptual descriptions. In overall, ten different classifications are made for each descriptor along the three levels. The final result is a classification of each descriptor in accordance to these ten classes. Samoylenko et al. use this classification to switch from a descriptor-related analysis to a more general analysis of results within different groups in the classification. Although this approach is promising for a generalized step of analysis of data obtained from free verbalization tasks, it does not allow developing general vocabulary which can be used in prospective evaluation studies.

The component model is a qualitative data extension that allows identifying the main components of Quality of Experience in an OPQ study and structuring these components into a logical structure of categories and subcategories. The component model is included in the Extended-OPQ approach [12] and extends OPQ with a fourth step of data analysis. It uses data that is collected during the OPQ test Within the attribute refinement task of the sensory evaluation, a free definition task is conducted. The task completes the attribute refinement and test participants are asked to define each of their idiosyncratic attributes. Like during the attribute elicitation, they are free to use their own words, but definitions must make clear what an attribute means for them or to which aspect of experienced quality it relates. In addition, participants are asked to define a minimum and a maximum value of sensation for each attribute if possible. Our experience has shown that this task is rather simple for the test participants compared to the attribute elicitation. After the attribute refinement task, they were all able to define their attributes very precisely (Table 5). Collecting definitions of the individual attributes is not new within the existing Free-Choice profiling approaches and definitions are collected in related methods [56]. However, those definitions have only served to interpret the attributes in the sensory data analysis [10]. In the Extended-OPQ approach, we see these definitions as a second level of descriptions of the experienced quality factors with the help of the free definition task. These descriptions are short (one sentence), well-defined, and precise. While the individual attributes are used for sensory analysis, the component model extension finally applies these qualitative descriptors to form a framework of components of Quality of Experience. By applying the principles of Grounded Theory framework [59] through systematical steps of open coding, concept development and categorizing, researchers get a descriptive Quality of Experience framework which shows the underlying main components of QoE in relation to the developed individual quality factors. Comparable approaches have been used in the interview-based mixed methods approaches which are also included in the UC-QoE evaluation framework [26][29]. This similarity makes it possible to directly compare (and combine) the outcomes of the different methods into a joint model. Following, we will present an application of the component model for the data obtained within the holistic analysis of mobile 3D video [10][11].


Table 5 Examples of attributes and their definitions obtained in the transmission study of Mobile3DTV [11]

Attribute Participant‟s Definition

Minimum Maximum

Fluent movement Movement and action get blurry and get stuck in the background

Movements get very blurry

N/a

Image blurred Frames are not layered correctly

Image not displaced Image seems to be highly displaced

Constant background Background does not change when there is a non-moving image

N/a Colours and outlines do not change at all

4.2.2 Components of Quality of Experience for mobile 3D video

From the data sets obtained in the evaluations of mobile 3D television, we chose three studies which represented a large variety of research problems. The characteristics of these studies are summarized inTable 6.

For each of these studies, test participants developed a set of individual definitions in the free definition task at the end of OPQ‟s attribute refinement task. These definitions were taken as independent descriptive data sets for experienced quality and analyzed following the ideas of data-driven framework in accordance to the principles of Grounded Theory [59] and the instructions given by Jumisko-Pyykkö [26] as follows:

1. Open coding towards concepts: Usually, this steps starts by extracting meaningful pieces of data from the transcribed data sets. In the analysis of the free definition data, each definition can be treated directly as codes in the analysis as the definitions are short, well defined, and precise in comparison to e.g. interview data. From these codes, concepts and their properties are identified.

2. Categorization: All concepts developed are further categorized into major categories and probably subcategories

3. Frequencies of mention: The frequency in each category is determined by counting the number of the participants who mentioned it. Several mentions of the same concept by the same participant are counted just once.

4. Interrater reliability: A second researcher performs coding and categorization for a randomly selected 20% of each data set and interrater reliability is calculated using Cohen‟s Kappa [7].


Table 6 Characteristics of the experiments chosen for development of the QoE component model

Experiment and research problem

Experiment Variables Stimuli characteristics under test

Experiment 1: [10]

2D-3D Comparison

Sample Size: 15

Video: presentation mode (2D/3D)

Audio: presentation mode (mono/stereo)

Content: 6 contents

Length: ~18 s


Presentation mode: 2D and 3D

Quality level: highly acceptable

Video: mp4, 10-22 Mbit/s, 25fps

Audio:WMA 9, 48kHz, 16 bit

Experiment 2: [10]

3D Coding Methods

Sample Size: 15

Video: 4 coding schemes, 2 quality levels (low: 74-160 kbps, high: 160-452 kbps bit rate)

Content: 6 contents

Length: ~10 s


Presentation mode: 3D


Video: H.264/AVC (JMVC 5.0.5)

Audio: none

Experiment 3: [12]

3D DVB-H Transmission

Sample Size: 17

Video: 3 coding schemes @slice and noslice mode, 2 MFER rates (10%, 20%)

Audio: clean audio

Content: 4 contents

Length: ~60 s

Video: Synthetic and natural

Presentation mode: 3D


Video_ H.264/AVC (JM 14.2), MVC (JMVC 5.0.5)

Audio: WMA 9 11/44.1 kHz

The results of the data-driven analysis of the free definition task data shows that, in general, experienced quality for mobile 3DTV transmission is constructed from components of visual quality (depth, spatial, temporal), viewing experience, content, audio, and audiovisual quality Table 7.


Table 7 Components of Quality of Experience, their definitions and percentage of participants’ attributes in this category per study

COMPONENTS (major and sub)

DEFINITION (examples) Exp1 N=15

%

Exp2 N=15

%

Exp3 N=17

%

VISUAL DEPTH Descriptions of depth in video

3D effect in general General descriptions of a perceived 3D effect and its delectability 86.7 80.0 58.8

Excellence of 3D effect Artificial, strange, erroneous 3D descriptions(too much depth, flat planes) 66.7 6.7

Layered 3D Depth is described having multiple layers or structure 26.7 33.3 23.5

Foreground Foreground related descriptions 46.7 26.7 17.6

Background Background related descriptions 33.3 66.7 35.3

VISUAL SPATIAL Descriptions of spatial video quality factors

Clarity Good spatial quality (clarity, sharpness, accuracy, visibility, error-free) 73.3 80.0 76.5

Color Colors in general, their intensity, hue, and contrast 66.7 100.0 52.9

Brightness Brightness and contrast 26.7 80.0 17.6

Blurry Blurry, inaccurate, not sharp 46.7 40.0 47.1

Visible pixels Impairments with visible structure (e.g. blockiness, graininess, pixels). 33.3 73.3 70.6

Detection of objects Ability to detect details, their edges, outlines 73.3 80.0 47.1

VISUAL TEMPORAL Descriptions of temporal video quality factors

Motion in general General descriptions of motion in the content or camera movement 26.7 53.3 29.4

Fluent motion Good temporal quality (fluency, dynamic, natural movements) 60.0 52.9

Influent motion Impairments in temporal quality (cut-offs, stops, jerky motion, judder) 6.7 40.0 88.2

Blurry motion Experience of blurred motion under the fast motion 20.0 6.7 17.6

VIEWING EXPERIENCE User’s high level constructs of experienced quality

Eye strain Feeling of discomfort in the eyes 20.0 20.0 35.5

Ease of viewing Ease of concentration, focusing on viewing, free from interruptions 40.0 6.7 52.9

Interest in content Interests in viewing content 40.0 13.3 11.8

3D Added value Added value of the 3D effect (advantage over current system, fun, 53.3 33.3 17.6

worth of seeing, touchable, involving)

Overall quality Experience of quality as a whole without emphasizing one certain factor 20.0 40.0 11.8

CONTENT Content and content dependent descriptions 13.3 6.7 17.6

AUDIO Mentions of audio and its excellence 13.3 11.8

AUDIOVISUAL Audiovisual quality (synchronism and fitness between media). 29.4

The component model provides converging results to the results obtained in the sensory evaluations in terms of components and their importance. The most important category of the component model obtained is the visual quality which confirms the findings of the sensory analysis. Although the weighting of its subcomponents spatial, temporal and depth differs among the different studies, the overall findings show that especially the artifact-free perception of the video (clarity, fluency, excellence of 3D) determines participants‟ components of quality. In addition, the model shows that test participants often use complementary descriptions of quality which leads to contrary subcategories comparable to descriptions along dimensions in the sensory results. So, visual spatial quality is either described positively in terms of detection of objects and their details, while, in contrast, the same effect is negatively described as different structural imperfections such as blocking impairments and visible pixels. This juxtaposition can also be identified for other components, e.g. fluent motion vs. influent motion or eye strain vs. ease of viewing. Finally, it is remarkable that the results of the component framework analysis confirm the findings of the sensory evaluations for audio and audiovisual components. While in the sensory analysis one could still argue that the audio-related components are just overwhelmed by the high impact of visual components, the model shows that just few attributes are developed in relation to audio and audiovisual quality components. This finding confirms the sensory results. However, the conclusion of audio and audiovisual as separate components is important for the holistic view of the developed component model.

Within the work in the UC-QoE framework development, the component model was used in a joint analysis with qualitative data obtained by interviews in contextual studies [45]. The


comparable characteristics of the data and the resulting separate component models allowed combining the different data sets into one descriptive Quality of Experience model for mobile 3D video. The results presented by Jumisko-Pyykkö et al. [28] confirm the results of the OPQ component model and generalize the model. Table 8 lists the final components of QoE for mobile 3D video [28]. Especially by the contextual data, new emphasis is put on context-dependent components within the category of Viewing Experience. In overall, the developed joint results present a general descriptive model of QoE for mobile 3D media. Jumisko-Pyykkö et al. [28] conclude that important steps for further work on the descriptive model relate to validation and operationalization.

Table 8 Components of Quality of Experience for 3D video on mobile devices and their definitions

COMPONENTS (major and sub) - Bipolar impressions

DEFINITION (examples)

VISUAL QUALITY Descriptions of quality of visual modality, divided into depth, spatial and motion quality

DEPTH Descriptions of depth quality in video, characterized by perceivable depth, its natural impression, composition of foreground and background layers, and balance of their quality

Perceivable depth Perceivable/Not perceivable

Ability to detect depth or variable amount of depth as a part of presentation

Impression of depth Natural/Artificial

3D effect creates a natural, realistic and error-free impression instead of an artificial and erroneous impression (e.g. too much depth, double objects, shadows, seeing through objects)

Foreground-background layers Smoothly combined layers/Separate layers

Depth is composed of foreground and background layers and the impression of the transitions between these layers can vary from smooth to distinguishable separate layers

Balance of foreground-background quality Balanced/Unbalanced

Balance between the excellence of foreground and background of image quality (e.g. sharp foreground, blurry background or vice versa, or they are otherwise not in balance)

SPATIAL Descriptions of spatial image quality of video, characterized by clarity, block-freeness, colors, brightness, contrast and ability to detect objects and edges

Clarity of image Clear/Blur

Clarity of image in overall -- Clear (synonyms: sharpness, accuracy, visibility) vs. unclear (synonyms: blur, inaccurate, not sharp)

Block-free image Block-free/Visible blocks

Existence of impairments with visible structure in image (e.g. blockiness, graininess, pixels)

Color, brightness and contrast Good/Poor

Excellence of colors, brightness and contrast

Objects and edges Accurate/Inaccurate

Ability to detect necessary objects and details, their edges and outlines

MOTION Descriptions of motion of video, characterized by fluency, clarity and nature of motion

Fluency of motion Fluent/Influent

Excellence of natural fluency of motion -- Fluent (dynamic, natural) vs. influent (cut-offs, stops, jerky)

Clarity of motion Clear/Blurry

Excellence of clarity of motion (e.g. accuracy under fast movement or movement out of screen) -- Clear, sharp vs. blurred, pixelated

Nature of motion Static/Dynamic

Nature of motion in the content or camera movements - Static (synonym: slow) vs. dynamic (synonym: fast)

VIEWING EXPERIENCE Descriptions of viewing experience, characterized by ease and pleasantness of viewing, enhanced immersion in it, visual discomfort and impression of improved technology and overall quality

Ease of viewing Easy/Difficult

Easy to concentrate on viewing (e.g. free from extra effort and learning, viewing angle does not interrupt viewing)

Pleasantness of viewing Pleasant/Unpleasant

Pleasurable viewing experience, also for a longer period of time (e.g. 15min)

Enhanced immersion Enhanced/Not enhanced

Feeling of enhanced immersion into the viewing experience (impression of becoming a part of the events in the content, involvement, fun and improved impression of naturality, like-likeness, tangibility and realism)

Visual discomfort Experienced/Not experienced

Feeling of visual discomfort (eye-strain) and descriptions of related discomfort symptoms (headache, general discomfort)

Comparison to existing technology Improved/Not improved

Impression that provided quality of new technology (3D) is higher than quality of comparable existing technology (e.g. 2D video on a mobile device)

Overall quality Good/Bad

Impression of excellence of quality as a whole without emphasizing a certain factor (e.g. excellence over the time, relation between erroneous/error-free)

CONTENT Descriptions of content, their content-dependency and interests in viewing content

OTHER MODALITIES INTERACTIONS Descriptions of quality of audio modality and interaction between quality of audio and


visual modalities

Audio Audio and its excellence

Audiovisual Bimodal audiovisual quality (synchronism and fitness between media) and its excellence

The results of a comparison study between Open Profiling of Quality and a newly introduced method called Conventional Profiling, in which the Free-Choice Profiling task is substituted with a Conventional Profiling task using the QoE component model as fixed vocabulary for sensory evaluations, are presented in the following section.


5 Comparison of OPQ and CP

5.1 Introduction and research problem Systematic comparison of different research approaches is an important issue for selecting a proper research method to a specific research problem. In addition, it is a key aspect in the methodological work on new research approaches. Comparisons between research methods are needed to provide guidelines for the effective use of these tools for the practitioners. Research methods are composed of a collection of methods or techniques which aim at producing information with as small probability of error as possible [49]. For example, subjective quality evaluation methods contain components such as sample selection, scaling, evaluation task, moment of rating, and stimuli and its presentation ([26][39]). Some of these components have been independently compared to estimate reliability and validity (e.g. [40]), but they provide only a limited view to the benefits and weaknesses of the whole method and cover only few dimensions to guide the selection between methods. The method comparisons can cover the

performance-related aspects (e.g., accuracy in different quality range, validity, reliability, and costs), complexity (e.g., ease of planning, conducting, analyzing, and interpreting results), evaluation factors (e.g. number of stimuli, knowledge of research personnel) (e.g. [32][33]). However, there are no extensive criteria for method comparisons available for the multimedia quality assessment research to guide the practitioners work.

Recent research has proposed novel methods and techniques to capture quality of experience, but their applicability is unknown. These methods extend the quantitative evaluations [39] by use of parallel descriptive tasks, psychopysiological measures (e.g. eyetracking, galvanic skin response), and hybrid methods to assess quality in the natural context of use (see overview [26]). Among these, mixed methods combine quantitative and descriptive methods into one study and have slowly started to gain popularity in subjective evaluation research, as several competing methods exist.

In these mixed methods approaches, quantitative preferences are collected using the conventional methods provided by the standardization bodies (e.g. ACR [39]. The descriptive part of these methods is either based on interview techniques or on vocabulary based approaches (for overview see [10]). We introduced a method called Open Profiling of Quality (OPQ) [13] in which we extend quantitative assessments with an adaptation of Free-Choice Profiling (FCP). Naïve participants generate their own vocabularies to describe and evaluate their quality perceptions. While OPQ and its individual vocabulary approach has provided good results in the evaluations of multimedia quality by complementing and explaining the quantitative-only evaluations [10][12], the extension of this new method also increases the costs of studies (overview [26]). A possibility to shorten the sensory evaluation is to use a fixed vocabulary for all participants instead of individual attributes as it is proposed in different consensus based approaches (e.g. [2]). Our Conventional Profiling approach therefore operationalizes a set of components for Quality of Experience for mobile 3D video [28] (see section 4.2.2). These components are taken as fixed vocabulary on which the test participants evaluate their perceived quality in the sensory evaluation. Although this sounds promising in terms of easier implementation, the benefits and weaknesses need to be compared systematically. In our mixed methods research approach, „the long-term goal is to support the idea

of safe development of these instruments by understanding their benefits and limitations when capturing

deeper understanding of experienced multimedia quality’. [13]‟

5.2 Comparison criteria and comparison model A literature review into several fields of research shows that criteria that are applied to describe different abilities of research methods are varying heavily among these fields. Comparable


approaches have been done for different psychoperceptual quality evaluation methods in the ITU recommendations [39][38]. In recommendations like ITU Recommendation ITU-T P.910 [39] different research methods are described and short guidelines are offered for purpose-directed selection of the appropriate method. Within these guidelines mostly stimulus-related factors, e.g. perceivable quality range or discrimination power, are taken into account to direct the selection process. Similar approaches can be found in the juxtaposition of different sensory evaluation methods in food sciences [31][35]. Here, the offered guidelines are oriented along three main criteria in a more research problem related approach and differentiate in accordance to the “three primary questions about products”: 1) Question about acceptability, 2) Question about sensory analysis, and 3) Question about the nature of differences [31]. These two approaches provide first guidelines for key aspects in the comparison of research methods. Further comparison criteria can be identified from other Fields of research going beyond the domain of quality evaluations.

The most general comparison criteria are described in the social sciences. Here, performance indices are well established tools to measure differences between methods in terms of their degree of scientific nature. These criteria are primarily validity and reliability [4][22], but also generalization, replication, and objectivity are found to be important criteria in related research approaches. Within the social sciences validity and reliability are considered as principles for good research. These criteria offer a very general way of comparing methods but do not offer specific guidelines with respect to effort or costs per method.

Studies on usability extend the criteria of validity and reliability with other performance-related criteria like effectiveness, efficiency and robustness related to economical aspects [32]. In addition, Markopoulos and Bekker [32] list criteria for describing different usability methods: purpose of the test, the artifact tested, the interaction tasks, participants, facilitator, environment/context, procedure, capture of data, and the characteristics of the test participants. Other studies on comparison of usability tests extend the definitions of effectiveness and describe it in terms of cost-effectiveness and effectiveness in terms of results (e.g. number of usability problems identified) [21][50].

In the food sciences also some effort has been done to compare different sensory evaluation methods. While many of the comparisons focus on the pure juxtaposition of the results [53][37], McTigue et al. [33] describe a set of requirements for holistic comparison of descriptive methods. In a comparison of four descriptive analysis approaches, they applied the following criteria: subject selection, number of subjects, training, samples evaluated, replications, method of measurement, analysis of data, outcome and professional personnel [33]. In addition to results, their comparison criteria describe similarities and differences in terms of requirements in time, test items, personnel, and need for technical equipment. The importance to include test personnel and technical equipment into a systematic comparison model was also found by Yokum and Armstrong [54]. In a comparison of several forecasting methods implementation-related criteria like ease of use, ease of interpretation and cost/time were rated as very important for the overall comparison beyond validity, reliability, or objectivity. Stecher et al. [51] found similar criteria when comparing assessments in vocational education.

This short review shows the different nature of comparison criteria in different fields of research and underlines the importance for an extensive comparison model to guide between-method comparisons. In a further step we categorized the collected comparison criteria to build up a comparison model. The model has a structure from particular criteria to more general categories that we identified during the development process: excellence related, economy related, implementation related, and assessment related criteria (Figure 10). This structure is beneficial for the comparison of methods as comparison can be done according to the general categories as well as by means of particularly selected criteria. Following, the four categories and the most relevant criteria in each category are described in more detail.


Figure 10 Comparison Model with the four categories and corresponding sub-criteria

Economy related criteria - This category comprises criteria that measure the economical potential of a method. Thereby, the amount of time and costs are related to the results and efficiency can be estimated. Furthermore, effectiveness assesses the performance of a method, its completeness and accuracy, and whether the desired goals are achieved [23]. Time and costs of a method have to be measured with regard to its results to compare the efficiency of methods.

Excellence related criteria - Excellence related criteria measure the quality of a test. The criteria validity and reliability are known as quality criteria in social science [4]. They are the main prerequisites that account for good research practice. General practice for measuring validity is a careful and thoughtful test design. Within the test design it is important to consider the things that could add bias to the test and threaten validity. Coolican [7] gives a good overview over threats to validity like history effects, sampling bias or confounding variables, which require thorough consideration in the test development. Reliability is another very important test quality criterion. Correlation coefficients are used to measure reliability [7] and coefficients above 0.75/0.8 are considered to represent good reliability. Despite validity and reliability, further excellence related criteria are included in the comparison model (Figure 10). Best practice for a


good test design is to describe them carefully in the test development process and to discuss and interpret them.

Assessment related criteria - Assessment related criteria concern the global characteristics of the test. The whole test-design depends on the purpose of the test. When designing a test that answers a particular research question one has to think about the context of the test and an appropriate environment. Moreover, test participants, their gender, age, expertise and group composition are relevant as well as the personnel and their demands and duties.

Implementation related criteria - Implementation related criteria concern the implementation of a test. A detailed description of the test procedure is very important to reconstruct the test. Within the test description it is also necessary to describe the test items, their production and how data was captured in the test. In the context of personnel demands and cost the complexity of a test is an issue to be considered. The complexity of a test can be split into four subcategories: the ease of implementation, how easy the test can be used, the ease of using the data, and how easy it is to interpret the results.

Table 9 Selected components for method comparisons

Category Component Definition How to compare/measure

Implementation related

Test procedure Test procedure + methods of analysis

Detailed description of procedure (number of sessions, method of measurements, data

analysis, and outcome)

Excellence/Economy related

Test results Results/outcome of the test – interpretation of

data

Describe and interpret results, differences – what does the

data tell us

Economy related Amount of time The time that it takes to develop, implement,

conduct, analyze the test and publish the results

Measure the time in minutes and compare between methods

Economy related Costs Costs of the test, depending on the time

and complexity of the test

Calculate the costs depending on task demands and amount

of time

Summarizing the model, it comprises many criteria originating from different research areas. Although we have created a good amount of criteria, a next step towards making the model a well-validated tool is operationalization of its components based on defined measures. Following we present the results of our work on comparing two different mixed methods approaches for multimodal quality assessments based on an initial set of comparison attributes – descriptions and measures of test procedure, the results, the amount of time, and the costs – which we see as key criteria for a methods comparison (Table 9).


5.3 Comparison Study: Comparing OPQ and CP

5.3.1 Research Method

We applied Open Profiling of Quality (OPQ) and a new variation called Conventional Profiling (CP) in which OPQ‟s Free-Choice Profiling approach is substituted with a sensory evaluation based on fixed vocabulary (see Section 4).

5.3.1.1 Participants

A total of 63 test participants took part in the study. All test participants were screened for normal or corrected to normal vision, color vision and 3D vision. All test participants can be classified as naïve assessors as they had experience neither in the domain of research nor in subjective quality evaluation studies. In the study, each test participant passed the psychoperceptual evaluation. For the qualitative part of the study, 15 randomly selected participants were assigned to the Conventional Profiling (CP) and 16 participants to the OPQ.

5.3.1.2 Variables and their production

The same contents and variables as in our context study (Section 2) were used.

5.3.1.3 Stimuli presentation

The tests were conducted in a laboratory at Ilmenau University of Technology and test conditions were arranged according to the specifications in ITU-T P.910 [39]. A digital Viewer FinePix REAL 3D V1 from FUJIFILM with a resolution of 640x480 pixels was used for playback of the 3D videos. The viewing distance was set to 50cm initially, but test participants were allowed to adjust their viewing distance for the best stereoscopic experience. The integrated loudspeakers of the FinePix V1 were used for audio playback due to a missing headphone connection. According to the speakers‟ maximum sampling rate audio was represented with a sampling rate of 11 kHz. Different playlists in pseudo-randomized orders were used for video presentation. During the psychoperceptual evaluation each test item was presented twice. In the OPQ and CP task each video was presented once.

5.3.1.4 Test Procedure

Open Profiling of Quality (OPQ) and Conventional Profiling (CP) are both methods that extend psychoperceptual evaluation with a descriptive profiling task. The Psychoperceptual evaluation started with training and anchoring. Test participants trained watching the scenes in 3D and practiced the evaluation task. During this task the whole range of constructed qualities and contents was presented. Absolute Category Rating (ACR) was applied for the following psychoperceptual evaluation to evaluate the overall quality on an unlabeled 11-point scale [39]. In addition, the acceptance of overall quality was rated on a binary (yes, no) scale [27]. After a short break, test participants were either conducting a Free-Choice Profiling according to Open Profiling of Quality or the Conventional Profiling.

According to Open Profiling of Quality, the psychoperceptual evaluation is complemented with an adaptation of Free-Choice Profiling [13]. It consists of four subsequent steps, 1) Introduction, 2) Attribute elicitation, 3) Attribute refinement, and 4) Sensory evaluation. We followed these steps as described in [13]. The introductory “apple task” and an extensive vocabulary elicitation and refinement are followed by the sensory evaluation. OPQ allows each test participant to develop his individual quality attributes which he uses to evaluate perceived quality. The whole OPQ study was conducted in one session in contrast to its original description [13].

For the CP study, we applied all attributes included in the components framework of quality of experience for 3D video framework [28](see Section 4 Extended OPQ), except the content component. In the test, participants got a list with the quality components, their descriptions, and


the scale labels [28] (Table 8) to become familiar with the attributes. They used them in a training task, in which they watched and rated 6 videos on a scoring card comparable to OPQ, but with the fixed attributes. Finally, they did the quality evaluation with the predefined quality components on all items.

5.3.1.5 Methods of Analysis

For the psychoperceptual evaluation nonparametric methods (Kolmogorov-Smirnov: p < .05) were used for the analysis. Several ordinal dependent variables were analyzed using Friedman‟s test and pairwise comparisons were analyzed with Wilcoxon‟s test [7]. Frequencies were counted for the acceptance ratings. PASW Statistics 18 was used for quantitative data analysis. Sensory data was analyzed using R with FactoMineR package. A Multiple Factor Analysis (MFA) and a joint Hierarchical MFA (HMFA) were calculated from the OPQ and the CP data sets [12].

5.3.2 Results

5.3.2.1 Psychoperceptual evaluation

The presented stimuli reached an acceptance level of 52.8% in total. Items with qp30 were accepted with a minimum of 82.5% and with 94.2% over all contents. Items with qp40 reached an acceptance level of 52.4% over all contents, whereas items with qp45 were not acceptable at all (11.9%)(Figure 11). The coding quality parameters influenced the overall quality perception when averaged over all contents (Fr = 1363.028, df = 2, p < .001). Figure 2 shows the mean satisfaction scores averaged over all participants for the different contents and quality parameters. Videos with qp30 provided the most satisfying quality compared to qp40 and qp45 (all comparisons: p < .001).

Figure 11 Overall Acceptance scores for the items under test


Figure 12 Mean Satisfaction scores for all contents and quality parameters

5.3.2.2 Sensory evaluation OPQ data

The first two components of the MFA model from OPQ data account for 56.42% of explained variance (dimension 1: 44.25%, dimension 2: 12.17%). The correlation plot in Figure 3 presents the attribute distribution in the perceptual space. The negative polarity of dimension 1 is described by attributes like grainy (e.g. p34.20, p31.12), artifacts (p1.39), or stumbling (p47.60). The positive polarity of dimension 1 is described with attributes like 3d effect (e.g. p37.67, p22.4), sharpness (e.g. p38.78, p47.57), naturalness (p38.81), or details (p22.7). Items (see figure 6: items with black dots) with qp30 are at the positive polarity of dimension 1 and described with positive attributes, whereas items with qp45 are at the opposite polarity and described with a negative video quality. Dimension 2 in its positive polarity describes mainly the ghosting effect of some items with the attribute double pictures (p29.65, p37.70, p27.25, p2.31). Its negative polarity shows partial correlation with attributes like bright (p49.54), perceivable as one picture (p17.38), and nice colors (p29.63). The test items are distributed along dimension 1 according to their quality parameter as can be seen in Figure 4. Dimension 2 mainly has an influence on item Dracula, which can be explained with the ghosting effects perceive within this content.


Figure 13 Correlation plot of OPQ results – only attributes having more than 50% of explained variance are shown (attributes are numbered consecutively and presented as

participant number. attribute number (e.g. p27.24))

5.3.2.3 Sensory evaluation CP data

For the CP dataset we calculated as well the MFA over all participants. The CP model resulting from the MFA accounts for 53.46% explained variance (dimension 1: 44.04% and dimension 2: 9.42%). The item plot (Figure 14, CP: red items) shows that the test items separate along the first component according to the different QPs. Along the second component, especially the items of content ‟dracula‟ separate from the other test items.For the sake of clarity we averaged the resulting correlations for all 19 attributes over the participants and present the averaged attribute correlations in Figure 16. While all these attributes show high correlation with dimension 1 and low correlation with dimension 2, the non-averaged results (gray arrows) also reveal high correlation of some attributes with dimension 2. For dimension 1, the highest correlation is given for attributes like ‟clarity of motion‟, ‟objects and edges‟, ‟color, brightness and contrast‟, and ‟clarity of image‟. For dimension 2, the correlation plot shows that the attributes with high correlation differ and are of classes like ‟ease of viewing‟, ‟fluency of motion‟, or ‟perceivable depth‟.We furthermore calculated the correlations of each individual PCA result with the overall MFA model (Figure 17).For all participants, the dimension 1 (F1) of each participant correlates with MFA Dim1 and many of the individual F2 with MFA Dim2. So the structure of the individual data sets seems to be comparable. However, no correlation of the overall attributes with Dim2


can be found within the averaged MFA results (Figure 16). According to the individual attributes (gray arrows in Figure 16) this suggests that maybe participants understood and used the attributes in different ways and that maybe the used attributes were not adequate for the description of all perceived quality characteristics.

Figure 14 MFA item plot (OPQ items in black and CP items in red)


Figure 15 Correlation plot of averaged MFA correlations over all participants and non-averaged results (gray arrows) with some exemplary labels (a1:perceivable depth,

a2:impression of depth, a3:fore/background layers, a4:balance of fore/background quality, a5:clarity of image, a6:block-free image, a7:color, brightness and contrast, a8:objects and edges, a9:fluency of motion, a10:clarity of motion, a11:nature

of motion, a12:ease of viewing, a13:pleasantness of viewing, a14:enhanced immersion, a15:visual discomfort, a16:comparison to existing technologies, a17:overall quality, a18:audio, a19:audiovisual)

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

Dim 1 (44.04 %)

Dim

2 (

9.4

2 %

)

a10

a13a6 a19

a18

a14a17

a11

a15

a4 a5a9

a7

a12a2

a3

a16a1

a8

Pleasantness of viewing.0

Fluency of motion.6

Visual discomfort.7

Perceivable depth.6

Nature of motion.6

Visual discomfort.6

Audiovisual.9

Visual discomfort.8

Nature of motion.9


Figure 16 Individual configurations and their partial plots in the overall MFA model

5.4 Systematic Comparison of Methods Open Profiling of Quality and Conventional Profiling both provide valuable results to understand the underlying quality factors of the psychoperceptual evaluation. However, we aim at a holistic understanding of similarities and differences among the methods. Following, we compare the two methods based on our selected comparison criteria. For the whole comparison, we assume that both methods are valid and reliable thanks to a careful test design, the usage of both methods in earlier evaluations (described in [10][12]).

5.4.1 Test results

To be able to compare both results on a statistical basis, we conducted a HMFA (Figure 17). The result and the partial clouds of OPQ and CP data set show good agreement of the discrimination of the test items. Especially the deviation along the first dimension is low so that both methods seem to be able to classify video quality similarly. Small differences can be found along dimension 2. Here, the OPQ data set seems to be more sensitive in capturing the impact of ghosting effects and double pictures caused by crosstalk.

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

Dim 1 (44.04 %)

Dim

2 (

9.4

2 %

)

F1(1)

F2(1)

F1(2)

F2(2)

F1(3)

F2(3)

F1(4)

F2(4)

F1(5)

F2(5)

F1(6)

F2(6)F1(7)

F2(7)

F1(8)

F2(8)

F1(9)

F2(9)

F1(10)

F2(10)

F1(11)F2(11)

F1(12)

F2(12)

F1(13)

F2(13)

F1(14)F2(14)

F1(15)

F2(15)


5.4.2 Test procedure

Both methods follow the same overall procedure of psychoperceptual evaluation and sensory analysis. The methods OPQ/CP differ in the important aspects of attribute generation/familiarization and refinement/training. In OPQ participants generate and refine their own vocabulary for the quality evaluation, whereas in CP we provide a consensus vocabulary and participants have to familiarize themselves with the predefined quality attributes in terms of meaning and usage.

Figure 17 Superimposed representation of the partial clouds of the HMFA of OPQ dataset

(black dots) and CP dataset (red dots)

5.4.3 Amount of time

We measured the time per participant for conducting the whole test session in minutes and calculated the average amount of time. The mean test duration for the psychoperceptual evaluation at the beginning of both methods was 31.9 minutes (standard deviation (sd) = 5.5). Participants needed 51.1 minutes (sd = 5.3) for the CP task and 40.5 minutes (sd = 7.6) for the OPQ task on average. This shows that participants needed significantly more time for the CP task than the OPQ task (t-test: T = 4.493, df = 29, p ≤ 0.001). The huge amount of 19 attributes and the inexperience of our participants may have made the CP method long lasting. This time could be shortened when using more experienced participants and less attributes.

5.4.4 Costs

Costs of a study depend on the personnel demands for different tasks and the amount of time that is needed for the study from planning to report of its results. Within our study, the conduction time was higher for CP than for OPQ and therefore OPQ produced lower costs. However, carefully conducted OPQ sessions demand for more experienced researchers compared to CP. The guidance of test participants in attribute elicitation and refinement, important steps in individual vocabulary profiling, need more experience for correct test conduction. We have found out that it is very difficult to compare the global costs of a method, as


the measurement of task demands and amount of time for certain tasks is not straightforward and depends on prior knowledge and experiences of the researchers with one or another method.

5.4.5 Research purpose

Beyond the comparison of results and costs the research purpose itself is a crucial criterion for a holistic comparison. The research purpose of a study is usually the starting point when investigating into a certain research question. OPQ is especially suitable for research areas in which the quality perception is not yet fully understood and no consensus vocabulary exists. OPQ studies help to identify crucial individual quality attributes, but communication of results to other parties of a development process is hard due to the individual nature of attributes. In contrary, CP is useful in an already explored research area with a defined set of quality components. If these components are well defined and validated, CP will offer a good method for discrimination of different test stimuli good communication of results based on the fixed components.

5.5 Discussion We developed an extensive comparison model to guide between-method comparisons based on a literature review and we compared two mixed methods using a subset of criteria of the model. The model contains four main criteria, called economy, excellence, implementation and assessment including 24 sub-criteria. At the current stage, the model gives an overview to the evaluation criteria, but further definition and operationalization are needed in work towards an applicable comparison model. Furthermore, it is essential to reflect the different parts of the model in relation to the central methods in multimodal quality evaluation research (e.g. [24][10][26]). This would assist practitioners in their choice for a suitable research method.

The comparison of the mixed methods on a subset of implementation, effectiveness and economy-related criteria showed benefits and weaknesses between the methods depending on the measured dimensions. In the implementation, conventional profiling provides a consensus vocabulary for the intended research problem and requires training for each participant to use it. OPQ requires a multistep procedure to develop an individual‟s own vocabulary to be used in rating, but it does not require prior knowledge in research problem. Analysis of this criterion proposes that OPQ is more suitable for identifying the quality of a novel phenomenon. According to the economy-related criteria, conventional profiling is slightly more time-consuming (26%) in the data-collection phase than individual profiling. However, this conclusion is highly dependent on the used number of rated attributes as well as the presentation of stimuli. The result of the joint HMFA analysis was similar in the dominating dimension indicating a good accordance of both methods (positive-negative) and is validated with comparable results from previous studies [10]. However, small differences appeared between the descriptions of crosstalk which can be explained by an inconsistent use of the fixed vocabulary to describe this artifact. It seems that the OPQ method can capture this phenomenon in more detail. However, the currently used subset of comparison criteria does not allow drawing strong conclusions of the universal preferences between the methods. Further work needs to address the comparison between the methods more holistically utilizing the comparison model developed.


6 Prototype Study: Usability and quality experience with the final mobile 3D prototype in the context of use

6.1 Preliminary planning Different coding methods and transmission parameters were developed and examined within the core technology development of Mobile3DTV. Information about the best coding methods and transmission parameter combinations were already gathered in previous experiments [10][11][12] and used for further optimizations of the Mobile3DTV system. In a final step the developed technologies were combined into one end-product. This end-product is developed according to the user requirements for mobile 3D television and video [30] and the quality evaluation results from previous experiments.

The goal of the study is to validate the optimized Mobile3DTV system and the prototype in field settings under as natural as possible usage conditions.

6.1.1 Participants

We plan to conduct the study with at least 30 participants. Participants should refelct the mobile 3DTV user profiles drawn by the previous studies. End-user group of the age from 20 to 45 took is assumed. All participants have to be screened for normal or corrected to normal visual acuity (myopia and hyperopia, Snellen index 20/30), color vision using Ishihara test, and stereo vision using Randot Stereo test (≤ 60 arcsec).

6.1.2 Test design

A factorial, related design [7] is to be applied to the experiment. Subject variables contain the content and the presentation modes. Participants should do the quality evaluation in three contexts.


A combination of a psychoperceptual quality evaluation and descriptive interviews is chosen. The psychoperceptual quality evaluation consists of a pre-test, a training session, and the quality evaluation. For the descriptive interview we use a semi-structured interview after each psychoperceptual evaluations in each context and at the end of the tests. Furthermore, participants have to do post-task tests, including a demographic and a workload questionnaire at the end of the tests. Quality evaluations are carried out in every context (laboratory, bus, and café) including quantitative ratings and qualitative interviews. An overall interview concludes the experiment after all three quality ratings in the three contexts.

6.1.3.1 Story

The whole study is planned like a big mobile 3D video usage story. Participants arrive at the laboratory, do the pre-tests, chose the content they want to watch, and do the first quality evaluation in the laboratory as our control setting. According to a short story 15 participants would take a bus to a café and do the quality evaluation in both contexts, first in the bus and afterwards in the café. The task for the participants was to imagine that they take the bus to meet a friend in the café and to wait for him there. The other 15 participants walked to the café to meet a friend there and went home by bus afterwards.

6.1.3.2 Psychoperceptual evaluation

Pre-test


Before the evaluation starts participants are to be introduced to the test signed a data protection policy. The screening of participants for myopia, hyperopia, color vision, and stereo vision is done before the actual start of the test session.

Accommodation and training

Accommodation and training takes place only in the laboratory setting to familiarize the participants with the device and the stereoscopic videos. All participants watch some high quality stereoscopic videos to get used to the device and to find a good viewing position for optimal three-dimensional viewing experience.

In the training and anchoring task we show the participants a subset of the test items. This subset represents the extreme values of the items and all contents. The intention of the training is to familiarize participants with the evaluation task and the usage of the quality scales. Furthermore, participants are not told about quality factors, so that they are expected to use their own quality reference. In the evaluation task they rate the quality acceptance and the overall quality of each test item.

Quality evaluation

As in previous studies [45] we will use Absolute Category Rating (ACR) according to ITU-T P.910 [39] for the evaluation of the overall quality. Additionally, we will use the Acceptance Threshold according to Jumsiko-Pyykkö et al. [27] to measure the general quality acceptance of the test items. The general acceptance is rated on a binary yes-no scale. The overall quality is evaluated on an 11-point unlabeled scale. All ratings are given on a small scoring card (A6) that hung around the user‟s neck when moving between the contexts (see Figure 18). Test items are presented in randomised orders and participants rated all items twice. The quantitative session takes about 40 minutes in average in the laboratory with the pre-tests and training and 20 minutes in the contexts bus and café having only the quality evaluation and not the pre-tests and training again.

Figure 18


6.1.3.3 Qualitative, descriptive Interviews

A semi-structured interview is to be done after the quality evaluation in each context. This type of interview uses main and supporting questions to ask the participant for detailed explanations of previous answers, but still uses an overall interview guideline. Such a guideline enables the later comparison of the interviews. The following questions are suggested for the interviews.

Main Questions:

What kind of factors did you pay attention to while viewing in this situation?

What kind of thoughts, feelings and ideas came to your mind while viewing?

How did you experience your surroundings?

Did you notice any positive or negative things, things you liked, didn‟t liked, things that disturbed you?

Which were the quality characteristics you paid attention to?

Supporting Questions:

Please, could you describe in more detail, what you mean by X (answer of main question)?

Please, could you describe in more detail when/how X appeared?

Please, could you clarify if X was among annoying – acceptable – pleasurable/negative – positive factors?

At the end of the whole test another semi-structured interview, following the same guideline is done.

6.1.3.4 Questionnaires

According to previous studies by Jumisko-Pyykkö et al. [29] we suggest conducting two questionnaires in our study to be filled out by the participants at the end of the experiment. In our first questionnaire we ask about demographic data of the participants as well as about interest in content and knowledge about the technology. A workload questionnaire (based on the NASA-TLX [19] questionnaire) is to be used to evaluate the demands of the evaluation task in the context of use.

6.1.4 Test Material and Apparatus

6.1.4.1 Selection of Test Sequences

We have collected 20 different contents to make a selection from for the study. The contents were selected according to the user requirements of mobile 3D television and video ([30][13][9]). Contents were selected from four categories: animation, sports, user-created, and documentation with five different contents in each category. The contents differ in their length from 40 seconds to approximately 3 minutes according to a meaningful plot and the available length of the video.

Additionally, contents were chosen to represent a variation of different content parameters like spatial details, temporal details, depth complexity, scene cuts, and audio. Table 10 illustrates the 20 contents with screenshots and provides the duration and a short description of the content parameters.


6.1.4.2 Selection of Test Parameters

For the evaluation of the prototype we want to present video sequences in high quality. For this purpose we selected MVC as coding method, because this method proved to be one of the best for mobile television and video and it can be used to encode videos consisting of right and left video streams ([11][8]). To receive comparable high quality for all sequences the quantization parameter (QP) of the encoder should be set no higher than 30 for high quality for the Simulcast Coding. For an optimized transmission simulation of the video sequences we choose two different transmission scenarios. On the one hand the sequences can be optimized for stationary transmission for the café and laboratory context. On the other hand the sequences can be optimized for the moving context in the bus. With the prototype device it is possible that participants choose the playback mode between stereoscopic (3D) and monosopic (2D). In 2D playback mode the right video stream is substituted by the left stream, so that the sequence consists of two left video streams. Participants in the test are not allowed to choose the playback mode for themselves, as they must not know the actual playback mode. For this reason we shall generate 3D and 2D versions of each video sequence beforehand.

6.1.4.3 Apparatus and Test Setup

The tests will be conducted in three different context settings. The first context is the Listening Lab at Ilmenau University of Technology. This laboratory offers a controlled test environment. The laboratory settings for the study are set according to ITU-R BT.500 [38]. As another context we chose the student café on the university campus. The café is in the basement of one of the student hostels and is used by students and employees weekdays from 11 to 17 o‟clock. The café consists of a bar, a lounge with comfortable sofas and another room with tables. To ensure comparable test conditions we only conducted the test in the café between 12 and 15 o‟clock, when approximately the same amount of people was in the café. Participants are always placed at the same table, from which they could see the entrance. Lighting conditions at this table are comparable and we always would tell the participants to find a good viewing position to have as little as possible light reflections on the display. The background noises are generated by talking people, music, and kitchen noise. We chose a bus ride as the third context for our study to have a “moving” context of use. Participants would take the bus from the laboratory to the café or from the café to their home. The bus was a regular local public transport bus in Ilmenau. Participants are asked to always use the same seat in the back of the bus. Both bus routes (laboratory to café and café to participants home) has approximately the same length of about 15 minutes.


Table 10 Snapshots of the 20 contents under assessment (VSD =visual spatial details, VTD: temporal motion,

VD: amount of depth, VDD: depth dynamism, VSC: amount of scene cuts, and A: audio characteristics)

Animation Sports User-created Documentation

Knight‟s quest (1:00 min)

VSD: medium, VTD: high, VD: high, VDD: high, VSC: medium, A: movie sounds, atmosphere sounds

24h Race (1:50 min)

VSD: medium, VTD: high, VD: medium, VDD: high, VSC: high, A: racing sounds, atmosphere sounds, music

NY-Street Dance (2:50 min)

VSD: medium, VTD: high, VD: medium, VDD: low, VSC: no, A: music, clapping hands, voices

Wildearth Safari (2:30 min)

VSD: medium, VTD: medium, VD: medium, VDD: medium, VSC: medium, A: atmosphere sounds

Dracula (40 sec)

VSD: medium, VTD: high, VD: high, VDD: high, VSC: high, A: movie sounds, atmosphere sounds

Skydive (2:20 min)

VSD: low, VTD: medium, VD: medium, VDD: medium, VSC: low, A: music, atmosphere sounds

Mountain Bike Race (2:00 min)

VSD: medium, VTD: high, VD: high, VDD: medium, VSC: medium, A: music

Rhine Valley (1:30 min)

VSD: medium, VTD: medium, VD: high, VDD: medium, VSC: low, A: music, atmosphere sounds

Shrek the Third Movie Part (3:00 min)

VSD: medium, VTD: medium, VD: medium, VDD: medium, VSC: low, A: movie sounds, engl. Speech

Motocross Race (1:20 min)

VSD: high, VTD: high, VD: high, VDD: high, VSC: medium, A: racing sounds, music

Cave (2:40 min)

VSD: medium, VTD: low, VD: medium, VDD: low, VSC: low, A: music, atmosphere sounds

The eye (2:20 min)

VSD: medium, VTD: low, VD: medium, VDD: low, VSC: medium, A: music

Cloudy with a chance of meatballs Trailer (1:20 min)

VSD: high, VTD: medium, VD: high, VDD: medium, VSC: low, A: movie sounds, engl. Speech

FIFA soccer WM ARG-NIG (2:30 min)

VSD: high, VTD: medium, VD: medium, VDD: medium, VSC: low, A: stadium sounds, engl. Speaker

Berlin in 3D (2:10 min)

VSD: medium, VTD: medium, VD: medium, VDD: medium, VSC: low, A: music, atmosphere sounds

Heidelberg (2:40 min)

VSD: medium, VTD: medium, VD: medium, VDD: medium, VSC: low, A: music, atmosphere sounds

Ice Age 3 Movie Part (2:20 min)

VSD: medium, VTD: medium, VD: medium, VDD: medium, VSC: medium, A: movie sounds, atmosphere sounds

Downhill skiing (2:00 min)

VSD: medium, VTD: high, VD: high, VDD: medium, VSC: low, A: skiing sounds, atmosphere sounds, music

Rhine Valley (1:50 min)

VSD: medium, VTD: low, VD: medium, VDD: medium, VSC: low, A: music, atmosphere sounds

Makroshow (1:50 min)

VSD: medium, VTD: high, VD: medium, VDD: high, VSC: medium, A: music


6.2 Actual planning Planning of the final usability tests of the prototype and system was transferred to WP6 to be held as part of the physical end-to-end system setup up. Therefore, the plan was modified and reduced accordingly. We refer also to Deliverable D6.8 “Complete end-to-end 3DTV system over DVB-H” for the actual conduct of the tests and the results of them.

6.2.1 Participants

We plan to conduct the study with at least 30 participants. Participants should reflect the mobile 3DTV user profiles drawn by the previous studies. End-user group of the age from 20 to 45 is assumed. However, due to the limited time of the tests, the group of participants, categorized as „Early adopters‟, should be favored. All participants have to be screened for normal or corrected to normal visual acuity (myopia and hyperopia, Snellen index 20/30), color vision using Ishihara test, and stereo vision using Randot Stereo test (≤ 60 arcsec).

6.2.2 Test design

A factorial, related design [7] is to be applied to the experiment. Subject variables contain the content and the presentation modes. Context is reduced to free watching in a cafeteria.


The procedure will include descriptive interviews. A pre-test, a training session, and free watching are to be followed. For the descriptive interview we use a semi-structured interview after the free watching in the chosen context. Furthermore, participants have to do post-task tests, including a demographic and a workload questionnaire at the end of the tests.

6.2.3.1 Story

The whole study is planned like a mobile 3D video usage story. Participants arrive at the laboratory, do the pre-tests. Then they are familiarized with the prototype and its functionality. The device allows for watching three thematic TV channels including documentary, cartoon and sport. According to the story, the participant comes from a lecture and have some time to spend in the cafeteria while waiting to meet a friend. That spare time varies between 30 and 40 minutes.

6.2.3.2 Pre-test

Before the evaluation starts participants are to be introduced to the test signed a data protection policy. The screening of participants for myopia, hyperopia, color vision, and stereo vision is done before the actual start of the test session.

Accommodation and training takes place in the laboratory to familiarize the participants with the device and the stereoscopic videos. All participants watch some high quality stereoscopic videos and play with the device controls to get used to the device and to find a good viewing position for optimal three-dimensional viewing experience.

6.2.3.3 Qualitative, descriptive Interviews

A semi-structured interview is to be done after the watching in the cafeteria context. This type of interview uses main and supporting questions to ask the participant for detailed explanations of previous answers, but still uses an overall interview guideline. Such a guideline enables the later comparison of the interviews. The following questions are suggested for the interviews.

Main Questions:


What kind of factors did you pay attention to while viewing in this situation?

What kind of thoughts, feelings and ideas came to your mind while viewing?

How did you experience your surroundings?

Did you notice any positive or negative things, things you liked, didn‟t liked, things that disturbed you?

Which were the quality characteristics you paid attention to?

Supporting Questions:

Please, could you describe in more detail, what you mean by X (answer of main question)?

Please, could you describe in more detail when/how X appeared?

Please, could you clarify if X was among annoying – acceptable – pleasurable/negative – positive factors?

6.2.4 Test Material and Apparatus

6.2.4.1 Selection of Test Sequences

The test sequences are to be selected for the collection of 20 different contents as described in the preliminary planning section. They should form three TV channels: animation, sports, and documentary (including user-created content). The content should form selection to be sufficient for 30-40 minutes watching.

6.2.4.2 Selection of Test Parameters

The video sequences are to be encoded by MVC ([11], [8]) with favorable quantization parameter (QP) of e.g. 29. The channel settings are to be selected to be optimal for the anticipated stationary transmission for the cafeteria scenario and the corresponding transport streams to be prepared and stored for real-time transmission on the Cardinal play-out. Only 3D viewing mode is assumed.

6.2.4.3 Apparatus and Test Setup

Pre-tests are to be accomplished in the 3D Media Lab of Tampere University of Technology. As another context we chose the student café ROM of the TUT campus. The cafeteria s in the first floor of the information technology building (Tietotalo) and is used by students and employees weekdays from 8:30 to 17 o‟clock. The cafeteria consists of coffee shop, a lounge with comfortable sofas and another area with tables. The same amount of people is normally encountered during the working hours, so the tests are scheduled between 9 and 16 o‟clock. Participants are always placed at the same sofa, from which they could see the entrance. Lighting conditions at this sofa are comparable and we always would tell the participants to find a good viewing position to have as little as possible light reflections on the display. The background noises are generated mainly by talking people.

For the actual tests and analysis of results we refer to D6.8


References

[1] Bech, S. and Zacharov, N. 2006. “Perceptual Audio Evaluation - Theory, Method and Application”. Wiley, Chichester, England

[2] Bech, S., Hamberg, R., Nijenhuis, M., et al., “Rapid perceptual image description (RaPID) method”, Proc. SPIE 2657, pp. 317-328, doi:10.1117/12.238728, 1996

[3] Brandt, M. A., Skinner, E. Z. and Coleman, J. A. (1963), Texture Profile Method. Journal of Food Science, 28: 404–409. doi: 10.1111/j.1365-2621.1963.tb00218.x

[4] Bryman, A., “Social Research Methods, 3rd ed., Oxford University Press, Oxford, UK, 2008

[5] Civille, G. V. and Lawless, H. T. (1986), The Importance of Language in Describing Perceptions. Journal of Sensory Studies, 1: 203–215. doi: 10.1111/j.1745-459X.1986.tb00174.x

[6] Cliff, M. A., Wall, K., Edwards, B. J. and King, M. C. (2000), Development of a Vocabulary for Profiling Apple Juices. Journal of Food Quality, 23: 73–86. doi: 10.1111/j.1745-4557.2000.tb00197.x

[7] Coolican, H. “Research methods and statistics in psychology”, 5th ed., London: Hodder Education, 2009

[8] D. Strohmeier and G. Tech, "Sharp, bright, three-dimensional: open profiling of quality for mobile 3DTV coding methods", Proc. SPIE 7542, 75420T, 2010, doi:10.1117/12.848000

[9] D. Strohmeier, M. Weitzel, S. Jumisko-Pyykkö, "Use scenarios - mobile 3D television and video", special session 'Delivery of 3D Video to Mobile Devices' at the conference 'Multimedia on Mobile Devices', a part of the Electronic Imaging Symposium 2009 in San Jose, California, USA, January 2009.

[10] D. Strohmeier, S. Jumisko-Pyykkö, and K. Kunze, “Open Profiling of Quality: A Mixed Method Approach to Understanding Multimodal Quality Perception,” Advances in Multimedia, vol. 2010, Article ID 658980, 28 pages, 2010. doi:10.1155/2010/658980

[11] D. Strohmeier, S. Jumisko-Pyykkö, and K. Kunze, G. Tech, D. Buğdayci, M. O. Bici, “Results of quality attributes of coding, transmission, and their combinations”, Deliverable 4.3, MOBILE3DTV, Project No. 216503, 2010

[12] D. Strohmeier, S. Jumisko-Pyykkö, and K. Kunze, M. O. Bici, “The Extended-OPQ method for User-centered Quality of Experience evaluation: A study for mobile 3D video broadcasting over DVB-H,” special issue “Quality of Multimedia Experience”, EURASIP Journal on Image and Video Processing, vol. 2011, Article ID 538294, 24 pages, 2011. doi:10.1155/2011/538294

[13] D. Strohmeier, S. Jumisko-Pyykkö, M. Weitzel, S. Schneider, “Report on User Needs and Expectations for Mobile Stereo-video”. Tampere University of Technology, 2008.

[14] D. Strohmeier, S. Jumisko-Pyykkö, U. Reiter, “Profiling experienced quality factors of audiovisual 3D perception”, Proc. of the International Workshop on Quality of Multimedia Experience (QoMEX 2010), Trondheim, Norway, June 2010

[15] Drake, M. and Civille, G. (2003), Flavor Lexicons. Comprehensive Reviews in Food Science and Food Safety, 2: 33–40. doi: 10.1111/j.1541-4337.2003.tb00013.x

[16] E. Samoylenko, S. McAdams, and V. Nosulenko, “Systematic Analysis of Verbalizations Produced in Comparing Musical Timbres,” International Journal of Psychology, vol. 31, no. 6, pp. 255–278, 1996, doi:10.1080/002075996401025.

[17] G. Lorho, “Perceived Quality Evaluation: An Application to Sound Reproduction over Headphones”, PhD thesis, Helsinki University of Technology, Helsinki, Finland, 2010

[18] H. T. Lawless and H. Heymann, Sensory evaluation of food: principles and practices, 1st ed. New York: Chapman & Hall, 1999.

[19] Hart, S. G., Staveland, L. E., “Development of NASA-TLX (Task Load Index): results of empirical and theoretical research”, In Hancook, P. A., Meshkati, N. (eds) “Human mental workload”, North-Holland, Amsterdam, pp 139-183, 1988


[20] Hart, S. G., Staveland, L. E., “Development of NASA-TLX (Task Load Index): results of empirical and theoretical research”, In Hancook, P. A., Meshkati, N. (eds) “Human mental workload”, North-Holland, Amsterdam, pp 139-183, 1988

[21] Hartson, H. R., Andre, T. S., Willigers, R. C., „Criteria for evaluating usability evaluation methods“, International Journal of Human-Computer Interaction, Vol. 15, No. 1, pp. 145-181, 2003

[22] Haslam, S., McCarty, C., “Research methods and statistics in psychology”, Sage Publications, London, UK, 2003

[23] ISO 9241-11, “Ergonomic requirements for office work with visual display terminaly (VDTs) – Part 11: Guidance on usability”, International Standards Organization, 1998

[24] ITU-T REC. P.910 “Subjective video quality assessment methods for multimedia applications”, Switzerland, 1999.

[25] J. Radun, T. Leisti, J. Häkkinen, H. Ojanen, J. Olives, T. Vuori, G. Nyman “Content and Quality: Interpretation-Based Estimation of Image Quality”, ACM Trans. Appl. Percept. 4, 4, 2008

[26] Jumisko-Pyykkö, S., “User-Centered Quality of Experience and its Evaluation Methods for Mobile Television, PhD thesis, Tampere University of Technology, 2011, in press

[27] Jumisko-Pyykkö, S., Kumar Malamal Vadakital, V., Hannuksela, M. M., „Acceptance Threshold: Bidimensional Research Method for User-Oriented Quality Evaluation Studies“, International Journal of Digital Multimedia Broadcasting, 2008

[28] Jumisko-Pyykkö, S., Strohmeier, D., Utriainen, T., Kunze, K., „Descriptive quality of experience for mobile 3D video“, In: Proceedings of the 6th Nordic Conference on Human-Computer Interaction: Extending Boundaries (NordiCHI ‚10), ACM, New York, NY, USA, pp. 266-275, 2010

[29] Jumisko-Pyykkö, S., Utriainen, T., “ A Hybrid Method for Quality Evaluation in the Context of Use for Mobile (3D) Television”, Multimedia Tools and Applications, Springer Netherlands, pp 1-41, 2010

[30] Jumisko-Pyykkö, S., Weitzel, M., Strohmeier, D., “Designing for User Experience: What to Expect from Mobile 3D TV and Video?”, First International Conference on Designing Interactive User Experience for TV and Video, October 22-24, 2008, Silicon Valley, California, USA

[31] Lawless, H. T., Heymann, H., “Sensory Evaluation of Food: Principles and Practices”, Springer Verlag, 1999

[32] Markopoulos P., Bekker, M., “How to Compare Usability Testing Methods with Children Participants”, In: Interaction Design and Children”, Shaker Publisher, pp. 153-159, 2002

[33] McTigue, M. C., Koehler, H. H., Silbernagel, M J., “Comparison of Four Sensory Evaluation Methods for Assessing Cooked Dry Beans”, Journal of Food Sciences, Vol. 54, No. 5, pp 1278-1283, 1989

[34] Meilgaard MC, Daigliesh CE, Clapperton JF. 1979. Beer flavour terminology. J Inst Brew, 85:38-42.

[35] Meilgaard, M., Civille, G. V., Carr, B. T., “Sensory Evaluation Techniques”, 3rd ed., CRC Press, 387 pp, ISBN 0-8493-0276-5, 1999

[36] Noble, A.C., Arnold, R.A., Masuda, B.M., Pecore, S.D., Schmidt, J.O., Stern, P.M. 1984. Progress towards a standardized system of wine aroma terminology. Am. J. Enol. Vitic. 35 (2), 76-77.

[37] Perrin, L., Symoneaux, R., Maître, I., Asselin, C., Jourjon, F., Pagès, J., “Comparison of three sensory methods for use with the napping procedure: Case of ten wines from loire valley”, Food Quality and Preference, Vol. 19, No. 1, pp. 1-11, 2008

[38] Recommendation ITU-R BT.500-11. 2002. Methodology for the Subjective Assessment of the Quality of Television Pictures, Recommendation ITU-R BT.500-11. ITU Telecom. Standardization Sector of ITU.


[39] Recommendation ITU-T P.910 “Subjective video quality assessment methods for multimedia applications”, Recommendation ITU-T P.910, ITU Telecom. Standardization Sector of ITU, 1999

[40] Rouse, D., Pépion, R., Hemami, S., Le Callet, P., “Tradeoffs in subjective testing methods for image and video quality assessment”, Human Vision and Electronic Imaging, 7527, 2010

[41] S. Bech, R. Hamberg, M. Nijenhuis, C. Teunissen, H. de Jong, P. Houben, S. Pramanik, “The RaPID perceptual image description method (RaPID)”. In Proc. SPIE. Vol. 2657. 317-328. 1996

[42] S. Jumisko-Pyykkö, “User-Centered Quality of Experience and its Evaluation Methods for Mobile Television”, PhD thesis, Tampere University of Technology, Tampere, Finland, 2011

[43] S. Jumisko-Pyykkö, and M.M. Hannuksela, “Does context matter in quality evaluation of mobile television?”, MobileHCI, Amsterdam, The Netherlands, 2008.

[44] S. Jumisko-Pyykkö, J. Häkkinen, G. Nyman, “Experienced Quality Factors – Qualitative Evaluation Approach to Audiovisual Quality”, Proceedings of the IS&T/SPIE 19th Annual Symposium of Electronic Imaging, Convention Paper 6507-21, 2007

[45] S. Jumisko-Pyykkö, T. Utriainen, “A Hybrid Method for Quality Evaluation in the Context of Use for Mobile (3D) Television”, Multimedia Tools and Applications, 2010

[46] S. Jumisko-Pyykkö, T. Vainio, “Framing the context of use for mobile HCI”, International Journal of Mobile-Human-Computer-Interaction (IJMHCI), 2010

[47] S. Jumisko-Pyykkö, V. K. M. Vadakital, M. M. Hannuksela, “Acceptance Threshold: Bidimensional Research Method for User-Oriented Quality Evaluation Studies.” International Journal of Digital Multimedia Broadcasting, 2008

[48] S. Tamminen, A. Oulasvirta, K. Toiskallio, and A. Kankainen, “Understanding mobile contexts”, Pers Ubiquit Comput, Volume 8, pp. 135-143, 2003.

[49] Shadish, W. R., Cook, T. D., Campbell, D. T., “Experimental and quasi-experimental Designs for Generalized Causal Inference”, Houghton Mifflin Company, 2002

[50] Smilowitz, E. D., Darnell, M. J., Benson, A. E., „Are we overlooking some usability testing methods? A Comparison of Lab, Beta, and Forum Tests”, Proceedings of the Human Factors and Ergonomics Society 37th Annual Meeting, 1993

[51] Stecher, B. M., Rahn, L. M., Ruby, A., Alt, M. N., Robyn, A., „Using Alternative Assessments in Vocational Education“, RAND, 176 pp, ISBN 0-8330-2489-2, 1997

[52] Stone, H. and Sidel, J. L. 2004. Sensory evaluation practices. 3rd ed. ed. Academic Press, San Diego

[53] Wiliams, A. A., Arnold, G. M., “Comparison of the aromas of six coffees characterized by conventional profiling, free-choice profiling and similarity scaling methods”, Journal of the Sciences of Food and Agriculture, Vol. 36, No. 3, pp. 204-214, 1985

[54] Yokum, J. T., Armstrong, J. S., “Beyond Accuracy: Comparison of Criteria Used to select Forecasting Methods”, International Journal of Forecasting, Vol. 11, pp. 591-597, 1995

[55] S. Jumisko-Pyykkö and T. Utriainen, “Results of the user-centred quality experiments”, Technical Report D4.4 Mobile3DTV, 2009

[56] Lorho, G., “Perceptual evaluation of mobile multimedia loudspeakers”, Proceedings of the Audio Engineering Society 122th Convention, 2007

[57] D. Strohmeier and S. Jumisko-Pyykkö, “Proposal on open profiling of quality as a mixed method evaluation approach for audiovisual quality assessment”, Proposal no. 181, ITU-T SG12, Question Q13/12, International Telecommunication Union, Switzerland, 2011

[58] S. Jumisko-Pyykkö, “Hybrid method for quality evaluation in the context of use”, Proposal no. 180, ITU-T SG12, Question Q13/12, International Telecommunication Union, Switzerland, 2011

[59] Strauss, A., Corbin, J., “Basics of qualitative research: Techniques and procedures for developing grounded theory”, Sage, Thousand Oaks, CA, USA, Vol. 2, 1998


[60] M. Lambooij, W. IJsselsteijn, M. Fortuin, and I. Heynder-ickx, “Visual discomfort and visual fatigue of stereoscopic displays: A review,” J. Imaging and Technology 53(3), 030201-030201-14, 2009

[61] L. M. J. Meesters, W. A. IJsselsteijn, and P. J. H. Seuntiens, “A survey of perceptual evaluations and requirements of three-dimensional TV,” IEEE Trans. Circuits Syst. Video Tech. vol. 14, no. 3, pp. 381-391, Mar. 2004

[62] R. Kennedy, N. Lane, K. Berbaum, and M. Lilienthal, “Simulator sickness questionnaire: An enhanced method for quantifying simulator sickness,” Int. J. Aviation Psychology 3(3), pp. 203-220, 1993

[63] J. Häkkinen, T. Vuori, and M. Puhakka, “Postural stability and sickness symptoms after HMD use,” in Proc. SMC Symp., 2002, pp. 147–152.

[64] B. K. Jaeger, and R. R. Mourant, “Comparison of simulator sickness using static and dynamic walking simulators,” Hu-man Factors and Ergonomics Society Annual Meeting Proc., Virtual Environments, pp. 1896-1900(5), 2001.

[65] M. Pölönen, and J. Häkkinen, “Near-to-Eye Display - An accessory for handheld multimedia devices: Subjective studies,” Journal of Display Technology, vol. 5, no. 9, pp. 358-367, Sep. 2009.

[66] M. Lambooij, W. IJsselsteijn, M. Fortuin, and I. Heynderickx, “Visual discomfort and visual fatigue of stereoscopic displays: A review,” J. Imaging and Technology 53(3), 030201-030201-14, 2009.

[67] M. Lambooij, W. IJsselsteijn, I. Heynderickx, "Visual dis-comfort in stereoscopic displays: a review," in Proc. SPIE 6490: 64900I (2007).

[68] “Actius AL-3DU Laptop”, Product brochure, Sharp. 2005, Available: www.sharpsystems.com/products/pc_notebooks/actius/al/3du/

[69] “Stereoscopic 3D LCD Display module”, Product Brochure, masterImage, 2009, Available: www.masterimage.co.kr/new_eng/product/module.htm

[70] S. Uehara, T. Hiroya, H. Kusanagi, K. Shigemura, and H.Asada, “1-inch diagonal transflective 2D and 3D LCD with HDDP arrangement,” in Proc. SPIE-IS&T Electronic Imaging 2008, Stereoscopic Displays and Applications XIX, Vol. 6803, San Jose, USA, Jan. 2008.

[71] K. Stanney, R. S. Kennedy, and J. M. Drexler, “Cyber- sickness is not simulator sickness,” in Proc. 41st Human Factors and Ergonomics Society, 1997, pp. 1138-1142

[72] S. Jumisko-Pyykkö and T. Utriainen, ”D4.4 v2.0 Results of the user-centred quality evaluation experiments”, Technical report, November 2009, (2009).

Mobile 3DTV Content Delivery Optimization over DVB-H System

MOBILE3DTV - Mobile 3DTV Content Delivery Optimization over DVB-H System - is a three-yearproject which started in January 2008. The project is partly funded by the European Union 7th

RTD Framework Programme in the context of the Information & Communication Technology (ICT)Cooperation Theme.

The main objective of MOBILE3DTV is to demonstrate the viability of the new technology ofmobile 3DTV. The project develops a technology demonstration system for the creation andcoding of 3D video content, its delivery over DVB-H and display on a mobile device, equippedwith an auto-stereoscopic display.

The MOBILE3DTV consortium is formed by three universities, a public research institute and twoSMEs from Finland, Germany, Turkey, and Bulgaria. Partners span diverse yet complementaryexpertise in the areas of 3D content creation and coding, error resilient transmission, userstudies, visual quality enhancement and project management.

For further information about the project, please visit www.mobile3dtv.eu.

Tuotekehitys Oy TamlinkProject coordinator

FINLAND

Tampereen Teknillinen Yliopisto

Visual quality enhancement,

Scientific coordinator

FINLAND

Fraunhofer Gesellschaft zur Förderung der angewandten Forschung e.V

Middle East Technical UniversityError resilient transmission

TURKEY

Stereo video content creation and coding

GERMANY

Technische Universität IlmenauDesign and execution of subjective tests

GERMANY

MM Solutions Ltd. Design of prototype terminal device

BULGARIA

MOBILE3DTV project has received funding from the European Community’s ICT programme in the context of theSeventh Framework Programme (FP7/2007-2011) under grant agreement n° 216503. This document reflects onlythe authors’ views and the Community or other project partners are not liable for any use that may be made of theinformation contained therein.

http://www.europa.eu/

http://cordis.europa.eu/fp7/

results of user-centered quality evaluation experiments and usability...

Documents